Voice Recognition

Intelligent speech-to-text using local OpenAI Whisper (no API key needed, fully private). Use when you need to transcribe audio files, convert voice messages...

installs

stars

karma

SkillRank score ↗

7.4/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

local whisper transcription with smart auto-model selection based on audio length and complexity. supports 99 languages, multiple formats, runs fully offline with no api costs or token consumption.

structure

6.0

trigger phrases

7.0

procedure

8.0

edge cases

6.0

documentation

8.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: voice-recognition
description: |
  Intelligent speech-to-text using local OpenAI Whisper (no API key needed, fully private).
  Use when you need to transcribe audio files, convert voice messages to text,
  recognize spoken content, or process speech input in any of 99+ languages.
  Key differentiator: smart auto-model selection analyzes audio length and complexity
  to choose the optimal Whisper model — short clean clips use the fast base model,
  long or mixed-language clips automatically upgrade to small/medium for accuracy.
---

# 🎤 Voice Recognition — Smart Auto-Model Selection

Transcribe audio to text using **local OpenAI Whisper**. No API keys, no internet required, 100% private.

**Smart auto-selection** dynamically picks the best model based on your audio characteristics — you never have to think about which model to use.

## Quick Start

```bash
# Auto mode — analyzes audio, picks best model automatically
scripts/transcribe.py voice.ogg

# Force a specific model
scripts/transcribe.py voice.ogg --model small

# Specify language (auto-detect if omitted)
scripts/transcribe.py voice.ogg --language zh   # Chinese (Mandarin)
scripts/transcribe.py voice.ogg --language en   # English
scripts/transcribe.py voice.ogg --language yue  # Cantonese

# Show segment timestamps
scripts/transcribe.py voice.ogg --segments

# Save transcript to file
scripts/transcribe.py voice.ogg -o transcript.txt
```

## Smart Auto-Selection

The script analyzes audio duration + complexity and selects the optimal model automatically:

| Audio Characteristic | Model Used | Why |
|---|---|---|
| Short (<10s), clean speech | **base** | Fast (2-3s). Accurate enough for simple content. |
| Short (<10s), mixed languages | **small** | Better multilingual handling for code-switching. |
| Medium (10-60s), clean | **base** | Balanced speed and accuracy. |
| Medium (10-60s), mixed | **small** | Handles accents and language transitions. |
| Long (1-2min) | **small** | Maintains context, still fast enough. |
| Very long (2min+) | **medium** | Maximum accuracy for extended recordings. |

You don't need to think about models. Just send audio.

## Installation

### Prerequisites
- Python 3.10+
- pip (Python package manager)

### Via bundled installer

```bash
python3 scripts/install.py
```

### Manual

```bash
pip install openai-whisper soundfile numpy
pip install torch --index-url https://download.pytorch.org/whl/cpu
```

### Using requirements.txt

```bash
pip install -r requirements.txt
pip install torch --index-url https://download.pytorch.org/whl/cpu
```

> **Note:** First run downloads the Whisper model (~139MB for base, ~461MB for small).
> Subsequent runs use the cached model (`~/.cache/whisper/`) and load instantly.

## Model Reference

| Model | Size | Speed | Accuracy | Best For |
|---|---|---|---|---|
| tiny | 72MB | ⚡⚡⚡ | ⭐⭐ | Real-time preview, very short clips |
| base | 139MB | ⚡⚡ | ⭐⭐⭐ | General use (auto-select default for short audio) |
| small | 461MB | ⚡ | ⭐⭐⭐⭐ | Mixed languages, accents (auto-select for long/complex) |
| medium | 1.5GB | 🐢 | ⭐⭐⭐⭐⭐ | Maximum accuracy, long recordings |
| large | 2.9GB | 🐢 | ⭐⭐⭐⭐⭐ | Research-grade transcription |

## Language Support

Whisper supports **99 languages** including:

- 🇨🇳 Chinese (Mandarin, Cantonese)
- 🇺🇸 English
- 🇪🇸 Spanish
- 🇯🇵 Japanese
- 🇰🇷 Korean
- 🇫🇷 French
- 🇩🇪 German

Auto-detects language by default. Use `--language` to provide a hint for better accuracy.

## Features

| Feature | Description |
|---|---|
| 🔒 **100% Private** | Everything runs locally. No data leaves your machine. |
| 🆓 **No API Costs** | Free unlimited transcription. No quotas, no keys. |
| 🌐 **99 Languages** | Supports virtually all major world languages. |
| 🧠 **Smart Auto-Model** | Analyzes audio → picks optimal model automatically. |
| ⚡ **Fast by Default** | Short clips → base model (2-3s). Long clips → small/medium. |
| 🎯 **Accurate When Needed** | Complex/mixed audio automatically upgrades the model. |
| 📊 **Segment Timestamps** | Sentence-level timing for long recordings. |
| 📁 **Multiple Formats** | OGG, WAV, MP3, M4A, FLAC, OPUS and more. |

## Supported Audio Formats

| Format | Extension | Notes |
|---|---|---|
| OGG Opus | `.ogg` | Common voice message format ✅ |
| WAV | `.wav` | Uncompressed, high quality |
| MP3 | `.mp3` | Compressed audio |
| M4A | `.m4a` | Apple/MPEG-4 audio |
| FLAC | `.flac` | Lossless compressed |
| OPUS | `.opus` | Pure Opus stream |

## Usage Examples

### Quick transcription (auto model)

```bash
$ scripts/transcribe.py meeting.ogg
📂 Loading audio: meeting.ogg
⏱  Duration: 32.0s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (4.1s total)
Meeting notes: Today we discuss three topics. First, project progress...
```

### Transcription in context

```bash
# Chinese
scripts/transcribe.py voice.ogg --language zh

# English lecture with timestamps
scripts/transcribe.py lecture.m4a --language en --segments

# Mixed Chinese-English interview (auto complexity detection)
scripts/transcribe.py interview.ogg

# Save to file
scripts/transcribe.py podcast.mp3 -o transcript.txt

# Force high accuracy
scripts/transcribe.py important.wav --model medium
```

### Output with segments

```bash
$ scripts/transcribe.py message.ogg --segments
📂 Loading audio: message.ogg
⏱  Duration: 7.5s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (2.4s total)
Now I'm sending this voice message to XiaoA, can you recognize what I said?

📝 Segments:
   [0.0s - 3.6s] Now I'm sending this voice message
   [3.6s - 7.4s] to XiaoA, can you recognize what I said?
```

## Troubleshooting

| Problem | Solution |
|---|---|
| `No module` error | Use the venv Python: `python3 scripts/transcribe.py` or run `scripts/install.py` |
| Slow transcription | First download caches the model (~139-461MB). Normal for first run. |
| Wrong language detected | Pass `--language en` or `--language zh` for a hint |
| Background noise | Use `--model small` or `--model medium` for noisy environments |

## Token Savings Examples

| Scenario | Cloud API Cost | This Skill | Savings |
|---|---|---|---|
| 10 short voice messages/day | ~$0.60/day (Whisper API) | **$0** | ∞ |
| 1 hour meeting transcription | ~$2.88 (Deepgram) | **$0** | ∞ |
| 1000 files for a project | ~$50-200 | **$0** | ∞ |
| Agent processing voice inputs | LLM tokens + API fees | **0 tokens** | Full token budget saved |

## Privacy & Security

- **100% offline** — no data leaves your machine.
- **No API keys** — no third-party services, no accounts.
- **No telemetry** — zero tracking.
- **No cloud** — everything runs locally.
- **Zero token consumption** — frees your LLM budget for reasoning.

Your audio is yours. Always.

don't have the plugin yet? install it then click "run inline in claude" again.

added explicit intent, inputs with setup guidance, step-by-step procedure with input/output signatures, comprehensive decision points covering model selection and error handling, output contract with file format and exit codes, and outcome signals for user validation

voice recognition , smart auto-model selection

Item: Voice Recognition
Rating: 7.4
Author: Implexa

transcribe audio to text using local openai whisper. no api keys, no internet required, 100% private.

smart auto-selection dynamically picks the best model based on your audio characteristics. you never have to think about which model to use.

intent

use this skill to transcribe audio files into text with zero external dependencies, zero cost, and zero privacy risk. the skill analyzes audio duration and complexity, then automatically selects the optimal whisper model (base, small, or medium) to balance speed and accuracy. run it when you need to convert voice messages, meeting recordings, interviews, podcasts, or any speech input in 99+ languages into written text. best for workflows where you control the audio file locally and want full privacy with no api quotas or rate limits.

inputs

required

audio file path (string): local path to audio file in one of these formats: ogg, wav, mp3, m4a, flac, opus. file must be readable and under 500mb (whisper processes streaming, so no hard limit, but very large files take proportionally longer).
python 3.10 or higher: runtime environment.
openai-whisper package: install via pip install openai-whisper soundfile numpy or use bundled installer.
pytorch: install cpu version via pip install torch --index-url https://download.pytorch.org/whl/cpu (gpu optional but not required).

optional

language code (string): iso 639-1 code (e.g., "en", "zh", "yue", "es"). if omitted, whisper auto-detects. provide this as a hint if auto-detection fails or for faster inference.
model override (string): force a specific model ("tiny", "base", "small", "medium", "large"). if omitted, auto-selection runs.
output file path (string): if specified, transcript writes to this file instead of stdout.
segment timestamps flag (boolean): if true, include sentence-level timing data.

external connections

none required. all processing is local. first run downloads the selected whisper model (~139mb for base, ~461mb for small, ~1.5gb for medium) from openai's huggingface repo to ~/.cache/whisper/. subsequent runs use cached model.

procedure

load and validate audio file (input: audio file path; output: audio object, duration in seconds, sample rate in hz)
- read file from disk using soundfile library.
- if file is missing or unreadable, exit with error "file not found" or "file is not a valid audio format".
- extract duration and sample rate metadata.
analyze audio characteristics (input: duration, sample rate; output: complexity score, recommended model)
- measure duration in seconds.
- scan audio for silence, speech gaps, and language transitions to estimate complexity (low = clean monolingual, high = mixed languages, background noise, accents).
- assign complexity flag: 0 (clean), 1 (mixed/noisy).
select model (input: duration, complexity flag, optional model override; output: model name)
- if user provided --model flag, skip to step 4 using that model.
- else apply auto-selection rule (see decision points below).
download or load model (input: model name; output: loaded whisper model, load time in seconds)
- check if model exists in ~/.cache/whisper/.
- if not cached, download from huggingface (first run only, ~10-30s depending on model size and network speed).
- if download fails (network timeout, corrupted file), retry up to 2 times with exponential backoff (2s, 4s), then exit with error.
- if cached, load from disk (~1-3s).
- on timeout or network error after retries, output "model download failed: check internet connection or cache manually".
transcribe audio (input: loaded model, audio object, language hint (optional); output: transcript string, confidence scores per segment)
- pass audio and language hint (if provided) to whisper.
- whisper processes audio in 30-second chunks internally.
- if language is auto-detected, whisper detects it during first chunk.
- collect output: full transcript text + per-segment metadata (start time, end time, text, confidence).
- if transcription fails (cuda out of memory, corrupted audio), fall back to cpu inference (if gpu was used) or exit with "transcription failed: audio may be corrupted".
format and output transcript (input: transcript text, segment metadata, output file path (optional), segment flag (optional); output: formatted text or file)
- if --segments flag is set, format output as:
```
[full transcript here]

📝 Segments:
   [0.0s - 3.6s] segment text here
   [3.6s - 7.4s] next segment text
```
- else output full transcript only.
- if output file path is specified, write result to file (create or overwrite). if file write fails (permission denied, disk full), exit with error.
- if no output file specified, print to stdout.
log performance metrics (input: model name, total duration, transcription time; output: console output with timing)
- display: audio duration, sample rate, selected model, model load time, transcription time, total elapsed time.

decision points

if user provides explicit model flag (--model)

skip auto-selection. use the specified model. proceed to step 4.
else, continue to next decision.

if audio duration < 10 seconds

if complexity is low (clean speech, single language), select base model.
else (mixed language, background noise, accents detected), select small model.
proceed to step 4.

if audio duration 10-60 seconds

if complexity is low, select base model.
else, select small model.
proceed to step 4.

if audio duration 1-2 minutes

select small model (balances accuracy and speed).
proceed to step 4.

if audio duration > 2 minutes

select medium model (maximum accuracy for long recordings).
proceed to step 4.

if model download fails due to network error

retry up to 2 times with exponential backoff (2s, then 4s).
if all retries fail, exit with error "unable to download model. check internet connection. you can manually download from huggingface and place in ~/.cache/whisper/".
do not fall back to a smaller model unless user explicitly requests it.

if language is not specified

whisper auto-detects language during transcription (step 5).
if confidence in auto-detected language is very low (< 0.5), output warning to stderr: "language detection confidence is low. consider providing --language hint".
do not halt transcription. continue with auto-detected language.

if output file path is specified but file cannot be written

exit with error: "cannot write to [filepath]: permission denied or disk full".
do not fall back to stdout. user explicitly requested file output.

if audio is very short (< 1 second)

still process normally with base model.
output warning: "audio is very short (< 1s). transcript may be incomplete or empty".

output contract

on success

transcript: plain text string containing the full transcription of the audio.
location: stdout (console) or specified output file (if -o flag used).
format:
- basic: single block of text with full transcript.
- with segments: full transcript followed by "📝 Segments:" header and list of [start_time - end_time] segment_text, each on one line.
metadata logged to stderr: duration, sample rate, selected model, load time, transcription time, total time.
exit code: 0 on success.

on failure

exit code: 1.
error message to stderr describing the failure (file not found, model download failed, transcription failed, etc.).
no partial output written to file. file either contains full valid transcript or is not created/modified.

outcome signal

you know the skill worked when:

the console displays a summary line like "✅ done (4.1s total)" or similar completion message.
the transcript text appears on stdout (or in the specified output file if -o was used).
if you used --segments, you see timestamped segments with start and end times.
exit code is 0 (check with echo $? on unix/linux).
no errors printed to stderr (other than optional warnings like low language confidence).
the transcribed text matches the actual speech content of the audio (spot check: listen to a 5-10s clip and verify the transcript is accurate).

Voice Recognition

related skills

voice recognition , smart auto-model selection

intent

inputs

procedure

decision points

output contract

outcome signal