This skill should be used when the user explicitly asks to "transcribe a meeting", "transcribe audio", "transcribe a meeting recording", "convert audio to te...
---
name: audio-transcribe
version: 1.7.1
description: >
This skill should be used when the user explicitly asks to "transcribe a meeting",
"transcribe audio", "transcribe a meeting recording",
"convert audio to text", "generate meeting minutes from audio",
"do speech-to-text", "transcribe with speaker diarization",
"identify speakers in audio", "transcribe Chinese audio",
"transcribe English audio", "transcribe Japanese audio",
"multi-speaker transcription", "transcribe a podcast",
"transcribe podcast episode", "transcribe an interview",
"convert podcast to text", "podcast to transcript",
or mentions FunASR, Paraformer, SenseVoice, Whisper, MiMo, MiMo-V2.5-ASR,
meeting transcription, podcast transcription, or speaker diarization.
Supports multi-speaker meeting and podcast transcription in Chinese,
English, Japanese, Korean, Cantonese, and 99 languages (via Whisper),
plus Xiaomi MiMo-V2.5-ASR (8B, local GPU) for stronger proper-noun and
code-switching accuracy. Automatic speaker diarization via CAM++,
hotword biasing (FunASR path), LLM cleanup. FunASR works on GPU and CPU;
MiMo requires a local CUDA GPU with >=20GB VRAM.
metadata:
openclaw:
requires:
bins: ["python3", "ffmpeg"]
env_vars:
- name: AWS_REGION
required: false
description: "AWS region for Bedrock LLM cleanup (default: us-west-2). Bedrock uses the standard AWS credential chain (IAM role, SSO, ~/.aws/credentials, env vars) ā no explicit keys needed."
- name: ANTHROPIC_API_KEY
required: false
description: "API key for Anthropic Claude LLM cleanup"
- name: OPENAI_API_KEY
required: false
description: "API key for OpenAI-compatible LLM cleanup"
- name: OPENAI_BASE_URL
required: false
description: "Base URL for OpenAI-compatible API (vLLM, Ollama, etc.)"
emoji: "šļø"
homepage: "https://github.com/zxkane/audio-transcriber"
---
# Meeting & Podcast Transcription (FunASR + MiMo)
Transcribe multi-speaker audio into structured Markdown with automatic
speaker diarization, hotword biasing, and optional LLM cleanup. Two
ASR engine families are available: **FunASR** (Paraformer / SenseVoice /
Whisper ā fast, cheap, GPU or CPU, 99 languages) and **MiMo-V2.5-ASR**
(Xiaomi's 8B model, local GPU only, stronger on proper nouns and
code-switching). Both share the same VAD + speaker-clustering stack.
All scripts run directly from the plugin directory ā no copying needed.
Define this shorthand at the start of every session:
```bash
SCRIPTS=${CLAUDE_PLUGIN_ROOT}/skills/audio-transcribe/scripts
```
## Supported Languages
| `--lang` | Model | Languages | Hotword |
|----------|-------|-----------|---------|
| `zh` (default) | SeACo-Paraformer | Chinese (CER 1.95%) | Yes |
| `zh-basic` | Paraformer-large | Chinese | No |
| `en` | Paraformer-en | English | No |
| `auto` | SenseVoiceSmall | Auto-detect: zh/en/ja/ko/yue | No |
| `whisper` | Whisper-large-v3-turbo | 99 languages | No |
| `mimo` | MiMo-V2.5-ASR (local 8B, GPU-only) | zh/en/code-switch/dialects | No |
All presets include **speaker diarization** (CAM++) and **VAD** (FSMN).
`mimo` reuses the FSMN VAD + CAM++ stack around MiMo's text output.
> **Diarization caveat:** `auto` and `whisper` do not output per-sentence timestamps,
> so speaker diarization does not work with these presets. Use `zh`, `zh-basic`,
> `en`, or `mimo` when speaker identification is needed (e.g., podcasts, meetings).
## Workflow
Before starting transcription, **always ask the user**:
1. **Audio file** ā path to the recording (required)
2. **Type** ā meeting, podcast, or interview? (affects defaults)
3. **Language** ā what language is spoken? (default: Chinese)
4. **Number of speakers** ā how many participants? (improves diarization)
5. **Speaker names** ā for podcasts: host + guest names; for meetings: attendee list
6. **Supporting files** ā ask:
> "Do you have any of the following to improve accuracy?"
> - **Attendee / guest list** ā for hotwords and speaker mapping
> - **Meeting agenda or episode topic** ā for hotwords (terms, names)
> - **Reference documents** (show notes, prior notes) ā for speaker identification and ASR correction
**Adapt defaults by recording type:**
- **Meeting**: default `--lang zh`, ask about supporting files
- **Podcast / interview**: default `--lang zh`, `--num-speakers 2`, always ask for
host + guest names, suggest `--speaker-context` for roles
(do NOT use `--lang auto` ā it lacks timestamps for speaker diarization)
> **ā ļø `--speakers` must use the speaker's real name, not a podcast alias.**
> The value passed to `--speakers` is used verbatim as the speaker label in the
> output transcript. Always derive it from the host/guest's actual name (e.g.
> from a shownotes "Host:" field), not from the podcast feed name or title.
>
> Example: if shownotes lists "Host: å¼ äøļ¼å¼ äøēę客ļ¼", pass `--speakers 'å¼ äø'`
> ā not the alias "å¼ äøēę客". Add both the real name and the alias to
> `hotwords.txt` so ASR can recognise both forms.
>
> When both `--speakers` and `--reference` are supplied, the script detects
> this mistake at startup and prints an `ACTION REQUIRED` block naming the
> suggested real name. **If you see that block, stop the run and re-invoke
> with the corrected `--speakers` value before Phase 3** ā the warning does
> not abort the pipeline.
If the user provides supporting materials:
- Extract participant names and key terms ā create `hotwords.txt` (include both real name and alias)
- Extract per-person context ā create `speaker-context.json`
- Pass original reference document with `--reference`
- Use all three together for best results
## Quick Start
### 1. Environment Setup
```bash
AUTO_YES=1 bash $SCRIPTS/setup_env.sh
# Or force CPU: AUTO_YES=1 bash $SCRIPTS/setup_env.sh cpu
```
The setup script patches FunASR's spectral clustering for O(N²·k) performance.
Without this, recordings over ~1 hour hang for hours during speaker clustering.
### 2. Run Transcription
Output files are written to the current working directory.
**LLM cleanup (Phase 3) is opt-in.** By default, transcription runs locally
without contacting any external service. To enable LLM-powered ASR correction
and speaker name refinement, pass `--model <model-id>`. Use LLM cleanup when:
- The raw transcript has many ASR errors (names, technical terms)
- You need polished, publication-ready output
- Speaker names need to be refined from context
> **ā ļø Data Privacy:** When LLM cleanup is enabled via `--model`, transcript
> excerpts are sent to external LLM providers (AWS Bedrock, Anthropic, or
> OpenAI depending on the model ID). Use `--skip-llm` or omit `--model` to
> keep all data local. For Bedrock, boto3 uses the standard AWS credential
> chain (IAM role, SSO, `~/.aws/credentials`, env vars).
```bash
# Chinese meeting with hotwords (local-only, no LLM)
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang zh --num-speakers 9 --hotwords hotwords.txt
# English meeting with speaker names
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang en --speakers "Alice,Bob,Carol,Dave"
# Auto-detect language (zh/en/ja/ko/yue)
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang auto --num-speakers 6
# Whisper for any language
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang whisper --num-speakers 4
# Enable LLM cleanup for polished output (requires --model)
# Bedrock (uses AWS credential chain: IAM role, SSO, ~/.aws/credentials)
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang zh --num-speakers 9 --hotwords hotwords.txt \
--provider bedrock --model us.anthropic.claude-sonnet-4-6
# Bedrock "global" cross-region profile (recent AWS deployments)
python3 $SCRIPTS/transcribe.py meeting.wav \
--provider bedrock --model global.anthropic.claude-sonnet-4-6
# Bedrock via litellm-style wrapper (supported; prefix is stripped for boto3)
python3 $SCRIPTS/transcribe.py meeting.wav \
--provider bedrock --model amazon-bedrock/global.anthropic.claude-sonnet-4-6
# Anthropic API (requires ANTHROPIC_API_KEY env var)
python3 $SCRIPTS/transcribe.py meeting.wav \
--provider anthropic --model claude-sonnet-4-6
# OpenAI-compatible API (requires OPENAI_API_KEY env var)
python3 $SCRIPTS/transcribe.py meeting.wav \
--provider openai --model gpt-4o
# Full pipeline with all supporting files + LLM (best quality)
python3 $SCRIPTS/transcribe.py episode.m4a \
--lang zh --num-speakers 2 \
--hotwords hotwords.txt \
--speakers "å
³ē¾½,å¼ é£" \
--speaker-context speaker-context.json \
--reference show-notes.md \
--model us.anthropic.claude-sonnet-4-6
# Resume interrupted LLM cleanup
python3 $SCRIPTS/transcribe.py meeting.wav \
--skip-transcribe --model us.anthropic.claude-sonnet-4-6
```
### 3. Verify Speaker Labels
If the transcript has swapped speaker labels (common with podcasts),
the verification script can detect and fix mismatches using LLM analysis:
```bash
# Dry-run: check if host/guest are swapped
python3 $SCRIPTS/verify_speakers.py podcast_raw_transcript.json \
--speakers "å
³ē¾½,å¼ é£" \
--speaker-context speaker-context.json
# Apply the fix
python3 $SCRIPTS/verify_speakers.py podcast_raw_transcript.json \
--speakers "å
³ē¾½,å¼ é£" \
--speaker-context speaker-context.json --fix
# Multi-speaker meeting: full reassignment
python3 $SCRIPTS/verify_speakers.py meeting_raw_transcript.json \
--speakers "Alice,Bob,Carol,Dave" \
--speaker-context speaker-context.json --fix
# Then regenerate the markdown with corrected labels
python3 $SCRIPTS/transcribe.py original.m4a \
--skip-transcribe --clean-cache
```
The script analyzes the first 5 minutes (configurable with `--minutes`)
and auto-detects podcast (2 speakers, swap detection) vs meeting
(N speakers, full reassignment).
## Audio Preprocessing
The script automatically converts input audio to 16kHz mono FLAC and
validates that no audio is lost (detects silent truncation).
| Format | 4h14m meeting | Quality | Recommendation |
|--------|--------------|---------|----------------|
| **FLAC** | **219MB** | Lossless | **Default, safest** |
| Opus | 55MB | Lossy | Risk of truncation on long files |
| WAV | 465MB | Lossless | Works but larger |
| Original M4A | 173MB | Source | Also works directly |
**Do NOT split long recordings** ā splitting breaks speaker ID consistency.
## MiMo-V2.5-ASR (optional, GPU-only)
`--lang mimo` runs Xiaomi's
[MiMo-V2.5-ASR](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) locally on a
CUDA GPU. Use it when:
- You want to evaluate MiMo against Paraformer on Chinese audio.
- The recording has heavy code-switching, dialects (Wu, Cantonese, Hokkien,
Sichuanese), lyrics, or rare proper nouns that other presets mis-transcribe.
**Requirements:**
- CUDA ā„12.0 and **ā„20 GB VRAM** (16 GB cards OOM during inference).
- Python 3.12 (enforced by `setup_env.sh`).
- ~20 GB weight download (one-time) and `flash-attn==2.7.4.post1` compile
(needs `nvcc` from the CUDA toolkit, takes 10ā30 min).
**Install (opt-in):**
```bash
# One-time: install MiMo on top of the standard environment
AUTO_YES=1 INSTALL_MIMO=1 \
MIMO_WEIGHTS_PATH=/mnt/models/hf \
bash $SCRIPTS/setup_env.sh
```
**Run:**
```bash
python3 $SCRIPTS/transcribe.py podcast.m4a \
--lang mimo --num-speakers 2 \
--mimo-weights-path /mnt/models/hf
```
**Resume after failure:**
```bash
python3 $SCRIPTS/transcribe.py podcast.m4a \
--lang mimo --resume-mimo --mimo-weights-path /mnt/models/hf
```
**Limitations:**
- No hotword biasing (MiMo has no API for it ā `--hotwords` is ignored).
- No CPU fallback.
- Inference is slower than Paraformer on the same GPU (8B model vs ~0.3B);
expect RTF around 0.1ā0.2 on an A100.
## Key Flags
| Flag | Purpose |
|------|---------|
| `--lang` | `zh` (default), `zh-basic`, `en`, `auto`, `whisper` |
| `--hotwords` | Hotword file or string ā biases ASR (zh only) |
| `--reference F` | Reference file for LLM ASR correction |
| `--num-speakers N` | Expected speaker count (improves diarization) |
| `--speakers "A,B,C"` | Assign real names by first-appearance order |
| `--speaker-context F` | JSON with per-speaker roles for LLM |
| `--no-detect-gender` | Disable automatic speaker gender detection (CAM++ gender classifier) |
| `--speaker-genders "A:female,B:male"` | Override per-speaker gender (also accepts positional `female,male`) |
| `--audio-format` | `flac` (default), `opus`, `wav` |
| `--device cpu` | Force CPU mode |
| `--batch-size N` | Adjust for memory (60 for CPU, 100 if GPU OOM) |
| `--phase1-only` | Exit after Phase 1 (VAD + ASR + diarization), skip Phase 2 + 3 |
| `--json-out PATH` | Write raw transcript JSON to explicit path (overrides default naming) |
| `--skip-transcribe` | Resume from saved `*_raw_transcript.json` |
| `--skip-llm` | Skip LLM cleanup (default when `--model` is omitted) |
| `--model ID` | Enable LLM cleanup with this model (auto-detects Bedrock/Anthropic/OpenAI) |
| `--title "..."` | Output document title |
| `--clean-cache` | Delete LLM chunk cache after completion |
| `--output PATH` | Custom output file path |
| `--model-cache-dir` | ModelScope model cache directory (~3GB, default: `~/.cache/modelscope/`) |
| `--mimo-audio-tag` | MiMo language hint: `<chinese>` (default), `<english>`, `<auto>` |
| `--mimo-batch N` | Concurrent VAD segments per MiMo call (default 1; H100/80GB can go higher) |
| `--mimo-weights-path DIR` | Cache dir for MiMo weights (default: `$HF_HOME` ā `~/.cache/huggingface`) |
| `--resume-mimo` | Resume MiMo Phase 1 from `*_mimo_partial.json` after a mid-run failure |
## Outputs
- `<stem>-transcript.md` ā Final Markdown with speaker labels and timestamps
- `<stem>_raw_transcript.json` ā Raw Phase 1 output (for resume/analysis)
## Speaker Diarization Tips
FunASR's CAM++ may merge acoustically similar speakers. To improve:
1. **`--num-speakers N`** ā Hint expected count
2. **`--hotwords`** ā Include participant names (Chinese names work best)
3. **`--speaker-context`** ā Provide per-person keywords for LLM splitting
4. **Keyword matching** ā Search `*_raw_transcript.json` for unique phrases
### Speaker gender
Enabled by default: each detected speaker is classified as `male` / `female`
via 3D-Speaker's CAM++ gender classifier (`iic/speech_campplus_two_class_gender_16k`).
The result appears next to each name in the **Speaker List** table and is
injected into the LLM cleanup prompt so pronouns (ä»/儹, he/she) get corrected.
Precedence when combined:
1. `--speaker-genders "Alice:female,Bob:male"` (explicit CLI) ā always wins
2. Reference text hints like `äø»ęļ¼å„³ļ¼ļ¼é©ę¢
ę¢
` or `Host (male): Alice` ā override auto
3. CAM++ auto-detection ā fallback
Disable with `--no-detect-gender` if you don't need gender and want to save
the ~500 MB model download and extra inference time.
## CPU-only / Low-Memory Machines
Long recordings on resource-constrained machines may hit exec timeouts
or OOM kills. See `references/pipeline-details.md` for workarounds:
- Detach from agent timeouts with `systemd-run` or `nohup`
- Prevent OOM via swap and/or `--lang zh-basic` (lighter model)
## Additional Resources
- **`references/pipeline-details.md`** ā Architecture, model specs, benchmarks,
speaker role verification, hotword effectiveness, clustering patch
- **`scripts/transcribe.py`** ā Main transcription pipeline
- **`scripts/verify_speakers.py`** ā Speaker label verification & fix
- **`scripts/llm_utils.py`** ā Shared LLM infrastructure (Bedrock/Anthropic/OpenAI)
- **`scripts/setup_env.sh`** ā Environment setup (venv + deps + patch)
don't have the plugin yet? install it then click "run inline in claude" again.