This skill should be used when the user explicitly asks to "transcribe a meeting", "transcribe audio", "transcribe a meeting recording", "convert audio to te...
SKILL.md

---
name: audio-transcribe
version: 1.7.1
description: >
  This skill should be used when the user explicitly asks to "transcribe a meeting",
  "transcribe audio", "transcribe a meeting recording",
  "convert audio to text", "generate meeting minutes from audio",
  "do speech-to-text", "transcribe with speaker diarization",
  "identify speakers in audio", "transcribe Chinese audio",
  "transcribe English audio", "transcribe Japanese audio",
  "multi-speaker transcription", "transcribe a podcast",
  "transcribe podcast episode", "transcribe an interview",
  "convert podcast to text", "podcast to transcript",
  or mentions FunASR, Paraformer, SenseVoice, Whisper, MiMo, MiMo-V2.5-ASR,
  meeting transcription, podcast transcription, or speaker diarization.
  Supports multi-speaker meeting and podcast transcription in Chinese,
  English, Japanese, Korean, Cantonese, and 99 languages (via Whisper),
  plus Xiaomi MiMo-V2.5-ASR (8B, local GPU) for stronger proper-noun and
  code-switching accuracy. Automatic speaker diarization via CAM++,
  hotword biasing (FunASR path), LLM cleanup. FunASR works on GPU and CPU;
  MiMo requires a local CUDA GPU with >=20GB VRAM.
metadata:
  openclaw:
    requires:
      bins: ["python3", "ffmpeg"]
    env_vars:
      - name: AWS_REGION
        required: false
        description: "AWS region for Bedrock LLM cleanup (default: us-west-2). Bedrock uses the standard AWS credential chain (IAM role, SSO, ~/.aws/credentials, env vars) — no explicit keys needed."
      - name: ANTHROPIC_API_KEY
        required: false
        description: "API key for Anthropic Claude LLM cleanup"
      - name: OPENAI_API_KEY
        required: false
        description: "API key for OpenAI-compatible LLM cleanup"
      - name: OPENAI_BASE_URL
        required: false
        description: "Base URL for OpenAI-compatible API (vLLM, Ollama, etc.)"
    emoji: "🎙️"
    homepage: "https://github.com/zxkane/audio-transcriber"
---

# Meeting & Podcast Transcription (FunASR + MiMo)

Transcribe multi-speaker audio into structured Markdown with automatic
speaker diarization, hotword biasing, and optional LLM cleanup. Two
ASR engine families are available: **FunASR** (Paraformer / SenseVoice /
Whisper — fast, cheap, GPU or CPU, 99 languages) and **MiMo-V2.5-ASR**
(Xiaomi's 8B model, local GPU only, stronger on proper nouns and
code-switching). Both share the same VAD + speaker-clustering stack.

All scripts run directly from the plugin directory — no copying needed.
Define this shorthand at the start of every session:

```bash
SCRIPTS=${CLAUDE_PLUGIN_ROOT}/skills/audio-transcribe/scripts
```

## Supported Languages

| `--lang` | Model | Languages | Hotword |
|----------|-------|-----------|---------|
| `zh` (default) | SeACo-Paraformer | Chinese (CER 1.95%) | Yes |
| `zh-basic` | Paraformer-large | Chinese | No |
| `en` | Paraformer-en | English | No |
| `auto` | SenseVoiceSmall | Auto-detect: zh/en/ja/ko/yue | No |
| `whisper` | Whisper-large-v3-turbo | 99 languages | No |
| `mimo` | MiMo-V2.5-ASR (local 8B, GPU-only) | zh/en/code-switch/dialects | No |

All presets include **speaker diarization** (CAM++) and **VAD** (FSMN).
`mimo` reuses the FSMN VAD + CAM++ stack around MiMo's text output.

> **Diarization caveat:** `auto` and `whisper` do not output per-sentence timestamps,
> so speaker diarization does not work with these presets. Use `zh`, `zh-basic`,
> `en`, or `mimo` when speaker identification is needed (e.g., podcasts, meetings).

## Workflow

Before starting transcription, **always ask the user**:

1. **Audio file** — path to the recording (required)
2. **Type** — meeting, podcast, or interview? (affects defaults)
3. **Language** — what language is spoken? (default: Chinese)
4. **Number of speakers** — how many participants? (improves diarization)
5. **Speaker names** — for podcasts: host + guest names; for meetings: attendee list
6. **Supporting files** — ask:
   > "Do you have any of the following to improve accuracy?"
   > - **Attendee / guest list** — for hotwords and speaker mapping
   > - **Meeting agenda or episode topic** — for hotwords (terms, names)
   > - **Reference documents** (show notes, prior notes) — for speaker identification and ASR correction

**Adapt defaults by recording type:**
- **Meeting**: default `--lang zh`, ask about supporting files
- **Podcast / interview**: default `--lang zh`, `--num-speakers 2`, always ask for
  host + guest names, suggest `--speaker-context` for roles
  (do NOT use `--lang auto` — it lacks timestamps for speaker diarization)

> **⚠️ `--speakers` must use the speaker's real name, not a podcast alias.**
> The value passed to `--speakers` is used verbatim as the speaker label in the
> output transcript. Always derive it from the host/guest's actual name (e.g.
> from a shownotes "Host:" field), not from the podcast feed name or title.
>
> Example: if shownotes lists "Host: 张三（张三的播客）", pass `--speakers '张三'`
> — not the alias "张三的播客". Add both the real name and the alias to
> `hotwords.txt` so ASR can recognise both forms.
>
> When both `--speakers` and `--reference` are supplied, the script detects
> this mistake at startup and prints an `ACTION REQUIRED` block naming the
> suggested real name. **If you see that block, stop the run and re-invoke
> with the corrected `--speakers` value before Phase 3** — the warning does
> not abort the pipeline.

If the user provides supporting materials:
- Extract participant names and key terms → create `hotwords.txt` (include both real name and alias)
- Extract per-person context → create `speaker-context.json`
- Pass original reference document with `--reference`
- Use all three together for best results

## Quick Start

### 1. Environment Setup

```bash
AUTO_YES=1 bash $SCRIPTS/setup_env.sh
# Or force CPU:  AUTO_YES=1 bash $SCRIPTS/setup_env.sh cpu
```

The setup script patches FunASR's spectral clustering for O(N²·k) performance.
Without this, recordings over ~1 hour hang for hours during speaker clustering.

### 2. Run Transcription

Output files are written to the current working directory.

**LLM cleanup (Phase 3) is opt-in.** By default, transcription runs locally
without contacting any external service. To enable LLM-powered ASR correction
and speaker name refinement, pass `--model <model-id>`. Use LLM cleanup when:
- The raw transcript has many ASR errors (names, technical terms)
- You need polished, publication-ready output
- Speaker names need to be refined from context

> **⚠️ Data Privacy:** When LLM cleanup is enabled via `--model`, transcript
> excerpts are sent to external LLM providers (AWS Bedrock, Anthropic, or
> OpenAI depending on the model ID). Use `--skip-llm` or omit `--model` to
> keep all data local. For Bedrock, boto3 uses the standard AWS credential
> chain (IAM role, SSO, `~/.aws/credentials`, env vars).

```bash
# Chinese meeting with hotwords (local-only, no LLM)
python3 $SCRIPTS/transcribe.py meeting.wav \
    --lang zh --num-speakers 9 --hotwords hotwords.txt

# English meeting with speaker names
python3 $SCRIPTS/transcribe.py meeting.wav \
    --lang en --speakers "Alice,Bob,Carol,Dave"

# Auto-detect language (zh/en/ja/ko/yue)
python3 $SCRIPTS/transcribe.py meeting.wav \
    --lang auto --num-speakers 6

# Whisper for any language
python3 $SCRIPTS/transcribe.py meeting.wav \
    --lang whisper --num-speakers 4

# Enable LLM cleanup for polished output (requires --model)
# Bedrock (uses AWS credential chain: IAM role, SSO, ~/.aws/credentials)
python3 $SCRIPTS/transcribe.py meeting.wav \
    --lang zh --num-speakers 9 --hotwords hotwords.txt \
    --provider bedrock --model us.anthropic.claude-sonnet-4-6

# Bedrock "global" cross-region profile (recent AWS deployments)
python3 $SCRIPTS/transcribe.py meeting.wav \
    --provider bedrock --model global.anthropic.claude-sonnet-4-6

# Bedrock via litellm-style wrapper (supported; prefix is stripped for boto3)
python3 $SCRIPTS/transcribe.py meeting.wav \
    --provider bedrock --model amazon-bedrock/global.anthropic.claude-sonnet-4-6

# Anthropic API (requires ANTHROPIC_API_KEY env var)
python3 $SCRIPTS/transcribe.py meeting.wav \
    --provider anthropic --model claude-sonnet-4-6

# OpenAI-compatible API (requires OPENAI_API_KEY env var)
python3 $SCRIPTS/transcribe.py meeting.wav \
    --provider openai --model gpt-4o

# Full pipeline with all supporting files + LLM (best quality)
python3 $SCRIPTS/transcribe.py episode.m4a \
    --lang zh --num-speakers 2 \
    --hotwords hotwords.txt \
    --speakers "关羽,张飞" \
    --speaker-context speaker-context.json \
    --reference show-notes.md \
    --model us.anthropic.claude-sonnet-4-6

# Resume interrupted LLM cleanup
python3 $SCRIPTS/transcribe.py meeting.wav \
    --skip-transcribe --model us.anthropic.claude-sonnet-4-6
```

### 3. Verify Speaker Labels

If the transcript has swapped speaker labels (common with podcasts),
the verification script can detect and fix mismatches using LLM analysis:

```bash
# Dry-run: check if host/guest are swapped
python3 $SCRIPTS/verify_speakers.py podcast_raw_transcript.json \
    --speakers "关羽,张飞" \
    --speaker-context speaker-context.json

# Apply the fix
python3 $SCRIPTS/verify_speakers.py podcast_raw_transcript.json \
    --speakers "关羽,张飞" \
    --speaker-context speaker-context.json --fix

# Multi-speaker meeting: full reassignment
python3 $SCRIPTS/verify_speakers.py meeting_raw_transcript.json \
    --speakers "Alice,Bob,Carol,Dave" \
    --speaker-context speaker-context.json --fix

# Then regenerate the markdown with corrected labels
python3 $SCRIPTS/transcribe.py original.m4a \
    --skip-transcribe --clean-cache
```

The script analyzes the first 5 minutes (configurable with `--minutes`)
and auto-detects podcast (2 speakers, swap detection) vs meeting
(N speakers, full reassignment).

## Audio Preprocessing

The script automatically converts input audio to 16kHz mono FLAC and
validates that no audio is lost (detects silent truncation).

| Format | 4h14m meeting | Quality | Recommendation |
|--------|--------------|---------|----------------|
| **FLAC** | **219MB** | Lossless | **Default, safest** |
| Opus | 55MB | Lossy | Risk of truncation on long files |
| WAV | 465MB | Lossless | Works but larger |
| Original M4A | 173MB | Source | Also works directly |

**Do NOT split long recordings** — splitting breaks speaker ID consistency.

## MiMo-V2.5-ASR (optional, GPU-only)

`--lang mimo` runs Xiaomi's
[MiMo-V2.5-ASR](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) locally on a
CUDA GPU. Use it when:
- You want to evaluate MiMo against Paraformer on Chinese audio.
- The recording has heavy code-switching, dialects (Wu, Cantonese, Hokkien,
  Sichuanese), lyrics, or rare proper nouns that other presets mis-transcribe.

**Requirements:**
- CUDA ≥12.0 and **≥20 GB VRAM** (16 GB cards OOM during inference).
- Python 3.12 (enforced by `setup_env.sh`).
- ~20 GB weight download (one-time) and `flash-attn==2.7.4.post1` compile
  (needs `nvcc` from the CUDA toolkit, takes 10–30 min).

**Install (opt-in):**

```bash
# One-time: install MiMo on top of the standard environment
AUTO_YES=1 INSTALL_MIMO=1 \
    MIMO_WEIGHTS_PATH=/mnt/models/hf \
    bash $SCRIPTS/setup_env.sh
```

**Run:**

```bash
python3 $SCRIPTS/transcribe.py podcast.m4a \
    --lang mimo --num-speakers 2 \
    --mimo-weights-path /mnt/models/hf
```

**Resume after failure:**

```bash
python3 $SCRIPTS/transcribe.py podcast.m4a \
    --lang mimo --resume-mimo --mimo-weights-path /mnt/models/hf
```

**Limitations:**
- No hotword biasing (MiMo has no API for it — `--hotwords` is ignored).
- No CPU fallback.
- Inference is slower than Paraformer on the same GPU (8B model vs ~0.3B);
  expect RTF around 0.1–0.2 on an A100.

## Key Flags

| Flag | Purpose |
|------|---------|
| `--lang` | `zh` (default), `zh-basic`, `en`, `auto`, `whisper` |
| `--hotwords` | Hotword file or string — biases ASR (zh only) |
| `--reference F` | Reference file for LLM ASR correction |
| `--num-speakers N` | Expected speaker count (improves diarization) |
| `--speakers "A,B,C"` | Assign real names by first-appearance order |
| `--speaker-context F` | JSON with per-speaker roles for LLM |
| `--no-detect-gender` | Disable automatic speaker gender detection (CAM++ gender classifier) |
| `--speaker-genders "A:female,B:male"` | Override per-speaker gender (also accepts positional `female,male`) |
| `--audio-format` | `flac` (default), `opus`, `wav` |
| `--device cpu` | Force CPU mode |
| `--batch-size N` | Adjust for memory (60 for CPU, 100 if GPU OOM) |
| `--phase1-only` | Exit after Phase 1 (VAD + ASR + diarization), skip Phase 2 + 3 |
| `--json-out PATH` | Write raw transcript JSON to explicit path (overrides default naming) |
| `--skip-transcribe` | Resume from saved `*_raw_transcript.json` |
| `--skip-llm` | Skip LLM cleanup (default when `--model` is omitted) |
| `--model ID` | Enable LLM cleanup with this model (auto-detects Bedrock/Anthropic/OpenAI) |
| `--title "..."` | Output document title |
| `--clean-cache` | Delete LLM chunk cache after completion |
| `--output PATH` | Custom output file path |
| `--model-cache-dir` | ModelScope model cache directory (~3GB, default: `~/.cache/modelscope/`) |
| `--mimo-audio-tag` | MiMo language hint: `<chinese>` (default), `<english>`, `<auto>` |
| `--mimo-batch N` | Concurrent VAD segments per MiMo call (default 1; H100/80GB can go higher) |
| `--mimo-weights-path DIR` | Cache dir for MiMo weights (default: `$HF_HOME` → `~/.cache/huggingface`) |
| `--resume-mimo` | Resume MiMo Phase 1 from `*_mimo_partial.json` after a mid-run failure |

## Outputs

- `<stem>-transcript.md` — Final Markdown with speaker labels and timestamps
- `<stem>_raw_transcript.json` — Raw Phase 1 output (for resume/analysis)

## Speaker Diarization Tips

FunASR's CAM++ may merge acoustically similar speakers. To improve:

1. **`--num-speakers N`** — Hint expected count
2. **`--hotwords`** — Include participant names (Chinese names work best)
3. **`--speaker-context`** — Provide per-person keywords for LLM splitting
4. **Keyword matching** — Search `*_raw_transcript.json` for unique phrases

### Speaker gender

Enabled by default: each detected speaker is classified as `male` / `female`
via 3D-Speaker's CAM++ gender classifier (`iic/speech_campplus_two_class_gender_16k`).
The result appears next to each name in the **Speaker List** table and is
injected into the LLM cleanup prompt so pronouns (他/她, he/she) get corrected.

Precedence when combined:

1. `--speaker-genders "Alice:female,Bob:male"` (explicit CLI) — always wins
2. Reference text hints like `主播（女）：韩梅梅` or `Host (male): Alice` — override auto
3. CAM++ auto-detection — fallback

Disable with `--no-detect-gender` if you don't need gender and want to save
the ~500 MB model download and extra inference time.

## CPU-only / Low-Memory Machines

Long recordings on resource-constrained machines may hit exec timeouts
or OOM kills. See `references/pipeline-details.md` for workarounds:
- Detach from agent timeouts with `systemd-run` or `nohup`
- Prevent OOM via swap and/or `--lang zh-basic` (lighter model)

## Additional Resources

- **`references/pipeline-details.md`** — Architecture, model specs, benchmarks,
  speaker role verification, hotword effectiveness, clustering patch
- **`scripts/transcribe.py`** — Main transcription pipeline
- **`scripts/verify_speakers.py`** — Speaker label verification & fix
- **`scripts/llm_utils.py`** — Shared LLM infrastructure (Bedrock/Anthropic/OpenAI)
- **`scripts/setup_env.sh`** — Environment setup (venv + deps + patch)
Audio Transcribe

SKILL.md

related skills