clawhub
AI科普视频工作室（Mac mini 16G适用）

AI科普视频全流程自动化制作技能。将数字人形象（Google Flow / SadTalker）、AI语音克隆（F5-TTS MLX）、Pillow内容幻灯片、逐字卡拉OK字幕（Pillow + FFmpeg）、以及专业级音视频QA整合为8阶段自动化流水线。覆盖：脚本策划 → 数字人生成 → TTS语音克隆 →...
view source
installs
stars
karma
SkillRank score ↗
6.2/ 10
evaluated by implexa, claude-haiku-4-5 · 2026-07-03
ai-science-video-studio automates educational video production by chaining digital avatars (Google Flow or SadTalker), F5-TTS voice cloning, Pillow slide rendering, karaoke subtitles, and FFmpeg compositing across an 8-stage pipeline. targets mac mini m-series with 16gb ram.
structure
7.0
trigger phrases
8.0
procedure
6.0
edge cases
5.0
documentation
6.0
SKILL.md

---
name: ai-science-video-studio
description: "AI科普视频全流程自动化制作技能。将数字人形象（Google Flow / SadTalker）、AI语音克隆（F5-TTS MLX）、Pillow内容幻灯片、逐字卡拉OK字幕（Pillow + FFmpeg）、以及专业级音视频QA整合为8阶段自动化流水线。覆盖：脚本策划 → 数字人生成 → TTS语音克隆 → 幻灯片渲染 → 字幕渲染 → 音频修复 → 最终合成（FFmpeg xfade + acrossfade + alimiter）→ 专业QA审查。触发词：AI科普视频、制作科普视频、做个AI讲解视频、生成科技短视频。**推荐硬件：Mac mini M系列 16GB内存**，利用Apple Silicon MLX加速语音克隆和本地渲染。"
agent_created: true
---

# AI Science Video Studio — AI科普视频自动化制作技能

## Overview

Full pipeline for producing explainer / educational videos that combine a digital human
avatar (intro/outro) with animated content slides (body), voice-narrated by a cloned
personal voice (F5-TTS MLX), with karaoke-style subtitles throughout.

The pipeline follows an 8-stage workflow:

```
Script Planning → Digital Human → TTS Voice → Content Slides
    → Subtitles → Audio Repair → Final Compositing → QA Review
```

Default configuration is tuned for 1280×720 (16:9), 24fps, CRF 20 encoding, a single
presenter avatar, and Mandarin Chinese narration. All parameters are adjustable.

---

## When to Use This Skill

Trigger on any of the following intents:

- User asks to create an "AI科普" (AI science explainer) video
- User wants an educational/explainer video with digital human + slides format
- User mentions combining a talking avatar with content slides
- User needs the full pipeline: script → voice clone → slides → subtitles → compositing
- User says "做一个讲解视频", "生成科普视频", "制作AI讲解类视频"

Do NOT use this skill for:
- Pure short films / drama without educational content → use `ai-short-film-studio`
- Only SadTalker PiP compositing without slides → use `sadtalker-pip-compositing`
- Only Google Flow video generation → use `google-flow-automation`

---

## Pipeline Stages

### Stage 1: Script Planning

Create a `script.json` file defining the video structure with exactly 5 paragraphs:

```json
{
  "intro":      { "type": "digital_human", "engine": "google_flow", "duration": 10, "narration": "开场旁白...", "flow_prompt": "..." },
  "content_1":  { "type": "slides",        "engine": "pillow",      "duration": 30, "narration": "正文第一段旁白..." },
  "content_2":  { "type": "slides",        "engine": "pillow",      "duration": 25, "narration": "正文第二段旁白..." },
  "content_3":  { "type": "slides",        "engine": "pillow",      "duration": 29, "narration": "正文第三段旁白..." },
  "outro":      { "type": "digital_human", "engine": "google_flow", "duration": 10, "narration": "结尾旁白...", "flow_prompt": "..." }
}
```

**Rules:**
- intro and outro use `digital_human` type (talking avatar)
- content segments use `slides` type (animated content screens)
- Each segment must specify: type, engine, duration (seconds), narration text
- Narration text should be ≤15 seconds worth of speech per segment (~60 Chinese characters)
- Duration field is the target video length (not TTS length — TTS naturally sets the pace)

For detailed script format specification, see `references/script_format.md`.

---

### Stage 2: Digital Human Generation (Intro & Outro)

Two approaches are available. Prefer Google Flow for standalone talking-head segments;
use SadTalker for picture-in-picture overlay on content slides.

#### Option A: Google Flow CDP Automation

Use the `google-flow-automation` skill to generate intro/outro videos:

1. Launch Chrome with remote debugging on port 9222
2. Navigate to labs.google/fx/tools/flow
3. Upload avatar reference image
4. Enter the Chinese prompt from script.json
5. Wait ~3-5 minutes for 10-second video generation
6. Download as `intro.mp4` and `outro.mp4`

**Key parameters:**
- Avatar: upload the user's preferred reference image (portrait photo)
- Prompt: in Chinese, describe the scene and delivery style
- Model: Omni Flash, 16:9 aspect ratio, 10s duration
- Account: the user's Google account credentials (handled by Chrome profile)

#### Option B: SadTalker MPS (for PiP on content)

Use the `sadtalker-pip-compositing` skill when the digital human should appear as a
circular picture-in-picture overlay on content slides.

**Steps:**
1. Run `scripts/fix_sadtalker_numpy.py` for numpy 2.x compatibility
2. Extract avatar image + TTS audio
3. Run SadTalker 3-stage inference with `device='mps'`
4. Create circular mask (120×120) with PIL
5. FFmpeg overlay onto content at bottom-left corner

**PiP Parameters:**
| Parameter | Value |
|-----------|-------|
| Size | 120×120 (final) |
| Position | bottom-left, 20px margin |
| Mask | PIL circular, radius 60px |
| Overlay | `overlay=20:H-h-20:shortest=1` |

---

### Stage 3: Voice Generation (TTS)

#### Primary: F5-TTS MLX Voice Cloning

Use F5-TTS MLX on Apple Silicon for personal voice cloning:

```python
from f5_tts_mlx.generate import generate

# For content narration (MUST use estimate_duration=True!)
audio = generate(
    text="旁白文本...",
    ref_audio_path="/path/to/ref_voice.mp3",
    ref_audio_text="参考音频的文本内容",
    steps=64,
    cfg_strength=2.5,
    speed=1.0,
    estimate_duration=True,  # CRITICAL for Chinese!
)
```

**CRITICAL — `estimate_duration=True`:**
Without this parameter, F5-TTS generates extremely short audio for Chinese text
(0.5-0.9 seconds per sentence). With it, the model estimates target duration and
generates properly-length audio.

**Parameter table:**
| Parameter | Intro/Outro | Content |
|-----------|-------------|---------|
| steps | 64 | 64 |
| cfg_strength | 2.5 | 2.5 |
| speed | 0.45 | 1.0 |
| estimate_duration | No | **Yes** (critical!) |

**Post-processing:**
After generation, compute the actual-vs-target duration ratio and apply `atempo`
to fine-tune timing:

```bash
# Example: actual 11.98s, target 10.0s → atempo=1.198
ffmpeg -i generated.wav -filter:a "atempo=1.198" output.wav
```

#### Fallback: edge-tts

When F5-TTS is unavailable or produces garbled output:

```bash
edge-tts --voice zh-CN-YunxiNeural --text "旁白文本" --write-media output.wav
```

**Voice selection:**
| Purpose | Voice |
|---------|-------|
| Content narration (male) | zh-CN-YunxiNeural |
| Patch/correction (female) | zh-CN-XiaoxiaoNeural |

---

### Stage 4: Content Slide Rendering

Render animated content slides using Pillow frame-by-frame rendering + FFmpeg pipe.

Use `scripts/render_slides.py` as the template. The script should:

1. Accept narration text split into lines
2. Render each frame with progressively "typed" text (one new line per frame)
3. Use terminal/IDE aesthetic: dark background (#1a1a2e), green/white text, monospace font
4. Output 1280×720, 24fps PNG frames via FFmpeg pipe
5. Sync frame count to the TTS audio duration

**Key rendering parameters:**
- Resolution: 1280×720
- Frame rate: 24fps
- Background: dark (#1a1a2e or pure black for terminal look)
- Text: green (#00ff41) for code, white for explanatory text
- Font: SF Mono or Menlo for code sections; STHeiti for Chinese text

The script is at `scripts/render_slides.py`. Customize the content per video topic
while keeping the rendering engine intact.

---

### Stage 5: Subtitle Rendering

Generate karaoke-style subtitles as transparent PNG frames overlayed on the final video.

The process:
```
Audio (.wav)
  → Whisper small/medium transcription
  → Word-level timestamps (segments + words)
  → Text correction mapping (fix Whisper mis-transcriptions)
  → Pillow frame-by-frame PNG rendering (transparent BG)
  → FFmpeg overlay onto video
```

Use `scripts/render_subtitles.py` as the rendering engine.

**Subtitle style specification (intro and outro MUST match):**
| Property | Value |
|----------|-------|
| Font | STHeiti Medium (macOS: `/System/Library/Fonts/STHeiti Medium.ttc`) |
| Size | 44px |
| Spoken text color | Orange (#FF6B2B) |
| Unspoken text color | White (#FFFFFF) |
| Outline | 2px black |
| Background bar | Semi-transparent black `rgba(0,0,0,160)` |
| Display mode | Per-sentence (each sentence appears and disappears independently) |
| Highlight mode | Word-by-word (karaoke-style progressive highlight) |

**Text correction mapping:**
Always maintain a correction dictionary to fix Whisper mis-transcriptions of
technical terms and proper names:

```python
corrections = {
    "材领": "才林",
    "Anthropy": "Anthropic",
    "Cloud Code": "Claude Code",
}
```

**CRITICAL — Consistency rule:** Intro and outro subtitles MUST use the exact
same rendering engine (Pillow) with identical style properties. Never mix Pillow
and ASS/other formats — FFmpeg on macOS lacks libass support.

---

### Stage 6: Audio Repair

Common audio issues and their fixes. Run `scripts/audio_analyzer.py` for automated
detection before proceeding.

| Issue | Symptom | Root Cause | Fix |
|-------|---------|------------|-----|
| Right channel dropout | Crunching noise at specific timestamps | Source right channel flickers 20+ times | `channelmap=FL-FL|FL-FR` (duplicate L→R) |
| Silence gaps | Sudden "click" in music | AI-generated BGM has gaps (100-400ms) | 250ms fade-out/in at each gap boundary |
| Audio truncation | Sound stops abruptly | Segment extracted from wrong time range | Use original source file, re-extract |
| Channel mismatch | Concat fails or silent segments | Mono vs stereo mismatch | Unify all to 48000Hz stereo |
| Clipping | Peak near 32768 (16-bit max) | Volume stacking at concatenation points | `alimiter` with `limit=-0.9dB` |
| TTS mispronunciation | Garbled Chinese characters | TTS engine multi-phoneme errors | Re-generate with F5-TTS or edge-tts patch |

**Channel fix command:**
```bash
ffmpeg -i input.wav -af "channelmap=map=FL-FL|FL-FR" -ar 48000 -ac 2 output.wav
```

**Gap smoothing approach:**
For each audio gap >80ms detected by `scripts/audio_analyzer.py`:
```bash
# Split at gap, apply fade-out/fade-in, re-concatenate
ffmpeg -i audio.wav -af "afade=t=out:st=GAP_START-0.25:d=0.25,afade=t=in:st=GAP_END:d=0.25" patched.wav
```

---

### Stage 7: Final Compositing

Assemble all segments with professional transitions using FFmpeg.

**Compositing order:**
```
intro.mp4 → content_1.mp4 → content_2.mp4 → content_3.mp4 → outro.mp4
```

Use `scripts/compose_final.py` for automated assembly.

**Encoding parameters:**
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Codec | libx264 | Maximum compatibility |
| Resolution | 1280×720 | 16:9 standard |
| Frame rate | 24fps | Cinematic feel |
| Rate control | CRF 20 | High-quality unified encoding |
| Pixel format | yuv420p | Universal compatibility |
| Audio codec | AAC 192kbps | 48000Hz stereo |
| Limiter | alimiter limit=-0.9dB | Prevent clipping |

**Transition effects:**
- Video: `xfade=transition=fade:duration=0.5:offset=<time>` (cross-fade, eliminates hard cuts)
- Audio: `acrossfade=d=0.5:curve=tri` (triangular cross-fade, smooth audio joins)

**Pre-compositing checklist:**
1. All segments re-encoded to CRF 20 (unified quality)
2. All audio normalized to 48000Hz stereo
3. Subtitle overlays applied to each segment
4. Transitions prepared: 0.5s offset for each segment boundary

---

### Stage 8: Professional QA Review

Run a systematic quality review before delivering the final video.

**QA dimensions and inspection methods:**
| Dimension | Check | Method |
|-----------|-------|--------|
| Video transitions | Hard cuts at boundaries? | Extract transition zone, frame-by-frame review |
| Video encoding | Consistent bitrate across segments? | `ffprobe` bitrate check |
| Audio artifacts | Noise, pops, silence gaps? | Second-by-second mean/peak analysis |
| Audio joins | Smooth at concatenation points? | Acrossfade spectral analysis |
| Audio clipping | Peaks near 32768? | Peak detection (>32000 = danger) |
| Subtitle sync | Subtitles aligned with speech? | Whisper word-level timestamp verification |
| Subtitle consistency | Intro and outro styles match? | Visual comparison of 7 style properties |
| Pronunciation | Chinese pronunciation accurate? | Whisper transcription cross-validation |

**Verification commands:**
```bash
# Audio per-second analysis
python3 scripts/audio_analyzer.py output.mp4

# Video quality check
ffprobe -v error -select_streams v:0 \
  -show_entries stream=codec_name,width,height,r_frame_rate,bit_rate \
  output.mp4

# Subtitle sync verification
whisper model small --language zh output.mp4
```

For the complete QA checklist, see `references/qa_checklist.md`.

---

## File Naming Convention

```
AI科普第{N}期_{主题}_v{N}.mp4
```

Example: `AI科普第一期_SadTalker画中画_v11.mp4`

**Intermediate files:**
| File | Purpose |
|------|---------|
| `content_video.mp4` / `content_with_pip_v{N}.mp4` | Content with optional PiP |
| `sadtalker_output.mp4` / `intro.mp4` / `outro.mp4` | Digital human outputs |
| `content_audio.wav` / `ref_audio_24k.wav` | Audio files |
| `subs_s{N}/frame_{N}.png` | Subtitle frames |
| `build_v{N}.py` / `build_v{N}_fixed.py` | Build scripts |
| `merge_final.sh` / `concat_v{N}.txt` | Merge scripts |

---

## Quality Targets

| Metric | Target |
|--------|--------|
| Video resolution | 1280×720 (16:9) |
| Frame rate | 24fps |
| Video bitrate | CRF 20 (~200-400 kbps) |
| Audio sample rate | 48000Hz stereo |
| Audio bitrate | AAC 192kbps |
| Audio peak | < -0.9dB (no clipping) |
| Segment transition | 0.5s xfade + acrossfade |
| Subtitle alignment | Whisper word-level timestamps |

---

## Critical Pitfalls

For the complete pitfalls reference, see `references/pitfalls.md`. Key highlights:

1. **F5-TTS Chinese too short**: Always set `estimate_duration=True` for Chinese content narration. Without it, audio is only 0.5-0.9s per sentence.

2. **Alpha channel compositing**: When using `alphamerge`, the human RGBA video is the color source (first input), and the circular mask PNG is the alpha (second input). Reversing them produces a white circle with no human visible.

3. **Concat format mismatch**: Different segments may have different sample rates (16000 vs 48000Hz) or channel counts (mono vs stereo). Unify all segments to 48000Hz stereo before concatenation.

4. **Subtitle rendering engine inconsistency**: Always use Pillow for both intro and outro subtitles. FFmpeg on macOS lacks libass, making ASS-subtitle filters unavailable.

5. **AI-generated BGM gaps**: Google Flow's AI-generated background music may contain silence gaps (100-400ms). Smooth them with 250ms crossfades at each gap boundary.

---

## Bundled Resources

### Scripts

- `scripts/render_slides.py` — Pillow-based content slide frame renderer (1280×720, dark IDE theme, progressive text reveal)
- `scripts/render_subtitles.py` — Karaoke-style subtitle PNG renderer (word-by-word orange highlight, transparent BG, STHeiti 44px)
- `scripts/compose_final.py` — End-to-end FFmpeg compositing (xfade + acrossfade + alimiter + CRF20 unified encoding)
- `scripts/audio_analyzer.py` — Audio QA analysis tool (second-by-second mean/peak detection, gap finder, clipping detector)

### References

- `references/script_format.md` — Complete script.json format specification and examples
- `references/qa_checklist.md` — Detailed 8-dimension QA review checklist
- `references/pitfalls.md` — Comprehensive list of known pitfalls with root causes and fixes
related skills

semantically similar in the cross-vendor index
clawhub
78% match
Voice.ai: Creator Voiceover Forge
Turn scripts into publishable voiceovers with Voice.ai TTS, including segments, chapters, captions, and video muxing.
don't have the plugin yet? install it then click "run inline in claude" again.