Wjs Segmenting Video

Use when the user has a long-form video (interview / lecture / podcast / conversation) and a transcript SRT, and wants to extract 3–6 stand-alone topical sho...

installs

stars

karma

SkillRank score ↗

8.3/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-06-29

extracts 3-6 stand-alone topical short clips from long-form video + srt transcript by human-judged segmentation, accurate-seek cutting, optional reframing, and srt slicing. hands off raw clips + metadata to /wjs-overlaying-video for covers and captions.

structure

9.0

trigger phrases

8.0

procedure

9.0

edge cases

8.0

documentation

8.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: wjs-segmenting-video
description: Use when the user has a long-form video (interview / lecture / podcast / conversation) and a transcript SRT, and wants to extract 3–6 stand-alone topical short clips from it. This skill ONLY cuts and crops — it produces raw clips + per-clip SRTs as a hand-off package for downstream post-production (`/wjs-overlaying-video`). Triggers — "切成几段", "分主题", "拆成短视频", "切片", "topic segments", "split into clips".
---

# wjs-segmenting-video

Cut a long video + SRT into multiple stand-alone short clips, each
oriented for the target platform. **This skill stops after cutting +
cropping** — it hands off the raw clips to `/wjs-overlaying-video` for
covers, captions, illustrations, CTA, and final render.

## When to use

- Long-form video (≥10 min) with an existing SRT transcript.
- Goal is **stand-alone** short clips (each viewable without context).
- The user will (or you will) drive post-production separately in
  `/wjs-overlaying-video`.

## When NOT to use

- Single-topic trimming → just use `ffmpeg -ss A -to B`.
- No transcript yet → run **`/wjs-transcribing-audio`** first (then `/wjs-translating-subtitles` if the segments need a non-source language).
- Multicam editing → use **`/wjs-editing-multicam`**.
- Highlight reel with multiple cuts inside a single topic → that's
  editing, not segmentation.

## What this skill IS — and IS NOT

| Is | Is not |
|---|---|
| You (the agent) **read the full SRT and decide the topic boundaries** | A script that runs NLP topic modeling, silence detection, or "viral moment" scoring. Topic boundaries are semantic; competing tools (Descript, OpusClip, Riverside Magic Clips) all get this wrong by automating it. |
| `segment.py` cuts; `/wjs-reframing-video` reorients | An end-to-end "magic" pipeline |
| Accurate-seek cuts by default (re-encode) — clip starts EXACTLY at requested timestamp | Stream-copy cuts (those produce keyframe-snap drift up to GOP duration) |
| Hands off **raw cropped clips + per-clip SRTs** | Burned subtitles, covers, intros, CTAs (those live in `/wjs-overlaying-video`) |

## The pipeline

```
long video + SRT
   ↓     (agent reads SRT, decides topics — judgment, not parsing)
segments.json
   ↓     segment.py --reencode (accurate seek; clip starts exactly at requested t)
clip_NN.mp4 + frame_NN.jpg
   ↓     ASK: target platform orientation match source?
   ↓     /wjs-reframing-video on each clip (if 16:9 → 9:16, etc.)
   ↓     re-extract frames from cropped clips
clip_NN.mp4 (now in target orientation) + clip_NN.zh-CN.burn.srt
   ↓
HAND OFF → /wjs-overlaying-video
   (does covers + captions + illustrations + CTA + final render)
```

## Step 1 — Read SRT, write `segments.json`

**Don't outsource topic identification to a script.** For each candidate segment, judge:

- **Self-contained?** A cold viewer must understand it without prior context.
- **Single thread?** One central question / insight; if the speaker pivots mid-clip, that's two segments.
- **Length fits platform?** 60–180s for 视频号 / 30–60s for 抖音&Shorts. <30s feels truncated; >4min loses retention.
- **Hook + payoff?** Open on a claim / question / vivid image; close on a takeaway. Never end mid-sentence.
- **Snap to SRT cue boundaries** — never cut mid-word.

3–6 strong segments from a 10-minute source is normal. Drop boring middles. Quality > quantity.

Schema (full spec in `references/segments_schema.json`, example in `references/example_segments.json`):

```json
{
  "source_video": "input.mp4",
  "source_srt": "input.zh-CN.srt",
  "platform": "wechat_channels",
  "segments": [{
    "id": 1, "slug": "intent-not-code",
    "title": "AI 时代不是写代码\n而是写意图",
    "summary": "Two-sentence pitch — what's the insight, what's at stake.",
    "start": "00:00:43.460", "end": "00:02:35.220",
    "cover_prompt": "Visual concept for gpt-image-2 (style anchor, not literal scene)"
  }]
}
```

`slug` = kebab-case English (used in filenames). `title` uses `\n` for line break, 2 lines max, 8–12 Chinese chars per line. `cover_prompt` is consumed downstream by `/wjs-overlaying-video`'s cover-generation step — keep it written here so the overlay skill can pick it up without re-asking.

## Step 2 — Accurate-seek cut

```bash
python3 ~/.claude/skills/wjs-segmenting-video/scripts/segment.py \
    --segments segments.json --out output/ --reencode
```

`--reencode` is the **default recommended mode**. It cuts with
`ffmpeg -ss N -i src -c:v libx264 -c:a aac` so the output starts
EXACTLY at the requested timestamp. ~30s per clip on CPU. Also extracts
a midpoint frame per segment to `output/frame_NN_slug.jpg`.

**Why default to `--reencode` and not stream-copy:**

Stream-copy via `ffmpeg -ss N -c copy` seeks to the nearest keyframe
*before* N (it can't re-encode). The output's t=0 then maps to source
t=keyframe, so the clip plays a fraction of a second of "lead-in"
content before the requested speech. Captions sliced from the master
SRT at boundary N appear **AHEAD of the audio** by exactly that GOP
fraction — listeners feel "subtitles lead the voice."

In practice on H.264 source with GOP=2s: every clip is off by
0.6–1.5s. Looks like a synchronization bug downstream; it's actually
a cut-time bug upstream.

### Stream-copy variant (only if you control the source encode)

If the source has been re-encoded with `-force_key_frames` at every
requested cut boundary, stream-copy IS accurate. Workflow:

```bash
# Build the comma-separated keyframe list from segments.json
KF=$(python3 -c "import json; s=json.load(open('segments.json'))
ts=[]
for seg in s['segments']:
    ts += [seg['start'], seg['end']]
print(','.join(ts))")

# Re-encode master once, forcing keyframes at all segment boundaries
ffmpeg -i master.mp4 \
  -c:v libx264 -preset medium -crf 18 \
  -force_key_frames "$KF" \
  -c:a copy master_kf.mp4

# Now stream-copy cuts land exactly:
python3 segment.py --segments segments.json --source master_kf.mp4 --out output/
```

Use this only when iterating on segment boundaries (you'll re-cut the
same source many times). For one-shot work, `--reencode` is simpler
and just as correct.

### Diagnosing keyframe-snap on already-cut clips

```bash
ffprobe -v error -select_streams v:0 -read_intervals "$((N-2))%$((N+5))" \
  -show_entries packet=pts_time,flags -of csv=p=0 master.mp4 | grep "K_"
```

Output like `360.023,K__   362.023,K__` → GOP=2s. A `-c copy` cut at
361.000 actually starts at 360.023, captions are 0.977s ahead of audio.
The retroactive fix is a per-clip SRT offset shim
(`requested_start − nearest_preceding_keyframe`) added to every cue's
start/end, but the root fix is to re-cut with `--reencode`.

## Step 3 — Orientation check (ask before continuing)

Compare source video aspect ratio to the target platform:

| Platform                  | Native orientation | Aspect |
|---------------------------|--------------------|--------|
| 视频号 (WeChat Channels)  | vertical           | 9:16   |
| 抖音 / TikTok / Reels     | vertical           | 9:16   |
| 小红书 (Xiaohongshu video)| vertical           | 9:16   |
| YouTube Shorts            | vertical           | 9:16   |
| YouTube (regular)         | horizontal         | 16:9   |
| B站 (Bilibili)            | horizontal         | 16:9   |

Probe with `ffprobe`:

```bash
ffprobe -v error -select_streams v:0 \
  -show_entries stream=width,height -of csv=p=0 clip_01_*.mp4
```

If source aspect already matches the platform → **skip this step**.

If mismatch → **ASK THE USER** before converting. Sample phrasing:

> 源视频是横屏 (1920×1080)，平台 视频号 需要竖屏 (9:16)。是否对每段
> 调用 `/wjs-reframing-video` 转成竖屏？(crop 会用 MediaPipe 跟踪正在说话
> 的人的脸，保持说话人始终在画面中)

**Never silently skip the check** — finding out at upload time that
your horizontal clip needs to be vertical is a frustrating failure
mode the skill exists to prevent.

### Calling `/wjs-reframing-video`

The crop script needs `mediapipe + opencv + numpy` in a Python 3.12
venv (mediapipe doesn't ship wheels for 3.14+). One-time setup:

```bash
uv venv --python 3.12 /tmp/_crop_venv
/tmp/_crop_venv/bin/python -m pip install mediapipe opencv-python numpy
```

Per-clip invocation:

```bash
for n in 01 02 03 04 05; do
  slug=$(ls clip_${n}_*.mp4 | grep -v -E "_intro|_burned|_vert" | head -1 | sed -E "s/clip_${n}_(.+)\.mp4/\1/")
  /tmp/_crop_venv/bin/python ~/.claude/skills/wjs-reframing-video/scripts/crop.py \
    "clip_${n}_${slug}.mp4" \
    --out "clip_${n}_${slug}_vert.mp4" \
    --target portrait \
    --bitrate 8M    # 视频号 caps at 10Mbps
done
```

After cropping, **swap the cropped versions to canonical names** so
downstream pipelines find them:

```bash
mkdir -p _horizontal_archive
for n in 01 02 03 04 05; do
  base=$(ls clip_${n}_*_vert.mp4 | sed -E "s/_vert\.mp4$//")
  mv "${base}.mp4" "_horizontal_archive/"
  mv "${base}_vert.mp4" "${base}.mp4"
  # Re-extract midpoint frame:
  mid=$(ffprobe -v error -show_entries format=duration -of csv=p=0 "${base}.mp4" | awk '{print $1/2}')
  slug=$(echo "$base" | sed -E "s/^clip_${n}_//")
  ffmpeg -hide_banner -loglevel error -ss "$mid" -i "${base}.mp4" \
    -frames:v 1 -q:v 3 "frame_${n}_${slug}.jpg" -y
done
```

**Sanity check**: face-on-screen detection rate in the crop log can
read low (e.g. `face#0: 9.6s on screen (9%)`) when speakers sit
further than ~2 m from the camera. That number being low is OK — the
active-speaker hysteresis + fallback-to-largest-face still produces
well-centered crops. **Verify visually** by extracting a midpoint
frame and confirming the speaker is centered before committing.

## Step 4 — Slice per-clip SRTs

```bash
python3 ~/.claude/skills/wjs-segmenting-video/scripts/burn_subs.py \
    --segments segments.json --out output/ --no-burn
```

The `--no-burn` flag emits per-clip SRTs (`clip_NN_slug.zh-CN.burn.srt`)
with timestamps already shifted to start at 0 — exactly the input
`/wjs-overlaying-video` captions expect (its compositions start the
body at t=cover_duration, not the master clock).

Despite the legacy name `burn_subs.py`, this step does NOT burn pixels
in `--no-burn` mode — it's just an SRT slicer. (The burn-pixels mode
exists for the legacy "Path A" workflow but is deprecated in favor of
`/wjs-overlaying-video`'s HTML/CSS caption rendering.)

## Hand-off package — what to deliver to `/wjs-overlaying-video`

After Steps 1–4, deliver EXACTLY these per-segment artifacts:

```
output/
  clip_NN_slug.mp4                  # raw cropped clip (target orientation, no subs, no cover)
  clip_NN_slug.zh-CN.burn.srt       # per-clip SRT, timestamps shifted to start at 0
  frame_NN_slug.jpg                 # midpoint frame (cover reference)
  segments.json                     # for slug/title/summary/cover_prompt metadata
```

Then invoke `/wjs-overlaying-video` to add covers, captions, illustrations,
CTA, and produce the upload-ready MP4 per clip. The overlay skill
generates ONE final composition per clip and renders it in a single
encode (no cascade of re-encodes).

## Quick reference

| Task | Command |
|---|---|
| Cut clips (accurate, default) | `segment.py --segments S.json --out output/ --reencode` |
| Probe source aspect | `ffprobe -v error -select_streams v:0 -show_entries stream=width,height -of csv=p=0 IN.mp4` |
| Convert orientation (ask first) | invoke `/wjs-reframing-video` per clip |
| Slice per-clip SRTs | `burn_subs.py --segments S.json --out output/ --no-burn` |
| Diagnose keyframe positions | `ffprobe -v error -select_streams v:0 -read_intervals A%B -show_entries packet=pts_time,flags -of csv=p=0 src.mp4 \| grep K_` |

## Common mistakes

- **Cutting mid-sentence** — always snap to SRT cue boundaries.
- **Trying to use 100% of the video** — 3–6 strong clips from 10 min is normal. Boring middle = drop.
- **Letting the LLM write the title** — the title is judgment, not summary. Review and rewrite before passing to make_cover.
- **Stream-copy without `--force_key_frames` preprocessing** — produces clips with audio ahead of captions by up to 1 GOP. Use `--reencode` (default) unless the source was specifically prepared.
- **Skipping the orientation check** — getting a horizontal podcast on 视频号 and finding out at upload time is preventable. Probe aspect and ask the user before cropping.
- **Burning subs / generating covers in THIS skill** — those moved to `/wjs-overlaying-video`. This skill stops after Step 4.

## Integration with other skills

- **`/wjs-transcribing-audio`** — produce the source SRT first if missing. The word-level Whisper output (or Volcano/豆包 ASR output) is preferred for accurate cue timing. If the segments need translating, chain into **`/wjs-translating-subtitles`**.
- **`/wjs-reframing-video`** — call in Step 3 when source orientation doesn't match target platform. Face-tracked active-speaker following keeps the talker in frame.
- **`/wjs-editing-multicam`** — if the source is multi-cam, render the synced single MP4 first, then segment.
- **`/wjs-overlaying-video`** — the **default downstream** for everything after Step 4. Covers, captions, illustrations, CTA, and final render all happen there. Don't add post-production in this skill.

## Files & references

- `scripts/segment.py` — accurate-seek + stream-copy cutting
- `scripts/burn_subs.py` — SRT slicer (`--no-burn` mode); legacy libass burn-in mode is deprecated in favor of `/wjs-overlaying-video`
- `references/segments_schema.json` — JSON Schema for segments.json
- `references/example_segments.json` — worked example

don't have the plugin yet? install it then click "run inline in claude" again.

wjs-segmenting-video

Cut a long video + SRT into multiple stand-alone short clips, each oriented for the target platform. This skill stops after cutting + cropping. It hands off raw clips to /wjs-overlaying-video for covers, captions, illustrations, CTA, and final render.

intent

Use this skill when you have a long-form video (≥10 min) with an existing SRT transcript and need to extract 3, 6 stand-alone topical short clips. Each clip must be viewable without prior context and fit the target platform (60, 180s for WeChat Channels, 30, 60s for TikTok/Reels). The skill reads the transcript, identifies topic boundaries (human judgment, not NLP), cuts clips with accurate timestamp seeking, checks orientation against platform requirements, slices per-clip SRTs, and delivers raw artifacts for post-production. Do NOT use this for single-topic trimming (use ffmpeg directly), multi-camera editing (use /wjs-editing-multicam), or when no transcript exists yet (run /wjs-transcribing-audio first).

inputs

Source artifacts:

Video file (MP4, MOV, or other ffmpeg-readable format). H.264 codec strongly preferred; validate with ffprobe if source is unusual.
SRT subtitle file matching the source language (typically source-language transcript, not translation). Word-level timing preferred (from Whisper or Volcano/豆包 ASR) for accurate cue boundary snapping.

Parameters (from user or inferred):

Target platform (one of: wechat_channels, douyin, tiktok, xiaohongshu, youtube_shorts, youtube, bilibili). Determines native aspect ratio and length constraints.
Desired number of segments (typically 3, 6 from a 10-minute source).

External tool dependencies:

ffmpeg + ffprobe (for cutting, probing, frame extraction). Install via apt install ffmpeg or brew install ffmpeg.
python3 (3.10+) with json, subprocess modules (standard library).
Optional: /wjs-reframing-video skill if source orientation doesn't match target platform. Requires Python 3.12 venv with mediapipe, opencv-python, numpy.
Optional: /wjs-translating-subtitles skill if segments need non-source-language SRTs.

File system setup:

Script location: ~/.claude/skills/wjs-segmenting-video/scripts/segment.py and burn_subs.py.
Reference schemas: ~/.claude/skills/wjs-segmenting-video/references/segments_schema.json and example_segments.json.
Working directory with write permissions for output/ subfolder.

No external APIs or authentication required. All work is local ffmpeg + Python.

procedure

Step 1: Read transcript, identify topic boundaries, write segments.json

Input: Source SRT file (full transcript with word-level or sentence-level cues).

Process:

Open the SRT in a text editor or parse it programmatically. Read the entire transcript to understand narrative flow, speaker pivots, and natural breaks.
For each candidate segment, apply these filters:
- Self-contained? A viewer who enters at t=start must understand the core idea without prior context. If the segment relies on setup from earlier, it's not yet a boundary.
- Single thread? One central question, insight, or argument. If the speaker shifts topics mid-candidate (e.g., pivots from "how to code in Python" to "hiring engineers"), split it into two segments.
- Platform-appropriate length? WeChat Channels: 60, 180s. TikTok/Reels/Douyin: 30, 60s. <30s feels truncated; >4min bleeds retention. (YouTube allows longer, Bilibili has soft caps around 3min for algorithm boost.)
- Hook + payoff? Segment opens on a question, claim, or vivid image. Closes on a takeaway or cliffhanger (not mid-sentence). Viewers feel the segment is "complete" even if they don't watch the full long-form.
- Snap to SRT cue boundaries? Never cut mid-word. Start boundary = first word of an SRT cue start time. End boundary = last word of an SRT cue end time.
Aim for 3, 6 strong segments from a 10-minute source. Drop boring middles. Quality over quantity.
For each approved segment, extract:
- id (integer, 1-indexed)
- slug (kebab-case English, used in filenames; e.g., "ai-not-coding")
- title (8, 12 Chinese characters per line, max 2 lines, separated by \n; e.g., "AI 时代不是写代码\n而是写意图")
- summary (2-sentence pitch: what's the insight, what's at stake)
- start / end (HH:MM:SS.mmm format, snapped to SRT cue boundaries)
- cover_prompt (visual concept for downstream cover generation; style anchor not literal scene description; e.g., "close-up of hands typing, neon cyan lighting, focused expression")

Write JSON file segments.json with schema:

{
  "source_video": "input.mp4",
  "source_srt": "input.zh-CN.srt",
  "platform": "wechat_channels",
  "segments": [
    {
      "id": 1,
      "slug": "intent-not-code",
      "title": "AI 时代不是写代码\n而是写意图",
      "summary": "Two-sentence summary here.",
      "start": "00:00:43.460",
      "end": "00:02:35.220",
      "cover_prompt": "Visual concept for cover generation"
    }
  ]
}

Output: segments.json in working directory. Validate against references/segments_schema.json.

Step 2: Accurate-seek cut with ffmpeg

Input: segments.json, source MP4, working output directory.

Process:

Run the segmentation script in --reencode mode (default):

python3 ~/.claude/skills/wjs-segmenting-video/scripts/segment.py \
    --segments segments.json --out output/ --reencode

Script uses ffmpeg -ss <start> -i source.mp4 -c:v libx264 -c:a aac ... for each segment. Re-encoding ensures output clip starts EXACTLY at requested timestamp (not snapped to nearest keyframe). Processing time: ~30s per clip on typical CPU.
Script automatically extracts midpoint frame per segment: ffmpeg -ss <midpoint> -i clip.mp4 -frames:v 1 -q:v 3 frame_NN_slug.jpg.
Monitor ffmpeg stderr for encoding errors (insufficient disk space, codec unavailable, input corruption). If a clip fails, examine the SRT cue boundaries and retry with slightly adjusted start/end times.

Why --reencode is the default (not stream-copy): Stream-copy (ffmpeg -ss N -c copy) seeks to the nearest keyframe before N. Output t=0 then maps to source t=keyframe, causing 0.6, 1.5s of "lead-in" audio. Captions sliced from the master SRT at boundary N appear ahead of audio by that keyframe gap, sounding like a sync bug. --reencode re-encodes the slice so timestamp=requested_timestamp, captions sync correctly.

Alternative: Stream-copy with pre-encoded keyframes (only if iterating on same source many times):

Extract keyframe list from segments.json:

KF=$(python3 -c "import json; s=json.load(open('segments.json')); ts=[]; \
[ts.extend([seg['start'], seg['end']]) for seg in s['segments']]; \
print(','.join(ts))")

Re-encode master once with forced keyframes:

ffmpeg -i master.mp4 -c:v libx264 -preset medium -crf 18 \
  -force_key_frames "$KF" -c:a copy master_kf.mp4

Then stream-copy cuts land exactly at boundaries. Skip this for one-shot work; --reencode is simpler and equally correct.

Diagnostic: Checking keyframe positions on existing source:

ffprobe -v error -select_streams v:0 -read_intervals "$((N-2))%$((N+5))" \
  -show_entries packet=pts_time,flags -of csv=p=0 master.mp4 | grep "K_"

Output like 360.023,K__ 362.023,K__ indicates GOP=2s. A -c copy cut at 361.000 starts at 360.023, captions lead audio by 0.977s. Retroactive fix: offset per-clip SRT timestamps, but root fix is re-cut with --reencode.

Output: Per segment, in output/ folder:

clip_NN_slug.mp4 (raw cropped clip, source orientation, no subtitles, no cover)
frame_NN_slug.jpg (midpoint frame for cover reference)

Step 3: Orientation check and adaptive cropping

Input: Newly cut clips from Step 2, target platform from segments.json.

Process:

Probe source aspect ratio:

ffprobe -v error -select_streams v:0 \
  -show_entries stream=width,height -of csv=p=0 clip_01_*.mp4

Output: e.g., 1920,1080 (16:9 landscape) or 1080,1920 (9:16 portrait).

Map to platform native aspect:
- WeChat Channels (视频号), TikTok/Douyin, Xiaohongshu video, YouTube Shorts: 9:16 portrait
- YouTube main, Bilibili: 16:9 landscape
- Instagram Reels: 9:16 portrait (or square)
Decision: If source aspect matches target, skip cropping. If mismatch, ASK the user before proceeding:

源视频是横屏 (1920×1080)，平台视频号需要竖屏 (9:16)。是否对每段调用 /wjs-reframing-video 转成竖屏？(crop 会用 MediaPipe 跟踪正在说话的人的脸，保持说话人始终在画面中)

Never silently skip the check. Finding out at upload time that a horizontal podcast needs portrait is a frustrating failure mode.
If user approves: Set up crop venv (one-time):
```
uv venv --python 3.12 /tmp/_crop_venv
/tmp/_crop_venv/bin/python -m pip install mediapipe opencv-python numpy
```
(If uv not available, use python3.12 -m venv /tmp/_crop_venv && /tmp/_crop_venv/bin/pip install ...)

Call /wjs-reframing-video per clip:

for n in 01 02 03 04 05; do
  slug=$(ls clip_${n}_*.mp4 | grep -v -E "_intro|_burned|_vert" | head -1 | sed -E "s/clip_${n}_(.+)\.mp4/\1/")
  /tmp/_crop_venv/bin/python ~/.claude/skills/wjs-reframing-video/scripts/crop.py \
    "clip_${n}_${slug}.mp4" \
    --out "clip_${n}_${slug}_vert.mp4" \
    --target portrait \
    --bitrate 8M    # WeChat Channels caps at 10Mbps
done

Swap cropped clips to canonical filenames for downstream:

mkdir -p _horizontal_archive
for n in 01 02 03 04 05; do
  base=$(ls clip_${n}_*_vert.mp4 | sed -E "s/_vert\.mp4$//")
  [ -f "${base}.mp4" ] && mv "${base}.mp4" "_horizontal_archive/"
  [ -f "${base}_vert.mp4" ] && mv "${base}_vert.mp4" "${base}.mp4"
  # Re-extract midpoint frame from cropped clip:
  mid=$(ffprobe -v error -show_entries format=duration -of csv=p=0 "${base}.mp4" | awk '{print $1/2}')
  slug=$(echo "$base" | sed -E "s/^clip_${n}_//")
  ffmpeg -hide_banner -loglevel error -ss "$mid" -i "${base}.mp4" \
    -frames:v 1 -q:v 3 "frame_${n}_${slug}.jpg" -y
done

Sanity check: Examine crop.py logs. Face-on-screen detection can read low (e.g., "face#0: 9.6s on screen (9%)") when speakers sit >2m from camera. Low detection ≠ bad crop; active-speaker hysteresis + largest-face fallback still produce well-centered output. Verify visually: extract and inspect a midpoint frame, confirm speaker is centered before committing.

Output: Cropped clips now in target orientation. Swapped into canonical filenames (clip_NN_slug.mp4). New midpoint frames extracted.

Step 4: Slice per-clip SRTs

Input: segments.json, source SRT file, target output directory.

Process:

Run the SRT slicer:

python3 ~/.claude/skills/wjs-segmenting-video/scripts/burn_subs.py \
    --segments segments.json --out output/ --no-burn

Script does NOT burn pixels (despite legacy name). --no-burn flag emits per-clip SRTs with timestamps shifted to start at 0. This is the exact input /wjs-overlaying-video expects: overlay compositions begin their body at t=cover_duration, not the master clock.
Per segment, script extracts SRT cues from source SRT whose timestamps fall within [segment.start, segment.end), re-timestamps them to [0, segment.end−segment.start), and writes to clip_NN_slug.zh-CN.burn.srt.
Verify output: spot-check one per-clip SRT. First cue should start at 00:00:00,000 (or shortly after). Last cue should end before segment duration. No cue should start mid-word (all should align to source SRT cue boundaries).

Output: Per segment, in output/ folder:

clip_NN_slug.zh-CN.burn.srt (per-clip SRT, timestamps shifted to start at 0)

decision points

DP1: Source SRT exists?

If yes, proceed to Step 1.
If no, chain into /wjs-transcribing-audio first. Word-level timing from Whisper or Volcano/豆包 ASR is preferred. If segments need non-source-language captions, chain into /wjs-translating-subtitles after transcription.

DP2: Source aspect matches target platform? (Evaluated in Step 3)

If yes, skip orientation cropping. Proceed directly to Step 4.
If no, ask user before calling /wjs-reframing-video. Only proceed if user approves.

DP3: User has control over source encoding (iterating many times on same source)? (In Step 2, choosing cut strategy)

If yes, run stream-copy mode with pre-encoded forced keyframes (see Step 2 alternative). Faster iteration, still accurate.
If no (one-shot work), use --reencode (default). Simpler, same quality.

DP4: Crop confidence low or speaker off-center in extracted frame? (In Step 3, after cropping)

If detected face-on-screen duration is <20% or visual inspection shows speaker off-frame, consider:
- Adjusting crop.py --target-margin parameter (default 0.1; increase to 0.2 for more padding).
- Re-running crop on that clip.
- Or accepting the result and flagging for manual review in /wjs-overlaying-video (where human can swap in a manually cropped version if needed).

DP5: Post-production scope creep? (At end of Step 4, before handoff)

If user asks for covers, captions, illustrations, CTA, or color grading IN THIS SKILL, redirect: "Those live in /wjs-overlaying-video. This skill stops after Step 4. I'll hand off the raw clips + SRTs; the overlay skill does the rest."

output contract

After Steps 1, 4, deliver EXACTLY these artifacts in output/ directory:

Per segment:

clip_NN_slug.mp4 , raw cropped clip (target orientation, no burned subtitles, no cover, no intro). Bitrate and codec chosen by segment.py (h.264 + AAC, quality tuned to platform). Verify: duration matches segment.end − segment.start within 100ms. Audio in sync with visual motion.
clip_NN_slug.zh-CN.burn.srt , per-clip SRT. Timestamps shifted so first cue ≥ 00:00:00

Wjs Segmenting Video

related skills

wjs-segmenting-video

intent

inputs

procedure

Step 1: Read transcript, identify topic boundaries, write segments.json

Step 2: Accurate-seek cut with ffmpeg

Step 3: Orientation check and adaptive cropping

Step 4: Slice per-clip SRTs

decision points

output contract