Audio Content Generator

Generate audiobooks, podcasts, or educational audio content on demand. User provides an idea or topic, Claude AI writes a script, and ElevenLabs converts it to high-quality audio. Supports multiple formats (audiobook, podcast, educational), custom lengths, and voice effects. Use when asked to create audio content, make a podcast, generate an audiobook, or produce educational audio. Returns MP3 audio file via MEDIA token.

installs

stars

karma

SkillRank score ↗

7.3/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

audio-gen orchestrates claude script generation with elevenlabs text-to-speech to produce audiobooks, podcasts, and educational content in mp3 format. supports three distinct formats with format-specific structure templates, target length negotiation, and user feedback loops before audio synthesis.

structure

9.0

trigger phrases

8.0

procedure

8.0

edge cases

6.0

documentation

7.0

view original SKILL.md from clawhubclick to expand

---
name: audio-gen
description: Generate audiobooks, podcasts, or educational audio content on demand. User provides an idea or topic, Claude AI writes a script, and ElevenLabs converts it to high-quality audio. Supports multiple formats (audiobook, podcast, educational), custom lengths, and voice effects. Use when asked to create audio content, make a podcast, generate an audiobook, or produce educational audio. Returns MP3 audio file via MEDIA token.
homepage: https://github.com/clawdbot/clawdbot
metadata: {"clawdbot":{"emoji":"🎙️","requires":{"skills":["sag"],"env":["ANTHROPIC_API_KEY","ELEVENLABS_API_KEY"]},"primaryEnv":"ANTHROPIC_API_KEY"}}
---

# 🎙️ Audio Content Generator

Generate high-quality audiobooks, podcasts, or educational audio content on demand using AI-written scripts and ElevenLabs text-to-speech.

## Quick Start

**Create an audiobook chapter:**
```
User: "Create a 5-minute audiobook chapter about a dragon discovering friendship"
```

**Generate a podcast:**
```
User: "Make a 10-minute podcast about the history of coffee"
```

**Produce educational content:**
```
User: "Generate a 15-minute educational audio explaining how neural networks work"
```

## Content Formats

### Audiobook
**Style:** Narrative storytelling with emotional depth
- Clear beginning, middle, and end
- Descriptive language and vivid imagery
- Dramatic pacing with thoughtful pauses
- Emotional tone that matches the story
- Use voice effects like `[whispers]`, `[excited]`, `[serious]` for impact

**Example Structure:**
```
[Opening hook - set the scene]
[long pause]

[Story development with character emotions]
[short pause] between sentences
[long pause] between paragraphs

[Climax with dramatic tension]
[long pause]

[Resolution and emotional closure]
```

### Podcast
**Style:** Conversational and engaging
- Warm, welcoming intro (15-30 seconds)
- Main content with natural flow
- Transitions between topics
- Memorable outro with key takeaways
- Conversational tone throughout

**Example Structure:**
```
**Intro:** "Welcome to [topic]. I'm excited to share..."
[short pause]

**Main Content:** "Let's start with... [topic 1]"
[long pause] between segments

**Outro:** "Thanks for listening! Remember..."
```

### Educational Content
**Style:** Clear explanations for learning
- Simple introductions to complex topics
- Step-by-step breakdowns
- Real-world examples and analogies
- Recap of key concepts at the end
- Enthusiastic delivery with `[excited]` for important points

**Example Structure:**
```
**Introduction:** What is [topic] and why it matters?

**Main Content:**
- Concept 1: Explanation + Example
- Concept 2: Explanation + Example
- Concept 3: Explanation + Example

**Summary:** Key takeaways and next steps
```

## Length Guidelines

**Word Count to Duration Conversion:**
- 5 minutes = ~375 words
- 10 minutes = ~750 words
- 15 minutes = ~1,125 words
- 20 minutes = ~1,500 words
- 30 minutes = ~2,250 words

**Pacing:** Average conversational speed is ~75 words per minute

**Practical Limits:**
- Minimum: 2 minutes (~150 words)
- Maximum: 30 minutes (~2,250 words)
- Sweet spot: 5-15 minutes for best engagement

## Workflow Instructions

### Step 1: Understand the Request

Parse the user's request for:
1. **Content type** (audiobook, podcast, educational, or inferred from topic)
2. **Topic/theme** (what should the content be about)
3. **Target length** (how many minutes)
4. **Tone/style** (dramatic, casual, educational, etc.)
5. **Special requests** (specific voice, emphasis on certain points)

### Step 2: Calculate Word Count

```
target_words = target_minutes × 75
```

Example: 10 minutes = 10 × 75 = 750 words

### Step 3: Generate the Script

Write the complete script following these rules:

**Content Guidelines:**
- Start strong with an engaging hook
- Maintain natural, conversational flow
- Use active voice and simple sentence structure
- Include relevant examples and stories
- End with a satisfying conclusion

**Formatting Rules:**
- Add `[short pause]` after sentences (use sparingly, not every sentence)
- Add `[long pause]` between paragraphs or major sections
- Use voice effects strategically: `[whispers]`, `[shouts]`, `[excited]`, `[serious]`, `[sarcastic]`, `[sings]`, `[laughs]`
- Write numbers as words: "twenty-three" not "23"
- Spell out acronyms first time: "AI, or artificial intelligence"
- Avoid complex punctuation (em-dashes work, but semicolons don't read well)
- Remove markdown formatting before TTS conversion

### Step 4: Present the Script

Show the script to the user and ask:
```
Here's the [format] script I've created (approximately [length] minutes):

[Display the script]

Would you like me to:
1. Generate the audio now
2. Make changes to the script
3. Adjust the length or tone
```

### Step 5: Handle User Feedback

If user requests changes:
- Regenerate the script with adjustments
- Maintain the target word count
- Present the revised version

If user approves:
- Proceed to audio generation

### Step 6: Generate Audio

**Format the script for TTS:**
1. Remove any remaining markdown (headers, bold, italics)
2. Ensure voice effects are in proper `[effect]` format
3. Check that pauses are appropriately placed
4. Verify numbers and acronyms are spelled out

**Invoke the TTS script:**

**IMPORTANT:** The `ELEVENLABS_API_KEY` environment variable is already configured in the system. Simply invoke the TTS script directly.

```bash
uv run /home/clawdbot/clawdbot/skills/sag/scripts/tts.py \
  -o /tmp/audio-gen-[timestamp]-[topic-slug].mp3 \
  -m eleven_multilingual_v2 \
  "[formatted_script]"
```

**For long scripts, use heredoc:**
```bash
uv run /home/clawdbot/clawdbot/skills/sag/scripts/tts.py \
  -o /tmp/audio-gen-[timestamp]-[topic-slug].mp3 \
  -m eleven_multilingual_v2 \
  "$(cat <<'EOF'
[formatted_script]
EOF
)"
```

**Return the result:**
```
MEDIA:/tmp/audio-gen-[timestamp]-[topic-slug].mp3

Your [format] is ready! [Brief description of content]. Duration: approximately [X] minutes.
```

## Voice Effects (SSML Tags)

Available voice modulation effects (use sparingly for impact):

- `[whispers]` - Soft, intimate delivery
- `[shouts]` - Loud, emphatic delivery
- `[excited]` - Enthusiastic, energetic tone
- `[serious]` - Grave, solemn tone
- `[sarcastic]` - Ironic, mocking tone
- `[sings]` - Musical, melodic delivery
- `[laughs]` - Amused, jovial tone
- `[short pause]` - Brief silence (~0.5s)
- `[long pause]` - Extended silence (~1-2s)

**Best Practices:**
- Use effects for emotional moments, not every sentence
- Pauses are your most powerful tool for pacing
- Voice effects work best in audiobooks and dramatic content
- Keep podcasts and educational content mostly natural

## Error Handling

### Script Too Long
If the generated script exceeds target by >20%:
```
The script I generated is [X] words ([Y] minutes), which is longer than your target of [Z] minutes. Would you like me to:
1. Condense it to fit the target length
2. Split it into multiple parts
3. Keep it as is
```

### Script Too Short
If the generated script is under target by >20%:
```
The script is [X] words ([Y] minutes), shorter than your target. Would you like me to:
1. Expand it with more detail
2. Add additional examples or stories
3. Generate as is
```

### TTS Generation Fails
If the TTS script fails:
```
I've created the script, but I'm unable to generate the audio right now. Here's your script:

[Display script]

Error: [specific error message]

You can:
1. Check that ELEVENLABS_API_KEY is configured
2. Use the script with your own text-to-speech tool
3. Try again in a moment
4. Ask me to troubleshoot the audio generation
```

**Common TTS Issues:**
- API key not set: Verify ELEVENLABS_API_KEY in config
- Rate limit: Wait a moment and try again
- Text too long: Break into smaller chunks (max ~5000 characters)

### Invalid Request
For unrealistic requests (e.g., "100-hour audiobook"):
```
That length would require [X] words and take significant time to generate. I recommend:
- Breaking it into multiple episodes/chapters
- Targeting 5-30 minutes per audio file
- Creating a series instead of one long file
```

## Tips for Best Results

### For Engaging Audiobooks
- Focus on character emotions and sensory details
- Use pauses to build dramatic tension
- Vary sentence length for rhythm
- Include internal monologue and reflection

### For Compelling Podcasts
- Start with a question or surprising fact
- Use conversational phrases: "You know what's interesting..."
- Include relatable examples from everyday life
- End with actionable takeaways

### For Effective Educational Content
- Use the "explain like I'm five" approach
- Build from simple to complex concepts
- Repeat key terms and definitions
- Provide multiple examples for clarity

## Technical Notes

**TTS Implementation:**
- Uses Python script: `~/.clawdbot/clawdbot/skills/sag/scripts/tts.py`
- No binary installation required (pure Python + requests)
- Directly calls ElevenLabs API
- Compatible with Linux and macOS

**File Storage:**
- Audio files are saved to `/tmp/audio-gen/`
- Filename format: `audio-gen-[timestamp]-[topic-slug].mp3`
- Files are automatically cleaned up after 24 hours

**API Requirements:**
- Anthropic API for script generation (already configured)
- ElevenLabs API for text-to-speech (configured via ELEVENLABS_API_KEY)
- Both services must be configured and have available credits

**Supported Models:**
- `eleven_multilingual_v2` - Best quality (default)
- `eleven_turbo_v2` - Faster generation
- `eleven_turbo_v2_5` - Fastest generation
- `eleven_multilingual_v1` - Legacy model

**Cost Estimate:**
- 10-minute audio (~750 words): approximately $1.43
  - Claude API: ~$0.075
  - ElevenLabs: ~$1.35
- Longer content scales proportionally

**Generation Time:**
- Script generation: 5-30 seconds (depending on length)
- Audio generation: 5-15 seconds (ElevenLabs processing)
- Total: Usually under 1 minute for 10-minute audio

## Limitations

1. **Maximum Length:** 30 minutes (~2,250 words) per audio file
   - For longer content, create multiple parts/episodes

2. **Single Voice:** Currently supports one narrator voice
   - Cannot do multi-voice dialogue or character voices

3. **No Background Music:** Pure voice narration only
   - No background music, sound effects, or audio mixing

4. **Real-time Generation:** Each request generates fresh content
   - No pre-made templates or cached audio

5. **Language:** Primarily English
   - ElevenLabs supports other languages, but content generation optimized for English

## Example Conversations

### Example 1: Quick Audiobook
```
User: Create a 5-minute audiobook chapter about a lonely robot learning to feel emotions

Agent: I'll create a 5-minute sci-fi audiobook chapter. Let me write the story...

[Generates ~400-word narrative with emotional arc]

Here's the story I've created about Unit-7, a maintenance robot who discovers the meaning of friendship through an unexpected encounter. Would you like me to generate the audio or make changes first?

User: Generate the audio

Agent: [Invokes SAG with formatted script]

MEDIA:/tmp/audio-gen-1234567890-lonely-robot.mp3

Your audiobook chapter is ready! The story follows Unit-7's emotional awakening through a touching encounter with a young engineer. Duration: approximately 5 minutes.
```

### Example 2: Podcast with Revisions
```
User: Make a 10-minute podcast about sustainable living tips

Agent: [Generates podcast script with intro, 5 practical tips, outro]

Here's a 10-minute podcast with 5 actionable sustainable living tips. The tone is encouraging and practical. Would you like me to:
1. Generate audio now
2. Make it more urgent/climate-focused
3. Add more specific examples
4. Adjust the length

User: Add more specific examples for each tip

Agent: [Regenerates with detailed examples]

Better? I've added specific examples like calculating your carbon footprint, composting at home, and choosing sustainable brands. Ready for audio?

User: Perfect!

Agent: MEDIA:/tmp/audio-gen-1234567890-sustainable-living.mp3

Your sustainable living podcast is ready with 5 practical tips, each with concrete examples you can start using today! Duration: 10 minutes.
```

### Example 3: Educational Content
```
User: Create a 15-minute educational audio explaining quantum computing for beginners

Agent: I'll create a beginner-friendly 15-minute explanation of quantum computing...

[Generates ~1,100 words with analogies, examples, and clear structure]

Script preview:
- Introduction: What is quantum computing? (2 min)
- Classical vs Quantum: The key differences (4 min)
- Qubits and Superposition: How it works (5 min)
- Real Applications: What it means for you (3 min)
- Conclusion: The future of computing (1 min)

Should I generate the audio?

User: Yes please

Agent: MEDIA:/tmp/audio-gen-1234567890-quantum-computing.mp3

Your educational podcast on quantum computing is ready! It uses everyday analogies to explain qubits, superposition, and real-world applications. Duration: approximately 15 minutes.
```

## When to Use This Skill

Use the audio-gen skill when users request:
- "Create an audiobook about..."
- "Make a podcast episode about..."
- "Generate audio content explaining..."
- "Record a story about..."
- "Produce an audio guide for..."
- "Create a narrated version of..."
- "Make an audio lesson about..."

**Do NOT use for:**
- Transcribing existing audio (use whisper skills instead)
- Converting user-written text to audio (use SAG skill directly)
- Real-time voice interaction (use voice-call plugin)
- Multi-speaker dialogues or interviews (current limitation)

## Advanced Usage

### Voice Selection
Users can request specific voices:
```
User: Create a podcast with the Rachel voice

Agent: [Checks voice configuration for "Rachel" voice ID and uses it in SAG command]
```

### Series/Episodes
For multi-part content:
```
User: Create a 3-episode series about space exploration

Agent: I'll create Episode 1 first. Each episode will be 10 minutes. Let's start with "The Dawn of the Space Age"...

[After completion]

Episode 1 is ready! Would you like me to continue with Episode 2?
```

### Format Blending
Mix formats for unique styles:
```
User: Create an educational podcast that tells a story

Agent: [Generates content that combines storytelling narrative with educational explanations]
```

## Troubleshooting

**Issue:** Audio sounds robotic or unnatural
**Solution:** Add more pauses and voice effects. Use contractions and conversational language.

**Issue:** Script doesn't match requested length
**Solution:** Regenerate with explicit word count target. Check calculations (75 words/min).

**Issue:** Content is too technical or too simple
**Solution:** Ask user for target audience. Adjust complexity accordingly.

**Issue:** SAG command fails
**Solution:** Check ELEVENLABS_API_KEY is set. Verify SAG skill is installed and working.

**Issue:** User wants to edit the script manually
**Solution:** Provide the plain text script. User can modify it and paste back for audio generation.

---

💡 **Pro Tip:** Always generate the script first and get user approval before creating audio. This saves time and API costs, and ensures the user gets exactly what they want.

don't have the plugin yet? install it then click "run inline in claude" again.

separated into intent, inputs (with env vars and external connections documented), procedure (8 detailed steps with explicit inputs/outputs), decision points (10 explicit if-else branches for ambiguity and error cases), output contract (success and error formats defined), and outcome signal (success and failure criteria). preserved original script logic, templates, and examples. added edge cases like rate limits, text length validation, and heredoc handling for long scripts.

🎙️ Audio Content Generator

Generate high-quality audiobooks, podcasts, or educational audio content on demand using AI-written scripts and ElevenLabs text-to-speech.

intent

Use this skill when a user asks you to create audio content from scratch. you take their topic or idea, generate a script tailored to their requested format (audiobook, podcast, or educational), then convert it to MP3 via ElevenLabs. this is for on-demand audio production, not transcription or conversion of existing text. the user provides the concept, you handle script writing and audio generation end-to-end.

inputs

required environment variables:

ANTHROPIC_API_KEY: Claude API key for script generation. set in system config.
ELEVENLABS_API_KEY: ElevenLabs API key for text-to-speech. set in system config.

required from user:

content type (audiobook, podcast, educational, or inferred from topic)
topic or theme (what the audio should be about)
target length in minutes (2-30 minute range)
optional: tone/style (dramatic, casual, urgent, etc.)
optional: special requests (specific voice, emphasis areas, etc.)

external connections:

Claude API (Anthropic): script generation via standard REST calls. no special setup beyond API key.
ElevenLabs API: text-to-speech synthesis. uses eleven_multilingual_v2 by default. api key must have available credits.
Python TTS script: ~/.clawdbot/clawdbot/skills/sag/scripts/tts.py. called via uv run.

assumed dependencies:

SAG skill must be installed (handles direct TTS invocation).
uv package manager available on system.
/tmp/ directory writable for temporary audio storage.

procedure

step 1: parse the user request

extract the following from their message:

content type (audiobook, podcast, educational, or infer from context). if ambiguous, ask.
topic or theme (what should the audio be about).
target length (how many minutes). if not stated, suggest 10 minutes as default.
tone/style (dramatic, casual, educational, etc.). infer from context if not explicit.
special requests (specific voice, emphasis on certain points, multi-part series, etc.).

output: structured request summary, ready for script generation.

step 2: calculate target word count

use formula: target_words = target_minutes * 75

example: 10 minutes target = 10 * 75 = 750 words.

record this for script validation in later steps.

output: word count target logged.

step 3: generate the script

call Claude to write the complete script. provide:

content type and structure template (audiobook narrative, podcast intro/body/outro, educational step-by-step).
target word count (from step 2).
tone and style guidelines.
formatting rules below.

formatting rules for script:

start strong with an engaging hook.
use active voice and simple sentence structure.
add [short pause] after major statements (use sparingly).
add [long pause] between paragraphs or section breaks.
use voice effects strategically: [whispers], [excited], [serious], [shouts], [sarcastic], [sings], [laughs]. use only for emotional moments, not every sentence.
write all numbers as words ("twenty-three" not "23").
spell out acronyms on first mention ("AI, or artificial intelligence").
avoid em-dashes. use commas, periods, colons, parens, or hyphens only.
no markdown formatting (headers, bold, italics) in final script. remove before TTS.

output: complete script text, approximately matching target word count (within 20% tolerance).

step 4: present script and request approval

display the script to the user. include approximate duration based on word count. ask:

"here's the [format] script i've created (approximately [length] minutes):

[display script]

want me to:

generate the audio now
make changes to the script
adjust the length or tone"

output: user decision captured.

step 5: handle user feedback loop

if user requests changes:

regenerate the script with adjustments while maintaining target word count.
re-present the revised version for approval.
repeat until user approves (or abandon if too many iterations).

if user approves:

proceed to step 6.

output: final approved script.

step 6: format script for TTS

clean the approved script:

remove any remaining markdown (headers, bold, italics).
verify all voice effects are in proper [effect] format.
confirm pauses are appropriately placed (not over-used).
verify numbers and acronyms are spelled out.
strip trailing whitespace.

output: formatted plain-text script, ready for TTS API.

step 7: invoke ElevenLabs TTS

call the Python TTS script with the formatted script text. use command below:

for short scripts (under ~3000 characters):

uv run /home/clawdbot/clawdbot/skills/sag/scripts/tts.py \
  -o /tmp/audio-gen-$(date +%s)-[topic-slug].mp3 \
  -m eleven_multilingual_v2 \
  "[formatted_script]"

for long scripts (over ~3000 characters), use heredoc to avoid shell escaping issues:

uv run /home/clawdbot/clawdbot/skills/sag/scripts/tts.py \
  -o /tmp/audio-gen-$(date +%s)-[topic-slug].mp3 \
  -m eleven_multilingual_v2 \
  "$(cat <<'EOF'
[formatted_script]
EOF
)"

replace [topic-slug] with a short slug of the topic (e.g., "lonely-robot", "sustainable-living").

output: MP3 file written to /tmp/audio-gen-[timestamp]-[topic-slug].mp3.

step 8: return result to user

on success, return:

MEDIA:/tmp/audio-gen-[timestamp]-[topic-slug].mp3

your [format] is ready! [1-2 sentence description of content]. duration: approximately [X] minutes.

output: media link and confirmation message.

decision points

if content type ambiguous: ask user to clarify (audiobook, podcast, or educational). do not guess.

if user specifies custom voice: check if voice name (e.g., "Rachel") is configured in system. if yes, update the SAG command with that voice ID. if no, confirm with user that only default voice is available.

if target length exceeds 30 minutes: decline and suggest breaking into multiple episodes or chapters. offer to create episode 1 first.

if target length under 2 minutes: warn user that content will be very brief (roughly 150 words). offer to extend to 5 minutes.

if generated script length deviates >20% from target: ask user:

if script is too long: "condense it, split into parts, or keep as is?"
if script is too short: "expand with more detail, add examples, or generate as is?" do not proceed to audio generation until user confirms.

if user wants to edit the script manually: provide the plain text script. let them modify and paste back. re-validate word count before proceeding to TTS.

if TTS API call fails: check error type. if API key missing or invalid, tell user to verify ELEVENLABS_API_KEY is set. if rate limit hit, wait a moment and retry. if text too long, split into smaller chunks (max ~5000 chars per call). if other error, display error message and offer to troubleshoot.

if user requests a multi-part series: generate and approve episode 1 first. after audio is ready, ask if they want episode 2. repeat as needed.

if user wants to convert their own text to audio: direct them to use the SAG skill directly instead. that skill is for "user provides text, we generate audio". audio-gen is for "user provides concept, we write script and generate audio".

output contract

on success:

file location: /tmp/audio-gen-[timestamp]-[topic-slug].mp3
file format: MP3, stereo, 128 kbps bitrate (ElevenLabs default)
duration: matches user's requested length (within ~10% variance due to pacing)
content: complete, uninterrupted narration in a single voice
return format: MEDIA token pointing to file, plus 1-2 sentence confirmation

example output:

MEDIA:/tmp/audio-gen-1704067200-quantum-computing.mp3

your 15-minute educational audio on quantum computing is ready. it explains qubits, superposition, and real-world applications using everyday analogies. duration: 15 minutes, 2 seconds.

on script generation error:

return the error message from Claude API
offer to retry or adjust request

on TTS generation error:

return the specific error from ElevenLabs API
return the plain-text script so user can generate audio elsewhere if needed
troubleshooting steps displayed based on error type

on user rejection after approval:

return the most recent script version in plain text
no file generated

outcome signal

the user knows the skill worked when:

audio file is returned via MEDIA token: they can click or download /tmp/audio-gen-[timestamp]-[topic-slug].mp3 immediately.
duration matches request: if they asked for 10 minutes, the returned audio is approximately 10 minutes (within 30-60 seconds variance is acceptable).
script quality matches expectations: the narration is natural (not robotic), includes appropriate pauses and voice effects, and covers the requested topic accurately.
content is complete and polished: no cutoffs, skipped sections, or gaps. the audio starts with a clear intro and ends with a satisfying conclusion.
user can play it back immediately: audio is playable in any standard MP3 player. no transcoding or additional processing needed.

failure signals:

no MEDIA link returned, or link is broken.
audio duration is wildly off (e.g., 3 minutes when 10 was requested).
audio contains glitches, silence gaps, or unintelligible sections.
script does not match the requested topic or tone.
API key errors or "credits exceeded" messages.

credits: original skill from clawhub. enriched for implexa standards.

Audio Content Generator

related skills

🎙️ Audio Content Generator

intent

inputs

procedure

decision points

output contract

outcome signal