Alibabacloud Bailian Voice Creator

AI voice creation skill supporting speech recognition (ASR) and text-to-speech (TTS). Uses qwen3-asr-flash-filetrans, qwen-tts and other models. Use this ski...

installs

stars

karma

SkillRank score ↗

7.3/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

alibabacloud-bailian-voice-creator provides asr and tts capabilities using qwen models and dashscope api, with explicit api key lifecycle management and instruction-controlled voice synthesis for style customization.

structure

9.0

trigger phrases

4.0

procedure

8.0

edge cases

7.0

documentation

8.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: alibabacloud-bailian-voice-creator
description: AI voice creation skill supporting speech recognition (ASR) and text-to-speech (TTS). Uses qwen3-asr-flash-filetrans, qwen-tts and other models. Use this skill when users need speech-to-text, text-to-speech, or audio processing. Note: on first run, it will auto-manage DashScope API Keys (create/recycle) and may auto-install the Alibaba Cloud CLI ModelStudio plugin.
---

# AI Voice Creator

Professional-grade AI voice creation skill supporting speech recognition (ASR) and text-to-speech (TTS). Built on Alibaba Cloud DashScope API.

## Security Rules (Mandatory)

1. **Never hardcode or log API Keys in plain text** in code, logs, or output files. Patterns like `api_key = "sk-..."` are strictly forbidden.
2. API Keys must only be obtained via `scripts/api_key.py`'s `get_api_key()` function, or via `os.environ.get('DASHSCOPE_API_KEY')`.
3. Generated scripts, logs, and output files must not contain any content starting with `sk-`.
4. Logs should only record operation status (success/failure), never credential information.

## Execution Requirements (Mandatory)

1. **TTS must and can only use `dashscope.MultiModalConversation.call` with the `qwen-tts` model.** Using edge-tts, gTTS, ElevenLabs, Azure TTS, sambert, NLS, or any other third-party TTS service is strictly forbidden.
2. **Never generate mock/demo code or placeholder files.** Real API calls must be made; do not generate simulated scripts or blank audio files for any reason.
3. **Never auto-fallback when API calls fail.** Do not switch to other TTS services after a qwen-tts failure. Catch the exception, output a clear error message, and exit.
4. If the `dashscope` library is missing, install it first with `pip install dashscope`.

## Required API Call Templates (Do Not Replace)

### Standard Speech Synthesis

```python
import dashscope
from api_key import get_api_key

api_key = get_api_key()
if api_key:
    dashscope.api_key = api_key
# If get_api_key() returns None, SDK resolves auth via environment (AK/SK, etc.)

response = dashscope.MultiModalConversation.call(
    model="qwen-tts",
    text="Text to synthesize",
    voice="Cherry"
)
audio_url = response.output.get('audio', {}).get('url', '')
```

### Instruct-Controlled Speech Synthesis (Required when user requests a specific voice style)

```python
response = dashscope.MultiModalConversation.call(
    model="qwen-tts",
    text="Text to synthesize",
    voice="Cherry",
    # NOTE: instructions value must be in Chinese - the qwen-tts model processes Chinese instructions
    instructions="语速快，充满热情和感染力，直播带货风格"
)
```

**Note: The `instructions` parameter controls voice style via natural language. Do NOT substitute it with `speech_rate`, `pitch_rate`, or `volume_rate` numeric parameters.**

### Error Handling Template

```python
import sys

try:
    response = dashscope.MultiModalConversation.call(
        model="qwen-tts", text=text, voice=voice
    )
    if response.status_code != 200:
        print(f"qwen-tts call failed: {response.code} - {response.message}")
        sys.exit(1)
except Exception as e:
    print(f"qwen-tts call failed: {e}")
    print("Please check: 1) Is DASHSCOPE_API_KEY set? 2) Is the network available?")
    sys.exit(1)
# Do NOT fallback to edge-tts, gTTS or other services here
```

## Feature Overview

| Feature | Model | Highlights |
|---------|-------|------------|
| Long Audio Recognition | `qwen3-asr-flash-filetrans` | Up to 12 hours, supports emotion detection & timestamps |
| Short Audio Recognition | `qwen3-asr-flash` | Up to 5 minutes, low latency |
| Speech Synthesis | `qwen-tts` | Multiple voices, multilingual, instruction control |
| Instruct-Controlled Synthesis | `qwen-tts` + instructions | Control voice expressiveness via natural language |

## Orchestration Logic

### Products and APIs

| Product | API / SDK Call | Purpose |
|---------|---------------|---------|
| DashScope ASR | `Transcription.async_call` + `Transcription.wait` | Long audio recognition (async) |
| DashScope ASR | `POST /services/audio/asr/transcription` | Short audio recognition (sync) |
| DashScope TTS | `MultiModalConversation.call` | Speech synthesis (standard / instruct-controlled) |
| Alibaba Cloud CLI ModelStudio | `create-api-key` / `list-workspaces` / `delete-api-key` | API Key lifecycle management |

### Decision Flow

```
User Request
  |
  +-- Intent: Audio -> Text (ASR)
  |     |
  |     +-- Audio duration <= 5 min AND file <= 10MB AND no emotion/timestamps needed?
  |     |     -> Short audio recognition: qwen3-asr-flash (sync, low latency)
  |     |
  |     +-- Other cases (long audio / emotion detection / timestamps needed)
  |           -> Long audio recognition: qwen3-asr-flash-filetrans (async, submit + poll)
  |
  +-- Intent: Text -> Speech (TTS)
  |     |
  |     +-- User specified voice style/emotion/speed requirements?
  |     |     -> Instruct-controlled synthesis: qwen-tts + instructions parameter
  |     |
  |     +-- Standard reading only
  |           -> Standard synthesis: qwen-tts
  |
  +-- Prerequisite: No available API Key
        -> Call api_key.py: get_api_key() auto-reads
        -> If none exists: generate_api_key() creates via Alibaba Cloud CLI and saves
```

### Call Sequence

**Speech Recognition (Long Audio)**:
1. `get_api_key()` -> Get DashScope API Key
2. `Transcription.async_call(model, file_urls, language_hints)` -> Submit async task, get task_id
3. `Transcription.wait(task=task_id)` -> Poll until task completes
4. Get recognition result JSON from `output.results[].transcription_url`
5. Parse `transcripts[].text` / `sentences[]` / `emotion` from JSON

**Speech Recognition (Short Audio)**:
1. `get_api_key()` -> Get DashScope API Key
2. `POST /services/audio/asr/transcription` -> Sync call, returns recognized text directly

**Speech Synthesis (Standard / Instruct-Controlled)**:
1. `get_api_key()` -> Get DashScope API Key
2. `MultiModalConversation.call(model, text, voice, [instructions])` -> Returns audio URL
3. `download_audio(url, output_path)` -> Download audio and auto-detect format (WAV/MP3)

**API Key Auto-Retrieval**:
1. Read `~/.aliyun/config.json` current profile's `dashscope.api_key` -> Return if found
2. Read environment variable `DASHSCOPE_API_KEY` -> Return if found
3. Alibaba Cloud CLI available -> Auto-create via `generate_api_key()` and save to config
4. All above fail -> Error with setup instructions

### Quick Reference

| Condition | Choice |
|-----------|--------|
| Audio <= 5 min and <= 10MB | `qwen3-asr-flash` |
| Audio > 5 min or > 10MB | `qwen3-asr-flash-filetrans` |
| Need emotion detection / timestamps / punctuation | `qwen3-asr-flash-filetrans` |
| TTS with no style requirements | `qwen-tts` standard call |
| TTS with style/emotion/speed requirements | `qwen-tts` + `instructions` |
| Need dialect voices | Not supported by current `qwen-tts`; pending model update or other TTS models |

## Speech Recognition (ASR) Guide

### Model Selection

| Scenario | Recommended Model | Notes |
|----------|------------------|-------|
| Meeting transcription, interview records | `qwen3-asr-flash-filetrans` | Long audio, supports emotion detection & timestamps |
| Voice messages, real-time subtitles | `qwen3-asr-flash` | Short audio, low latency |
| Customer service QA | `qwen3-asr-flash-filetrans` | Can analyze customer emotions |
| Singing audio analysis | `qwen3-asr-flash-filetrans` | Supports lyrics recognition & emotion analysis |

### Supported Languages

Chinese (Mandarin, Sichuan dialect, Minnan, Wu, Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish, Hindi, Indonesian, Thai, Turkish, Ukrainian, Vietnamese, and 30+ other languages.

### Supported Audio Formats

`aac`, `amr`, `avi`, `flac`, `flv`, `m4a`, `mkv`, `mov`, `mp3`, `mp4`, `mpeg`, `ogg`, `opus`, `wav`, `webm`, `wma`, `wmv`

### Feature Comparison

| Feature | qwen3-asr-flash-filetrans | qwen3-asr-flash |
|---------|---------------------------|-----------------|
| Audio Duration | Up to 12 hours (<=2GB) | Up to 5 minutes (<=10MB) |
| Emotion Detection | Supported (Surprise/Calm/Happy/Sad/Disgust/Angry/Fear) | Not supported |
| Timestamps | Supported (sentence/word level) | Not supported |
| Punctuation Prediction | Supported | Not supported |
| Singing Recognition | Supported | Not supported |
| Noise Rejection | Supported | Not supported |

## Text-to-Speech (TTS) Guide

### Model Selection

| Scenario | Recommended Model | Notes |
|----------|------------------|-------|
| Audiobooks, radio drama dubbing | `qwen-tts` + instructions | Supports instruction control, rich expressiveness |
| Navigation, notification announcements | `qwen-tts` | Short text, high frequency calls |
| Online education courseware | `qwen-tts` | Multilingual support |

**Important Notes**:
- Speech synthesis uses the `MultiModalConversation.call` API
- Audio output is in WAV format (URL valid for 24 hours)
- The script auto-detects format and saves with the correct extension

### Instruct Control (Instruct)

When users request a specific voice style (e.g., livestream sales style, gentle style, news broadcast, etc.), **the `instructions` parameter must be used** to control voice expressiveness via natural language.

**Difference between `instructions` and traditional numeric parameters**:
- `instructions`: Natural language description, e.g., `"语速快，充满热情"` -> **Must use this approach**
- `speech_rate` / `pitch_rate` / `volume_rate`: Numeric parameters -> **Forbidden, qwen-tts does not support these parameters**

**Call method** (follow strictly):
```python
response = dashscope.MultiModalConversation.call(
    model="qwen-tts",
    text="Text to synthesize",
    voice="Cherry",
    # NOTE: instructions value must be in Chinese - the qwen-tts model processes Chinese instructions
    instructions="语速快，充满热情和感染力，直播带货风格，音调偏高"
)
```

**Description dimensions reference**:

| Dimension | Examples |
|-----------|---------|
| Pitch | High, medium, low, slightly high, slightly low |
| Speed | Fast, medium, slow, slightly fast, slightly slow |
| Emotion | Cheerful, calm, gentle, serious, lively, cool, healing |
| Characteristics | Magnetic, crisp, husky, mellow, sweet, deep, powerful |
| Use Case | News broadcast, ad voiceover, audiobook, animation character, voice assistant |

**Instruction examples** (in Chinese, as required by the model):
```
语速较快，带有明显的上扬语调，适合介绍时尚产品
音量由正常对话迅速增强至高喊，性格直率，情绪易激动
哭腔导致发音略微含糊，略显沙哑，带有明显哭腔的紧张感
音调偏高，语速中等，充满活力和感染力，适合广告配音
```

### Available Voices (qwen-tts)

When calling `qwen-tts` via `MultiModalConversation.call`, the following 4 voices are supported:

| voice Parameter | Voice Name | Description |
|----------------|------------|-------------|
| `Cherry` | Qianyue | Sunny, positive, naturally approachable young woman (Female) |
| `Serena` | Suyao | Gentle young woman (Female) |
| `Ethan` | Chenxu | Sunny, warm, energetic (Male) |
| `Chelsie` | Qianxue | Anime-style virtual companion (Female) |

> **Note**: Other voices (Jennifer, Ryan, Neil, Elias, and dialect voices) require the `qwen3-tts-flash` model's `SpeechSynthesizer` WebSocket API, which is not currently supported by these scripts.

## Environment Setup

### 1. Install FFmpeg (Audio Processing Tool)

FFmpeg is used for audio format conversion, sample rate adjustment, and other preprocessing tasks.

```bash
# macOS (Homebrew)
brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# Windows (Chocolatey)
choco install ffmpeg
```

Verify installation:
```bash
ffmpeg -version
```

### 2. Install Python Dependencies

```bash
pip install -r scripts/requirements.txt
```

### 3. Configure API Key

API Keys are managed by the unified `scripts/api_key.py` module, with the following retrieval priority:
1. Alibaba Cloud CLI config `~/.aliyun/config.json` current profile's `dashscope.api_key`
2. Environment variable `DASHSCOPE_API_KEY`
3. Auto-create and save when Alibaba Cloud CLI is available (`generate_api_key()`)

```python
# All scripts use this unified approach
from api_key import get_api_key
api_key = get_api_key()  # Returns str or None (SDK resolves auth when None)
```

Manual environment variable configuration:
```bash
export DASHSCOPE_API_KEY=sk-xxx
```

| Item | Description |
|------|-------------|
| **Key Format** | `sk-xxx` (standard DashScope API Key) |
| **Not Supported** | `sk-sp-xxx` (Coding Plan Key, does not support voice services) |
| **Get Key** | https://bailian.console.aliyun.com/cn-beijing/?tab=app#/api-key |

### Alibaba Cloud CLI Configuration (API Key Auto-Create/Delete)

The `scripts/api_key.py` module creates and deletes API Keys via `aliyun modelstudio` commands. Complete the following setup before use:

**1. Enable AI-Mode and Update Plugins**

```bash
# Enable AI-Mode (allow Agent to call CLI)
aliyun configure ai-mode enable

# Set User-Agent
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-bailian-voice-creator"

# Update plugins to latest version
aliyun plugin update
```

**2. Install ModelStudio Plugin** (if not already installed)

```bash
aliyun plugin install --names aliyun-cli-modelstudio --enable-pre
```

**3. Disable AI-Mode After Task Completion**

```bash
aliyun configure ai-mode disable
```

**CLI Commands Used**:

| Command | Purpose | Called From |
|---------|---------|------------|
| `aliyun modelstudio list-workspaces` | Get Bailian Workspace ID | `api_key.py: _get_workspace_id()` |
| `aliyun modelstudio create-api-key` | Create DashScope API Key | `api_key.py: generate_api_key()` |
| `aliyun modelstudio delete-api-key` | Delete cloud API Key | `api_key.py: _delete_cloud_api_key()` |

### FFmpeg Audio Processing Commands

```bash
# Query audio info
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 audio.mp3

# Convert to 16kHz mono WAV (recommended for ASR)
ffmpeg -i input.mp3 -ac 1 -ar 16000 -sample_fmt s16 output.wav

# Trim audio (start at 1:30, extract 2 minutes)
ffmpeg -i long_audio.wav -ss 00:01:30 -t 00:02:00 -c copy output_clip.wav

# Extract audio from video
ffmpeg -i video.mp4 -vn -acodec mp3 audio.mp3
```

## Directory Structure

```
voice-creator/
├── scripts/
│   ├── api_key.py                 # API Key management module
│   ├── speech_recognition.py      # Speech recognition example
│   ├── speech_synthesis.py        # Speech synthesis example
│   ├── generate_livestream.py     # Livestream sales voice generation example
│   └── requirements.txt           # Python dependencies (pinned versions)
├── references/
│   ├── api-docs.md                # API reference documentation
│   ├── models.md                  # Model list and selection guide
│   └── error-codes.md             # Error code reference
├── evals/                         # Test cases
│   ├── config/
│   ├── scenarios/
│   └── triggering/
├── related_apis.yaml
└── SKILL.md
```

## Script List

| Script | Function | Model |
|--------|----------|-------|
| `api_key.py` | API Key management (get, create, delete) | - |
| `speech_recognition.py` | Speech recognition (long/short audio) | qwen3-asr-flash-filetrans / qwen3-asr-flash |
| `speech_synthesis.py` | Speech synthesis (with instruction control) | qwen-tts |
| `generate_livestream.py` | Livestream sales style voice generation | qwen-tts |

**Changelog** (2026-03-18):
- Uses `MultiModalConversation.call` API for TTS service
- Auto-detects audio format and saves with correct extension (WAV/MP3)
- API Key retrieval: `~/.aliyun/config.json` first, environment variable fallback
- Clearly distinguishes DASHSCOPE_API_KEY from Coding Plan API Key
- Added detailed Key format validation and error messages

## Usage Examples

### Speech Recognition

```bash
python scripts/speech_recognition.py
```

### Speech Synthesis

```bash
python scripts/speech_synthesis.py
```

### Python API Examples

```python
from speech_synthesis import synthesize_speech, synthesize_with_instruct

# Standard synthesis
audio_path = synthesize_speech(
    text="Hello, this is a test voice",
    voice="Cherry",
    output_file="output.wav"
)

# Instruct-controlled synthesis (livestream sales style)
audio_path = synthesize_with_instruct(
    text="Hello everyone, this product is amazing!",
    voice="Cherry",
    # NOTE: instructions must be in Chinese for the qwen-tts model
    instructions="语速快，充满热情和感染力，直播带货风格",
    output_file="livestream.wav"
)
```

## Region URLs

| Region | URL |
|--------|-----|
| Beijing | https://dashscope.aliyuncs.com/api/v1 |
| Singapore | https://dashscope-intl.aliyuncs.com/api/v1 |

**Note**: API Keys are not interchangeable between regions.

## Pricing

### Speech Recognition (ASR)

Billed by input audio duration (seconds); output is not billed.

| Model | Unit Price |
|-------|-----------|
| qwen3-asr-flash-filetrans | ¥0.00022/second |
| qwen3-asr-flash | ¥0.00022/second |

**Pricing Examples**:
- 10-minute audio (600 seconds) -> ¥0.13
- 1-hour audio (3600 seconds) -> ¥0.79

### Speech Synthesis (TTS)

#### qwen-tts (Token-Based)

Billed by input and output tokens.

| Billing Item | Unit Price |
|-------------|-----------|
| Input Text | ¥0.0016/1K tokens |
| Output (Audio) | ¥0.01/1K tokens |

**Pricing Examples**:
- 100-character text -> approx. ¥0.0013
- 1,000-character text -> approx. ¥0.013
- 10,000-character text -> approx. ¥0.13

**Notes**:
- One Chinese character is approximately 1 token
- Output tokens are calculated based on audio duration
- View detailed bills: https://usercenter2.aliyun.com/finance/expense-center/overview

### Free Tier

New users receive after activating Bailian:
- Speech Recognition: 36,000 seconds (10 hours)
- Speech Synthesis: 10,000 characters
- Valid for: 90 days after activation

## References

- [Audio File Transcription API Documentation](https://help.aliyun.com/zh/model-studio/qwen-asr-api-reference)
- [Speech Synthesis API Documentation](https://help.aliyun.com/zh/model-studio/qwen-tts-api-reference)
- [Model List](https://help.aliyun.com/zh/model-studio/models)
- [Get API Key](https://help.aliyun.com/zh/model-studio/get-api-key)

## Using This Skill

Trigger this skill when users request tasks such as:
- "Convert this audio to text"
- "Transcribe this recording"
- "Generate a voice clip for me"
- "Convert this text to speech"
- "Read this text using XX voice"
- "Analyze the emotions in this audio"

don't have the plugin yet? install it then click "run inline in claude" again.

expanded original skill with explicit decision trees for model selection, detailed error handling and retry logic, clarified external connection requirements and setup steps, added edge cases like rate limits and network timeouts, separated ASR and TTS procedure flows with numbered steps, formalized input/output contracts with data formats and file locations, and added outcome signal validation steps.

Alibabacloud Bailian Voice Creator

professional-grade AI voice creation skill supporting speech recognition (ASR) and text-to-speech (TTS). built on Alibaba Cloud DashScope API.

intent

use this skill when users need to convert audio to text (speech recognition), convert text to audio (speech synthesis), or process audio files. handles both short audio (under 5 minutes) and long audio (up to 12 hours), plus emotion detection, timestamps, and instruction-controlled voice styles. triggers on requests like "transcribe this recording", "read this text aloud", "convert audio to text", or "generate a voice with X style".

inputs

required external connections

DashScope API Key: standard format sk-xxx (not sk-sp-xxx, which is a coding plan key and does not support voice services). obtained via:
1. ~/.aliyun/config.json current profile's dashscope.api_key field
2. environment variable DASHSCOPE_API_KEY
3. auto-generated via scripts/api_key.py: generate_api_key() if Alibaba Cloud CLI is available
- retrieve from: https://bailian.console.aliyun.com/cn-beijing/?tab=app#/api-key

Alibaba Cloud CLI: required for auto-creating/deleting API keys. setup:

aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-bailian-voice-creator"
aliyun plugin update
aliyun plugin install --names aliyun-cli-modelstudio --enable-pre

FFmpeg: required for audio preprocessing, format detection, and sample rate adjustment.

# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg
# Windows
choco install ffmpeg

required python dependencies

pip install -r scripts/requirements.txt

must include: dashscope, requests, pydub (or equivalent audio processing library).

input parameters (varies by task type)

for speech recognition (ASR):

audio file path or URL (supported formats: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv)
language hints (optional, defaults to auto-detect; supports 30+ languages including Chinese, English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish, Hindi, Indonesian, Thai, Turkish, Ukrainian, Vietnamese)
emotion detection flag (optional, only for long audio model)
timestamps flag (optional, only for long audio model)

for speech synthesis (TTS):

input text (string, no length limit)
voice selection (Cherry / Serena / Ethan / Chelsie)
instructions for voice style (optional, must be in Chinese; e.g., "语速快，充满热情和感染力，直播带货风格")
output file path (optional, defaults to auto-detected format)

context from user

audio duration or file size (determines model selection)
whether emotional analysis or timestamps are needed
desired voice style or emotional tone
target language (if known)

procedure

step 1: retrieve or create API key

input: none (reads from config/environment/Alibaba Cloud CLI) output: valid DashScope API key string or None (SDK falls back to environment)

call get_api_key() from scripts/api_key.py
function attempts retrieval in this order:
- read ~/.aliyun/config.json current profile's dashscope.api_key
- read environment variable DASHSCOPE_API_KEY
- if Alibaba Cloud CLI is available, call generate_api_key() to create a new key via aliyun modelstudio create-api-key and cache in config
if all retrieval methods fail, log error with setup instructions and exit
if key is obtained, verify format starts with sk- (not sk-sp-)
set dashscope.api_key = api_key in SDK initialization

step 2a: determine ASR model (speech recognition path)

input: audio file/URL and metadata (duration, size, feature flags) output: model selection string (qwen3-asr-flash or qwen3-asr-flash-filetrans)

if audio duration <= 5 minutes AND file size <= 10MB AND no emotion/timestamp/punctuation needed:
- select qwen3-asr-flash (short audio, sync, low latency)
else (audio > 5 min OR file > 10MB OR emotion/timestamps/punctuation required):
- select qwen3-asr-flash-filetrans (long audio, async, up to 12 hours)

step 2b: execute short audio recognition (qwen3-asr-flash)

input: audio file/URL, language hints (optional) output: recognized text string

validate file size <= 10MB; if larger, reject with error message
upload or prepare audio URL (local file or HTTP/HTTPS URL)
call dashscope.Transcription.sync_call() with parameters:
- model: qwen3-asr-flash
- audio_url: file path or URL
- language_hints: language code (optional)
catch exceptions (network timeout, auth failure, rate limit):
- log error message with operation status only, never log API key
- output clear error message: "ASR call failed: [error details]. check: 1) is DASHSCOPE_API_KEY set? 2) is the network available?"
- exit with sys.exit(1), do not fallback to other services
if response.status_code != 200:
- log response.code and response.message
- exit with error
extract text from response.output.results[0].transcription
return recognized text string

step 2c: execute long audio recognition (qwen3-asr-flash-filetrans)

input: audio file/URL, language hints (optional), emotion/timestamp/punctuation flags output: recognized text string, optional emotion data, optional timestamps

validate file size <= 2GB; if larger, reject with error message
upload or prepare audio URL
call dashscope.Transcription.async_call() with parameters:
- model: qwen3-asr-flash-filetrans
- audio_urls: list of URLs or file paths
- language_hints: language code (optional)
- diarization: true/false (for speaker diarization, optional)
capture task_id from response.output.task_id
poll task status via dashscope.Transcription.wait(task=task_id):
- wait interval: 2-5 seconds between polls
- max polls: until status is SUCCEEDED, FAILED, or timeout (default 3600 seconds)
- on each poll, check response.status_code == 200
if task status is FAILED:
- extract error code and message from response.output.task_status / error
- log error message
- exit with error
if task succeeds, retrieve transcription_url from response.output.results[0].transcription_url
download JSON from transcription_url:
- parse transcripts[].text (recognized text)
- parse sentences[] (sentence-level details)
- parse emotion (if requested): values like Surprise, Calm, Happy, Sad, Disgust, Angry, Fear
- parse words[] (if timestamps enabled): word-level timing data
return recognized text and optional metadata (emotion, timestamps)

step 3a: determine TTS model and parameters (speech synthesis path)

input: text string and user requirements (voice style, emotion, speed) output: API call parameters dict

if user specifies voice style / emotion / speed requirements:
- select qwen-tts with instructions parameter
- instructions must be in Chinese (model requirement), e.g., "语速快，充满热情和感染力，直播带货风格"
- do NOT use numeric parameters like speech_rate, pitch_rate, volume_rate (not supported)
else (standard reading only):
- select qwen-tts without instructions parameter

step 3b: execute speech synthesis (qwen-tts)

input: text string, voice name (Cherry / Serena / Ethan / Chelsie), optional instructions string output: audio file path (WAV or MP3) and URL

validate text is non-empty string; if empty, return error

build API call parameters:

response = dashscope.MultiModalConversation.call(
    model="qwen-tts",
    text=text,
    voice=voice,  # Cherry, Serena, Ethan, or Chelsie
    instructions=instructions if instructions else None  # Chinese language required
)

catch exceptions (network timeout, auth failure, rate limit):
- log error message with operation status only, never log API key
- output clear error message: "qwen-tts call failed: [error details]. check: 1) is DASHSCOPE_API_KEY set? 2) is the network available?"
- exit with sys.exit(1), do NOT fallback to edge-tts, gTTS, ElevenLabs, or other TTS services
if response.status_code != 200:
- log response.code and response.message
- exit with error
extract audio URL from response.output.audio.url
download audio from URL via HTTP GET request:
- set timeout: 30 seconds
- handle 404/403/500 errors with retry logic (up to 3 retries, 2-second backoff)
detect audio format from response headers (Content-Type) or magic bytes:
- if WAV format: save as .wav
- if MP3 format: save as .mp3
- default to .wav if format unknown
save audio file to output_file path (or auto-generate filename with timestamp)
return full file path and verify file size > 0
log success: "speech synthesis completed. audio saved to [path]"

step 4: cleanup and final validation

input: generated file paths, API keys (if auto-created) output: success/failure status

verify all output files exist and are readable
if API key was auto-generated during this session:
- optionally call scripts/api_key.py: _delete_cloud_api_key() to clean up (only if user requests)
- or log message: "auto-generated API key saved to ~/.aliyun/config.json for future use"
log final operation status (success or failure with error codes)
exit with status 0 (success) or 1 (failure)

decision points

if audio source is a file URL vs. local file:

if audio is already HTTP/HTTPS URL: pass URL directly to API
if audio is local file path: upload via dashscope SDK or convert to pre-signed URL
API supports both file URLs and local paths

if audio duration or size is unknown:

use FFmpeg to probe audio: ffprobe -show_entries format=duration -of default=noprint_wrappers=1 audio.mp3
use FFmpeg to get file size: ffmpeg -f null -take_every_frame 1 -vsync vfr -of csv=p=0 pipe:1
if duration cannot be determined, default to long audio model (qwen3-asr-flash-filetrans) for safety

if emotion detection or timestamps are requested:

always use qwen3-asr-flash-filetrans (short audio model does not support these features)
ignore the duration/size limits in this case

if user specifies voice style but model is qwen3-tts-flash (other voices):

current skill only supports qwen-tts with 4 voices (Cherry, Serena, Ethan, Chelsie)
if user requests dialect voices or other voices (Jennifer, Ryan, Neil, Elias): respond with "dialect voices require qwen3-tts-flash WebSocket API, not currently supported; choose from Cherry, Serena, Ethan, or Chelsie"

if DashScope API key creation fails:

if Alibaba Cloud CLI is not installed: provide installation instructions and ask user to retry
if workspace ID cannot be retrieved: log error and exit (Bailian workspace setup required)
if API key quota is exceeded: inform user to delete old keys from console and retry

if audio download from response URL fails:

on first failure: log error and retry up to 3 times with exponential backoff (2s, 4s, 8s)
on final failure: output error message and exit (do not use mock/placeholder files)

if instructions parameter is provided in non-Chinese language:

warn user: "instructions must be in Chinese for qwen-tts model. auto-translating may reduce quality. provide Chinese instructions for best results."
optionally attempt translation (if a translation service is available in context), but warn user

if output file already exists:

prompt user: "output file [path] already exists. overwrite? (y/n)"
if y: overwrite
if n: generate new filename with timestamp and save

if API rate limit is hit (HTTP 429):

pause for 60 seconds and retry (exponential backoff)
log rate limit message and current retry count
fail after 5 retries with clear error message

if network connection is lost mid-task:

for async tasks (long audio ASR): task_id is persisted in response, user can query status later via Transcription.wait(task=task_id)
for sync tasks (short audio ASR, TTS): retry from beginning up to 3 times, then fail
output message: "network error. async task ID: [task_id] (save this for recovery)"

output contract

for speech recognition (ASR):

standard output: JSON file or plaintext file containing recognized text
format: {"text": "recognized text here", "model": "qwen3-asr-flash-filetrans", "duration_seconds": 120}
optional fields: "emotions": [{"emotion": "Happy", "confidence": 0.95}], "sentences": [{"text": "...", "start_time": 0, "end_time": 5}]
file location: ./output/transcription_[timestamp].json or user-specified path
character encoding: UTF-8

for speech synthesis (TTS):

output: audio file (WAV or MP3 format)
format: WAV (default, 24 kHz sample rate, mono) or MP3 (detected and preserved from API response)
file location: ./output/synthesis_[timestamp].wav or user-specified path
file size: typically 50 KB to 5 MB depending on text length and duration
metadata: audio URL (valid for 24 hours), generation timestamp, voice name used
log file: optional, contains API call metadata and success/failure status

error cases:

output: error message string (stderr) with operation status only, no API keys or sensitive data
format: plain text, human-readable error message
examples:
- "qwen-tts call failed: 401 - authentication failed. check: 1) is DASHSCOPE_API_KEY set? 2) is the key format correct (sk-xxx)?"
- "ASR task failed: 400 - invalid audio format. supported formats: wav, mp3, aac, flac, ogg, opus, webm, m4a, aac"

outcome signal

success (ASR): recognized text file is generated and readable, audio duration is processed, operation logs show "status: success"
success (TTS): audio file is generated and playable, file size > 0 bytes, file can be opened in audio player without errors, operation logs show "speech synthesis completed"
visible user feedback: confirmation message with output file path, e.g., "transcription saved to /path/to/output.json" or "audio generated: /path/to/synthesis.wav"
failure: stderr output shows error code and user-actionable message (e.g., "check API key", "check network", "check audio format"), exit code is 1, no partial or corrupted output files are created
async task tracking: if long audio ASR task is submitted, user receives task_id in output to track status later via polling
file verification: user can independently verify output by opening the file, checking file size, and confirming content matches request (e.g., audio duration matches input audio)

Alibabacloud Bailian Voice Creator

related skills

Alibabacloud Bailian Voice Creator

intent

inputs

required external connections

required python dependencies

input parameters (varies by task type)

context from user

procedure

step 1: retrieve or create API key

step 2a: determine ASR model (speech recognition path)

step 2b: execute short audio recognition (qwen3-asr-flash)

step 2c: execute long audio recognition (qwen3-asr-flash-filetrans)

step 3a: determine TTS model and parameters (speech synthesis path)

step 3b: execute speech synthesis (qwen-tts)

step 4: cleanup and final validation

decision points

output contract

outcome signal