Huawei Cloud Ascend model deployment and testing skill for large language models on Ascend DevServer (910B series). Supports single-machine and dual-machine...
SKILL.md

---
name: huawei-cloud-ascend-models-deploy
description: |
  Huawei Cloud Ascend model deployment and testing skill for large language models on Ascend DevServer (910B series). Supports single-machine and dual-machine deployment for LLM, VL (vision-language), Embedding, and Rerank models. Provides model inference testing, deployment log viewing, and status monitoring with automated model matching and deployment script generation.
  Use this skill when the user wants to: (1) deploy a model on Ascend DevServer, (2) test model inference, (3) view deployment logs or status, (4) list supported models, (5) check deployment prerequisites.
  Trigger: deploy, test, model list, deployment log, Ascend, DevServer, 910B, ModelArts, LLM, VL, Embedding, Rerank, multimodal, inference, model catalog, 昇腾, 部署模型, 测试模型, 模型列表, 部署日志, 模型部署, 推理测试
tags: [Ascend, LLM, deploy, inference]
---

# Huawei Cloud Ascend Models Deploy

Deploy and test large language models on Huawei Cloud Ascend DevServer (910B series). Supports single-machine and dual-machine deployment, model inference testing, and deployment monitoring.

## Overview

This skill deploys and tests large language models on Huawei Cloud Ascend DevServer (910B series). Supports single-machine and dual-machine deployment for LLM, VL, Embedding, and Rerank models.

**Related Skills** (Agent orchestrated, no direct call, Rule 3):
- `huawei-cloud-ascend-remote-connect` - SSH connection to DevServer (prerequisite for deployment)
- `huawei-cloud-ascend-command` - NPU status check and monitoring (prerequisite and post-deploy monitoring)

**Capabilities**:
- Model deployment (single-node, dual-node)
- Inference testing (LLM chat, VL multimodal, Embedding, Rerank)
- Deployment log and status monitoring
- Model catalog and script auto-matching

**Deployment Workflow** (Agent orchestrated):
1. Agent calls `huawei-cloud-ascend-remote-connect` to establish SSH connection
2. Agent calls `huawei-cloud-ascend-command` to check NPU health and availability
3. Agent calls this skill (`huawei-cloud-ascend-models-deploy`) to deploy model
4. Agent calls `huawei-cloud-ascend-command` to monitor NPU status during deployment

## Architecture

### System Architecture Diagram

```
┌─────────────────────────────────────────────────────────────────────┐
│                         Agent Orchestration                         │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  1. SSH connect (remote-connect)                             │    │
│  │  2. NPU health check (ascend-command)                        │    │
│  │  3. Deploy model (this skill)                                 │    │
│  │  4. Monitor NPU (ascend-command)                             │    │
│  └────────────────────────────┬────────────────────────────────┘    │
│                               │ Explicit param passing (Rule 1)    │
│                               ▼                                     │
├─────────────────────────────────────────────────────────────────────┤
│              Huawei Cloud Ascend Models Deploy                      │
│                      (Stateless, Rule 2)                            │
├─────────────────────────────────────────────────────────────────────┤
│  ┌──────────────────┐    ┌──────────────────────────────────┐      │
│  │  Natural Language│    │          Deploy Helper           │      │
│  │     Commands     │───▶│  - Model Matching & Catalog      │      │
│  └──────────────────┘    │  - Script Auto-Match             │      │
│                          │  - Command Generation            │      │
│                          └──────────────────────────────────┘      │
│                                           │                         │
│          ┌─────────────────────────────────┼──────────────┐        │
│          ▼                                 ▼              ▼        │
│  ┌───────────────┐              ┌─────────────────┐ ┌────────┐    │
│  │ Model         │              │ Inference       │ │ Log    │    │
│  │ Deployment    │              │ Testing         │ │ Status │    │
│  │               │              │                 │ │        │    │
│  │ • Single-node │              │ • LLM Chat      │ │ • View │    │
│  │ • Dual-node   │              │ • VL Multimodal │ │ • Check│    │
│  │ • 910B Series │              │ • Embedding     │ │        │    │
│  └───────────────┘              │ • Rerank        │ └────────┘    │
│                                 └─────────────────┘               │
└─────────────────────────────────────────────────────────────────────┘
```

### Agent Orchestration Flow

```
User request: "Deploy Qwen2.5-72B on DevServer 116.204.23.145"
         │
         ▼
┌─────────────────────────────────────────────────────────────┐
│ Agent Step 1: SSH Connection                                 │
│   → Call huawei-cloud-ascend-remote-connect                  │
│   → Pass: host, user, password (explicit, Rule 1)            │
└─────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────┐
│ Agent Step 2: NPU Health Check                               │
│   → Call huawei-cloud-ascend-command                         │
│   → Check: NPU list, health, HBM availability                │
│   → Fail if NPU not healthy or insufficient HBM              │
└─────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────┐
│ Agent Step 3: Deploy Model (this skill)                      │
│   → Match model from catalog                                 │
│   → Generate deploy script                                   │
│   → Execute deployment                                        │
│   → Stateless execution (Rule 2)                             │
└─────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────┐
│ Agent Step 4: Monitor NPU                                    │
│   → Call huawei-cloud-ascend-command                         │
│   → Monitor: HBM usage, temperature, processes               │
└─────────────────────────────────────────────────────────────┘
         │
         ▼
      Deployment Complete
```

### Related Skills Table

| Skill | Purpose | Orchestration Stage |
|-------|---------|---------------------|
| `huawei-cloud-ascend-remote-connect` | SSH connection | Pre-deploy: Establish connection to DevServer |
| `huawei-cloud-ascend-command` | NPU management | Pre-deploy: Health check; Post-deploy: Monitoring |

**Note**: No direct calls between Skills. All orchestration by Agent based on user intent (Rule 3).


## Prerequisites

> **Prerequisite check: Ascend 910B series required**
> - Supported: 910B1, 910B2, 910B3, 910B4
> - Unsupported: 910A, 310, 310P, etc.
> - Check with: `npu-smi info`

---

## Mandatory Rules (AI Must Follow)

1. **Never guess commands from memory** — Must read "Deploy Script Auto-Match" section
2. **Must call deploy_helper.py first** — Confirm model category and script URL
3. **Different models use different scripts**:
   - LLM / Embedding / Rerank → `deploy-large-models.sh`
   - VL → `deploy-qwen3-vl-model.sh`
   - OpenSource → `deploy-ai-models.sh`
4. **Must validate before deployment** — Port, NPU, model, card count
5. **Show command and wait for confirmation** — Sensitive operation, never execute directly

---

## Natural Language Understanding Rules

Extract key information from user natural language and assemble commands accurately.

### Operation Type Detection
| Keywords | Operation |
|----------|-----------|
| deploy / start / launch | Single-machine deployment |
| dual-machine / two-node / dual-node | Dual-machine deployment |
| test / inference / call | Test (execute) |
| write command / generate command | Write test command (generate only, no execute) |
| deployment log / view log | View deployment log |
| deployment status / is ready | View deployment status |
| model list / supported models | Show model catalog |
| parameter help / API parameters | Show parameter manual |

### Information Extraction Rules

**Model Name (fuzzy match, case-insensitive, supports card count filter):**
- "qwen3-14b" → Qwen3-14B
- "qwen3-235b" → Multiple matches, prefer Instruct version (Qwen3-235B-A22B-Instruct-2507), or ask user
- "vl-32b" → Qwen3-VL-32B-Instruct
- "bge-m3" → bge-m3
- "qwen3-vl" + 2 cards → Match VL models with ≤2 cards, list for user to choose
- "qwen3" + 2 cards → Match all Qwen3 models with ≤2 cards, list for user to choose
- Multiple candidates → List all candidates (with card count and category), let user confirm
- No match → Show full model catalog for user to select

**Card Count:**
- "2 cards" / "use 2 cards" / "2 npus" → 2
- "16 cards" / "16 npus" → 16
- "dual-machine" → 16
- Not specified → Use minimum card count from model catalog

**Port:**
- "port 8022" / "port:8022" → 8022
- Not specified → Default 8080

**Missing Parameters (check each, prompt what is missing):**
- Missing model name → "Please specify model name" + show model list
- Missing card count → "Please specify card count, e.g.: 2 cards" + show minimum cards for this model
- Missing port → "Please specify port (default 8080), e.g.: port 8001"
- Dual-machine missing head IP → "Please specify head node IP, e.g.: head:192.168.1.1"
- Dual-machine missing worker IP → "Please specify worker node IP, e.g.: worker:192.168.1.2"

**Head/Worker IP (dual-machine deployment):**
- "head:1.1.1.1" / "head node 1.1.1.1" → Head node IP
- "worker:2.2.2.2" / "worker node 2.2.2.2" → Worker node IP

**Prompt:**
- "prompt:hello" / "ask:hello" → Prompt text
- Not specified → LLM default "hello", VL default "describe the image", Embedding default "I love shanghai", Rerank default "What is the capital of France?"

**Image URL (VL test):**
- "image:https://xxx.jpg" / direct URL → Image URL
- User sends image attachment → Auto-convert to base64 data URL
- Not specified and testing multimodal model → Prompt user for image URL

**Multimodal Capability Auto-Detection:**
- VL category → Supports multimodal
- OpenSource: Qwen3.6-35B-A3B, Qwen3.6-27B → Supports multimodal
- LLM category → Text only
- Embedding → Text only
- Rerank → Text only

**Image URL Conversion (local image → data URL):**
```bash
# Efficient base64 conversion
IMG_B64=$(base64 -w 0 ${local_image_path})
IMG_URL="data:image/jpeg;base64,${IMG_B64}"
```

**Advanced Parameters (optional):**
- "max_tokens:64" → max_tokens=64
- "temperature:0.7" → temperature=0.7
- "stream" → stream=true
- "system:You are assistant" → system_prompt
- "disable thinking" / "no thinking" → chat_template_kwargs: {"enable_thinking": false}
- (Default = thinking mode enabled)

**Thinking Mode:**
Qwen3/Qwen3.6 models default to thinking mode, outputting reasoning process before final response.
- Enable thinking: Higher quality, more token consumption
- Disable thinking: Direct output, less token consumption, suitable for simple queries
- Request-level control via `"chat_template_kwargs": {"enable_thinking": false/true}`

---

## Supported Machine Types

Only **Ascend 910B series** (910B1 / 910B2 / 910B3 / 910B4). Must check NPU model before deployment, reject non-910B series.

---

## Model Catalog

### Large Language Models (LLM) — Endpoint: /v1/chat/completions
| Model | Min Cards |
|-------|-----------|
| Qwen3-14B | 1 |
| Qwen3-30B-A3B-Instruct-2507 | 2 |
| Qwen3-32B | 2 |
| Qwen3-235B-A22B-Thinking-2507 | 16 |
| Qwen3-235B-A22B-Instruct-2507 | 16 |
| DeepSeek-R1-Distill-Llama-70B | 4 |

### Vision-Language (VL) — Endpoint: /v1/chat/completions
| Model | Min Cards |
|-------|-----------|
| Qwen3-VL-30B-A3B-Instruct | 2 |
| Qwen3-VL-32B-Instruct | 2 |
| Qwen3-VL-235B-A22B-Instruct | 16 |
| Qwen3-VL-235B-A22B-Instruct-W8A8 | 8 |

### Embedding — Endpoint: /v1/embeddings (V0 backend only, single card only)
| Model | Min Cards | Multi-card |
|-------|-----------|------------|
| Qwen3-Embedding-8B | 1 | No |
| bge-large-zh-v1.5 | 1 | No |
| bge-m3 | 1 | No |

### Rerank — Endpoint: /v1/rerank (single card only)
| Model | Min Cards | Multi-card |
|-------|-----------|------------|
| Qwen3-Reranker-8B | 1 | No |
| bge-reranker-v2-m3 | 1 | No |

### OpenSource (Multimodal)
| Model | Min Cards | Capability |
|-------|-----------|------------|
| Qwen3.6-35B-A3B | 2 | Text + Image (MoE) |
| Qwen3.6-27B | 2 | Text + Image (MoE) |
| Qwen3-Next-80B-A3B-Instruct | 4 | Large language model |
| DeepSeek-V4-Flash-w8a8-mtp | 8 | Large language model |

---

## Deploy Script Auto-Match (Must use, never guess script URL)

**Script Path:** `scripts/deploy_helper.py`

**Match Rules (hardcoded, 100% accurate):**

| Model Category | Deploy Script | Notes |
|----------------|---------------|-------|
| LLM | `deploy-large-models.sh` | Shared with Embedding/Rerank |
| Embedding | `deploy-large-models.sh` | Same as above |
| Rerank | `deploy-large-models.sh` | Same as above |
| VL | `deploy-qwen3-vl-model.sh` | Multimodal specific |
| OpenSource | `deploy-ai-models.sh` | OpenSource specific |

**Usage:**

```bash
# Match model (returns category, script URL, min cards, etc.)
python3 scripts/deploy_helper.py match <model_name>

# Generate deploy command directly
python3 scripts/deploy_helper.py command <model_name> <cards> <port>

# List all models (optional category filter)
python3 scripts/deploy_helper.py list [LLM|VL|Embedding|Rerank|OpenSource]
```

**AI must call `deploy_helper.py match` first to confirm category and script, then use returned `deploy_url` to assemble command. Never guess from memory!**

---

## Core Commands

Core commands for model deployment and testing. See [Operation Flow](#operation-flow) for detailed steps.

| Command | Description |
|---------|-------------|
| `deploy <model> <port>` | Deploy model on single machine |
| `deploy <model> <port> <cards>` | Deploy with specified card count |
| `dual-machine deploy <model> head:<IP> worker:<IP> port:<PORT>` | Deploy on dual-machine cluster |
| `test <model> <port>` | Test model inference |
| `deployment log` | View deployment log |
| `deployment status` | Check deployment status |
| `model list` | Show supported models |

## Operation Flow

### I. Deployment

#### 1. Pre-deployment Check (Must execute every time, cannot skip)

Check in order, stop if any fails:

1. **NPU Model Check** — Agent calls `huawei-cloud-ascend-command` to check chip model, reject non-910B series
2. **NPU Card Count Check** — Agent calls `huawei-cloud-ascend-command` to check available cards, confirm >= required cards
3. **User Card Count Check** — User-specified cards must be >= minimum and within supported range (1,2,4,8,16)
4. **Embedding/Rerank Single Card Check** — Embedding and Rerank only support single card, reject multi-card
5. **Port Occupancy Check** — Agent calls `huawei-cloud-ascend-remote-connect` to run `ss -tlnp | grep :port`, notify if occupied
6. **SSH Connectivity Check** — For dual-machine, verify both head and worker nodes are SSH accessible

#### 2. Single-machine Deployment

User says: "deploy model_name port XXXX" or "deploy model_name port XXXX N cards"

**Before deploying, must SSH execute `mkdir -p /home/modelarts-agent` to ensure directory exists.**

**LLM / Embedding / Rerank Command Template:**
```bash
nohup bash -c 'export model_name=${model} && export required_cards=${cards} && export port=${port} && wget -P /home/modelarts-agent/ https://documentation-samples-17.obs.cn-north-9.myhuaweicloud.com/solution-as-code-publicbucket/solution-as-code-module/quickly-deploy-llm-on-modelarts-lite-devserver/userdata/deploy-large-models/single-machine/deploy-large-models.sh && chmod 755 /home/modelarts-agent/deploy-large-models.sh && sh /home/modelarts-agent/deploy-large-models.sh ${model} ${cards} ${port}' > /home/modelarts-agent/deploy_${model}.log 2>&1 &
```

**VL Multimodal Command Template:**
```bash
nohup bash -c 'export model_name=${model} && export required_cards=${cards} && export port=${port} && wget -P /home/modelarts-agent/ https://documentation-samples-17.obs.cn-north-9.myhuaweicloud.com/solution-as-code-publicbucket/solution-as-code-module/quickly-deploy-llm-on-modelarts-lite-devserver/userdata/deploy-vl-model/single-machine/deploy-qwen3-vl-model.sh && chmod 755 /home/modelarts-agent/deploy-qwen3-vl-model.sh && sh /home/modelarts-agent/deploy-qwen3-vl-model.sh ${model} ${cards} ${port}' > /home/modelarts-agent/deploy_${model}.log 2>&1 &
```

**OpenSource Command Template:**
```bash
nohup bash -c 'export model_name=${model} && export required_cards=${cards} && export port=${port} && wget -P /home/modelarts-agent/ https://documentation-samples-17.obs.cn-north-9.myhuaweicloud.com/solution-as-code-publicbucket/solution-as-code-module/quickly-deploy-llm-on-modelarts-lite-devserver/userdata/deploy-large-models/single-machine/open_source/deploy-ai-models.sh && chmod 755 /home/modelarts-agent/deploy-ai-models.sh && sh /home/modelarts-agent/deploy-ai-models.sh ${model} ${cards} ${port}' > /home/modelarts-agent/deploy_${model}.log 2>&1 &
```

#### 3. Dual-machine Deployment

User says: "dual-machine deploy model_name head:IP worker:IP port XXXX"

**Before dual-machine deploy, both head and worker nodes need `mkdir -p /home/modelarts-agent`.**

**Head Node Command Template:**
```bash
nohup bash -c 'export ray_head_ip=${head_ip} && export model_name=${model} && export port=${port} && wget -P /home/modelarts-agent/ https://documentation-samples-17.obs.cn-north-9.myhuaweicloud.com/solution-as-code-publicbucket/solution-as-code-module/quickly-deploy-llm-on-modelarts-lite-devserver/userdata/deploy-large-models/dual-machine/qwen3-235b-a22b.sh && chmod 755 /home/modelarts-agent/qwen3-235b-a22b.sh && sh /home/modelarts-agent/qwen3-235b-a22b.sh head ${head_ip} ${model} ${port}' > /home/modelarts-agent/deploy_${model}_head.log 2>&1 &
```

**Worker Node Command Template:**
```bash
nohup bash -c 'export ray_head_ip=${head_ip} && export model_name=${model} && export port=${port} && wget -P /home/modelarts-agent/ https://documentation-samples-17.obs.cn-north-9.myhuaweicloud.com/solution-as-code-publicbucket/solution-as-code-module/quickly-deploy-llm-on-modelarts-lite-devserver/userdata/deploy-large-models/dual-machine/qwen3-235b-a22b.sh && chmod 755 /home/modelarts-agent/qwen3-235b-a22b.sh && sh /home/modelarts-agent/qwen3-235b-a22b.sh worker ${head_ip} ${model} ${port}' > /home/modelarts-agent/deploy_${model}_worker.log 2>&1 &
```


**VL Dual-machine Deployment:**

For VL models (Qwen3-VL-235B-A22B-Instruct, etc.), use the following scripts:

**VL Head Node Command:**
```bash
nohup bash -c 'export ray_head_ip=${head_ip} && export model_name=${model} && export port=${port} && wget -P /home/modelarts-agent/ https://documentation-samples-17.obs.cn-north-9.myhuaweicloud.com/solution-as-code-publicbucket/solution-as-code-module/quickly-deploy-llm-on-modelarts-lite-devserver/userdata/deploy-vl-model/dual-machine/qwen3-vl-235b-a22b.sh && chmod 755 /home/modelarts-agent/qwen3-vl-235b-a22b.sh && sh /home/modelarts-agent/qwen3-vl-235b-a22b.sh head ${head_ip} ${model} ${port}' > /home/modelarts-agent/deploy_${model}_head.log 2>&1 &
```


**VL Worker Node Command:**
```bash
nohup bash -c 'export ray_head_ip=${head_ip} && export model_name=${model} && export port=${port} && wget -P /home/modelarts-agent/ https://documentation-samples-17.obs.cn-north-9.myhuaweicloud.com/solution-as-code-publicbucket/solution-as-code-module/quickly-deploy-llm-on-modelarts-lite-devserver/userdata/deploy-vl-model/dual-machine/qwen3-vl-235b-a22b.sh && chmod 755 /home/modelarts-agent/qwen3-vl-235b-a22b.sh && sh /home/modelarts-agent/qwen3-vl-235b-a22b.sh worker ${head_ip} ${model} ${port}' > /home/modelarts-agent/deploy_${model}_worker.log 2>&1 &
```


#### 4. Deployment Confirmation Flow

**Sensitive operation, must show full command and wait for user "confirm" before executing.**

After deploy command sent:
1. Notify user: Ready, starting deployment of ${model}, log at `/home/modelarts-agent/deploy_${model}.log`
2. **Check log every 2 minutes**, report progress (loading weights, Dynamo compiling, service starting, etc.)
3. When port is listening, notify deployment success
4. **Deployment failure handling (strict compliance):**
   - Deployment failed = Report failure reason, no automatic retry
   - Never auto-change image and retry
   - Never auto-modify parameters and retry
   - Never try other deployment methods
   - Only report error, let user decide next step
5. **Output API sample** for user:

```
Deployment successful! ${model} is ready

Service URL: http://${IP}:${PORT}/v1/chat/completions

Example request:
curl -X POST http://${IP}:${PORT}/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"${model}","messages":[{"role":"user","content":"hello"}],"max_tokens":256}'

Multimodal request (if supported):
curl -X POST http://${IP}:${PORT}/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"${model}","messages":[{"role":"user","content":[{"type":"image_url","image_url":{"url":"image_url"}},{"type":"text","text":"describe the image"}]}],"max_tokens":512}'
```

---

### II. Deployment Log

User says: "deployment log model_name"

Agent uses `huawei-cloud-ascend-remote-connect` to execute:
```bash
tail -50 /home/modelarts-agent/deploy_${model}.log
```

---

### III. Deployment Status

User says: "deployment status port XXXX"

Agent uses `huawei-cloud-ascend-remote-connect` to execute:
```bash
ss -tlnp | grep :
```

Port listening = Service ready for testing.

---

### IV. Test (Execute)

User says: "test model_name prompt:xxx" or "test model_name image:URL"

**Test flow (strict compliance):**
1. **Show full curl command** for user to review
2. Wait for user "confirm" or "send" before executing
3. **Structured result output:**

```
Test Result

| Field | Value |
|-------|-------|
| id | chatcmpl-xxx |
| model | Qwen3-VL-32B-Instruct |
| prompt_tokens | 93 |
| completion_tokens | 400 |
| total_tokens | 493 |
| finish_reason | stop |

Model Response:
[Extract full content, no truncation]

Raw Response:
[Full JSON, no truncation]
```

#### LLM Chat Completions
```bash
curl -s -X POST http://${IP}:${PORT}/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"${model}","messages":[{"role":"user","content":"${prompt}"}],"max_tokens":1024,"temperature":0.7}'
```

#### Multimodal VL
```bash
curl -s -X POST http://${IP}:${PORT}/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"${model}","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":[{"type":"image_url","image_url":{"url":"${image_url}"}},{"type":"text","text":"${prompt}"}]}],"max_tokens":512,"temperature":0.7}'
```

#### Embedding
```bash
curl -s -X POST http://${IP}:${PORT}/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"model":"${model}","input":"${text}"}'
```

#### Rerank
```bash
curl -s -X POST http://${IP}:${PORT}/v1/rerank \
  -H 'Content-Type: application/json' \
  -d '{"model":"${model}","query":"${query}","documents":["${doc1}","${doc2}"]}'
```

---

### V. Write Test Command (Generate Only)

User says: "write test command model_name prompt:xxx"

Same logic as "test", but **only output command text, no execution**.

---

## API Parameter Manual

### LLM Parameters (/v1/chat/completions)

| Parameter | Required | Default | Description |
|-----------|:--------:|---------|-------------|
| model | Yes | — | Model name, same as deployment |
| messages | Yes | — | Message list, each with role and content |
| max_tokens | No | 16 | Max generation tokens |
| temperature | No | 1.0 | Sampling randomness, 0=greedy |
| top_p | No | 1.0 | Nucleus sampling threshold |
| top_k | No | -1 | Only consider top-K tokens |
| stream | No | false | Streaming output (SSE) |
| chat_template_kwargs | No | {} | Template params, e.g. {"enable_thinking": false} |

### VL Extra Parameters
| Parameter | Description |
|-----------|-------------|
| content[] | Array format: image_url object + text object |
| detail | Image precision: auto/high/low |

### Embedding Parameters (/v1/embeddings)
| Parameter | Required | Description |
|-----------|:--------:|-------------|
| model | Yes | Model name |
| input | Yes | String or string list |
| encoding_format | No | float/base64 |

### Rerank Parameters (/v1/rerank)
| Parameter | Required | Description |
|-----------|:--------:|-------------|
| model | Yes | Model name |
| query | Yes | Query text |
| documents | Yes | Document list to rerank |
| top_n | No | Return top N |

---

## Execution Mode

This skill operates in **stateless mode** (Rule 2). All context (host, credentials, model info) must be explicitly passed by Agent (Rule 1).

### Prerequisites (Agent orchestrated)

Before calling this skill, Agent MUST:

1. **Establish SSH connection** using `huawei-cloud-ascend-remote-connect`
   - Agent receives: host, port, user, password from user
   - Agent validates connection is successful

2. **Check NPU status** using `huawei-cloud-ascend-command`
   - Agent checks: NPU health, HBM availability
   - Agent validates: sufficient cards for model deployment

### Skill Execution

This skill receives explicit parameters from Agent:

```bash
# Model matching (local operation)
python3 scripts/deploy_helper.py match <model_name>

# Script URL generation (local operation)
python3 scripts/deploy_helper.py script <model_name>

# Deploy command generation (local operation)
python3 scripts/deploy_helper.py command <model> <cards> <port>
```

### Remote Deployment Execution

Agent executes deployment commands on remote server:

```bash
# Agent uses SSH to execute deployment on DevServer
ssh root@<host> "cd /path/to/model && bash deploy.sh"
```

### Post-Deployment (Agent orchestrated)

After deployment, Agent calls `huawei-cloud-ascend-command` to:
- Monitor NPU HBM usage
- Check deployment process status
- Verify model endpoint is responding

### Parameter Flow

```
User Input                    Agent                      This Skill
    │                          │                            │
    │ host, password           │                            │
    ├─────────────────────────▶│                            │
    │                          │ SSH connect                │
    │                          ├───────────────────────────▶│
    │                          │                            │ (remote-connect)
    │                          │◀───────────────────────────┤
    │                          │                            │
    │                          │ NPU check                  │
    │                          ├───────────────────────────▶│
    │                          │                            │ (ascend-command)
    │                          │◀───────────────────────────┤
    │                          │                            │
    │ model_name, cards        │                            │
    ├─────────────────────────▶│                            │
    │                          │ match model                │
    │                          ├───────────────────────────▶│
    │                          │                            │ deploy_helper.py
    │                          │◀───────────────────────────┤
    │                          │                            │
    │                          │ execute deploy             │
    │                          ├───────────────────────────▶│
    │                          │                            │ (via SSH)
    │                          │◀───────────────────────────┤
    │                          │                            │
    │                          │ monitor NPU                │
    │                          ├───────────────────────────▶│
    │                          │                            │ (ascend-command)
    │                          │◀───────────────────────────┤
    │                          │                            │
    ▼                          ▼                            ▼
```

**Note**: No direct skill-to-skill calls. All orchestration by Agent (Rule 3).

---

## References

| Document | Description |
|----------|-------------|
| [task-deploy-model.md](references/task-deploy-model.md) | Deployment task steps |
| [task-test-model.md](references/task-test-model.md) | Testing task steps |
| [model-catalog.md](references/model-catalog.md) | Complete model catalog |
| [api-parameters.md](references/api-parameters.md) | API parameter reference |
| [prerequisites.md](references/prerequisites.md) | Prerequisites checklist |
| [verification-method.md](references/verification-method.md) | Verification steps |
| [troubleshooting.md](references/troubleshooting.md) | Troubleshooting guide |
| [scripts/deploy_helper.py](scripts/deploy_helper.py) | Model matching helper |
huawei-cloud-ascend-models-deploy

SKILL.md

related skills