Convert natural language questions into safe executable SQL to query Ascend PyTorch Profiler / msprof database for operator time consumption, communication,...
SKILL.md

---
name: huawei-cloud-ascend-profiler-db-explorer
description: |
  Convert natural language questions into safe executable SQL to query
  Ascend PyTorch Profiler / msprof database for operator time consumption,
  communication, dispatch, and other performance data. Supports table schema
  extraction from official documentation.
  Use this skill when the user wants to: (1) analyze Ascend profiling database,
  (2) query operator performance data, (3) analyze communication and dispatch
  bottlenecks, (4) check table schema for profiling data.
  Trigger: user mentions "profiler db", "sqlite", "sql", "table", "schema",
  "ascend-pytorch-profiler", "msprof", "operator time", "communication time",
  "dispatch analysis", "性能分析", "算子耗时", "数据库查询", "性能数据",
  "性能瓶颈"
compatibility:
  - msprof >= 7.0.0
  - sqlite3 >= 3.0.0
tags: [Ascend, profiler, SQL, database]
allowed-tools:
  - python3
  - sqlite3
---

# Huawei Cloud Ascend Profiler DB Explorer

## Overview

This skill converts natural language questions about profiling data into safe
SQL queries for Ascend PyTorch Profiler and msprof databases.

**Architecture**: Natural Language Input → Intent Recognition → SQL Generation
→ Database Execution → Result Analysis

**Related Skills**:

- `huawei-cloud-msot-msopprof-operator-profiler` - Operator performance data
  collection
- `huawei-cloud-ascend-small-model-migrate` - Migration workflow that uses
  profiling analysis
- `huawei-cloud-ascendc-operator-performance-optim` - Operator optimization
  workflow

## Architecture Components

This skill involves the following cloud services and components:

- **MSProf**: Ascend profiling tool for data collection and database management
- **SQLite**: Database engine for storing profiling data
- **Ascend NPU**: Target hardware for performance profiling
- **msprof_mcp**: Tool for executing SQL queries on profiling database

**Architecture Diagram:**

```text
┌─────────────────────────────────────────────────────────────┐
│              Profiler DB Explorer Skill                    │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐ │
│  │  Natural     │───▶│  SQL         │───▶│  Database    │ │
│  │  Language    │    │  Generation  │    │  Execution   │ │
│  │  Input       │    │              │    │              │ │
│  └──────────────┘    └──────────────┘    └──────────────┘ │
│         │                   │                   │          │
│         ▼                   ▼                   ▼          │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐ │
│  │  Intent      │    │  CTE Macro   │    │  Result      │ │
│  │  Recognition │    │  Templates   │    │  Analysis    │ │
│  └──────────────┘    └──────────────┘    └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
```

## Use Cases

**Typical Problem Scenarios:**

- Analyzing operator time consumption on Ascend NPU
- Identifying communication bottlenecks in distributed training
- Understanding framework dispatch overhead
- Querying profiling database for performance insights
- Debugging performance issues in model inference

**Typical User Phrases:**

- "Which operators are most time-consuming?"
- "Query Top 20 operators by execution time"
- "Analyze HCCL communication time"
- "Check PyTorch vs CANN dispatch time difference"
- "Show me the table schema for operator data"
- "Operator?"
- "AnalysisOperatorPerformance"
- "QueryprofilerDatabase"

## Skill Objectives

- **Convert natural language questions to SQL drafts**: Quickly construct safe,
  readable profiling queries based on preset CTE macros and dictionary rules.
- **Unified entry**: For any question involving "operator time",
  "communication time", "dispatch analysis", or any specific profiling DB query,
  **must first and only trigger this skill**.
- **Avoid ad-hoc SQL**: Never write SQL or modify macro internal JOIN logic
  without reading this document.

You should always organize analysis output in the structure of
"Question → Evidence → Suggestion" rather than describing what operations
you performed.

## Role Positioning

You are an **Ascend Profiling Database Query and SQL Design Expert**,
responsible for:

- Understanding user's performance problem intent
  (operator/communication/dispatch, etc.).
- Selecting appropriate query channel (Track A / Track B).
- Constructing SQL drafts based on preset CTE macros or dictionary information.
- Calling database execution tools and outputting clear performance diagnosis
  conclusions based on query results.

## Usage Scenarios

Prioritize calling this skill in following scenarios:

- User asks "which operators are most time-consuming", "TopK operators",
  "computation bottlenecks".
- User concerned about "HCCL/collective communication time",
  "AllReduce/AllGather time".
- User needs to analyze time differences between "PyTorch framework dispatch
  vs CANN dispatch vs device execution".
- Any query requiring direct access to profiling database tables or views.

## Trigger Words (Recall Enhancement)

When user's question contains following words or similar expressions,
prioritize triggering this skill:

- `ascend-pytorch-profiler-db` / `ascend_pytorch_profiler*.db` / `msprof_*.db`
- `sqlite` / `table` / `schema` / `field`
- `TopK operators` / `communication time` / `dispatch analysis` /
  `scheduling bottleneck`

## Mandatory Restrictions

- Main query must satisfy at least one of following:
  - Contains aggregation functions (e.g., `SUM`, `AVG`, `COUNT`, etc.), OR
  - Explicitly includes `ORDER BY ... LIMIT 20` (or smaller LIMIT).
- Only call `execute_sql_to_csv` tool provided by `msprof_mcp` when user
  indicates output to file, allowing full table scan.
- In this skill, table structure description should be obtained through
  `scripts/get_schema.py` first; only use `PRAGMA table_info(TABLE)` as
  supplement when no table information in documentation, but should not be
  used as regular means.

## Track A: Golden Views / CTE Macros (Priority)

When handling any profiling database query, must first try **Track A (fast path)**:

1. **Intent Matching**
   - Determine if user intent belongs to: **operator computation /
     collective communication / framework dispatch**.
   - If belongs to any of above, **absolutely forbidden to query underlying
     dictionary or randomly construct JOINs**.

2. **Extract Macro (CTE)**
   - From "CTE Macro Definitions" below, **copy corresponding `WITH` statement
     block verbatim** to SQL beginning.
   - Never modify JOIN logic and field expressions inside macros.

3. **Concatenate Main Query**
   - After copied `WITH ... AS (...)`, write `SELECT` query for corresponding
     view (e.g., `compute_view`, `comm_view`, `dispatch_view`).
   - Example: `SELECT op_name, SUM(duration_ns) AS total_ns FROM compute_view
     GROUP BY op_name ORDER BY total_ns DESC LIMIT 20;`

## Track B: Underlying Documentation / profiler_db_data_format.md

Only enter Track B when one of following conditions met:

- User explicitly requests querying underlying hardware metrics
  (e.g., PMU counts, memory allocation, Step division, etc.).
- Requirement not covered by existing views in "CTE Macro Definitions".

Core tool for Track B is `scripts/get_schema.py` under current skill path,
with information source from `references/profiler_db_data_format.md`.

### 1. Get Real Table Names from Current DB (Recommended)

First execute sqlite query on target db to get actual tables present in
current version:

```bash
sqlite3 {db_path} ".tables"
sqlite3 {db_path} "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;"
```

> Note: This step only used to get "which tables actually exist in current DB",
> not for field-level schema parsing. For field descriptions, use
> `get_schema.py --table_name`.

### 2. Use Script for Document/DB Alignment (Recommended)

- **Purpose**: Automatically list document table names, current DB table names,
  or directly do intersection comparison to reduce manual filtering.
- **Command line examples**:

```bash
cd {skills_path}/huawei-cloud-ascend-profiler-db-explorer/scripts
python3 get_schema.py --list_tables
python3 get_schema.py --db_path {db_path} --list_db_tables
python3 get_schema.py --db_path {db_path} --compare_doc_db
```

### 3. get_schema_by_table_name(table_name)

- **Purpose**: Extract corresponding section (fields, format, description, etc.)
  for the table from `profiler_db_.md` by table name.
- **Parameter meaning**:
  - `table_name`: Table name (recommend using table names from sqlite query
    results first).
- **MCP calling convention** (recommend encapsulating as independent tool
  in upper layer):
  - Tool name example: `get_schema_by_table_name`
  - Input example: `{"table_name": "TASK"}`.
- **Command line examples**:

```bash
cd {skills_path}/huawei-cloud-ascend-profiler-db-explorer/scripts
python3 get_schema.py --table_name TASK
python3 get_schema.py --table_name COMMUNICATION_OP
```

Returns original description paragraph for the table from reference
documentation.

### Track B Usage Principles

1. Use real table names from sqlite query first, then call
   `get_schema.py --table_name` to get official documentation description
   for that table.
2. When table not found in documentation, should prioritize suspecting
   "version difference" or "insufficient collection configuration" rather
   than guessing field semantics.
3. **Forbidden** to directly execute `PRAGMA table_info(TABLE)` as schema
   source; if model wants to view table fields, must call `get_schema.py`
   instead.

## Execution and Summary

- **Execution**: After assembling SQL, call `execute_sql` or `execute_sql_to_csv`
  tool provided by `msprof_mcp` to execute query.
- **Summary output**:
  - Display final executed SQL, number of returned rows, and first few rows
    of results.

## CTE Macro Definitions (Must Reuse in Track A)

[Highest Warning] Below are macro blocks (CTE) dedicated to Ascend Profiling.
In Track A:

- Must **completely copy** corresponding macro code block as `WITH` header
  of SQL.
- Never modify JOIN, field meaning, or computation logic inside macros.

### 1. Operator Computation Detail Macro (Compute Macro)

**Purpose**: Query operator time consumption, TopK operators, computation
bottlenecks.

```sql
WITH compute_view AS (
    SELECT c.globalTaskId, ROUND(t.endNs - t.startNs) AS duration_ns,
           n.value AS op_name, type_str.value AS op_type
    FROM COMPUTE_TASK_INFO c
    LEFT JOIN TASK t ON t.globalTaskId = c.globalTaskId
    LEFT JOIN STRING_IDS n ON n.id = c.name
    LEFT JOIN STRING_IDS type_str ON type_str.id = c.opType
)
```

### 2. Communication Detail Macro (Communication Macro)

**Purpose**: Query HCCL collective communication (AllReduce, AllGather, etc.)
time.

```sql
WITH comm_view AS (
    SELECT ROUND(c.endNs - c.startNs) AS duration_ns, n.value AS op_name,
           t.value AS op_type, g.value AS group_name
    FROM COMMUNICATION_OP c
    LEFT JOIN STRING_IDS n ON n.id = c.opName
    LEFT JOIN STRING_IDS t ON t.id = c.opType
    LEFT JOIN STRING_IDS g ON g.id = c.groupName
)
```

### 3. Dispatch Mapping Macro (Dispatch Macro)

**Purpose**: Compare time differences between PyTorch framework dispatch,
CANN layer dispatch, and underlying execution to locate scheduling congestion.

```sql
WITH dispatch_view AS (
    SELECT
        ROUND(t.endNs - t.startNs) AS task_duration_ns,
        ROUND(c.endNs - c.startNs) AS cann_duration_ns,
        ROUND(p.endNs - p.startNs) AS pytorch_duration_ns,
        c_str.value AS cann_api_name,
        p_str.value AS pytorch_api_name,
        t_str.value AS task_type
    FROM TASK t
    LEFT JOIN CANN_API c ON t.connectionId = c.connectionId
    LEFT JOIN CONNECTION_IDS conn ON conn.connectionId = t.connectionId
    LEFT JOIN PYTORCH_API p ON p.connectionId = conn.id
    LEFT JOIN STRING_IDS c_str ON c.name = c_str.id
    LEFT JOIN STRING_IDS p_str ON p.name = p_str.id
    LEFT JOIN STRING_IDS t_str ON t.taskType = t_str.id
)
```

## Enhanced Features

### Intelligent Bottleneck Diagnoser

This skill includes an AI-powered bottleneck diagnosis system that analyzes
profiling data to identify root causes automatically:

**Features:**

- **Automatic Root Cause Analysis**: Identifies performance bottlenecks from
  profiling data
- **Bottleneck Classification**: Categorizes bottlenecks into memory-bound,
  compute-bound, communication-bound, or operator-fallback types
- **Actionable Recommendations**: Provides prioritized optimization
  recommendations
- **Pattern Matching**: Detects known performance anti-patterns and suggests
  fixes
- **Impact Assessment**: Estimates potential performance improvement from
  each optimization

**Bottleneck Classification:**

| Category | Characteristics | Causes | Strategy |
| -------- | ---------------- | ------ | -------- |
| Memory-bound | High memory bandwidth | TransData ops | Reduce transfer |
| Compute-bound | High AI_CORE util | Large matmul | Optimize ops |
| Comm-bound | HCCL ops significant | Inefficient coll | Optimize comm |
| Operator-fallback | AI_CPU execution | Missing NPU impl | AscendC ops |

**Bottleneck Diagnosis Output:**

```markdown
## Intelligent Bottleneck Diagnosis Report

### Overall Performance Summary
- Total Inference Time: 15.2 ms
- Bottleneck Score: 78/100
- Main Bottleneck Type: Memory-bound

### Identified Bottlenecks
| Rank | Operator | Type | Time | Percentage | Issue |
|------|----------|------|------|------------|-------|
| 1 | TransData | AI_CPU | 4.2 ms | 27.6% | Frequent CPU-NPU transfer |
| 2 | IndexSelect | AI_CPU | 2.8 ms | 18.4% | Operator fallback to CPU |
| 3 | NMS | AI_CPU | 1.5 ms | 9.9% | No NPU implementation |

### Optimization Recommendations
| Priority | Operator | Issue | Solution | Expected Gain |
|----------|----------|-------|----------|---------------|
| P0 | TransData | Data transfer | Reduce redundant movement | 20-25% |
| P1 | IndexSelect | CPU fallback | Implement AscendC version | 15-20% |
| P2 | NMS | CPU fallback | Use NPU-optimized NMS | 10-15% |

### Quick Wins
1. Batch pre-processing on NPU instead of CPU
2. Use async data transfer with overlap
3. Enable memory pooling for intermediate tensors
```

## Reference Documents

| Document | Description |
| -------- | ----------- |
| Profiler DB Data Format | Table structure |
| Acceptance Criteria | Acceptance criteria |
| Verification Method | Verification approach |
| Troubleshooting | Common issues |

## Prerequisites

- msprof >= 7.0.0 installed
- sqlite3 >= 3.0.0 installed
- Have Ascend PyTorch Profiler or msprof generated database file

## Core Commands

```bash
# Query operator time consumption
python3 scripts/query_profiler_db.py \
  --db /path/to/ascend_pytorch_profiler.db \
  --query "Top 10 operators by time consumption"
```

## Parameter Confirmation

| Parameter | Description | Required |
| --------- | ----------- | -------- |
| db | Profiler database path | Yes |
| query | Natural language query | Yes |
| output | Output format | No |
huawei-cloud-ascend-profiler-db-explorer

SKILL.md

related skills