Run only when the user explicitly asks for a fleet-wide Cosmos DB for MongoDB (RU) health check — scans NormalizedRU consumption, service availability, serve...
SKILL.md

---
name: amg-check-cosmosdb-mongo-ru
description: Run only when the user explicitly asks for a fleet-wide Cosmos DB for MongoDB (RU) health check — scans NormalizedRU consumption, service availability, server-side latency, throttling (429s), and replication metrics across all accounts, then deep-dives into abnormal accounts with resource logs and correlation analysis. Tracks known issues across sessions via persistent report. Uses AMG-MCP pulse check for Tier 1 triage, then batched Azure Monitor queries for Tier 2 investigation. On first run, auto-discovers datasource UID and prompts for subscription ID.
argument-hint: "[time-range, e.g. 7d, 1d, 3d] [subscription-id]"
disable-model-invocation: true
effort: max
allowed-tools: mcp__amg__amgmcp_pulse_check mcp__amg__amgmcp_query_resource_graph mcp__amg__amgmcp_query_resource_metric mcp__amg__amgmcp_query_resource_metric_definition mcp__amg__amgmcp_query_resource_log mcp__amg__amgmcp_datasource_list mcp__amg__amgmcp_query_activity_log Bash(node *) Glob Read Write Edit
---

<!-- Auto-generated for OpenClaw by pack-openclaw. Notes for OpenClaw users:
     - Claude Code dynamic expressions (!`...`) in this file are NOT evaluated by OpenClaw
       and appear as literal text. Run them manually at the start of the workflow.
     - Invoke this skill only via slash command (e.g. /amg-check-cosmosdb-mongo-ru). Auto-invocation is
       disabled on Claude Code but not on OpenClaw. -->

## OpenClaw Setup (one-time)

This skill calls MCP tools prefixed with `mcp__amg__*`, so OpenClaw must have an MCP server registered under the exact name **`amg`**. Run this once per workspace before invoking the skill:

```bash
openclaw mcp set amg '{"url":"https://<your-grafana-instance>/api/azure-mcp","transport":"streamable-http","headers":{"Authorization":"Bearer <your-token>"}}'
```

Replace `<your-grafana-instance>` with your Azure Managed Grafana endpoint and `<your-token>` with a valid Grafana service-account token (starts with `glsa_`). The server name **must** be `amg` — the skill's `allowed-tools` reference `mcp__amg__*` and will not find tools under any other name.

Verify the server is registered:

```bash
openclaw mcp list
```

> Official skill source: https://github.com/Azure/amg-skills

## Runtime Context
- Current UTC time: !`date -u +%Y-%m-%dT%H:%M:%SZ`
- Config: !`cat memory/amg-check-cosmosdb-mongo-ru/config.md 2>/dev/null || echo "NOT_CONFIGURED"`
- Prior report: !`[ -f memory/amg-check-cosmosdb-mongo-ru/report.md ] && echo "exists ($(grep -c '^### BUG-' memory/amg-check-cosmosdb-mongo-ru/report.md) bugs documented)" || echo "not found"`
- Arguments: time-range=$0, subscription-override=$1

> **Known Issues**: Before presenting findings, cross-reference results against `memory/amg-check-cosmosdb-mongo-ru/report.md`.

# Cosmos DB for MongoDB (RU) Health Check

## Critical Constraints

- **No subagents for MCP.** The Agent tool cannot access MCP tools — all MCP calls must be made from the main context.
- **Scan every resource.** No sampling or early stopping.
- **Time format**: ISO 8601 UTC with explicit `from`/`to` — NEVER use `timespan` (it causes errors).
- **Safe interval**: Always use `PT1H` — it works for all Cosmos DB metrics. `PT6H` is NOT supported. `DataUsage`, `IndexUsage`, and `DocumentCount` do NOT support `P1D`.
- **Parallelism cap**: 30 concurrent MCP calls per batch. Reduce to 4-5 if rate-limited.
- **Result too large**: Save to temp file and parse outside the context window. Prefer `node -e "..."` if installed; otherwise fall back to `python -c "..."`, `jq`, or `pwsh -Command "..."`. Bash permission for the chosen interpreter will be prompted on first use.

## Progress Tracking

Update checkboxes as you complete each phase:

- [ ] Phase 1a: Datasource validated
- [ ] Phase 1b: Accounts discovered (N=?)
- [ ] Phase 1c: Non-succeeded accounts investigated (if any)
- [ ] Phase 2: Metric definitions validated
- [ ] Phase 3: Pulse check completed (N scanned, N findings)
- [ ] Phase 4: Deep metrics for abnormal accounts
- [ ] Phase 5: Resource logs for abnormal accounts
- [ ] Report presented
- [ ] Known issues updated in `memory/amg-check-cosmosdb-mongo-ru/report.md`

## Configuration

**If Config shows `NOT_CONFIGURED`**: Run [First-Run Setup](#first-run-setup) at the bottom of this file, then return here.

**If Config is populated**: Extract the datasource UID and subscription ID from the pre-loaded Runtime Context above and use them for all queries. Use `$1` as the subscription override if provided.

- **Datasource UID**: from `## Azure Monitor Datasource` > `UID`
- **Subscription ID**: from `## Subscription` (or `$1` if provided)
- **Resource Type**: `microsoft.documentdb/databaseaccounts` (lowercase) with `kind == 'MongoDB'`

## Time Range

Default: 7 days for metrics, 24 hours for logs. Override with `$0` (e.g., `3d`). Keep log queries to 1-2 days to avoid timeouts.

---

## Workflow

### Phase 1a: Validate Datasource

Call `amgmcp_datasource_list` (no parameters). Find entry with `type == "grafana-azure-monitor-datasource"`.

- Matches configured UID → proceed.
- Different UID → update `memory/amg-check-cosmosdb-mongo-ru/config.md`, warn user, use new UID.
- Not found → abort with error.

### Phase 1b: Discover All Cosmos DB for MongoDB (RU) Accounts

```
azureMonitorDatasourceUid: {DATASOURCE_UID}
query: |
  resources
  | where type == 'microsoft.documentdb/databaseaccounts'
  | where kind == 'MongoDB'
  | project name, resourceGroup, location, subscriptionId, id, properties.provisioningState
  | order by location asc, name asc
```

If the config specifies subscription IDs (not "all"), add `| where subscriptionId in ('{ID1}', '{ID2}')`. Derive region summary by counting accounts per `location`. Flag accounts not in "Succeeded" state. Stop if zero accounts found.

> **Why `kind == 'MongoDB'`?** Filters for RU-based MongoDB API accounts. vCore-based MongoDB uses `microsoft.documentdb/mongoclusters`.

### Phase 1c: Activity Log for Non-Succeeded Accounts

If any accounts are not in "Succeeded" state, query the activity log for up to 3 of them:

```
azureMonitorDatasourceUid: {DATASOURCE_UID}
scope: {account's full ARM resource ID}
startTime: now-3d
endTime: now
select: eventTimestamp,operationName,status,caller,subStatus
```

If the response exceeds 500 KB, retry with `startTime: now-1d`. Summarize: operations performed, caller type, success/in-progress status, likely cause.

### Phase 2: Validate Available Metrics

Call `amgmcp_query_resource_metric_definition` on the first account from Phase 1. Confirm expected metrics exist. Run only once — definitions are the same across all accounts.

### Phase 3: Tier 1 — Fleet-Wide Pulse Check

```
azureMonitorDatasourceUid: {DATASOURCE_UID}
pastDays: 7
scenarios: cosmosdb_mongo
```

Scans all accounts across 3 scenarios: `cosmosdb_mongo_ru`, `cosmosdb_mongo_throttling`, `cosmosdb_mongo_availability`.

**Before moving to Phase 4, verify:**
1. `scanSummary.totalResourcesScanned` matches Phase 1 account count.
2. All 3 scenarios show `status: "completed"` in `scenarioResults`.
3. If `errors` non-empty, retry affected scenarios individually.
4. If >10% accounts missing, fall back to batched `amgmcp_query_resource_metric` for unscanned accounts.

Accounts in the `findings` array are abnormal. Also flag any non-Succeeded accounts from Phase 1.

> **Note**: Sustained-high detection (>50% for 6+ hours), RU spike pattern detection (>30pp jump in 1h), and latency analysis require hourly time-series data and are performed in Phase 4 on flagged accounts only.

### Phase 4: Tier 2 — Deep Metrics for Abnormal Accounts

Read **[reference/phase4-deep-metrics.md](${CLAUDE_SKILL_DIR}/reference/phase4-deep-metrics.md)** before starting Phase 4. It contains:
- Response size management (critical — fleet-wide PT1H queries exceed 500 KB)
- Fleet-wide triage strategy (when >50% accounts are flagged)
- Core and secondary metrics tables
- Batch strategy and correlation analysis patterns (use ultrathink)

### Phase 5: Resource Logs for Abnormal Accounts

Read **[reference/phase5-resource-logs.md](${CLAUDE_SKILL_DIR}/reference/phase5-resource-logs.md)** before starting Phase 5. It contains:
- 5 KQL query templates: throttling, high latency, request volume, top RU operations, error codes
- Fallback table guidance (CDBDataPlaneRequests if CDBMongoRequests is empty)

---

## Output

Present the report using the structure in **[reference/output-format.md](${CLAUDE_SKILL_DIR}/reference/output-format.md)**.

**Classification:**

| Severity | Criteria |
|----------|----------|
| **CRITICAL** | NormalizedRU = 100% sustained, OR ServiceAvailability < 99.9%, OR latency avg > 50ms |
| **HIGH** | NormalizedRU max 85-100% with frequent spikes, OR ReplicationLatency > 1000ms |
| **WARNING** | NormalizedRU max 70-85% sustained, OR sustained RU > 50% for 6h+, OR RU spike >30pp in 1h, OR ServiceAvailability < 99.99%, OR latency avg > 10ms, OR ReplicationLatency > 100ms |
| **MODERATE** | NormalizedRU max 50-70% |
| **HEALTHY** | All metrics within normal ranges (NormalizedRU < 50%) |

## Update Known Issues

After presenting findings, update `memory/amg-check-cosmosdb-mongo-ru/report.md`:

1. Read the current file.
2. Rebuild the Resource Inventory table at the end: every account, full ARM ID, region, subscription, state. Group by region, sorted alphabetically.
3. Update existing bug status from today's telemetry (resolved / improving / worsening / still active).
4. Add new bugs with: severity, account name, region, metric evidence, log evidence, root cause, recommended action.
5. Update the "Updated" date header.

Only add genuine issues: sustained throttling, availability drops, high latency patterns, or replication problems. Skip transient single-hour spikes or expected maintenance windows.

## Error Handling

See **[reference/error-handling.md](${CLAUDE_SKILL_DIR}/reference/error-handling.md)** for the full recovery table.

## Analysis Guidance

- Known patterns, signals, root causes: [reference/analysis-patterns.md](${CLAUDE_SKILL_DIR}/reference/analysis-patterns.md)
- Optional deep-dive KQL queries: [reference/deep-dive-queries.md](${CLAUDE_SKILL_DIR}/reference/deep-dive-queries.md)

## Reference

- Cosmos DB resource type: `microsoft.documentdb/databaseaccounts` (kind: `MongoDB`)
- vCore resource type (different): `microsoft.documentdb/mongoclusters`
- Latency metrics: `ServerSideLatencyDirect` and `ServerSideLatencyGateway` (the old `ServerSideLatency` is deprecated)
- Resource log tables: `CDBMongoRequests` (primary), `CDBDataPlaneRequests` (fallback)
- Key error codes: `429` / `16500` (throttling), `50` (server error), `13` (unauthorized)
- Safe metric interval: `PT1H` for all metrics (PT6H NOT supported)
- Known issues: `memory/amg-check-cosmosdb-mongo-ru/report.md`
- User config: `memory/amg-check-cosmosdb-mongo-ru/config.md`

---

## First-Run Setup

Run only when Config shows `NOT_CONFIGURED`. After completing, return to the [Workflow](#workflow) above.

**1. Discover Datasource UID**: Call `amgmcp_datasource_list`. Filter `type == "grafana-azure-monitor-datasource"`. Prefer `uid == "azure-monitor-oob"` if multiple match. Abort if zero match.

**2. Discover Subscription ID**: Run this Resource Graph query to list all subscriptions with Cosmos DB for MongoDB (RU) accounts, then present the results as a table and ask the user which subscription(s) to use:
```
resources
| where type == 'microsoft.documentdb/databaseaccounts'
| where kind == 'MongoDB'
| join kind=inner (
    resourcecontainers
    | where type == 'microsoft.resources/subscriptions'
    | project subscriptionId, subscriptionName=name
) on subscriptionId
| summarize AccountCount=count() by subscriptionId, subscriptionName
| order by AccountCount desc
```

Present the results as a table with columns: **Subscription Name**, **Subscription ID**, **Account Count**. Then ask the user: *"Which subscription ID(s) should I configure for this health check? Or type 'all' to scan all subscriptions."*

**3. Write config**: Write `memory/amg-check-cosmosdb-mongo-ru/config.md`:
```markdown
# amg-check-cosmosdb-mongo-ru Configuration

User-specific values for the Cosmos DB for MongoDB (RU) health check skill.
This file is auto-generated on first run and can be edited manually.

## Azure Monitor Datasource
- **UID**: {discovered_uid}
- **Name**: {discovered_name}

## Subscription
- {subscription_id_or_"all"}
```

**4. Confirm**: Show the resolved config and ask for confirmation before proceeding.
AMG Cosmos DB for MongoDB (RU) Health Check

SKILL.md

related skills