Use when the user wants to prepare, create, or generate an AWS FIS (Fault Injection Service) experiment configuration. Triggers on "prepare FIS experiment",...
SKILL.md

---
name: aws-fis-experiment-prepare
description: >
  Use when the user wants to prepare, create, or generate an AWS FIS (Fault
  Injection Service) experiment configuration. Triggers on "prepare FIS
  experiment", "create FIS experiment for [scenario]", "generate chaos
  experiment config", "准备 FIS 实验", "生成 [scenario] 混沌实验配置",
  "create experiment template for AZ power interruption", "set up fault
  injection test". Covers Scenario Library pre-built scenarios (AZ Power
  Interruption, AZ Application Slowdown, Cross-AZ Traffic Slowdown,
  Cross-Region Connectivity), custom single FIS actions
  (aws:rds:failover-db-cluster, aws:ec2:stop-instances, etc.), and SSM
  Automation-based fault injection for Amazon MSK (broker reboot) and
  ElastiCache Redis/Valkey (primary node reboot, replication group failover).
---

# AWS FIS Experiment Prepare

Generate all configuration files needed to run an AWS FIS experiment, then
deploy via CloudFormation with self-healing iteration until the stack
succeeds. Outputs a self-contained directory with a validated, deployed
experiment template ready for execution.

**Core principle:** Validate resource-action compatibility before generating
files. Never deliver untested configuration — deploy and self-heal first.

## References

**Always load for every experiment:**
- `references/output-format.md` — directory layout, slug naming, README
  template
- `references/cfn-base-template.md` — CFN skeleton (Parameters, IAM Role,
  Dashboard, FIS Template, Outputs)
- `references/slug-conventions.md` — scenario/context slug abbreviations,
  resource naming, name length budget

**Load conditionally by scenario:**
- `references/az-power-interruption-guide.md` — AZ Power Interruption
  (sub-action pruning, tagging strategy, permissions)
- `references/eks-pod-action-guide.md` — any `aws:eks:pod-*` action
  (RBAC Lambda, EKS Access Entry, Pod memory stress calculation)
- `references/elasticache-redis-guide.md` — ElastiCache Redis/Valkey
  (native AZ power interruption, primary node reboot via SSM
  Automation, or replication group failover via SSM Automation)
- `references/msk-guide.md` — Amazon MSK (broker reboot via SSM
  Automation — no native FIS action exists)

**Utility scripts (execute, do not read as reference):**
- `scripts/precheck-cfn-permissions.sh` — detects required CFN service role
- `scripts/deploy-with-retry.sh` — validate + deploy + delete-on-fail
- `scripts/rename-output-dir.sh` — appends FIS template ID to directory name

**Script invocation:** `${SKILL_DIR}` refers to the absolute path of this
skill's directory (where SKILL.md lives). Resolve it from the skill's
filesystem location before running any scripts.

## Output Language Rule

Detect the user's conversation language and use the **same language** for all
output files (README.md, comments in JSON/YAML).
- Chinese input → Chinese output
- English input → English output
- Mixed → follow the dominant language

## Prerequisites

Required tools:
- **AWS CLI** — `aws fis list-actions`, resource discovery, CloudFormation
- **aws___search_documentation** / **aws___read_documentation** — FIS docs
  research
- **jq** — required by `scripts/deploy-with-retry.sh` and
  `scripts/precheck-cfn-permissions.sh`

**EKS Pod fault injection:** Cluster auth mode must be
`API_AND_CONFIG_MAP` or `API`. Check:
```bash
aws eks describe-cluster --name {CLUSTER} \
  --query 'cluster.accessConfig.authenticationMode'
```
If `CONFIG_MAP` only, the user must update the cluster first.
**MANDATORY:** For any `aws:eks:pod-*` action, follow
`references/eks-pod-action-guide.md`.

## Workflow

### Step 1: Identify Scenario and Region

**Classify user intent into one of these branches:**

| Branch | Trigger | Additional Reference |
|---|---|---|
| Scenario Library | AZ Power Interruption, AZ App Slowdown, Cross-AZ/Region scenarios | Read AWS doc URL (table below) |
| Custom FIS action | User specifies an action ID or describes a single fault | — |
| Custom FIS action (ElastiCache) | ElastiCache AZ power interruption or Redis/Valkey failover | `references/elasticache-redis-guide.md` |
| SSM Automation | Target service has no native FIS action (MSK, ElastiCache primary reboot, ElastiCache failover) | `references/msk-guide.md` or `references/elasticache-redis-guide.md` |

If ambiguous, ask the user.

**Scenario Library documentation URLs** (JSON templates are NOT available via
CLI/API — read the doc to extract):

| Scenario | Documentation URL |
|---|---|
| AZ Power Interruption | `https://docs.aws.amazon.com/en_us/fis/latest/userguide/az-availability-scenario.html` |
| AZ Application Slowdown | `https://docs.aws.amazon.com/en_us/fis/latest/userguide/az-application-slowdown-scenario.html` |
| Cross-AZ Traffic Slowdown | `https://docs.aws.amazon.com/en_us/fis/latest/userguide/cross-az-traffic-slowdown-scenario.html` |
| Cross-Region Connectivity | `https://docs.aws.amazon.com/en_us/fis/latest/userguide/cross-region-scenario.html` |

**Region detection order:**
1. User explicitly specifies
2. Infer from context (ARNs, previous conversation)
3. `aws configure get region`
4. Ask the user

Store as `TARGET_REGION`.

**Default experiment duration: `PT10M` (10 minutes)** for all scenarios and
sub-actions unless the user specifies otherwise. For AZ Power Interruption,
scale ARC Zonal Autoshift timing proportionally (ARC starts at minute 2,
runs for 8 minutes at PT10M; formula: `startAfter = duration × (5/30)`).

### Step 2: Discover Target Resources

#### For Scenario Library Scenarios

**CRITICAL: Scenario Library experiment templates CANNOT be generated via
FIS API.** You MUST call `aws___read_documentation` with the scenario URL
(Step 1 table) to extract the JSON experiment template before generating
any files. The documentation is the only authoritative source.

**Target identification — prefer `resourceArns` over `resourceTags`:**
- Use `resourceArns` (exact ARNs) for most resource types — more precise,
  no pre-tagging needed
- Exception — these types do NOT support `resourceArns`, use
  `resourceTags` instead:
  - `aws:elasticache:replicationgroup`
  - `aws:ec2:autoscaling-group`
- EKS pod actions use Kubernetes namespace + pod labels (neither
  `resourceArns` nor `resourceTags`)

**`resourceArns` and `filters` are mutually exclusive.** FIS rejects targets
that specify both. For AZ-scoped targeting, either use `resourceArns` with
only the target AZ's ARNs, or use `resourceTags` + `filters` together.

**If scenario is AZ Power Interruption:** follow
`references/az-power-interruption-guide.md` for sub-action pruning, tagging
strategy, permissions, and one-Stack-per-AZ design.

**Ask the user:**
1. Which AZ to target (for AZ-level scenarios)
2. Which services to include (for AZ Power Interruption) — if user mentions
   specific services, include ONLY those + mandatory infrastructure sub-actions
3. Target resource identifiers (cluster IDs, instance IDs, etc.)

#### For Custom FIS Actions

```bash
aws fis get-action --id "ACTION_ID" --region TARGET_REGION
```

Extract required `targets` and `parameters`. Resolve user-provided
identifiers to ARNs via AWS CLI.

#### For Services Without Native FIS Actions (SSM Automation)

1. Confirm no native action exists:
   ```bash
   aws fis list-actions \
     --query "actions[?starts_with(id, 'aws:{SERVICE}:')]" \
     --region TARGET_REGION
   ```

2. If empty, follow the service-specific guide:
   - Amazon MSK → `references/msk-guide.md`
   - ElastiCache primary node reboot → `references/elasticache-redis-guide.md`
     (Scenario 2)
   - Other services → not yet documented. Stop and inform the user.

**Special case — ElastiCache:** Has a native FIS action for AZ-level impact
(`aws:elasticache:replicationgroup-interrupt-az-power`) but **no native
action for single-node reboot or replication group failover**. For primary
node reboot, use SSM Automation per
`references/elasticache-redis-guide.md` → Scenario 2. For replication group
failover (TestFailover), use SSM Automation per
`references/elasticache-redis-guide.md` → Scenario 3.

3. Discover resources via the target service's CLI (`aws kafka list-clusters`,
   etc.).

### Step 2.5: EKS Pod Action Setup Gate

**If the experiment includes ANY `aws:eks:pod-*` action, complete this gate
BEFORE Step 3.**

Applicable actions: `aws:eks:pod-cpu-stress`, `aws:eks:pod-delete`,
`aws:eks:pod-io-stress`, `aws:eks:pod-memory-stress`,
`aws:eks:pod-network-blackhole-port`, `aws:eks:pod-network-latency`,
`aws:eks:pod-network-packet-loss`.

1. Read the official documentation:
   ```
   aws___read_documentation:
     url: https://docs.aws.amazon.com/fis/latest/userguide/eks-pod-actions.html
   ```

2. Follow ALL requirements in `references/eks-pod-action-guide.md`:
   - Lambda-backed CFN Custom Resource for K8s RBAC (fixed names: `fis-sa`,
     `fis-experiment-role`, `fis-experiment-role-binding`)
   - EKS Access Entry for FIS Experiment Role (`Username: fis-experiment`)
   - Cluster auth mode check (`API_AND_CONFIG_MAP` or `API`)
   - Pod `readOnlyRootFilesystem: false` check
   - Network action limitations (no Fargate, no bridge mode)
   - **Pod memory stress threshold calculation** (if action is
     `aws:eks:pod-memory-stress`) — user's percent is total target, not
     injection value

Do NOT skip. EKS pod actions have complex setup requirements that differ
significantly from other FIS actions.

### Step 3: Validate Resource-Action Compatibility

**CRITICAL GATE.** Before generating any files, verify that the user's
actual resources are compatible with the chosen FIS action(s).

#### 3a. Inspect the Actual Resource

| User Says | CLI Command | Key Fields |
|---|---|---|
| RDS database | `aws rds describe-db-instances --db-instance-identifier {ID}` | `Engine`, `DBClusterIdentifier` |
| RDS/Aurora cluster | `aws rds describe-db-clusters --db-cluster-identifier {ID}` | `Engine`, `EngineMode`, `MultiAZ` |
| EC2 instance | `aws ec2 describe-instances --instance-ids {ID}` | `InstanceType`, `Placement.AvailabilityZone` |
| EKS cluster | `aws eks describe-cluster --name {NAME}` | `accessConfig.authenticationMode`, `version` |
| ElastiCache | `aws elasticache describe-replication-groups --replication-group-id {ID}` | `NodeGroupConfiguration`, `MultiAZ` |
| ASG | `aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names {NAME}` | `AvailabilityZones`, `Instances` |

#### 3b. Cross-Check Against FIS Action Requirements

```bash
aws fis get-action --id "ACTION_ID" --region TARGET_REGION \
  --query 'action.targets' --output json
```

**Common incompatibility traps:**

| FIS Action | Required resourceType | Incompatible With | Detection |
|---|---|---|---|
| `aws:rds:failover-db-cluster` | `aws:rds:cluster` | Standalone RDS (non-Aurora) | `DBClusterIdentifier` is null |
| `aws:rds:reboot-db-instances` | `aws:rds:db` | Aurora clusters | `Engine` starts with `aurora` |
| `aws:elasticache:replicationgroup-interrupt-az-power` | `aws:elasticache:replicationgroup` | Standalone ElastiCache nodes | No replication group |
| `aws:ec2:stop-instances` | `aws:ec2:instance` | Spot instances | `InstanceLifecycle` = `spot` |

#### 3c. Decision Gate

- **Compatible** → proceed to Step 4.
- **Incompatible** → explain the mismatch, suggest alternatives based on
  the actual resource type, ask the user to confirm or abort.

Example alternatives:
- Standalone RDS Multi-AZ → `aws:rds:reboot-db-instances` with
  `--force-failover`
- Aurora cluster → `aws:rds:failover-db-cluster`
- ElastiCache standalone → explain replication group is required

#### 3d. For Scenario Library Scenarios

Validate EACH included sub-action against its target resources. Only
validate sub-actions that remain after service-scoped pruning (Step 2).

### Step 4: Determine Monitoring Configuration

**Stop Conditions — default: `source: "none"` (no alarm).** Only create a
CloudWatch Alarm if the user explicitly provides one.

**Dashboard Metrics — comprehensive, per-service.** Group widgets by
service, 3 widgets per service (availability, performance, errors/latency).
Include only services actually affected by the experiment.

| Service | Metrics |
|---|---|
| EC2 | `StatusCheckFailed`, `CPUUtilization`, `NetworkIn/Out`, `NetworkPacketsIn/Out` |
| RDS/Aurora | `DatabaseConnections`, `ReadLatency`, `WriteLatency`, `AuroraReplicaLag`, `FreeableMemory` |
| EKS | `pod_number_of_running_pods`, `pod_number_of_container_restarts`, `node_cpu_utilization`, `node_memory_utilization` |
| ElastiCache | `ReplicationLag`, `EngineCPUUtilization`, `CurrConnections`, `CacheHitRate`, `Evictions`, `IsMaster` |
| ALB | `HealthyHostCount`, `UnHealthyHostCount`, `HTTPCode_ELB_5XX_Count`, `TargetResponseTime` |
| NLB | `ActiveFlowCount`, `TCP_Client_Reset_Count`, `TCP_Target_Reset_Count` |

### Step 5: Generate Configuration Files

**Create output directory:**

```bash
# ─── Fill in from user's request + references/slug-conventions.md ───
SCENARIO_SLUG="..."         # e.g., pod-delete, az-power-int, rds-failover
TARGET_RESOURCE_ID="..."    # e.g., my-aurora-cluster, i-0abc123def
CONTEXT_NAME=""             # optional (e.g., redis, msk); leave empty if N/A
# ────────────────────────────────────────────────────────────────────

# Derived values (do not edit):
TARGET_SLUG=$(echo "${TARGET_RESOURCE_ID}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-' | cut -c1-20)
CONTEXT_SLUG=$(echo "${CONTEXT_NAME}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-' | cut -c1-10)
TIMESTAMP=$(TZ=Asia/Shanghai date +%Y-%m-%d-%H-%M-%S)

if [ -n "${CONTEXT_SLUG}" ]; then
    OUTPUT_DIR="./${TIMESTAMP}-${SCENARIO_SLUG}-${TARGET_SLUG}-${CONTEXT_SLUG}"
else
    OUTPUT_DIR="./${TIMESTAMP}-${SCENARIO_SLUG}-${TARGET_SLUG}"
fi
mkdir -p "${OUTPUT_DIR}"
```

**REQUIRED:** Before generating `cfn-template.yaml`, read the
`AWS::FIS::ExperimentTemplate` CloudFormation resource documentation:

```
aws___read_documentation:
  url: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-fis-experimenttemplate.html
```

**ALSO REQUIRED:** Search for CloudFormation examples for the resources used:

```
aws___search_documentation:
  search_phrase: "<CFN resource types in this experiment>"
  topics: ["cloudformation"]
```

**Generate files:**

1. **cfn-template.yaml** — use `references/cfn-base-template.md` as the
   skeleton. Extend with scenario-specific resources per:
   - `references/az-power-interruption-guide.md` (if AZ Power Interruption)
   - `references/eks-pod-action-guide.md` (if EKS pod actions)
   - `references/msk-guide.md` (if MSK)
   - `references/elasticache-redis-guide.md` (if ElastiCache)

2. **README.md** — use the template in `references/output-format.md`.

### Step 5.5: CFN Permission Pre-Check

Run the precheck script to detect whether a CFN service role is required:

```bash
CFN_ROLE_ARN=$("${SKILL_DIR}/scripts/precheck-cfn-permissions.sh")
```

If the caller lacks CloudFormation permissions, the script exits 1 with
guidance — **stop and inform the user**. Otherwise, `CFN_ROLE_ARN` is either
empty (no service role needed) or contains the required role ARN.

### Step 6: Deploy CFN Template (Self-Healing Loop)

**Generate deployment parameters:**

```bash
# See references/slug-conventions.md for the ExperimentName composition rule
RANDOM_SUFFIX=$(LC_ALL=C tr -dc 'a-z0-9' < /dev/urandom | head -c6)

if [ -n "${CONTEXT_SLUG}" ]; then
    EXPERIMENT_NAME="${SCENARIO_SLUG}-${TARGET_SLUG}-${CONTEXT_SLUG}-${RANDOM_SUFFIX}"
else
    EXPERIMENT_NAME="${SCENARIO_SLUG}-${TARGET_SLUG}-${RANDOM_SUFFIX}"
fi
STACK_NAME="fis-${EXPERIMENT_NAME}"
```

**Deploy with self-healing retry loop** (maximum 5 attempts driven by the
agent). The `deploy-with-retry.sh` script performs **one attempt** — the
agent drives the loop externally. On each attempt:

1. Run `scripts/deploy-with-retry.sh`:
   ```bash
   "${SKILL_DIR}/scripts/deploy-with-retry.sh" \
     "${OUTPUT_DIR}/cfn-template.yaml" \
     "${STACK_NAME}" \
     "${TARGET_REGION}" \
     "${CFN_ROLE_ARN}" \
     "ExperimentName=${EXPERIMENT_NAME}" \
     "RandomSuffix=${RANDOM_SUFFIX}"
   ```
2. Exit 0 → deployment succeeded, proceed to "On Successful Deployment".
3. Exit 1 (validation failed) or 2 (deployment failed, stack deleted) →
   analyze stderr output, fix `cfn-template.yaml`, increment attempt
   counter, re-invoke the script.
4. After 5 failed attempts → stop and report to the user with the last
   error, all fixes attempted, and the current `cfn-template.yaml`.

**Common CFN errors and fixes:**

| Error Pattern | Root Cause | Fix |
|---|---|---|
| `Property validation failure` | Invalid CFN property name/value | Fix the resource property |
| `Template format error` | YAML syntax issue | Fix indentation/structure |
| `Resource type not supported` | Resource unavailable in region | Check regional availability |
| `Circular dependency` | Resources reference each other | Use `DependsOn` or restructure |
| `RoleArn ... is invalid` | IAM role not yet propagated | Add `DependsOn` for IAM role |
| Empty `logConfiguration` | AZ Power Interruption doc artifact | Remove the `logConfiguration` block |

#### On Successful Deployment

1. Extract stack outputs:
   ```bash
   aws cloudformation describe-stacks \
     --stack-name "${STACK_NAME}" \
     --query 'Stacks[0].Outputs' \
     --region "${TARGET_REGION}" --output table
   ```

2. Update `README.md` with actual stack name, template ID, dashboard URL,
   and cleanup command. Replace ALL `{STACK_NAME}` placeholders — do NOT
   leave placeholders in the final output.

### Step 7: Rename Output Directory with Template ID

Run the rename script:

```bash
NEW_OUTPUT_DIR=$("${SKILL_DIR}/scripts/rename-output-dir.sh" \
    "${OUTPUT_DIR}" \
    "${STACK_NAME}" \
    "${TARGET_REGION}")
OUTPUT_DIR="${NEW_OUTPUT_DIR}"
```

Update `README.md`'s `**Directory:**` field with the full absolute path of
the renamed directory. If CFN deployment failed (Step 6 exceeded max
retries), skip this step.

Print a brief summary to the terminal:
- Experiment output directory (with template ID)
- CFN stack name and deployment status
- Experiment template ID
- Next step instruction

## Important Guidelines

- **Scenario Library templates come from documentation.** Call
  `aws___read_documentation` on the scenario's doc URL (Step 1 table) before
  generating any files. The documentation is the only authoritative source.
- **Never start the FIS experiment in this skill.** Starting the experiment
  is handled by `aws-fis-experiment-execute` or manually by the user.
- **Validate resource-action compatibility BEFORE generating files** (Step 3).
  The most common source of wasted effort is deploying a template that
  targets an incompatible resource.
- **Always deploy and validate.** Do not just generate files — deploy the CFN
  template and iterate until it succeeds (Step 6). The user should receive a
  working, deployed experiment template ready to start.
- **Self-heal on CFN errors.** Read stack events, diagnose, fix the template,
  delete the failed stack, retry. Do not ask the user to fix CFN errors.
- **Verify FIS action availability** (`aws fis list-actions` /
  `aws fis get-action`) before generating templates. Don't fabricate action
  IDs.
- **Prefer `resourceArns` over `resourceTags` for targets.** Exceptions:
  `aws:elasticache:replicationgroup`, `aws:ec2:autoscaling-group`. Never
  combine `resourceArns` with `filters`.
- **IAM policy must be least-privilege.** Only include permissions for the
  specific actions in the experiment.
- **CFN template must be self-contained.** Deploy the CFN template and get a
  working experiment without any other steps.
- **Sequential MCP calls.** All `aws___read_documentation` and
  `aws___search_documentation` calls must be sequential, never parallel.
  Retry up to 10 times on rate limit errors.
- **Keep local files in sync.** After successful deployment, update README.md
  with real ARNs and stack outputs.
Aws Fis Experiment Prepare

SKILL.md

related skills