Name: Database Replication Advisor
Availability: InStock
Author: charlie-morrison
Analyze database replication topology, detect lag, and recommend replication strategy based on CAP tradeoffs
SKILL.md

---
name: database-replication-advisor
description: Analyze database replication topology, detect lag, and recommend replication strategy based on CAP tradeoffs
---

# Database Replication Advisor

Analyze the health and design of database replication setups. This skill teaches an AI agent to inspect replication lag, evaluate topology choices (single-leader, multi-leader, leaderless), assess failover readiness, and recommend replication strategies grounded in CAP theorem tradeoffs and real operational constraints.

Use when: "check replication lag", "replication health", "failover readiness", "replication topology", "CAP tradeoffs", "design replication", "replica drift", "split-brain risk", "failover drill"

## Commands

### 1. `assess` -- Check current replication health

Inspect the running replication state, measure lag, detect divergence, and flag risks.

#### Step 1: Identify the database engine and topology

```bash
# PostgreSQL: check if this is a primary or standby
psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT pg_is_in_recovery();"

# PostgreSQL: list replication slots and connected standbys
psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "
SELECT slot_name, slot_type, active, restart_lsn
FROM pg_replication_slots;
"

# MySQL: check replication status on a replica
mysql -h "$REPLICA_HOST" -u "$DB_USER" -p"$DB_PASS" -e "SHOW REPLICA STATUS\G"

# Redis: check replication info
redis-cli -h "$REDIS_HOST" INFO replication
```

#### Step 2: Measure replication lag

Lag is the most critical replication health metric. Measure it from both the database internals and from application-level probes.

```bash
# PostgreSQL: lag in bytes and seconds for each standby
psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "
SELECT client_addr,
       state,
       sent_lsn,
       write_lsn,
       flush_lsn,
       replay_lsn,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS byte_lag,
       replay_lag
FROM pg_stat_replication;
"

# MySQL: seconds behind primary
mysql -h "$REPLICA_HOST" -u "$DB_USER" -p"$DB_PASS" -e "
SELECT
  CHANNEL_NAME,
  SOURCE_UUID,
  LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP,
  APPLYING_TRANSACTION,
  LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP
FROM performance_schema.replication_applier_status_by_worker;
"

# Application-level heartbeat probe (write timestamp to primary, read from replica)
python3 -c "
import time, psycopg2

primary = psycopg2.connect(host='$PRIMARY_HOST', dbname='$DB_NAME', user='$DB_USER')
replica = psycopg2.connect(host='$REPLICA_HOST', dbname='$DB_NAME', user='$DB_USER')

# Write heartbeat to primary
with primary.cursor() as cur:
    cur.execute('CREATE TABLE IF NOT EXISTS _repl_heartbeat (id int PRIMARY KEY, ts timestamptz)')
    cur.execute('INSERT INTO _repl_heartbeat VALUES (1, now()) ON CONFLICT (id) DO UPDATE SET ts = now()')
    primary.commit()
    cur.execute('SELECT ts FROM _repl_heartbeat WHERE id = 1')
    write_ts = cur.fetchone()[0]

time.sleep(0.5)

# Read heartbeat from replica
with replica.cursor() as cur:
    cur.execute('SELECT ts FROM _repl_heartbeat WHERE id = 1')
    read_ts = cur.fetchone()[0]

lag = (write_ts - read_ts).total_seconds() if write_ts > read_ts else 0
print(f'Application-level replication lag: {lag:.3f}s')
print(f'Assessment: {\"HEALTHY\" if lag < 1 else \"WARNING\" if lag < 10 else \"CRITICAL\"}')
"
```

#### Step 3: Check for replication conflicts and errors

```bash
# PostgreSQL: check for replication conflicts (queries cancelled on standby)
psql -h "$REPLICA_HOST" -U "$DB_USER" -d "$DB_NAME" -c "
SELECT datname, confl_tablespace, confl_lock, confl_snapshot, confl_bufferpin, confl_deadlock
FROM pg_stat_database_conflicts
WHERE datname = '$DB_NAME';
"

# PostgreSQL: check WAL archiving health on primary
psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "
SELECT archived_count, failed_count, last_archived_wal, last_archived_time, last_failed_time
FROM pg_stat_archiver;
"

# MySQL: check for replication errors
mysql -h "$REPLICA_HOST" -u "$DB_USER" -p"$DB_PASS" -e "
SELECT LAST_ERROR_NUMBER, LAST_ERROR_MESSAGE, LAST_ERROR_TIMESTAMP
FROM performance_schema.replication_applier_status_by_worker
WHERE LAST_ERROR_NUMBER != 0;
"
```

#### Step 4: Evaluate network and disk bottlenecks

```bash
# Check WAL generation rate on primary (PostgreSQL)
psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') / (1024*1024*1024) AS total_wal_gb,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / (1024*1024) AS pending_mb
FROM pg_stat_replication;
"

# Check disk I/O on replica
iostat -x 1 3 | tail -20

# Check network latency between primary and replica
ping -c 5 "$REPLICA_HOST" | tail -1
```

#### Report template

```
## Replication Health Assessment

**Date:** YYYY-MM-DD
**Engine:** PostgreSQL 16 / MySQL 8 / Redis 7
**Topology:** Single-leader with 2 async standbys

### Replication Status
| Replica | State | Byte Lag | Time Lag | Conflicts | Verdict |
|---------|-------|----------|----------|-----------|---------|
| replica-1 | streaming | 1.2 MB | 0.3s | 0 | HEALTHY |
| replica-2 | streaming | 45 MB | 8.2s | 12 | WARNING |

### Risk Assessment
- **Data loss window (RPO):** ~8s (worst replica lag)
- **Failover time estimate (RTO):** ~30s (manual), ~5s (Patroni/orchestrator)
- **Split-brain risk:** LOW (single leader, no multi-master)
- **WAL accumulation:** Normal (archiving healthy)

### Recommendations
1. Investigate replica-2 lag -- likely disk I/O bottleneck (iostat shows 95% util)
2. Consider synchronous replication for replica-1 if zero-data-loss RPO required
3. Add replication lag alerting at 5s threshold
```

---

### 2. `design` -- Recommend replication topology

Given application requirements, recommend the right replication strategy with explicit CAP tradeoff analysis.

#### Step 1: Gather requirements

The agent must determine these constraints before recommending a topology:

```
1. Read/write ratio (read-heavy? write-heavy? balanced?)
2. Geographic distribution (single region? multi-region?)
3. Consistency requirement (strong? eventual? causal?)
4. Durability requirement (zero data loss? seconds acceptable?)
5. Availability target (99.9%? 99.99%? 99.999%?)
6. Write throughput (transactions per second)
7. Budget constraints (how many replicas can you afford?)
```

#### Step 2: Apply the decision framework

```python
import json

def recommend_topology(requirements):
    """Recommend replication topology based on requirements."""
    read_heavy = requirements.get("read_write_ratio", 10) > 5
    multi_region = requirements.get("multi_region", False)
    strong_consistency = requirements.get("consistency") == "strong"
    zero_data_loss = requirements.get("max_data_loss_seconds", 5) == 0
    high_write_throughput = requirements.get("writes_per_second", 100) > 5000

    recommendation = {}

    # Topology selection
    if multi_region and not strong_consistency:
        recommendation["topology"] = "multi-leader"
        recommendation["rationale"] = (
            "Multi-region with eventual consistency favors multi-leader. "
            "Each region has a local leader for low-latency writes. "
            "Conflict resolution required (last-writer-wins or custom)."
        )
        recommendation["cap_choice"] = "AP (availability + partition tolerance)"
    elif high_write_throughput and not strong_consistency:
        recommendation["topology"] = "leaderless (Dynamo-style)"
        recommendation["rationale"] = (
            "High write throughput with eventual consistency suits leaderless replication. "
            "Use quorum reads/writes (W + R > N) for tunable consistency."
        )
        recommendation["cap_choice"] = "AP with tunable consistency via quorums"
    else:
        recommendation["topology"] = "single-leader"
        recommendation["rationale"] = (
            "Single leader is the safest default. Strong consistency via synchronous replication, "
            "read scaling via async replicas."
        )
        recommendation["cap_choice"] = "CP (consistency + partition tolerance)"

    # Sync mode selection
    if zero_data_loss:
        recommendation["sync_mode"] = "synchronous (at least one sync standby)"
        recommendation["tradeoff"] = "Higher write latency, but guaranteed zero RPO"
    else:
        recommendation["sync_mode"] = "asynchronous"
        recommendation["tradeoff"] = f"Possible data loss up to {requirements.get('max_data_loss_seconds', 5)}s, but lower write latency"

    # Replica count
    if read_heavy:
        recommendation["min_replicas"] = max(2, requirements.get("read_write_ratio", 10) // 5)
        recommendation["replica_note"] = "Read replicas sized to absorb read traffic"
    else:
        recommendation["min_replicas"] = 1
        recommendation["replica_note"] = "Single replica for failover, not load distribution"

    return recommendation

reqs = {
    "read_write_ratio": 20,
    "multi_region": False,
    "consistency": "strong",
    "max_data_loss_seconds": 0,
    "writes_per_second": 500
}
result = recommend_topology(reqs)
print(json.dumps(result, indent=2))
```

#### Step 3: Validate against infrastructure

```bash
# Check current DB instance specs to verify replicas can handle load
# AWS RDS example
aws rds describe-db-instances --query 'DBInstances[*].{ID:DBInstanceIdentifier,Class:DBInstanceClass,Engine:Engine,MultiAZ:MultiAZ,ReadReplicas:ReadReplicaDBInstanceIdentifiers}' --output table

# Check available storage and IOPS
aws rds describe-db-instances --db-instance-identifier "$DB_INSTANCE" --query 'DBInstances[0].{Storage:AllocatedStorage,IOPS:Iops,StorageType:StorageType}'

# Check cross-region latency (for multi-region decisions)
for region in us-east-1 eu-west-1 ap-southeast-1; do
  echo -n "$region: "
  ping -c 3 "ec2.$region.amazonaws.com" 2>/dev/null | tail -1 | awk -F'/' '{print $5 "ms"}'
done
```

#### Report template

```
## Replication Design Recommendation

**Application:** [service-name]
**Current state:** Single PostgreSQL instance, no replication

### Requirements Analysis
- Read/write ratio: 20:1 (read-heavy)
- Regions: Single (us-east-1)
- Consistency: Strong required for financial transactions
- Max acceptable data loss: 0 seconds
- Availability target: 99.99%

### Recommended Topology
**Single-leader with 1 synchronous + 2 asynchronous standbys**

### CAP Tradeoff
Choosing CP: consistency and partition tolerance. During a network partition, writes block rather than risk inconsistency. This is correct for financial data.

### Architecture
```
Primary (us-east-1a) --sync--> Standby-1 (us-east-1b) [failover target]
                     --async-> Standby-2 (us-east-1c) [read replica]
                     --async-> Standby-3 (us-east-1c) [read replica]
```

### Configuration
- `synchronous_standby_names = 'FIRST 1 (standby1)'`
- `max_wal_senders = 10`
- `wal_level = replica`
- `hot_standby = on`
- Connection pooler (PgBouncer) routing reads to standbys
```

---

### 3. `failover-test` -- Plan and validate a failover drill

Design a safe failover test, define success criteria, and provide runbooks for both automated and manual failover.

#### Step 1: Pre-flight checks

```bash
# Ensure replication is healthy before testing failover
psql -h "$PRIMARY_HOST" -U "$DB_USER" -d "$DB_NAME" -c "
SELECT client_addr, state, sync_state,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS byte_lag
FROM pg_stat_replication;
"

# Verify the failover target is caught up (lag < 1MB)
lag=$(psql -h "$PRIMARY_HOST" -U "$DB_USER" -d "$DB_NAME" -t -c "
SELECT pg_wal_lsn_diff(sent_lsn, replay_lsn)
FROM pg_stat_replication WHERE client_addr = '$FAILOVER_TARGET_IP';
")
if [ "$lag" -gt 1048576 ]; then
  echo "ABORT: Failover target has ${lag} bytes lag. Wait for it to catch up."
  exit 1
fi
echo "Pre-flight OK: lag is ${lag} bytes"

# Check Patroni cluster state (if using Patroni)
patronictl -c /etc/patroni.yml list

# Check connection pool state
pgbouncer -d "$PGBOUNCER_DB" -c "SHOW POOLS;"
```

#### Step 2: Define the failover procedure

```bash
# Option A: Patroni-managed failover (preferred)
patronictl -c /etc/patroni.yml switchover --master "$PRIMARY_NAME" --candidate "$STANDBY_NAME" --force

# Option B: Manual PostgreSQL failover
# 1. Stop writes on primary (set to read-only)
psql -h "$PRIMARY_HOST" -U "$DB_USER" -c "ALTER SYSTEM SET default_transaction_read_only = on; SELECT pg_reload_conf();"
# 2. Wait for standby to fully catch up
# 3. Promote the standby
psql -h "$STANDBY_HOST" -U "$DB_USER" -c "SELECT pg_promote();"
# 4. Repoint application connection strings or update DNS/proxy

# Option C: AWS RDS failover
aws rds failover-db-cluster --db-cluster-identifier "$CLUSTER_ID" --target-db-instance-identifier "$FAILOVER_TARGET"
```

#### Step 3: Validate failover success

```bash
# Check new primary is accepting writes
psql -h "$NEW_PRIMARY" -U "$DB_USER" -d "$DB_NAME" -c "
CREATE TABLE _failover_test_$(date +%s) (id int);
DROP TABLE _failover_test_$(date +%s);
SELECT 'WRITES OK' AS status;
"

# Check old primary is no longer primary
psql -h "$OLD_PRIMARY" -U "$DB_USER" -d "$DB_NAME" -c "SELECT pg_is_in_recovery();"
# Should return: t (true = standby)

# Check application connectivity
curl -s "$APP_HEALTH_ENDPOINT" | python3 -c "
import sys, json
data = json.load(sys.stdin)
db_status = data.get('database', {}).get('status', 'unknown')
print(f'Application DB status: {db_status}')
assert db_status == 'healthy', 'Application not connected to new primary!'
"

# Measure total downtime
echo "Failover started at: $FAILOVER_START"
echo "First successful write at: $(date -Iseconds)"
echo "Total downtime: calculate difference"
```

#### Step 4: Post-failover recovery

```bash
# Rebuild old primary as a new standby
# PostgreSQL: use pg_rewind if timeline hasn't diverged too far
pg_rewind --target-pgdata=/var/lib/postgresql/data --source-server="host=$NEW_PRIMARY dbname=$DB_NAME user=$DB_USER"

# Verify the new standby is streaming
psql -h "$NEW_PRIMARY" -U "$DB_USER" -d "$DB_NAME" -c "SELECT * FROM pg_stat_replication;"
```

#### Report template

```
## Failover Drill Report

**Date:** YYYY-MM-DD
**Trigger:** Planned drill / Unplanned incident
**Engine:** PostgreSQL 16 with Patroni

### Timeline
| Event | Timestamp | Duration |
|-------|-----------|----------|
| Failover initiated | 14:00:00 | - |
| Old primary stopped accepting writes | 14:00:02 | 2s |
| New primary promoted | 14:00:05 | 3s |
| Application reconnected | 14:00:08 | 3s |
| **Total write downtime** | - | **8 seconds** |

### Success Criteria
- [x] Total downtime < 30 seconds (actual: 8s)
- [x] Zero data loss (sync replication confirmed)
- [x] Application auto-reconnected without manual intervention
- [x] Old primary successfully rebuilt as standby
- [ ] Alerting fired within 60 seconds (actual: 90s -- needs tuning)

### Findings
1. Application connection pool took 3s to detect primary change -- consider reducing health check interval
2. Alerting delay of 90s -- adjust PagerDuty integration threshold
3. Old primary rewind completed in 45s -- acceptable

### Action Items
- [ ] Reduce PgBouncer server_check_delay from 30s to 5s
- [ ] Tune monitoring alert threshold to < 60s detection
- [ ] Schedule next drill in 90 days
```
Database Replication Advisor

SKILL.md

related skills