added explicit inputs (deployment platform credentials, observability stack config), expanded procedure with edge cases and validation steps, documented decision logic (latency budgets, consistency trade-offs, rollback conditions), defined detailed output artifacts and storage, clarified outcome signals with testable criteria.
intent
design cloud-native microservices architectures by applying domain-driven design to identify service boundaries, choose communication patterns, define data strategies, and implement resilience mechanisms. use this skill when decomposing a monolith, architecting a new distributed system, or evaluating whether a system is ready for microservices. produces reference documents, validation checklists, and implementation guides that teams use to build, deploy, and operate independent services at scale.
inputs
- system context: current architecture (monolith, legacy, hybrid), business domain, team structure, deployment environment
- ddd knowledge base: access to domain experts, business process documentation, aggregate definitions, bounded context boundaries
- operational constraints: sla requirements, latency budgets (sub-100ms for sync, longer for async), consistency requirements, disaster recovery targets
- deployment platform: kubernetes cluster credentials (KUBECONFIG env var or in-cluster service account), container registry, ci/cd system (github actions, jenkins, gitlab ci), service mesh (istio, linkerd) if applicable
- observability stack: distributed tracing backend (jaeger, datadog, newrelic endpoint), centralized logging (elasticsearch, splunk, cloudwatch), metrics collector (prometheus scrape config), correlation id propagation strategy
- external systems: kafka/rabbitmq broker for async messaging (connection string, credentials), api gateway or ingress controller configuration, identity provider for service-to-service auth (oauth2 client credentials scope)
- reference materials: existing service inventory, database schemas, api contracts, deployment manifests
procedure
conduct domain analysis using domain-driven design
- input: system context, domain experts, business process documentation
- interview stakeholders to map business capabilities and value streams
- identify candidate bounded contexts by grouping related aggregates and entities
- define service boundaries: each service owns one or more cohesive bounded contexts
- output: bounded context diagram, service candidate list with capability mappings, data ownership matrix
- edge case: if domain knowledge is distributed across many teams, schedule multiple sessions; if legacy system has implicit domains, reverse-engineer by analyzing code coupling and data dependencies
validate service boundary independence
- input: bounded context diagram, candidate service list, current data schema
- for each service: confirm it owns its data model exclusively (no shared schema), has a clear public api contract, and can be deployed independently
- identify and document all inter-service dependencies (data, temporal, transactional)
- output: service independence checklist (pass/fail per service), dependency graph, data ownership contract per service
- edge case: if services currently share a database, document the schema split strategy and phased migration plan; if circular dependencies exist, redraw boundaries
design communication patterns and protocols
- input: service dependencies, sla requirements, consistency model, message volume estimates
- for query/command pairs with sub-100ms latency sla: choose synchronous rest or grpc
- for long-running operations or cross-aggregate workflows: choose async messaging (kafka, rabbitmq, cloud events)
- for read-heavy operations: consider eventual consistency with event-driven cache updates
- output: communication matrix (service pairs, protocol choice, justification), sample api contract (openapi for rest, protobuf for grpc), async event schema examples
- edge case: if latency budget is unknown, document assumption and flag for performance testing; if message volume exceeds broker capacity, recommend sharding strategy
define data strategy and consistency model
- input: service dependencies, consistency requirements, regulatory constraints (gdpr, audit trails)
- apply database-per-service pattern: each service uses its own database (sql, nosql, or hybrid)
- document eventual consistency windows for cross-service operations
- for strong consistency requirements: use sagas (orchestrated or choreographed) with compensating transactions
- for high-cardinality state: consider event sourcing with snapshots
- output: data topology diagram, consistency model per operation (strong, eventual, saga compensation), database selection rationale, schema per service
- edge case: if schema needs to be shared for auditing, use change data capture (cdc) instead of direct schema coupling; if consistency window exceeds business tolerance, escalate to stakeholders
implement resilience patterns
- input: external dependencies, sla targets, blast radius assessment
- add explicit timeout for every external call (network, database, third-party api): default 5-30 seconds, tuned per dependency
- add retry logic with exponential backoff for transient failures (max 3 attempts, jitter to avoid thundering herd)
- implement circuit breaker for repeated failures: open after 5 consecutive failures, half-open after 30 seconds
- add bulkhead isolation: separate thread pools or connection pools per downstream service (limit: 10-20 concurrent connections per service)
- add fallback strategy: return cached/degraded response, queue for async processing, or fail gracefully
- output: resilience pattern checklist, code snippets (circuit breaker middleware, retry decorator, timeout config), fault injection test scenarios
- edge case: if circuit breaker trips, log alert and notify on-call; if fallback cache is stale, flag to user; if retry budget is exhausted, transition to dead letter queue
design observability for distributed tracing and correlation
- input: observability stack endpoints, service list, communication patterns
- implement correlation id injection: generate uuid at api gateway or request entry point, propagate via http headers (x-correlation-id) and async message metadata
- configure distributed tracing: instrument all http clients, database drivers, and message producers to emit spans with service name, operation name, duration, error status
- set up centralized logging: forward all service logs to elasticsearch/splunk with correlation id as a field for aggregation
- add request/response logging middleware: log at info level (headers, body summary, latency) without sensitive data
- output: observability config template (jaeger/datadog instrumentation), correlation id propagation code example, log schema with correlation id, sample dashboards
- edge case: if log volume exceeds storage budget, configure sampling (1 in 100 trace for high-volume endpoints); if tracing overhead impacts latency, use tail sampling
define deployment and health check strategy
- input: deployment platform (kubernetes), service list, ci/cd system
- configure health probes for each service: liveness probe (service is responsive), readiness probe (can handle traffic), startup probe (ready to accept connections)
- define probe endpoints (e.g. /healthz, /readyz): return http 200 with json status
- choose progressive delivery strategy: canary (5% traffic to new version, monitor error rate and latency for 10 minutes), blue-green (run old and new in parallel, switch at once), or rolling (10% at a time)
- output: kubernetes deployment manifest template with probes, health check endpoints in code, canary/blue-green rollout script, runbook for rollback
- edge case: if startup time exceeds 30 seconds, set startup probe delay accordingly; if liveness probe is too aggressive, increase failure threshold to avoid cascading restarts; if canary metrics indicate regression, halt and rollback
document architecture decision record and implementation roadmap
- input: all outputs from steps 1-7, team capacity, dependency constraints
- create architecture decision record (adr) per major decision (bounded contexts, async messaging, circuit breakers, data strategy)
- prioritize service decomposition: phase 1 (high-cohesion, low-coupling candidates), phase 2 (medium candidates), phase 3 (tightly coupled candidates)
- for each phase: list service name, dependencies, data migration strategy, rollout plan, success criteria
- output: adr documents (markdown, stored in git), implementation roadmap (timeline, dependencies, team assignments), success metrics (latency, error rate, deployment frequency)
- edge case: if team lacks microservices experience, add training and pairing sessions in phase 1; if timeline is compressed, focus on highest-risk services first
decision points
- if sla latency budget is sub-100 ms: use synchronous rest or grpc for that operation; else use async messaging with eventual consistency
- if strong consistency is required (financial transactions, inventory): implement saga orchestration with compensating transactions; else use eventual consistency with event-driven updates
- if service owns multiple bounded contexts: consider whether they should be separate services or fine-grained modules within one service based on deployment frequency and team structure; err toward separate services if deployed at different cadences
- if circuit breaker trips: log alert, notify on-call, and fall back to cached/degraded response; do not retry until circuit is half-open
- if startup probe fails (service hangs during boot): do not mark service as ready; do not send traffic; increase startup probe timeout or fix initialization logic
- if canary shows error rate increase (>0.5% above baseline): stop rollout immediately, roll back to previous version, and investigate; do not promote to production
- if consistency window exceeds business tolerance: escalate to product and negotiate sla or architecture change (e.g., dual-write, cdc, or strongerconsistency guarantee)
- if team has no microservices experience: start with strangler fig pattern (wrap monolith with api gateway, gradually extract services) and pair junior engineers with architects; do not attempt big-bang rewrite
output contract
successful skill execution produces:
- bounded context diagram: visual map of services and their data ownership (draw.io, c4 model, or text-based ascii)
- service independence checklist: per-service validation (name, data ownership, api contract, deployment independence, dependency list)
- communication matrix: table of (source service, target service, protocol, latency sla, consistency model, example payload)
- api contracts: openapi 3.0 for rest or protobuf for grpc; includes request/response schemas, error codes, retry semantics
- async event schema: json schema or protobuf for each event type; includes correlation id, timestamp, source service, payload
- data topology diagram: services and their databases (sql, nosql, event store), plus change data capture pipelines if applicable
- consistency model spec: per operation, document strong or eventual consistency, saga steps if applicable, compensation logic
- resilience pattern checklist: per service, list timeouts, retry budgets, circuit breaker thresholds, bulkhead limits, fallback strategies
- observability config: correlation id propagation code, jaeger/datadog instrumentation snippets, log schema, sample dashboards
- kubernetes deployment manifest template: with liveness, readiness, startup probes, resource requests/limits, replica strategy
- canary/blue-green rollout script: automated traffic shifting, metric collection, rollback logic
- architecture decision records: markdown files documenting major choices (why rest over grpc, why saga over distributed lock, why kafka over http webhooks)
- implementation roadmap: phased service extraction plan with timeline, team assignments, success criteria, risk mitigation
- health check endpoint code: /healthz, /readyz returning json status with service dependencies, database connectivity, external api health
all artifacts stored in git (markdown, yaml, protobuf, json schema); diagrams in draw.io or lucidchart with version history.
outcome signal
the skill worked when:
- stakeholders confirm the bounded context diagram accurately represents their business domain and service boundaries make sense
- each service passes the independence checklist: owns its data, has a public api contract, can be deployed without coordinating other services
- a single request can be traced end-to-end across all services using its correlation id in logs and distributed traces (verify by running a request through canary and observing the trace in jaeger/datadog)
- the communication matrix is approved by the team and all synchronous calls have sub-100ms latency (measured in staging), async operations complete within agreed-upon consistency windows
- circuit breakers and bulkheads prevent cascading failures: when one service goes down, others continue serving with graceful degradation (verify with chaos engineering tests)
- health probes are green for all services in staging: liveness indicates service is alive, readiness indicates ready for traffic, startup indicates initialization is complete
- first canary rollout completes successfully: new version rolls to 5% of traffic, metrics (error rate, latency, business kpis) stay green for 10 minutes, then rolls to 100%
- team can articulate why each architectural choice was made (by reading adr documents) and agrees on phased rollout plan with assigned owners and target dates
- deployment frequency increases or remains stable (not degraded by architectural change) and mean time to recovery (mttr) decreases as observability improves