added explicit inputs (deployment platform credentials, observability stack config), expanded procedure with edge cases and validation steps, documented decision logic (latency budgets, consistency trade-offs, rollback conditions), defined detailed output artifacts and storage, clarified outcome signals with testable criteria.

intent

design cloud-native microservices architectures by applying domain-driven design to identify service boundaries, choose communication patterns, define data strategies, and implement resilience mechanisms. use this skill when decomposing a monolith, architecting a new distributed system, or evaluating whether a system is ready for microservices. produces reference documents, validation checklists, and implementation guides that teams use to build, deploy, and operate independent services at scale.

inputs

system context: current architecture (monolith, legacy, hybrid), business domain, team structure, deployment environment
ddd knowledge base: access to domain experts, business process documentation, aggregate definitions, bounded context boundaries
operational constraints: sla requirements, latency budgets (sub-100ms for sync, longer for async), consistency requirements, disaster recovery targets
deployment platform: kubernetes cluster credentials (KUBECONFIG env var or in-cluster service account), container registry, ci/cd system (github actions, jenkins, gitlab ci), service mesh (istio, linkerd) if applicable
observability stack: distributed tracing backend (jaeger, datadog, newrelic endpoint), centralized logging (elasticsearch, splunk, cloudwatch), metrics collector (prometheus scrape config), correlation id propagation strategy
external systems: kafka/rabbitmq broker for async messaging (connection string, credentials), api gateway or ingress controller configuration, identity provider for service-to-service auth (oauth2 client credentials scope)
reference materials: existing service inventory, database schemas, api contracts, deployment manifests

procedure

conduct domain analysis using domain-driven design
- input: system context, domain experts, business process documentation
- interview stakeholders to map business capabilities and value streams
- identify candidate bounded contexts by grouping related aggregates and entities
- define service boundaries: each service owns one or more cohesive bounded contexts
- output: bounded context diagram, service candidate list with capability mappings, data ownership matrix
- edge case: if domain knowledge is distributed across many teams, schedule multiple sessions; if legacy system has implicit domains, reverse-engineer by analyzing code coupling and data dependencies
validate service boundary independence
- input: bounded context diagram, candidate service list, current data schema
- for each service: confirm it owns its data model exclusively (no shared schema), has a clear public api contract, and can be deployed independently
- identify and document all inter-service dependencies (data, temporal, transactional)
- output: service independence checklist (pass/fail per service), dependency graph, data ownership contract per service
- edge case: if services currently share a database, document the schema split strategy and phased migration plan; if circular dependencies exist, redraw boundaries
design communication patterns and protocols
- input: service dependencies, sla requirements, consistency model, message volume estimates
- for query/command pairs with sub-100ms latency sla: choose synchronous rest or grpc
- for long-running operations or cross-aggregate workflows: choose async messaging (kafka, rabbitmq, cloud events)
- for read-heavy operations: consider eventual consistency with event-driven cache updates
- output: communication matrix (service pairs, protocol choice, justification), sample api contract (openapi for rest, protobuf for grpc), async event schema examples
- edge case: if latency budget is unknown, document assumption and flag for performance testing; if message volume exceeds broker capacity, recommend sharding strategy
define data strategy and consistency model
- input: service dependencies, consistency requirements, regulatory constraints (gdpr, audit trails)
- apply database-per-service pattern: each service uses its own database (sql, nosql, or hybrid)
- document eventual consistency windows for cross-service operations
- for strong consistency requirements: use sagas (orchestrated or choreographed) with compensating transactions
- for high-cardinality state: consider event sourcing with snapshots
- output: data topology diagram, consistency model per operation (strong, eventual, saga compensation), database selection rationale, schema per service
- edge case: if schema needs to be shared for auditing, use change data capture (cdc) instead of direct schema coupling; if consistency window exceeds business tolerance, escalate to stakeholders
implement resilience patterns
- input: external dependencies, sla targets, blast radius assessment
- add explicit timeout for every external call (network, database, third-party api): default 5-30 seconds, tuned per dependency
- add retry logic with exponential backoff for transient failures (max 3 attempts, jitter to avoid thundering herd)
- implement circuit breaker for repeated failures: open after 5 consecutive failures, half-open after 30 seconds
- add bulkhead isolation: separate thread pools or connection pools per downstream service (limit: 10-20 concurrent connections per service)
- add fallback strategy: return cached/degraded response, queue for async processing, or fail gracefully
- output: resilience pattern checklist, code snippets (circuit breaker middleware, retry decorator, timeout config), fault injection test scenarios
- edge case: if circuit breaker trips, log alert and notify on-call; if fallback cache is stale, flag to user; if retry budget is exhausted, transition to dead letter queue
design observability for distributed tracing and correlation
- input: observability stack endpoints, service list, communication patterns
- implement correlation id injection: generate uuid at api gateway or request entry point, propagate via http headers (x-correlation-id) and async message metadata
- configure distributed tracing: instrument all http clients, database drivers, and message producers to emit spans with service name, operation name, duration, error status
- set up centralized logging: forward all service logs to elasticsearch/splunk with correlation id as a field for aggregation
- add request/response logging middleware: log at info level (headers, body summary, latency) without sensitive data
- output: observability config template (jaeger/datadog instrumentation), correlation id propagation code example, log schema with correlation id, sample dashboards
- edge case: if log volume exceeds storage budget, configure sampling (1 in 100 trace for high-volume endpoints); if tracing overhead impacts latency, use tail sampling
define deployment and health check strategy
- input: deployment platform (kubernetes), service list, ci/cd system
- configure health probes for each service: liveness probe (service is responsive), readiness probe (can handle traffic), startup probe (ready to accept connections)
- define probe endpoints (e.g. /healthz, /readyz): return http 200 with json status
- choose progressive delivery strategy: canary (5% traffic to new version, monitor error rate and latency for 10 minutes), blue-green (run old and new in parallel, switch at once), or rolling (10% at a time)
- output: kubernetes deployment manifest template with probes, health check endpoints in code, canary/blue-green rollout script, runbook for rollback
- edge case: if startup time exceeds 30 seconds, set startup probe delay accordingly; if liveness probe is too aggressive, increase failure threshold to avoid cascading restarts; if canary metrics indicate regression, halt and rollback
document architecture decision record and implementation roadmap
- input: all outputs from steps 1-7, team capacity, dependency constraints
- create architecture decision record (adr) per major decision (bounded contexts, async messaging, circuit breakers, data strategy)
- prioritize service decomposition: phase 1 (high-cohesion, low-coupling candidates), phase 2 (medium candidates), phase 3 (tightly coupled candidates)
- for each phase: list service name, dependencies, data migration strategy, rollout plan, success criteria
- output: adr documents (markdown, stored in git), implementation roadmap (timeline, dependencies, team assignments), success metrics (latency, error rate, deployment frequency)
- edge case: if team lacks microservices experience, add training and pairing sessions in phase 1; if timeline is compressed, focus on highest-risk services first

decision points

if sla latency budget is sub-100 ms: use synchronous rest or grpc for that operation; else use async messaging with eventual consistency
if strong consistency is required (financial transactions, inventory): implement saga orchestration with compensating transactions; else use eventual consistency with event-driven updates
if service owns multiple bounded contexts: consider whether they should be separate services or fine-grained modules within one service based on deployment frequency and team structure; err toward separate services if deployed at different cadences
if circuit breaker trips: log alert, notify on-call, and fall back to cached/degraded response; do not retry until circuit is half-open
if startup probe fails (service hangs during boot): do not mark service as ready; do not send traffic; increase startup probe timeout or fix initialization logic
if canary shows error rate increase (>0.5% above baseline): stop rollout immediately, roll back to previous version, and investigate; do not promote to production
if consistency window exceeds business tolerance: escalate to product and negotiate sla or architecture change (e.g., dual-write, cdc, or strongerconsistency guarantee)
if team has no microservices experience: start with strangler fig pattern (wrap monolith with api gateway, gradually extract services) and pair junior engineers with architects; do not attempt big-bang rewrite

output contract

successful skill execution produces:

bounded context diagram: visual map of services and their data ownership (draw.io, c4 model, or text-based ascii)
service independence checklist: per-service validation (name, data ownership, api contract, deployment independence, dependency list)
communication matrix: table of (source service, target service, protocol, latency sla, consistency model, example payload)
api contracts: openapi 3.0 for rest or protobuf for grpc; includes request/response schemas, error codes, retry semantics
async event schema: json schema or protobuf for each event type; includes correlation id, timestamp, source service, payload
data topology diagram: services and their databases (sql, nosql, event store), plus change data capture pipelines if applicable
consistency model spec: per operation, document strong or eventual consistency, saga steps if applicable, compensation logic
resilience pattern checklist: per service, list timeouts, retry budgets, circuit breaker thresholds, bulkhead limits, fallback strategies
observability config: correlation id propagation code, jaeger/datadog instrumentation snippets, log schema, sample dashboards
kubernetes deployment manifest template: with liveness, readiness, startup probes, resource requests/limits, replica strategy
canary/blue-green rollout script: automated traffic shifting, metric collection, rollback logic
architecture decision records: markdown files documenting major choices (why rest over grpc, why saga over distributed lock, why kafka over http webhooks)
implementation roadmap: phased service extraction plan with timeline, team assignments, success criteria, risk mitigation
health check endpoint code: /healthz, /readyz returning json status with service dependencies, database connectivity, external api health

all artifacts stored in git (markdown, yaml, protobuf, json schema); diagrams in draw.io or lucidchart with version history.

outcome signal

the skill worked when:

stakeholders confirm the bounded context diagram accurately represents their business domain and service boundaries make sense
each service passes the independence checklist: owns its data, has a public api contract, can be deployed without coordinating other services
a single request can be traced end-to-end across all services using its correlation id in logs and distributed traces (verify by running a request through canary and observing the trace in jaeger/datadog)
the communication matrix is approved by the team and all synchronous calls have sub-100ms latency (measured in staging), async operations complete within agreed-upon consistency windows
circuit breakers and bulkheads prevent cascading failures: when one service goes down, others continue serving with graceful degradation (verify with chaos engineering tests)
health probes are green for all services in staging: liveness indicates service is alive, readiness indicates ready for traffic, startup indicates initialization is complete
first canary rollout completes successfully: new version rolls to 5% of traffic, metrics (error rate, latency, business kpis) stay green for 10 minutes, then rolls to 100%
team can articulate why each architectural choice was made (by reading adr documents) and agrees on phased rollout plan with assigned owners and target dates
deployment frequency increases or remains stable (not degraded by architectural change) and mean time to recovery (mttr) decreases as observability improves

microservices-architect

related skills