Expert in Apache Kafka, Event Streaming, and Real-time Data Pipelines. Specializes in Kafka Connect, KSQL, and Schema Registry.

SKILL.md

Kafka Engineer

Purpose

Provides Apache Kafka and event streaming expertise specializing in scalable event-driven architectures and real-time data pipelines. Builds fault-tolerant streaming platforms with exactly-once processing, Kafka Connect, and Schema Registry management.

When to Use

Designing event-driven microservices architectures

Setting up Kafka Connect pipelines (CDC, S3 Sink)

Writing stream processing apps (Kafka Streams / ksqlDB)

Debugging consumer lag, rebalancing storms, or broker performance

Designing schemas (Avro/Protobuf) with Schema Registry

Configuring ACLs and mTLS security

2. Decision Framework
Architecture Selection

What is the use case?
│
├─ **Data Integration (ETL)**
│ ├─ DB to DB/Data Lake? → **Kafka Connect** (Zero code)
│ └─ Complex transformations? → **Kafka Streams**
│
├─ **Real-Time Analytics**
│ ├─ SQL-like queries? → **ksqlDB** (Quick aggregation)
│ └─ Complex stateful logic? → **Kafka Streams / Flink**
│
└─ **Microservices Comm**
├─ Event Notification? → **Standard Producer/Consumer**
└─ Event Sourcing? → **State Stores (RocksDB)**

Config Tuning (The "Big 3")

Throughput: batch.size, linger.ms, compression.type=lz4.

Latency: linger.ms=0, acks=1.

Durability: acks=all, min.insync.replicas=2, replication.factor=3.

Red Flags → Escalate to sre-engineer:

"Unclean leader election" enabled (Data loss risk)

Zookeeper dependency in new clusters (Use KRaft mode)

Disk usage > 80% on brokers

Consumer lag constantly increasing (Capacity mismatch)

3. Core Workflows

Workflow 1: Kafka Connect (CDC)

Goal: Stream changes from PostgreSQL to S3.

Steps:

Source Config (postgres-source.json)

{
"name": "postgres-source",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "db-host",
"database.dbname": "mydb",
"database.user": "kafka",
"plugin.name": "pgoutput"
}
}

Sink Config (s3-sink.json)

{
"name": "s3-sink",
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"s3.bucket.name": "my-datalake",
"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
"flush.size": "1000"
}
}

Deploy

curl -X POST -d @postgres-source.json http://connect:8083/connectors

Workflow 3: Schema Registry Integration

Goal: Enforce schema compatibility.

Steps:

Define Schema (user.avsc)

{
"type": "record",
"name": "User",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"}
]
}

Producer (Java)

Use KafkaAvroSerializer.

Registry URL: http://schema-registry:8081.

5. Anti-Patterns & Gotchas

❌ Anti-Pattern 1: Large Messages

What it looks like:

Sending 10MB images payload in Kafka message.

Why it fails:

Kafka is optimized for small messages (< 1MB). Large messages block the broker threads.

Correct approach:

Store image in S3.

Send Reference URL in Kafka message.

❌ Anti-Pattern 2: Too Many Partitions

What it looks like:

Creating 10,000 partitions on a small cluster.

Why it fails:

Slow leader election (Zookeeper overhead).

High file handle usage.

Correct approach:

Limit partitions per broker (~4000). Use fewer topics or larger clusters.

❌ Anti-Pattern 3: Blocking Consumer

What it looks like:

Consumer doing heavy HTTP call (30s) for each message.

Why it fails:

Rebalance storm (Consumer leaves group due to timeout).

Correct approach:

Async Processing: Move work to a thread pool.

Pause/Resume: consumer.pause() if buffer is full.

7. Quality Checklist

Configuration:

Replication: Factor 3 for production.

Min.ISR: 2 (Prevents data loss).

Retention: Configured correctly (Time vs Size).

Observability:

Lag: Consumer Lag monitored (Burrow/Prometheus).

Under-replicated: Alert on under-replicated partitions (>0).

JMX: Metrics exported.

Examples

Example 1: Real-Time Fraud Detection Pipeline

Scenario: A financial services company needs real-time fraud detection using Kafka streaming.

Architecture Implementation:

Event Ingestion: Kafka Connect CDC from PostgreSQL transaction database

Stream Processing: Kafka Streams application for real-time pattern detection

Alert System: Producer to alert topic triggering notifications

Storage: S3 sink for historical analysis and compliance

Pipeline Configuration:

Component
Configuration
Purpose

Topics
3 (transactions, alerts, enriched)
Data organization

Partitions
12 (3 brokers × 4)
Parallelism

Replication
3
High availability

Compression
LZ4
Throughput optimization

Key Logic:

Detects velocity patterns (5+ transactions in 1 minute)

Identifies geographic anomalies (impossible travel)

Flags high-risk merchant categories

Results:

99.7% of fraud detected in under 100ms

False positive rate reduced from 5% to 0.3%

Compliance audit passed with zero findings

Example 2: E-Commerce Order Processing System

Scenario: Build a resilient order processing system with Kafka for high reliability.

System Design:

Order Events: Topic for order lifecycle events

Inventory Service: Consumes orders, updates stock

Payment Service: Processes payments, publishes results

Notification Service: Sends confirmations via email/SMS

Resilience Patterns:

Dead Letter Queue for failed processing

Idempotent producers for exactly-once semantics

Consumer groups with manual offset management

Retries with exponential backoff

Configuration:

# Producer Configuration
acks: all
retries: 3
enable.idempotence: true

# Consumer Configuration
auto.offset.reset: earliest
enable.auto.commit: false
max.poll.records: 500

Results:

99.99% message delivery reliability

Zero duplicate orders in 6 months

Peak processing: 10,000 orders/second

Example 3: IoT Telemetry Platform

Scenario: Process millions of IoT device telemetry messages with Kafka.

Platform Architecture:

Device Gateway: MQTT to Kafka proxy

Data Enrichment: Stream processing adds device metadata

Time-Series Storage: S3 sink partitioned by device_id/date

Real-Time Alerts: Threshold-based alerting for anomalies

Scalability Configuration:

50 partitions for parallel processing

Compression enabled for cost optimization

Retention: 7 days hot, 1 year cold in S3

Schema Registry for data contracts

Performance Metrics:

Metric
Value

Throughput
500,000 messages/sec

Latency (P99)
50ms

Consumer lag
< 1 second

Storage efficiency
60% reduction with compression

Best Practices

Topic Design

Naming Conventions: Use clear, hierarchical topic names (domain.entity.event)

Partition Strategy: Plan for future growth (3x expected throughput)

Retention Policies: Match retention to business requirements

Cleanup Policies: Use delete for time-based, compact for state

Schema Management: Enforce schemas via Schema Registry

Producer Optimization

Batching: Increase batch.size and linger.ms for throughput

Compression: Use LZ4 for balance of speed and size

Acks Configuration: Use all for reliability, 1 for latency

Retry Strategy: Implement retries with backoff

Idempotence: Enable for exactly-once semantics in critical paths

Consumer Best Practices

Offset Management: Use manual commit for critical processing

Batch Processing: Increase max.poll.records for efficiency

Rebalance Handling: Implement graceful shutdown

Error Handling: Dead letter queues for poison messages

Monitoring: Track consumer lag and processing time

Security Configuration

Encryption: TLS for all client-broker communication

Authentication: SASL/SCRAM or mTLS for production

Authorization: ACLs with least privilege principle

Quotas: Implement client quotas to prevent abuse

Audit Logging: Log all access and configuration changes

Performance Tuning

Broker Configuration: Optimize for workload type (throughput vs latency)

JVM Tuning: Heap size and garbage collector selection

OS Tuning: File descriptor limits, network settings

Monitoring: Metrics for throughput, latency, and errors

Capacity Planning: Regular review and scaling assessment

Security:

Encryption: TLS enabled for Client-Broker and Inter-broker.

Auth: SASL/SCRAM or mTLS enabled.

ACLs: Principle of least privilege (Topic read/write).

kafka-engineer

SKILL.md

related skills