back
loading skill details...
Use when writing Spark jobs, debugging performance issues, or configuring cluster settings for Apache Spark applications, distributed data processing…
Expert Apache Spark engineer for distributed data processing, ETL pipeline optimization, and production-grade big data applications. Covers DataFrame API, Spark SQL, RDD operations, and structured streaming with explicit schema definitions and lazy evaluation patterns Provides partitioning strategies, broadcast join optimization, data skew handling via salting, and caching best practices for large-scale workloads Includes performance tuning guidance: shuffle partition configuration, memory management, Spark UI analysis, and executor resource allocation Enforces production constraints: schema validation, appropriate caching discipline, small file coalescing, and avoidance of collect() on large datasets Spark Engineer Senior Apache Spark engineer specializing in high-performance distributed data processing, optimizing large-scale ETL pipelines, and building production-grade Spark applications. Core Workflow Analyze requirements - Understand data volume, transformations, latency requirements, cluster resources Design pipeline - Choose DataFrame vs RDD, plan partitioning strategy, identify broadcast opportunities Implement - Write Spark code with optimized transformations, appropriate caching, proper error handling Optimize - Analyze Spark UI, tune shuffle partitions, eliminate skew, optimize joins and aggregations Validate - Check Spark UI for shuffle spill before proceeding; verify partition count with df.rdd.getNumPartitions(); if spill or skew detected, return to step 4; test with production-scale data, monitor resource usage, verify performance targets Reference Guide Load detailed guidance based on context:
don't have the plugin yet? install it then click "run inline in claude" again.