Description
You will design and implement a PySpark Structured Streaming application to ingest data from Confluent Kafka topics into Apache Iceberg tables.
Responsibilities
- Build streaming pipelines that parse JSON and Avro payloads, apply schema mappings, and write atomically to Iceberg using the foreachBatch micro-batch pattern.
- Configure Kafka source parameters including bootstrap servers, consumer group IDs, offset management, checkpoint paths, and trigger intervals.
- Implement PII detection and Protegrity tokenization hooks within the ingestion pipeline before data lands in the Iceberg Bronze layer.
- Write comprehensive unit and integration tests for row count validation, schema conformance, Kafka offset commit verification, and source data comparison.
- Support UAT by walking engineers through the code, demonstrating the use of only supported Apache APIs, and addressing review findings.
Required Skills
- 4+ years of hands-on experience with Apache Kafka (Producer & Consumer) in PySpark, Java, or Scala.
- Deep understanding of Kafka internals: topics, partitions, consumer groups, offsets, rebalancing, and exactly-once delivery semantics.
- Experience with Confluent Kafka: Schema Registry, Avro/JSON serialization, and Confluent Cloud or on-prem cluster configuration.
- Strong practical experience with PySpark Structured Streaming: Kafka/file sources, foreachBatch, output modes, and checkpoint management.
- Ability to tune streaming micro-batch trigger intervals, watermarking, and late data handling for production workloads.
- Hands-on experience writing streaming data directly to Apache Iceberg tables using the Iceberg Spark runtime.
- Implement robust error handling including dead-letter queues, parse error isolation, and recovery from checkpoint failures.
- Working knowledge of Apache Iceberg catalog configuration, schema definition, append writes, and partition strategies.
- Familiarity with S3-compatible object storage as an Iceberg warehouse destination and medallion architecture principles.