Description
You will design and maintain data pipelines using Python, PySpark, and Java within the AWS ecosystem. Your work involves processing big data with EMR, Redshift, and Athena, while managing schema evolution with Apache Iceberg.
Responsibilities
- Build and optimize ETL processes using PySpark and AWS Glue for large-scale data processing.
- Implement SCD Type 2 logic and data molding strategies to ensure data integrity.
- Manage real-time data streaming and messaging using Kafka and Shell scripting.
- Deploy and manage infrastructure on AWS using Lambda, Step Functions, and Kubernetes.
- Establish CI/CD best practices and handle deployments via SnowSQL and Snowflake.
Required Skills
- 5+ years of hands-on programming experience in Python, PySpark, and Java.
- Deep expertise with Snowflake, SnowSQL, and Apache Iceberg for data warehousing.
- Strong proficiency in AWS services including EMR, Redshift, Athena, Glue, Lambda, and Step Functions.
- Experience with Kafka for data streaming and message queuing.
- Proficiency in Shell scripting for automation and operational tasks.
- Practical experience with CI/CD pipelines and Kubernetes (K8s) deployment.
- Understanding of data modeling concepts, specifically SCD Type 2.