You will design, implement, and optimize scalable data pipelines within distributed computing environments.
Responsibilities
- Develop and maintain scalable data processing pipelines using Python and PySpark.
- Optimize existing pipelines for performance, scalability, and efficient distributed computing.
- Perform data wrangling, cleansing, and analysis on large datasets.
- Conduct code reviews, mentor junior developers, and maintain technical documentation.
- Troubleshoot, debug, and resolve data processing issues.
Required Skills
- 5+ years of experience in Python programming.
- 3+ years of hands-on experience with PySpark and distributed data processing.
- Strong understanding of Hadoop, Spark, and Hive.
- Proficiency with SQL and relational databases.
- Experience with ETL processes and data pipelines.
- Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field.
Preferred Skills
- Experience with Docker, Kubernetes, or other containerization technologies.
- Knowledge of DevOps tools including CI/CD pipelines, Jenkins, and Git.
- Experience with Apache Kafka or real-time data streaming.