Description

You will design, implement, and optimize scalable data pipelines within distributed computing environments.

Responsibilities

  • Develop and maintain scalable data processing pipelines using Python and PySpark.
  • Optimize existing pipelines for performance, scalability, and efficient distributed computing.
  • Perform data wrangling, cleansing, and analysis on large datasets.
  • Conduct code reviews, mentor junior developers, and maintain technical documentation.
  • Troubleshoot, debug, and resolve data processing issues.

Required Skills

  • 5+ years of experience in Python programming.
  • 3+ years of hands-on experience with PySpark and distributed data processing.
  • Strong understanding of Hadoop, Spark, and Hive.
  • Proficiency with SQL and relational databases.
  • Experience with ETL processes and data pipelines.
  • Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field.

Preferred Skills

  • Experience with Docker, Kubernetes, or other containerization technologies.
  • Knowledge of DevOps tools including CI/CD pipelines, Jenkins, and Git.
  • Experience with Apache Kafka or real-time data streaming.

Education

Bachelor’s or Master’s degree