You will build and orchestrate data pipelines and distributed computing applications within an AWS ecosystem.
Responsibilities
- Build and orchestrate data pipelines and ETL processes.
- Develop distributed computing applications using PySpark.
- Design and implement data models using normalization, denormalization, and schema design.
- Write, maintain, and execute automated unit tests following Test-Driven Development (TDD) practices.
- Build APIs and manage serverless architectures.
Required Skills
- 5+ years of experience in big data environments.
- Proficiency in Python programming.
- Strong expertise in SQL, Presto, Hive, and Spark.
- Experience with PySpark and libraries including Pandas, Polars, and NumPy.
- Extensive experience with AWS services: EMR, Lambda, Glue ETL, Step Functions, S3, ECS, Kinesis, IAM, RDS PostgreSQL, DynamoDB, CloudWatch Events/EventBridge, Athena, SNS, SQS, and VPC.
- Experience with relational and NoSQL databases, including Amazon Redshift.
- Knowledge of trading and investment data.
- Experience with OneTick or KDB.
- Understanding of CI/CD, source control, and data warehousing concepts.
Preferred Skills
- Proficiency in data visualization tools, specifically Tableau.