You will build and maintain the platform supporting machine learning workflows.
Responsibilities
- Operate and troubleshoot Kubernetes environments, specifically AWS EKS.
- Build and manage continuous integration and continuous deployment pipelines.
- Support production platforms, including incident management and root cause analysis.
- Implement infrastructure using IaC tools like Terraform or CloudFormation.
- Manage and monitor ML workflows spanning training, inference, and model monitoring.
Required Skills
- 3+ years of hands-on experience with AWS Services (EKS, EC2, S3, IAM, CloudWatch, EC2R).
- Strong experience operating and troubleshooting Kubernetes.
- Proficiency in containerization using Docker.
- Strong programming and scripting skills in Python and Bash.
- Experience building and managing CI/CD pipelines (GitLab or equivalent).
- Familiarity with machine learning workflows (training, inference, monitoring).
- Experience with infrastructure-as-code (Terraform or CloudFormation).
- Experience supporting production platforms and incident management.
- Any Graduate education.