Description

You will build and maintain the platform supporting machine learning workflows.

Responsibilities

  • Operate and troubleshoot Kubernetes environments, specifically AWS EKS.
  • Build and manage continuous integration and continuous deployment pipelines.
  • Support production platforms, including incident management and root cause analysis.
  • Implement infrastructure using IaC tools like Terraform or CloudFormation.
  • Manage and monitor ML workflows spanning training, inference, and model monitoring.

Required Skills

  • 3+ years of hands-on experience with AWS Services (EKS, EC2, S3, IAM, CloudWatch, EC2R).
  • Strong experience operating and troubleshooting Kubernetes.
  • Proficiency in containerization using Docker.
  • Strong programming and scripting skills in Python and Bash.
  • Experience building and managing CI/CD pipelines (GitLab or equivalent).
  • Familiarity with machine learning workflows (training, inference, monitoring).
  • Experience with infrastructure-as-code (Terraform or CloudFormation).
  • Experience supporting production platforms and incident management.
  • Any Graduate education.

Education

Any Gradute