Description

You will own system reliability and operational excellence for critical services.

Responsibilities

  • Ensure system reliability and availability by monitoring issues, designing automated troubleshooting systems, and writing post-mortems.
  • Mitigate operational risks by assessing potential issues and implementing mitigation strategies in collaboration with development teams.
  • Maintain and enhance CI/CD pipelines, including gatekeeping production deployments and managing Git branching strategies on AWS.
  • Automate operational processes across the infrastructure.
  • Solve performance and scalability issues within the system architecture.

Required Skills

  • 8+ years of experience in DevOps and Site Reliability Engineering.
  • Hands-on experience with containerization and orchestration, specifically Docker and Kubernetes/EKS.
  • Proficiency with Infrastructure as Code tools: Terraform, Ansible, or CloudFormation.
  • In-depth understanding of SRE principles, including monitoring, error budgets, and fault analysis.
  • Expert knowledge of monitoring and observability tools, including Apache Splunk.
  • Demonstrable fluency in Java or Python for scripting and automation.
  • Experience building and managing CI/CD pipelines and implementing network policies on AWS.
  • Knowledge of Linux operating system principles and networking fundamentals.

Education

Any Graduate