You will own system reliability and operational excellence for critical services.
Responsibilities
- Ensure system reliability and availability by monitoring issues, designing automated troubleshooting systems, and writing post-mortems.
- Mitigate operational risks by assessing potential issues and implementing mitigation strategies in collaboration with development teams.
- Maintain and enhance CI/CD pipelines, including gatekeeping production deployments and managing Git branching strategies on AWS.
- Automate operational processes across the infrastructure.
- Solve performance and scalability issues within the system architecture.
Required Skills
- 8+ years of experience in DevOps and Site Reliability Engineering.
- Hands-on experience with containerization and orchestration, specifically Docker and Kubernetes/EKS.
- Proficiency with Infrastructure as Code tools: Terraform, Ansible, or CloudFormation.
- In-depth understanding of SRE principles, including monitoring, error budgets, and fault analysis.
- Expert knowledge of monitoring and observability tools, including Apache Splunk.
- Demonstrable fluency in Java or Python for scripting and automation.
- Experience building and managing CI/CD pipelines and implementing network policies on AWS.
- Knowledge of Linux operating system principles and networking fundamentals.