You will maintain the reliability and performance of critical services by bridging the gap between development and operations.
Responsibilities
- Design and implement resilient system architectures to support high availability and scalability.
- Develop automation tools and scripts to increase operational efficiency and reduce manual toil.
- Define, track, and analyze SLOs and SLIs to ensure performance meets business requirements.
- Conduct post-mortem analyses to identify root causes and implement long-term solutions.
- Troubleshoot issues involving database performance, network connectivity, and platform-level failures in Kubernetes or virtual machines.
Required Skills
- 10+ years of experience in system architecture and design.
- Proficiency in Python, Golang, or Java.
- Strong understanding of SRE principles, including SLOs, SLIs, and toil reduction.
- Experience managing cloud environments such as AWS, Azure, or Google Cloud.
- Expertise in Linux system administration.
- Proven ability to troubleshoot application support, performance, and connectivity issues.
- Strong grasp of networking concepts and troubleshooting techniques.
- Any Graduate degree.
Preferred Skills
- Experience with CI/CD, infrastructure as code, and containerization.
- Familiarity with monitoring tools and performance optimization.