Site Reliability Engineer

You will maintain the reliability and performance of critical services by bridging the gap between development and operations.

Design and implement resilient system architectures to support high availability and scalability.
Develop automation tools and scripts to increase operational efficiency and reduce manual toil.
Define, track, and analyze SLOs and SLIs to ensure performance meets business requirements.
Conduct post-mortem analyses to identify root causes and implement long-term solutions.
Troubleshoot issues involving database performance, network connectivity, and platform-level failures in Kubernetes or virtual machines.

10+ years of experience in system architecture and design.
Proficiency in Python, Golang, or Java.
Strong understanding of SRE principles, including SLOs, SLIs, and toil reduction.
Experience managing cloud environments such as AWS, Azure, or Google Cloud.
Expertise in Linux system administration.
Proven ability to troubleshoot application support, performance, and connectivity issues.
Strong grasp of networking concepts and troubleshooting techniques.
Any Graduate degree.

Any Graduate

Back To Jobs