Description

You will maintain the reliability and performance of critical services by bridging the gap between development and operations.

Responsibilities

  • Design and implement resilient system architectures to support high availability and scalability.
  • Develop automation tools and scripts to increase operational efficiency and reduce manual toil.
  • Define, track, and analyze SLOs and SLIs to ensure performance meets business requirements.
  • Conduct post-mortem analyses to identify root causes and implement long-term solutions.
  • Troubleshoot issues involving database performance, network connectivity, and platform-level failures in Kubernetes or virtual machines.

Required Skills

  • 10+ years of experience in system architecture and design.
  • Proficiency in Python, Golang, or Java.
  • Strong understanding of SRE principles, including SLOs, SLIs, and toil reduction.
  • Experience managing cloud environments such as AWS, Azure, or Google Cloud.
  • Expertise in Linux system administration.
  • Proven ability to troubleshoot application support, performance, and connectivity issues.
  • Strong grasp of networking concepts and troubleshooting techniques.
  • Any Graduate degree.

Preferred Skills

  • Experience with CI/CD, infrastructure as code, and containerization.
  • Familiarity with monitoring tools and performance optimization.

Education

Any Graduate