Description

You will manage the availability, latency, and performance of critical systems while driving platform-first initiatives for scalability.

Responsibilities

  • Maintain system availability, efficiency, and capacity planning through proactive monitoring and emergency response.
  • Identify and resolve system failures to ensure continuous reliability and prevent infrastructure bottlenecks.
  • Analyze operating system and application metrics to perform tuning and fault identification.
  • Partner with development teams to enhance services through rigorous testing and release procedures.
  • Design and develop fully automated workflows using JavaScript, PowerShell, or Bash.

Required Skills

  • 5-8 years of experience with SRE and Observability concepts.
  • Strong experience in scripting and automation using Python and Shell scripting.
  • Hands-on experience with DevOps tools including Jenkins, TeamCity, Ansible, and uDeploy.
  • Proficiency with monitoring tools such as Prometheus, Grafana, ITRS, and AppDynamics.
  • Expertise in Software Design and Rapid Prototyping.
  • Experience working within Agile development environments.
  • Ability to provide technical mentoring to other engineers.
  • Degree in B.E., B.Tech, or B.Sc.

Education

Any Graduate