You will manage the availability, latency, and performance of critical systems while driving platform-first initiatives for scalability.
Responsibilities
- Maintain system availability, efficiency, and capacity planning through proactive monitoring and emergency response.
- Identify and resolve system failures to ensure continuous reliability and prevent infrastructure bottlenecks.
- Analyze operating system and application metrics to perform tuning and fault identification.
- Partner with development teams to enhance services through rigorous testing and release procedures.
- Design and develop fully automated workflows using JavaScript, PowerShell, or Bash.
Required Skills
- 5-8 years of experience with SRE and Observability concepts.
- Strong experience in scripting and automation using Python and Shell scripting.
- Hands-on experience with DevOps tools including Jenkins, TeamCity, Ansible, and uDeploy.
- Proficiency with monitoring tools such as Prometheus, Grafana, ITRS, and AppDynamics.
- Expertise in Software Design and Rapid Prototyping.
- Experience working within Agile development environments.
- Ability to provide technical mentoring to other engineers.
- Degree in B.E., B.Tech, or B.Sc.