You will architect and implement RunOps frameworks to support Day 2 operations, incident management, and automated remediation.
Responsibilities
- Lead the development and enhancement of BAU support models L2L3 for cloud infrastructure (Windows/Linux) and PaaS services.
- Design and maintain automation modules for platform features (Day 01 and resiliency features Day 2) using Infrastructure as Code (IaC).
- Collaborate with SRE and application teams to ensure operational readiness and observability integration across environments.
- Drive adoption of AIOps and self-healing mechanisms to reduce MTTR and improve system reliability.
- Define and enforce operational SLAs, SLOs, and KPIs to measure and improve service performance.
Required Skills
- 7+ years of experience in cloud operations managing AWS, Azure, or GCP and hybrid infrastructure.
- Expertise in IaC tools, specifically Terraform and Ansible, and CI/CD pipelines.
- Hands-on experience with observability and incident management tools (e.g., New Relic, PagerDuty, Splunk).
- Proficiency in scripting and automation using Python, Bash, and PowerShell.
- Experience with Linux operating systems and PaaS services.
- Familiarity with DevSecOps practices and integrating security into operational workflows.
- Experience with version control systems like GitHub.