Description

You will architect and implement RunOps frameworks to support Day 2 operations, incident management, and automated remediation.

Responsibilities

  • Lead the development and enhancement of BAU support models L2L3 for cloud infrastructure (Windows/Linux) and PaaS services.
  • Design and maintain automation modules for platform features (Day 01 and resiliency features Day 2) using Infrastructure as Code (IaC).
  • Collaborate with SRE and application teams to ensure operational readiness and observability integration across environments.
  • Drive adoption of AIOps and self-healing mechanisms to reduce MTTR and improve system reliability.
  • Define and enforce operational SLAs, SLOs, and KPIs to measure and improve service performance.

Required Skills

  • 7+ years of experience in cloud operations managing AWS, Azure, or GCP and hybrid infrastructure.
  • Expertise in IaC tools, specifically Terraform and Ansible, and CI/CD pipelines.
  • Hands-on experience with observability and incident management tools (e.g., New Relic, PagerDuty, Splunk).
  • Proficiency in scripting and automation using Python, Bash, and PowerShell.
  • Experience with Linux operating systems and PaaS services.
  • Familiarity with DevSecOps practices and integrating security into operational workflows.
  • Experience with version control systems like GitHub.

Education

Bachelor's degree