Description
Key Responsibilities:
- Manage and support Azure-based production environments, ensuring stability, performance, and availability.
- Implement SRE practices including SLIs, SLOs, error budgets, alert tuning, and toil reduction.
- Develop automation scripts and internal tooling using Python for reporting, integrations, and operational workflows.
- Work with Azure Monitor, Log Analytics, Application Insights, and create dashboards, alerts, and KQL queries.
- Deploy, manage, and troubleshoot Azure services including App Services, Functions, AKS, Storage, Key Vault, Azure SQL, Service Bus, Event Hub, etc.
- Implement and maintain CI/CD pipelines using Azure DevOps, GitHub Actions, Jenkins, or similar tools.
- Use Terraform/Bicep/ARM templates for infrastructure provisioning and automation.
- Troubleshoot incidents across application, infrastructure, and network layers, performing root cause analysis.
- Improve platform reliability through automation, observability enhancements, and performance optimization.
- Collaborate with development, product, and platform engineering teams on deployments, releases, and reliability improvements.
Required Skills and Qualification
- Strong hands-on experience with Microsoft Azure cloud services.
- Proven experience in SRE, DevOps, platform engineering, or production support roles.
- Strong Python scripting experience for automation, integrations, reporting, and tooling.
- Good understanding of Azure SDK, REST APIs, Azure CLI, Bash, or PowerShell.
- Experience with Azure Monitor, Application Insights, Log Analytics, KQL, dashboards & alerts.
- Strong knowledge of Azure networking: VNets, subnets, NSGs, Private Endpoints, Load Balancers, App Gateway, Azure Front Door.
- Experience with Terraform, Bicep, ARM templates, or similar IaC tools.
- Hands-on experience building and maintaining CI/CD pipelines.
- Good understanding of Linux and Windows environments.
- Strong expertise in incident management, troubleshooting, RCA, and operational excellence.
- Knowledge of SRE fundamentals (SLA, SLO, SLI, MTTR, MTTD, reliability engineering).
Good to have Skills
- Experience with Kafka, Service Bus, Event Hub, or high‑volume messaging systems.
- Exposure to FinOps, cost‑optimization, or cloud governance.
- Experience working in financial services or other highly regulated environments.
- Familiarity with container platforms (AKS, Docker) and distributed systems.
- Knowledge of advanced observability tools (Prometheus, Grafana, Datadog).
- Experience collaborating with globally distributed teams