Key Skills: Splunk, AWS, GCP, Python, Terraform, Observability, Site Reliability Engineering, Grafana, OpenTelemetry, Distributed Tracing
Good to Have Skills: Experience with AIOps platforms and intelligent alerting solutions. Knowledge of Kubernetes and containerized environments. Experience integrating observability tools with ServiceNow and CI/CD ecosystems. AWS, GCP, Observability, or SRE-related certifications. Familiarity with Dynatrace and cloud-native architectures.
Roles & Responsibilities:
- Design and implement end-to-end observability solutions across applications, infrastructure, and cloud environments.
- Develop dashboards, alerts, and telemetry frameworks to provide real-time visibility into system health and performance.
- Build automation solutions to eliminate repetitive operational tasks and improve efficiency.
- Enable runbook automation, self-healing capabilities, and automated incident triage workflows.
- Define and implement SLIs, SLOs, and alerting strategies to improve service reliability.
- Drive improvements in MTTD and MTTR through actionable alerts and telemetry-driven insights.
- Implement proactive monitoring, anomaly detection, and predictive alerting to identify issues before customer impact.
- Leverage AIOps capabilities for alert correlation and intelligent incident response.
- Integrate observability platforms with CI/CD pipelines, cloud services, and ITSM tools such as ServiceNow.
- Collaborate with engineering, product, and operations teams to establish observability standards and operational readiness practices.
Experience Required: 3+ years of experience in Observability Engineering, Site Reliability Engineering, or related domains. Strong understanding of SLIs, SLOs, alerting strategies, and incident response frameworks. Experience implementing metrics, logs, events, and distributed tracing (MELT) across distributed systems.
Education: Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience)