← Back to jobs
Bangalore Urban, Karnataka, India
No related jobs found
• Design and implement end-to-end observability solutions across applications and infrastructure
• Establish unified visibility across logs, metrics, and distributed tracing
• Define and standardize monitoring frameworks, dashboards, and alerting strategies
• Enable proactive detection of issues through intelligent alerting and anomaly detection
Monitoring & Tooling
• Implement and manage tools such as Splunk, Datadog, Prometheus, Grafana, New Relic, or similar
• Build actionable dashboards for SRE, operations, and business stakeholders
• Optimize alert configurations to reduce noise and improve signal quality
• Continuously enhance monitoring coverage across systems and services
Incident Support & Reliability
• Support late night / US overlap shift for production monitoring and incident response
• Analyze logs, metrics, and traces to support incident triage and root cause analysis (RCA)
• Collaborate with SRE and engineering teams to improve system reliability and performance
• Participate in post-incident reviews and continuous improvement initiatives
Automation & Integration
• Automate monitoring setup and configuration using Infrastructure as Code (IaC)
• Integrate observability tools with CI/CD pipelines and DevOps workflows
• Develop scripts/tools to improve data collection, alerting, and reporting
Platform & Integration Support
• Monitor enterprise applications, APIs, and integration layers (e.g., middleware, cloud services)
• Ensure end-to-end visibility across distributed systems and microservices architectures
• Work closely with platform teams (cloud, Salesforce, etc.) to enhance observability
Governance & Compliance
• Ensure monitoring practices align with security and compliance requirements (e.g., SOX)
• Maintain runbooks, documentation, and monitoring standards
• Support audit and governance requirements as needed
Required Skills & Qualifications
Technical Skills
• Strong experience in observability, monitoring, or SRE roles
• Hands-on experience with tools like Splunk, Datadog, Prometheus, Grafana, New Relic
• Strong understanding of logs, metrics, traces, and distributed systems
• Experience with APM tools and performance monitoring
• Scripting skills (Python, Bash, PowerShell, or similar)
• Familiarity with CI/CD tools (Jenkins, GitHub Actions, Azure DevOps)
• Knowledge of Infrastructure as Code (Terraform or similar)
Operational Excellence
• Experience supporting production environments in 24x7 models
• Strong incident management and RCA capabilities
• Ability to analyze performance issues and recommend improvements Soft Skills
• Ability to work effectively in a late night / US overlap shift
• Strong communication and collaboration skills
• Proactive mindset with a focus on continuous improvement
Any Gradute
No related jobs found
← Back to jobs