Incident Management Specialist

You will manage and resolve production incidents within an AWS environment to ensure minimal downtime and effective system stability.

Lead IT production incidents from detection through to resolution, managing incident status and impact updates.
Direct technical triage calls to analyze application and infrastructure health.
Perform detailed transaction-level monitoring and root cause analysis across web, database, storage, and network layers.
Conduct postmortem analysis for critical incidents to document corrective actions and lessons learned.
Collaborate with cross-functional teams to influence resolution and improve operational processes.

5+ years of experience in incident management and application triage.
Expertise in AWS infrastructure, specifically EC2, ELB, RDS, Redshift, DynamoDB, Aurora, Route53, ECS, and Lambda.
Hands-on experience with monitoring and troubleshooting tools such as Dynatrace, Splunk, SolarWinds, MoogSoft, Extrahop, or Catchpoint.
Proficiency in diagnosing incidents within Unix/Linux and WinTel server environments.
Experience troubleshooting databases including Oracle and MS SQL.
Ability to perform transaction tracing and log analysis using AWS Console and diagnostic tools.
Strong technical communication skills for providing executive-level reporting and stakeholder updates.
Knowledge of networking, SAN, and virtualization technologies.

Any Graduate

Back To Jobs