You will manage and resolve production incidents within an AWS environment to ensure minimal downtime and effective system stability.
Responsibilities
- Lead IT production incidents from detection through to resolution, managing incident status and impact updates.
- Direct technical triage calls to analyze application and infrastructure health.
- Perform detailed transaction-level monitoring and root cause analysis across web, database, storage, and network layers.
- Conduct postmortem analysis for critical incidents to document corrective actions and lessons learned.
- Collaborate with cross-functional teams to influence resolution and improve operational processes.
Required Skills
- 5+ years of experience in incident management and application triage.
- Expertise in AWS infrastructure, specifically EC2, ELB, RDS, Redshift, DynamoDB, Aurora, Route53, ECS, and Lambda.
- Hands-on experience with monitoring and troubleshooting tools such as Dynatrace, Splunk, SolarWinds, MoogSoft, Extrahop, or Catchpoint.
- Proficiency in diagnosing incidents within Unix/Linux and WinTel server environments.
- Experience troubleshooting databases including Oracle and MS SQL.
- Ability to perform transaction tracing and log analysis using AWS Console and diagnostic tools.
- Strong technical communication skills for providing executive-level reporting and stakeholder updates.
- Knowledge of networking, SAN, and virtualization technologies.