Description

You will manage and resolve production incidents within an AWS environment to ensure minimal downtime and effective system stability.

Responsibilities

  • Lead IT production incidents from detection through to resolution, managing incident status and impact updates.
  • Direct technical triage calls to analyze application and infrastructure health.
  • Perform detailed transaction-level monitoring and root cause analysis across web, database, storage, and network layers.
  • Conduct postmortem analysis for critical incidents to document corrective actions and lessons learned.
  • Collaborate with cross-functional teams to influence resolution and improve operational processes.

Required Skills

  • 5+ years of experience in incident management and application triage.
  • Expertise in AWS infrastructure, specifically EC2, ELB, RDS, Redshift, DynamoDB, Aurora, Route53, ECS, and Lambda.
  • Hands-on experience with monitoring and troubleshooting tools such as Dynatrace, Splunk, SolarWinds, MoogSoft, Extrahop, or Catchpoint.
  • Proficiency in diagnosing incidents within Unix/Linux and WinTel server environments.
  • Experience troubleshooting databases including Oracle and MS SQL.
  • Ability to perform transaction tracing and log analysis using AWS Console and diagnostic tools.
  • Strong technical communication skills for providing executive-level reporting and stakeholder updates.
  • Knowledge of networking, SAN, and virtualization technologies.

Education

Any Graduate