← Back to jobs
Naperville, IL, USA
No related jobs found
Job Responsibilities include:
Ensure high availability and reliability of enterprise applications in a 24×7 production environment.
Monitor systems, batch jobs, and workflows while proactively identifying and resolving issues.
Lead incident management (P1/P2), perform root cause analysis, and implement preventive solutions.
Design and maintain monitoring dashboards and improve system observability and alerting.
Support release deployments, change validation, and disaster recovery activities.
Collaborate with cross-functional teams and maintain documentation, runbooks, and knowledge base articles.
Required Qualifications:
Bachelor’s degree in computer science, Information Technology, or a related field (or equivalent experience).
10+ years of experience in Site Reliability Engineering, Application Support, or production operations in enterprise environments.
Strong expertise in incident management, root cause analysis, and maintaining SLA/SLO compliance in a 24×7 environment.
Hands-on experience with AWS (EC2, S3, VPC), monitoring tools, and supporting enterprise applications across distributed systems
Any Graduate
No related jobs found
← Back to jobs