Apply chaos engineering principles to validate system resilience and recovery.
Work with chaos engineering tools such as Gremlin, Litmus, and Chaos Mesh.
Automate operational and deployment processes to improve efficiency.
Build and maintain CI/CD pipelines using Jenkins, GitHub Actions, GitLab CI, or AWS CodePipeline.
Define and implement observability strategies covering monitoring, logging, tracing, and alerting.
Design and manage scalable, resilient, and cost-effective AWS cloud infrastructure.
Drive incident management, root cause analysis, and post-incident reviews.
Establish and maintain SLIs, SLOs, and SLAs for system reliability.
Implement Infrastructure as Code using Terraform and CloudFormation.
Ensure security, compliance, and governance standards are met.
Mentor engineers and promote reliability, automation, and operational excellence.
Collaborate with Development, QA, and Product teams to ensure seamless releases.
Qualifications: Mandatory Technical Skills
7+ years of experience in DevOps, Cloud Infrastructure, or Site Reliability Engineering roles.
Strong expertise in AWS Cloud services including EC2, EKS, Lambda, RDS, S3, CloudFront, IAM, and VPC.
Hands-on experience with Kubernetes, EKS, and Docker.
Strong knowledge of Infrastructure as Code using Terraform and/or CloudFormation.
Experience implementing CI/CD pipelines using Jenkins, GitHub Actions, GitLab CI, or AWS CodePipeline.
Hands-on experience with monitoring and observability tools including Prometheus, Grafana, ELK/EFK, CloudWatch, Datadog, New Relic, Splunk, or Dynatrace.
Experience with incident management, troubleshooting, and reliability engineering practices.
Knowledge of cloud security, IAM, compliance frameworks, and governance controls.
Exposure to chaos engineering and resiliency testing tools.
Strong understanding of SLIs, SLOs, SLAs, and operational excellence frameworks