Description
Software Requirements
- AWS: strong hands-on experience with EC2, ECS/EKS, IAM, VPC, ALB/NLB, Route 53, S3, CloudWatch
- Docker
- Container orchestration using EKS/Kubernetes or ECS
- CI/CD using GitHub Actions, Jenkins, or Azure DevOps
- IaC using Terraform or CloudFormation
- Observability tools: CloudWatch, Prometheus/Grafana, ELK/OpenSearch, X-Ray
- Automation using Python and/or Bash
- Linux system administration and troubleshooting
- Networking knowledge covering DNS, TCP/IP, TLS, security groups, NACLs
Preferred
- Experience with CloudFront, RDS, ElastiCache, ASG
- Blue/green and canary deployment strategies
- Artifact management and release approval workflows
- Vulnerability scanning and secrets management tools
Overall Responsibilities
- Define and maintain SLOs, SLIs, SLAs, and error budgets
- Build and manage AWS infrastructure for scalable, highly available systems
- Operate containerized services using Docker and ECS/EKS/Kubernetes
- Implement and optimize CI/CD pipelines and deployment strategies
- Establish observability through metrics, logs, and traces
- Automate infrastructure and operations using IaC and scripting
- Manage incident response, runbooks, root-cause analysis, and remediation
- Drive performance tuning, capacity planning, and cost optimization
- Implement security best practices across infrastructure and deployments
- Partner with development teams to improve reliability by design
Technical Skills (By Category)
Programming Languages
Essential: Python, Bash
Preferred: Scripting for operational automation and diagnostics
Databases / Data Management
Essential: Operational familiarity with RDS and ElastiCache in production environments
Preferred: Performance tuning and availability planning for managed data services
Cloud Technologies
Essential: AWS including EC2, ECS/EKS, IAM, VPC, ALB/NLB, Route 53, S3, CloudWatch
Preferred: CloudFront, Auto Scaling Groups, advanced cost optimization practices
Frameworks and Libraries
Essential: Docker, Kubernetes/EKS or ECS
Preferred: Reliability patterns such as circuit breakers, retries, backoff, health checks
Development Tools and Methodologies
Essential: CI/CD, Terraform or CloudFormation, monitoring and alerting, incident response, Linux troubleshooting
Preferred: Blue/green and canary deployments, release engineering improvements
Security Protocols
Essential: Least-privilege IAM, SSL/TLS, secrets handling, vulnerability awareness
Preferred: Automated scanning, policy enforcement, and remediation workflows
Experience Requirements
- 7+ years of experience in SRE, DevOps, or Cloud Operations
- Experience owning production infrastructure and reliability outcomes
- Strong experience with AWS, Docker, orchestration, CI/CD, IaC, and incident response
- Experience improving MTTR, availability, and operational efficiency
- Equivalent experience in related production engineering roles will also be considered
Day-to-Day Activities
- Maintain AWS environments and containerized services
- Monitor system health, alerts, logs, and traces
- Improve deployment pipelines and release reliability
- Participate in incident response, troubleshooting, and postmortems
- Update runbooks, dashboards, and automation scripts
- Work with Dev, QA, and Security teams on resilience and operational readiness
- Join standups, planning sessions, reviews, and reliability discussions
Qualifications
Required
- Bachelor’s degree in Computer Science, Engineering, Information Technology, or related field or equivalent practical experience
Preferred
- AWS, Kubernetes, Terraform, or cloud operations certifications
- Ongoing learning in reliability engineering, security, and performance optimization