You will architect and drive reliability strategy across the engineering organization.
Responsibilities
Architect and design highly available, scalable, secure, and cost-effective infrastructure and application patterns on AWS.
Define and evangelize SRE best practices, standards, and blueprints for service design, deployment, monitoring, and operational readiness.
Review observability implementation to identify gaps and define steps to mature observability setup for deep system insights.
Design solutions to systematically reduce operational toil through automation and improved system design.
Lead blameless postmortems for significant incidents, prioritizing systemic architectural improvements.
Required Skills
5+ years of proven experience in an architectural role designing for reliability, scalability, and performance.
Deep understanding and practical application of SRE principles (SLIs/SLOs, error budgets, toil reduction, automation, incident management, postmortems).
Expertise in cloud computing platforms, specifically AWS, including infrastructure, networking, and security services.
Strong experience with containerization and orchestration technologies (Kubernetes, Docker, serverless computing).
Solid experience designing and implementing observability solutions using tools like Dynatrace, Prometheus, Grafana, ELK/EFK Stack, Jaeger, or OpenTelemetry.
Strong programming/scripting skills (Python, Go, Bash) for automation and tool development.
Experience with CI/CD pipelines and Infrastructure as Code.
Excellent analytical, problem-solving, and strategic thinking skills.
Preferred Skills
Experience designing and implementing chaos engineering practices and platforms.