You will manage and scale mission-critical global hybrid infrastructure across multiple datacenters and cloud providers. You own system uptime, capacity planning, and product SLAs for large-scale production environments.
Responsibilities
- Architect and maintain scalable, highly available systems designed to handle high-volume internet traffic.
- Participate in a weekly 24/7 on-call rotation to resolve outages, debug production issues, and solve escalated tickets.
- Automate routine tasks and manual processes through scripting and DevOps principles.
- Develop tools and platforms to improve system observability, insights, and security.
Required Skills
- 5+ years of experience in a Cloud SRE or similar role.
- Hands-on experience with cloud providers including AWS, GCP, or OCI.
- Proficiency with configuration management tools such as Terraform, Ansible, or Puppet.
- Experience managing containers using Kubernetes and Docker.
- Strong scripting skills in Python or Golang for task automation.
- Experience with load balancers such as HAProxy, Nginx, F5, dnsdist, or Varnish.
- Experience with web servers like Apache or Nginx.
- Ability to design, develop, and deploy modular cloud-based systems.
Preferred Skills
- AWS and/or GCP certifications.