You will build and run large-scale, massively distributed, fault-tolerant systems in a DevSecOps environment. You work closely with development and operations teams to ensure high availability, cost-effectiveness, and extreme uptime. You participate in a 24/7 follow-the-sun operating model as a first responder for incident and problem management.
Responsibilities
- Build infrastructure as code patterns using Terraform and cloud SDKs that meet security and engineering standards.
- Create auto-remediation tools and scripts to establish end-to-end monitoring and alerting for critical system aspects.
- Troubleshoot and resolve trouble tickets by collaborating with cloud operations teams.
- Automate and orchestrate Linux/Windows systems and containers to eliminate toil.
- Ensure functional and performance objectives through rigorous monitoring of infrastructure and application uptime.
Required Skills
- 5+ years of experience in public cloud development or administration.
- 2+ years of experience with monitoring infrastructure and application availability.
- Proficiency in Python, Bash, Java, Go, JavaScript, and/or Node.js.
- Experience with system administration, including automation and orchestration of Linux/Windows.
- Hands-on experience with containers (Docker, Kubernetes) and Infrastructure as Code (Terraform, Chef, Ansible).
- 2+ years of experience with CI/CD tooling and practices.
- Cross-functional knowledge of systems, storage, networking, security, and databases.
- BS degree in Computer Science, Physics, Mathematics, or equivalent job experience.