Description
Manage and operate large-scale, distributed, fault-tolerant systems within a DevSecOps environment. Build highly available, cost-effective systems with high uptime metrics in collaboration with development and operations teams.
Responsibilities
- Develop scripts and tools for incident auto-remediation and end-to-end monitoring and alerting.
- Implement infrastructure as code (IaC) patterns using Terraform, cloud CLI, and cloud SDKs to meet security standards.
- Resolve trouble tickets and troubleshoot issues alongside the cloud operations team.
- Participate in a 24/7, follow-the-sun operating model as part of a first responder team for incident and problem management.
Required Skills
- BS degree in Computer Science or a related technical field involving coding, or equivalent experience.
- 2+ years of experience in monitoring infrastructure and application uptime and availability.
- 2+ years of experience with Python, Bash, Java, Go, JavaScript, or Node.js.
- 2+ years of experience with system administration, including Linux/Windows automation and orchestration using Terraform, Chef, Ansible, or containers like Docker and Kubernetes.
- 2+ years of experience with continuous integration and continuous delivery (CI/CD) tooling and practices.
- 2+ years of cross-functional knowledge involving systems, storage, networking, security, and databases.
- 1+ years of experience developing or administering software in a public cloud environment.
Preferred Skills
- Experience designing, analyzing, and troubleshooting large-scale distributed systems.
- Experience managing Infrastructure as Code via Terraform or CloudFormation.
- Experience building or maintaining systems in highly secure, regulated, or compliant industries.