Description
Manage and scale production environments across AWS and Azure. Ensure high availability, reliability, and performance of distributed systems.
Responsibilities
- Design and implement scalable infrastructure to guarantee application reliability.
- Build and maintain monitoring tools, alerts, and dashboards using Grafana, Prometheus, or ELK.
- Automate repetitive operational tasks and CI/CD pipelines using Jenkins and GitLab.
- Conduct post-incident root cause analysis and implement preventive measures.
- Manage cloud resource provisioning and configuration via CLI and APIs.
Required Skills
- 4+ years of experience in DevOps or Site Reliability Engineering.
- Proficiency with AWS and Azure cloud platforms.
- Hands-on experience with Kubernetes and Docker containerization.
- Strong scripting skills in Python, Shell, or Perl.
- Experience managing Jenkins CI/CD pipelines and GitLab CI/CD configurations.
- Solid understanding of Linux/Unix systems, including configuration and troubleshooting.
- Knowledge of networking protocols: TCP/IP, HTTP, DNS, DHCP, and VPCs.
- Familiarity with configuration management tools such as Ansible, Puppet, or Chef.
- Understanding of Terraform syntax and infrastructure as code.
Preferred Skills
- Experience with Citrix technologies including XenApp, XenDesktop, and NetScaler.
- Familiarity with Windows OS and Active Directory.