Design and maintain fault-tolerant, high-availability architectures across multiple cloud platforms including AWS, Azure, and GCP.
Deploy, manage, and optimize cloud resources using Infrastructure as Code (IaC) tools such as Terraform and Ansible.
Implement monitoring, alerting, and logging frameworks to proactively detect and resolve issues, ensuring optimal system performance.
Lead incident response efforts, conduct root-cause analysis, and facilitate postmortem reviews to improve system resilience.
Develop automation scripts and internal tooling using languages like Python, PowerShell, or Bash to streamline operations and reduce manual intervention.
What's Needed?
Proficiency in programming and scripting languages such as Python, PowerShell, or Bash for automation tasks.
Hands-on experience with cloud platforms including AWS, Azure, or GCP, with a solid understanding of core services like VPCs, IAM, and serverless architectures.
Experience with containerization and orchestration tools such as Docker and Kubernetes.
Knowledge of Infrastructure as Code (IaC) tools like Terraform and Ansible.
Strong expertise in monitoring and observability tools such as Splunk, Azure Monitor, Dynatrace, or AWS CloudWatch