Description
You will maintain the availability, performance, and scalability of our applications and distributed systems.
Responsibilities
- Build monitoring dashboards in Grafana, Splunk, and cloud monitoring tools to enable proactive system oversight.
- Automate alerts, notifications, and daemon processes to improve incident response.
- Perform root cause analysis (RCA) and provide solution support and configuration changes across all environments.
- Manage code and data deployment and release support through CI/CD pipelines.
- Troubleshoot and optimize distributed systems to ensure high reliability.
- Work shifts and weekends as required to support system uptime.
Required Skills
- 5+ years of experience in site reliability or systems engineering.
- Proficiency with AWS services including Lambda, SNS, and SQS.
- Hands-on experience with Docker and Kubernetes.
- Experience managing CI/CD pipelines and version control tools.
- Practical knowledge of automation tools such as Ansible, Chef, Puppet, GitLab, Terraform, or CloudFormation.
- Expertise in monitoring tools including Grafana and Splunk.
- Deep understanding of distributed systems architecture and troubleshooting.
- Degree in any field (Any Graduate).