Description
You will own the reliability, automation, and operational health of our systems.
Responsibilities
- Handle SRE BAU operations including incident management, root cause analysis, problem resolution, and service restoration.
- Manage and maintain CI/CD pipelines and deployment automation across environments.
- Improve system reliability, scalability, and performance through automation and proactive monitoring.
- Implement and manage observability solutions including logging, metrics, alerting, and dashboards.
- Utilize AI tools for faster troubleshooting, documentation, code generation, and incident analysis.
Required Skills
- 5+ years of hands-on experience in SRE or DevOps operations.
- Expertise in CI/CD tools including Github Actions, Gitlab Ci, Jenkins, and Azure Devops.
- Experience with monitoring and observability tools like Grafana and Prometheus.
- Proficiency in scripting languages such as Python or Bash.
- Familiarity with Infrastructure as Code (IaC) tools like Terraform or Cloudformation.
- Understanding of cloud platforms such as AWS, Azure, or GCP.
- Practical experience using AI tools (e.g., CursorAI, GenAI) in engineering workflows.
- Experience with observability stacks like ELK.
Preferred Skills
- Experience with ARM or Ansible.