- We are looking for a highly skilled and motivated Site Reliability Engineer (SRE) to join our team.
- In this role, you will be responsible for building and maintaining reliable, scalable, and efficient systems that ensure the high availability and performance of our applications.
- You will work closely with development and operations teams to implement SRE practices, including dashboard building, monitoring, and performance optimization.
Key Responsibilities:
- Design, build, and maintain SRE dashboards to provide real-time visibility into the health and performance of our applications.
- Implement and maintain SLA/SLO/SSO to ensure service reliability and align with business requirements.
- Leverage DevOps principles to improve CI/CD pipelines, enabling faster and more reliable deployment cycles.
- Support and optimize microservices development to ensure scalability, reliability, and performance across distributed systems.
- Build and manage AWS infrastructure for efficient resource provisioning, scaling, and monitoring.
- Collaborate with cross-functional teams to identify and resolve production issues in a timely manner.
- Automate monitoring, alerting, and remediation processes to reduce manual intervention and increase uptime.
- Participate in on-call rotations to ensure prompt resolution of incidents and service disruptions.
- Conduct post-mortems on incidents, identify root causes, and implement preventive measures to avoid recurrence.
- Foster a culture of continuous improvement, reliability, and resilience in the software development lifecycle.
Required Skills & Qualifications:
- Proven experience in SRE practices, including dashboard building, monitoring, and alerting.
- In-depth understanding of SLA/SLO/SSO concepts and how they apply to service reliability.
- Strong experience with DevOps, including CI/CD pipelines, version control systems, and automated testing.
- Solid background in microservices development, containerization (Docker, Kubernetes), and distributed systems.
- Proficient in cloud infrastructure management, particularly AWS services (EC2, S3, Lambda, CloudWatch, etc.).
- Expertise in scripting and automation tools (e.g., Python, Bash, Terraform).
- Strong troubleshooting and incident response skills, with a focus on improving system reliability.
- Experience with monitoring tools such as Prometheus, Grafana, and Datadog.
- Strong collaboration and communication skills to work across teams and support business goals