Description

  • We are looking for a highly skilled and motivated Site Reliability Engineer (SRE) to join our team.
  • In this role, you will be responsible for building and maintaining reliable, scalable, and efficient systems that ensure the high availability and performance of our applications.
  • You will work closely with development and operations teams to implement SRE practices, including dashboard building, monitoring, and performance optimization.

Key Responsibilities:

  • Design, build, and maintain SRE dashboards to provide real-time visibility into the health and performance of our applications.
  • Implement and maintain SLA/SLO/SSO to ensure service reliability and align with business requirements.
  • Leverage DevOps principles to improve CI/CD pipelines, enabling faster and more reliable deployment cycles.
  • Support and optimize microservices development to ensure scalability, reliability, and performance across distributed systems.
  • Build and manage AWS infrastructure for efficient resource provisioning, scaling, and monitoring.
  • Collaborate with cross-functional teams to identify and resolve production issues in a timely manner.
  • Automate monitoring, alerting, and remediation processes to reduce manual intervention and increase uptime.
  • Participate in on-call rotations to ensure prompt resolution of incidents and service disruptions.
  • Conduct post-mortems on incidents, identify root causes, and implement preventive measures to avoid recurrence.
  • Foster a culture of continuous improvement, reliability, and resilience in the software development lifecycle.

Required Skills & Qualifications:

  • Proven experience in SRE practices, including dashboard building, monitoring, and alerting.
  • In-depth understanding of SLA/SLO/SSO concepts and how they apply to service reliability.
  • Strong experience with DevOps, including CI/CD pipelines, version control systems, and automated testing.
  • Solid background in microservices development, containerization (Docker, Kubernetes), and distributed systems.
  • Proficient in cloud infrastructure management, particularly AWS services (EC2, S3, Lambda, CloudWatch, etc.).
  • Expertise in scripting and automation tools (e.g., Python, Bash, Terraform).
  • Strong troubleshooting and incident response skills, with a focus on improving system reliability.
  • Experience with monitoring tools such as Prometheus, Grafana, and Datadog.
  • Strong collaboration and communication skills to work across teams and support business goals

Education

Any Gradute