Description

You will own the implementation and maintenance of Site Reliability Engineering practices for our web portal.

Responsibilities

  • Design and implement comprehensive SRE monitoring for the web portal on GCP.
  • Implement logging and tracing standards across all portal components using Cloud Logging and Cloud Trace.
  • Configure APIGEE monitoring and track API performance for portal services.
  • Develop and maintain SRE automation scripts within GKE namespaces for monitoring, deployment, and troubleshooting.
  • Create drill-down dashboards correlating metrics, logs, and traces using GCP tools.

Required Skills

  • 5+ years of experience in SRE or DevOps.
  • Strong proficiency with Kubernetes (GKE), including namespace management and RBAC.
  • Experience implementing OpenTelemetry (OTEL) and distributed tracing with W3C Trace Context headers.
  • Expertise with GCP Observability tools: Cloud Monitoring (GMP) and Cloud Logging.
  • Proficiency in querying metrics using PromQL and Grafana.
  • Hands-on experience with JVM metrics collection, heap analysis, and garbage collection optimization for Java applications.
  • Experience building CI/CD pipelines and managing infrastructure using Docker, YAML, and Helm.
  • Familiarity with UI instrumentation for frontend monitoring and traceability.
  • Proficiency in Python and Linux scripting.

Education

Any Gradute