You will own the implementation and maintenance of Site Reliability Engineering practices for our web portal.
Responsibilities
- Design and implement comprehensive SRE monitoring for the web portal on GCP.
- Implement logging and tracing standards across all portal components using Cloud Logging and Cloud Trace.
- Configure APIGEE monitoring and track API performance for portal services.
- Develop and maintain SRE automation scripts within GKE namespaces for monitoring, deployment, and troubleshooting.
- Create drill-down dashboards correlating metrics, logs, and traces using GCP tools.
Required Skills
- 5+ years of experience in SRE or DevOps.
- Strong proficiency with Kubernetes (GKE), including namespace management and RBAC.
- Experience implementing OpenTelemetry (OTEL) and distributed tracing with W3C Trace Context headers.
- Expertise with GCP Observability tools: Cloud Monitoring (GMP) and Cloud Logging.
- Proficiency in querying metrics using PromQL and Grafana.
- Hands-on experience with JVM metrics collection, heap analysis, and garbage collection optimization for Java applications.
- Experience building CI/CD pipelines and managing infrastructure using Docker, YAML, and Helm.
- Familiarity with UI instrumentation for frontend monitoring and traceability.
- Proficiency in Python and Linux scripting.