Core Responsibilities
- Support, monitor, and maintain microservices-based data platforms in production environments.
- Proactively monitor, troubleshoot, and resolve incidents across distributed systems and streaming platforms.
- Ensure high availability, reliability, and performance of Kafka-based ingestion and processing pipelines.
- Own operational readiness across non-production and production environments, including:
- Start/stop procedures
- Dependency validation
- Release support
- Drive automation initiatives across:
- Deployments
- Vulnerability remediation
- Monitoring and alerting
- Operational recovery workflows
Required Technical Skills
- Strong programming expertise in Java and/or Python.
- Hands-on experience with Apache Kafka, including:
- Topics, partitions, brokers
- Consumer groups
- Kafka Connect
- Lag monitoring and alert handling
- Proven experience in microservices architecture (build, support, and troubleshooting).
- Solid understanding of SRE principles, including:
- SLAs, SLOs
- Incident response
- Threshold-based alerting
- Observability and system resilience
- Experience with CI/CD pipelines and tools such as:
- Jenkins, GitLab CI/CD, Nexus
- Source control platforms (Git-based)
- Strong troubleshooting capability across logs, metrics, traces, infrastructure dependencies, and application failures.
- Kafka, Platform Operations & Environment Readiness
- Experience with data ingestion mechanisms including Kafka, SFTP, and APIs.
- Knowledge of data formats such as JSON and Parquet.
- Ability to manage and operate:
- Kafka topic inventory
- Metadata/entity mapping
- Topic classifications (K0/K1/K2)
- Consumer group operations
- Experience with: Kafka backlog monitoring Alert triage and incident response
- Exposure to Cloudera-managed Kafka environments (preferred).
- Ability to validate end-to-end operational readiness of non-production environments for releases.
- Platform, Infrastructure & Observability
- Experience with cloud platforms (AWS or Azure).
- Hands-on knowledge of:
- Docker
- Kubernetes (AKS preferred)
- Familiarity with observability and monitoring tools:
- Grafana, Prometheus, Pushgateway
- Splunk
- Ability to validate:
- Dashboard ownership
- Log routing
- Reconciliation metrics
- Monitoring coverage
- Understanding of platform dependencies across:
- Jenkins
- Nexus
- GOGS
- Supporting infrastructure services
- Operational Expertise
- Experience handling production incidents (P1/P2) and escalations.
- Strong exposure to:
- Root Cause Analysis (RCA) preparation
- Support runbook execution
- Familiarity with ServiceNow (or similar ticketing systems) and queue management workflows.
Good understanding of:
- Application runbooks (start/stop order, escalation paths, thresholds)
- Access governance (AD/IDM, privileged access management)
- Change management processes and deployment ownership
- Exit procedures and access revocation controls