Description

Core Responsibilities

  • Support, monitor, and maintain microservices-based data platforms in production environments.
  • Proactively monitor, troubleshoot, and resolve incidents across distributed systems and streaming platforms.
  • Ensure high availability, reliability, and performance of Kafka-based ingestion and processing pipelines.
  • Own operational readiness across non-production and production environments, including:
    • Start/stop procedures
    • Dependency validation
    • Release support
  • Drive automation initiatives across:
    • Deployments
    • Vulnerability remediation
    • Monitoring and alerting
    • Operational recovery workflows

Required Technical Skills

  • Strong programming expertise in Java and/or Python.
  • Hands-on experience with Apache Kafka, including:
  • Topics, partitions, brokers
  • Consumer groups
  • Kafka Connect
  • Lag monitoring and alert handling
  • Proven experience in microservices architecture (build, support, and troubleshooting).
  • Solid understanding of SRE principles, including:
    • SLAs, SLOs
    • Incident response
    • Threshold-based alerting
    • Observability and system resilience
  • Experience with CI/CD pipelines and tools such as:
    • Jenkins, GitLab CI/CD, Nexus
    • Source control platforms (Git-based)
  • Strong troubleshooting capability across logs, metrics, traces, infrastructure dependencies, and application failures.
  • Kafka, Platform Operations & Environment Readiness
  • Experience with data ingestion mechanisms including Kafka, SFTP, and APIs.
  • Knowledge of data formats such as JSON and Parquet.
  • Ability to manage and operate:
    • Kafka topic inventory
    • Metadata/entity mapping
    • Topic classifications (K0/K1/K2)
    • Consumer group operations
  • Experience with: Kafka backlog monitoring Alert triage and incident response
  • Exposure to Cloudera-managed Kafka environments (preferred).
  • Ability to validate end-to-end operational readiness of non-production environments for releases.
  • Platform, Infrastructure & Observability
  • Experience with cloud platforms (AWS or Azure).
  • Hands-on knowledge of:
    • Docker
    • Kubernetes (AKS preferred)
    • Familiarity with observability and monitoring tools:
    • Grafana, Prometheus, Pushgateway
    • Splunk
  • Ability to validate:
    • Dashboard ownership
    • Log routing
    • Reconciliation metrics
    • Monitoring coverage
  • Understanding of platform dependencies across:
    • Jenkins
    • Nexus
    • GOGS
    • Supporting infrastructure services
  • Operational Expertise
  • Experience handling production incidents (P1/P2) and escalations.
  • Strong exposure to:
    • Root Cause Analysis (RCA) preparation
    • Support runbook execution
    • Familiarity with ServiceNow (or similar ticketing systems) and queue management workflows.

Good understanding of:

  • Application runbooks (start/stop order, escalation paths, thresholds)
  • Access governance (AD/IDM, privileged access management)
  • Change management processes and deployment ownership
  • Exit procedures and access revocation controls

Education

Bachelor's degree