← Back to jobs

Data Platform SRE

Merican Inc

Richardson, Texas, USA

Posted On: 30+ days ago

Experience: 5+ years

Availability: Hybrid

Openings: 1

Category: Data Platform Engineer

Tenure: Contract - Corp-to-Corp

Related Jobs

No related jobs found

Description

Core Responsibilities

Support, monitor, and maintain microservices-based data platforms in production environments.
Proactively monitor, troubleshoot, and resolve incidents across distributed systems and streaming platforms.
Ensure high availability, reliability, and performance of Kafka-based ingestion and processing pipelines.
Own operational readiness across non-production and production environments, including:
- Start/stop procedures
- Dependency validation
- Release support
Drive automation initiatives across:
- Deployments
- Vulnerability remediation
- Monitoring and alerting
- Operational recovery workflows

Required Technical Skills

Strong programming expertise in Java and/or Python.
Hands-on experience with Apache Kafka, including:
Topics, partitions, brokers
Consumer groups
Kafka Connect
Lag monitoring and alert handling
Proven experience in microservices architecture (build, support, and troubleshooting).
Solid understanding of SRE principles, including:
- SLAs, SLOs
- Incident response
- Threshold-based alerting
- Observability and system resilience
Experience with CI/CD pipelines and tools such as:
- Jenkins, GitLab CI/CD, Nexus
- Source control platforms (Git-based)
Strong troubleshooting capability across logs, metrics, traces, infrastructure dependencies, and application failures.
Kafka, Platform Operations & Environment Readiness
Experience with data ingestion mechanisms including Kafka, SFTP, and APIs.
Knowledge of data formats such as JSON and Parquet.
Ability to manage and operate:
- Kafka topic inventory
- Metadata/entity mapping
- Topic classifications (K0/K1/K2)
- Consumer group operations
Experience with: Kafka backlog monitoring Alert triage and incident response
Exposure to Cloudera-managed Kafka environments (preferred).
Ability to validate end-to-end operational readiness of non-production environments for releases.
Platform, Infrastructure & Observability
Experience with cloud platforms (AWS or Azure).
Hands-on knowledge of:
- Docker
- Kubernetes (AKS preferred)
- Familiarity with observability and monitoring tools:
- Grafana, Prometheus, Pushgateway
- Splunk
Ability to validate:
- Dashboard ownership
- Log routing
- Reconciliation metrics
- Monitoring coverage
Understanding of platform dependencies across:
- Jenkins
- Nexus
- GOGS
- Supporting infrastructure services
Operational Expertise
Experience handling production incidents (P1/P2) and escalations.
Strong exposure to:
- Root Cause Analysis (RCA) preparation
- Support runbook execution
- Familiarity with ServiceNow (or similar ticketing systems) and queue management workflows.

Good understanding of:

Application runbooks (start/stop order, escalation paths, thresholds)
Access governance (AD/IDM, privileged access management)
Change management processes and deployment ownership
Exit procedures and access revocation controls

Key Skills

Python Java Docker Kubernetes Aws Grafana Prometheus Pushgateway Azure Json

Education

Bachelor's degree

Related Jobs

No related jobs found

← Back to jobs

Data Platform SRE

Related Jobs

Description

Key Skills

Education

Related Jobs

Explore More Jobs