Implement pipeline-level metrics collection: records in/out, processing lag, throughput, failure counts, and data freshness indicators.
Build and maintain monitoring dashboards in CloudWatch and/or Grafana for all production ingestion pipelines.
Configure alerting thresholds tied to agreed SLAs; ensure alerts trigger appropriately for pipeline lag, failures, and data quality breaches.
Capture data lineage and metadata for every ingestion pipeline and publish to the Client Data/Developer Portal or catalog.
Design and implement data quality rules (completeness, schema conformance, record counts, freshness) in collaboration with Data/Cloud Engineers and data governance/stewards.
Produce automated weekly and monthly SLA reports showing ingestion success rates, data freshness, incident counts, and trend analysis.
Develop cost monitoring views for ingestion compute spend and provide optimization recommendations.
Collaborate with Client's monitoring/observability team for dashboard and alerting integration.
Support incident triage by providing pipeline health diagnostics and root cause data.
Maintain and evolve DQ and observability standards as new sources are onboarded each month.
Required Skills & Qualifications
9-12 years of experience in data quality engineering, data observability, or data operations with a platform focus.
Hands-on experience with CloudWatch (metrics, logs, alarms, dashboards) and Grafana.
Strong experience implementing data quality frameworks: Great Expectations, dbt tests, Deequ, Soda, or equivalent.
Familiarity with data lineage and cataloging tools: AWS Glue Catalog, Apache Atlas, DataHub, OpenLineage, or similar.
Proficiency in SQL and Python for metrics collection, reporting automation, and DQ rule implementation.
Experience building SLA dashboards and automated reporting for data pipelines.
Understanding of data observability concepts: freshness, volume, schema change detection, distribution anomalies.