← Back to jobs
Culver City, CA, USA
No related jobs found
Core Responsibilities Lakehouse Architecture & Data Engineering • Schema Design: Design and evolve the Lakehouse schema (Delta Lake / Apache Iceberg) for multi-domain observability data at petabyte scale. • Pipeline Engineering: Build and maintain robust ingestion pipelines from the OTel Collector through Kafka to the Lakehouse, ensuring exactly-once semantics and strict schema enforcement. • Data Transformation: Implement dbt transformation models to generate mart-ready, denormalized fact and dimension tables for each of the six domains. • Data Quality Governance: Define and enforce data quality contracts, establishing SLAs for data freshness, completeness, and cardinality budgets per mart. • Performance Optimization: Optimize query performance utilizing partitioning strategies, Z-ordering, bloom filters, and materialized views tailored for time-series patterns. ML Model Development & AIOps • AIOps Modeling: Design, train, and deploy machine learning models for streaming multivariate anomaly detection, root-cause analysis, and incident forecasting across all six mart domains. • Streaming Inference: Build low-latency streaming inference pipelines (Flink / Spark Streaming) for real-time anomaly scoring on APM, infrastructure, and security signals. • Log Intelligence: Develop sophisticated log intelligence models—including clustering (DRAIN3 / LogBERT), NLP classification, and error deduplication—over the Log mart. • Behavioral Analytics: Implement unsupervised and semi-supervised methods for User Experience frustration detection and KPI correlation analysis. • Feature Store Management: Own the ML feature store, managing feature engineering, versioning, backfill pipelines, and point-in-time correct joins for training datasets. • Model Lifecycle MLOps: Instrument model performance tracking, including drift detection, accuracy monitoring, and automated retraining triggers. AIOps Platform & Productionization • Workflow Orchestration: Design and operate the end-to-end AIOps workflow, spanning signal ingestion, feature computation, model inference, alert routing, and auto-remediation hooks. • Model Serving Infrastructure: Build high-performance model serving infrastructure—supporting real-time REST/gRPC endpoints and async batch scoring—with strict p99 latency SLOs. • Incident Tool Integration: Integrate AIOps insights with incident management platforms (PagerDuty, Opsgenie) and internal runbooks to deliver enriched, noise-reduced alerting. • Business Impact Quantification: Define and publish metrics from the Business KPI mart to quantify the blast radius, revenue loss, and affected user counts for each incident. Security & Compliance Observability • Security Mart Collaboration: Partner with the Security team to build the Security mart schema, including threat feed ingestion, UEBA baselines, and CVE correlation pipelines. • Threat Detection: Train anomalous-access and lateral-movement detection models, tuning precision/recall thresholds in collaboration with the SOC team. • Compliance & Governance: Ensure all data handling across the marts adheres strictly to data residency requirements, PII masking standards, and audit-log protocols. Collaboration & Engineering Standards • Schema Contracts: Define telemetry schema contracts with the OTel Instrumentation team to guarantee high upstream signal quality for downstream ML models. • Organizational Standards: Author ML platform RFCs and contribute actively to observability data model standards across the broader engineering organization. • Mentorship & Reviews: Mentor junior ML and data engineers, and conduct rigorous design reviews for new mart schemas and model architectures." ✅ Kafka + Streaming (Flink/Spark) ✅ Lakehouse (Delta / Iceberg) ✅ ML (Anomaly detection + time-series) ✅ Observability (OTel, APM, Logs) ✅ MLOps (feature store, drift, retraining) ✅ SQL + Python (strong)
Any Gradute
No related jobs found
← Back to jobs