Description
You will own the development and deployment of ML models for observability use cases.
Responsibilities
- Develop and deploy machine learning models for anomaly detection, predictive alerting, incident forecasting, and automated root cause analysis.
- Build and optimize ML models using Python libraries to identify behavioral patterns and predict system failures across Azure ecosystems.
- Architect scalable, real-time data pipelines using Apache Kafka to ingest high-volume observability data.
- Utilize Azure Databricks for large-scale data processing, feature engineering, and automating MLOps pipelines.
- Establish complete MLOps workflows, managing model versioning, CI/CD, and performance monitoring to reduce MTTD and MTTR.
Required Skills
- 5+ years of experience in MLOps, Data Engineering, or Data Science.
- Strong knowledge of machine learning applied to time-series data and AIOps use cases.
- Hands-on experience with Kafka, Flink, and Azure Databricks.
- Proficiency in Python and ML libraries including TensorFlow, PyTorch, and Scikit-learn.
- Experience understanding observability ecosystems like Datadog, Prometheus, or Elastic.
- Proven experience deploying ML models in production and managing the full model lifecycle.