Description
You will design and implement ML training, evaluation, and deployment pipelines for LLM applications.
Responsibilities
- Operate and manage SUSE Linux Enterprise (SLES) GPU clusters featuring NVIDIA H100 hardware, handling driver installation and CUDA/NCCL tuning.
- Integrate MLflow tracking, Azure ML model registry, and Model Catalog to unify model versioning and promotion.
- Deploy GPU-based inference endpoints using Managed Online Endpoints, AKS GPU node pools, or Arc-enabled Kubernetes, managing traffic splits and rollbacks.
- Automate CI/CD in Azure DevOps from data preparation through model deployment using Infrastructure as Code (Terraform / Bicep).
- Monitor model performance, data drift, and GPU metrics via Azure Monitor, Log Analytics, and NVIDIA DCGM Exporter integration.
Required Skills
- 7+ years in ML/AI engineering or MLOps roles with significant GPU workload experience.
- Hands-on experience with SUSE Linux (SLES) in production AI environments.
- In-depth knowledge of NVIDIA H100 architecture (HBM3, NVLink, MIG, multi-GPU orchestration).
- Proficiency in Azure ML, Azure AI Foundry, and Prompt Flow for LLM workflows.
- Expertise in deploying on Kubernetes with GPU node support (AKS, Arc-enabled K8s).
- Experience with Infrastructure as Code, specifically Terraform and Bicep.
- Familiarity with CI/CD workflows using Azure DevOps Pipelines.
- Knowledge of distributed training frameworks (DeepSpeed, Horovod, PyTorch DDP).
- Experience implementing governance and security for ML platforms.
Preferred Skills
- Proven track record deploying ML models in hybrid (cloud + on-prem) environments.
- Experience applying Azure’s Well-Architected ML guidance.