Description
You will design, develop, and deploy AI/ML models for anomaly detection, incident prediction, and automated root cause analysis in IT systems.
Responsibilities
- Integrate AIOps solutions with monitoring tools like Splunk, Datadog, Prometheus, and AppDynamics.
- Collaborate with SREs, DevOps, and IT teams to identify operational pain points and automation opportunities.
- Build and maintain data pipelines to process logs, metrics, and events from various sources.
- Implement NLP techniques for log analysis and intelligent alert correlation.
- Continuously evaluate model performance and retrain models as required.
Required Skills
- 3+ years of experience in AI/ML engineering, preferably in IT operations or DevOps.
- Strong programming skills in Python.
- Experience with time-series analysis, anomaly detection, and predictive modeling.
- Familiarity with cloud platforms (AWS, Azure, or GCP) and containerized environments (Docker, Kubernetes).
- Knowledge of ITSM and ITIL processes.
- Experience with Data Science principles.
- Familiarity with Splunk.
- Experience with Automation practices.
Preferred Skills
- Experience with AIOps platforms like Dynatrace.
- Experience with Agentic AI project planning and implementation.