Description
You will design and implement AI solutions to improve system reliability, automate incident management, and drive root cause analysis.
Responsibilities
- Design and deploy AI/ML models for anomaly detection, incident prediction, and root cause analysis.
- Build integrations between AI solutions and monitoring tools for metrics, logs, and traces across infrastructure, application, security, and network domains.
- Develop and maintain AI pipelines, including data preprocessing, feature extraction, model training, and evaluation.
- Collaborate with Incident and Problem Management teams to reduce MTTR and incident volume.
- Create technical documentation, including system architecture, data flow diagrams, and testing plans.
Required Skills
- 10+ years of experience in software development or related technical roles.
- Hands-on experience with AIOps and MLOps.
- Experience designing and implementing AI models using GPT, VAE, and GANs.
- Proficiency in integrating with monitoring tools such as AppDynamics, New Relic, Splunk, Azure Log Analytics, SCOM, ServiceNow, and RunDeck.
- Ability to design scalable system architectures and data flows for AI integration.
- Experience managing observability platforms to gain real-time visibility into system health.
- Strong background in data cleaning and preparation for model training.
Preferred Skills
- Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field.