← Back to jobs

AI Reliability Engineer

Akkodis

Tampa, FL, USA

Posted On: 1 day ago

Experience: 8+ years

Availability: Onsite

Openings: 1

Category: AI Reliability Engineer

Tenure: No Preference/Any

Related Jobs

No related jobs found

Description

Job Responsibilities include:

Design, scale, and maintain highly available infrastructure for LLM training, fine-tuning, and inference workloads.

Architect and implement agentic AI systems for automated alert triage, root cause analysis, and self-healing.

Optimize GPU utilization, cluster health, and orchestration across Kubernetes-based environments.

Define and monitor AI-specific SLOs/SLIs such as latency, throughput, and cost efficiency.

Ensure reliability of vector databases and RAG pipelines for AI data processing.

Integrate AI-driven incident management with ChatOps and enforce security guardrails for GenAI systems.

Required Qualifications:

Bachelor’s or master’s degree in computer science, Engineering, Data Science, or a related field.

8–10 years of experience in Site Reliability Engineering, DevOps, or AI/ML infrastructure roles.

Strong expertise in Kubernetes, cloud platforms (AWS/GCP/Azure), and infrastructure automation (Terraform, CI/CD).

Proven experience in GenAI/LLM systems, including orchestration frameworks (LangChain, AutoGen), vector databases, and building reliable, scalable AI platforms

Key Skills

Kubernetes Aws Gcp Azure Terraform Ci/cd Langchain Autogen Rag Pipelines

Education

Any Graduate

Related Jobs

No related jobs found

← Back to jobs

AI Reliability Engineer

Related Jobs

Description

Key Skills

Education

Related Jobs

Explore More Jobs