← Back to jobs
Tampa, FL, USA
No related jobs found
Job Responsibilities include:
Design, scale, and maintain highly available infrastructure for LLM training, fine-tuning, and inference workloads.
Architect and implement agentic AI systems for automated alert triage, root cause analysis, and self-healing.
Optimize GPU utilization, cluster health, and orchestration across Kubernetes-based environments.
Define and monitor AI-specific SLOs/SLIs such as latency, throughput, and cost efficiency.
Ensure reliability of vector databases and RAG pipelines for AI data processing.
Integrate AI-driven incident management with ChatOps and enforce security guardrails for GenAI systems.
Required Qualifications:
Bachelor’s or master’s degree in computer science, Engineering, Data Science, or a related field.
8–10 years of experience in Site Reliability Engineering, DevOps, or AI/ML infrastructure roles.
Strong expertise in Kubernetes, cloud platforms (AWS/GCP/Azure), and infrastructure automation (Terraform, CI/CD).
Proven experience in GenAI/LLM systems, including orchestration frameworks (LangChain, AutoGen), vector databases, and building reliable, scalable AI platforms
Any Graduate
No related jobs found
← Back to jobs