You will build agentic AI solutions to mitigate risk and cost within large-scale production environments.
Responsibilities
- Design and implement tool-calling agents that execute actions following MCP protocol, integrating retrieval and structured reasoning.
- Productionize LLMs by building evaluation frameworks, retrieval pipelines, and self-correction loops for production operations.
- Integrate agents with observability, incident management, and deployment systems for automated diagnostics and remediation.
- Translate production pain points into agentic AI roadmaps by partnering with application teams and defining objective functions.
- Instrument continuous evaluations and enforce guardrails, circuit breakers, and rollback strategies for safety and correctness.
Required Skills
- 5+ years of software development experience in Python, C/C++, Go, or Java, with strong preference for large-scale Python applications.
- 3+ years designing, architecting, and launching production ML systems, including model serving and evaluation.
- Practical experience with LLMs: API integration, prompt engineering, and building agents using RAG and function calling.
- Understanding of various LLM types, including commercial and open-source models (e.g., OpenAI, Gemini, Llama).
- Solid grasp of applied statistics, core ML concepts, and data structures.
- Experience with DynamoDB and Redshift.
- Strong analytical problem-solving, ownership, and ability to communicate technical concepts clearly.
- Familiarity with SageMaker.