Lead end-to-end quality engineering for enterprise AI applications, including LLM-powered products, RAG pipelines, and agentic workflows.
Design and execute prompt validation strategies, evaluating LLM responses for accuracy, semantic relevance, hallucination risk, and safety compliance.
Build automated evaluation pipelines for AI model outputs using metrics such as BLEU, ROUGE, embedding-based similarity, precision, recall, and F1-score.
Validate agentic systems for correctness, determinism, and failure mode handling.
Architect and maintain Python-based automation frameworks for AI/ML model evaluation, regression testing, and continuous model quality monitoring.
What's Needed?
10+ years of professional experience in Quality Engineering and Test Automation, validating complex enterprise applications.
Proficient in validating AI/ML systems, including Generative AI and LLM-based applications.
Strong proficiency in Python and experience building automation frameworks from the ground up.
Practical experience with prompt validation, agentic workflow testing, and AI model evaluation.
Working knowledge of evaluation metrics: BLEU, ROUGE, embedding similarity, precision, recall, F1-score, and human-evaluation methodologies