Build and maintain automated tests for AI agent workflows, APIs, tools, telemetry backed analysis, remediation flows, and ticketing behavior.
Design evaluation suites for LLM and agentic behavior, including expected-answer checks, rubric-based grading, regression datasets, tool-call validation, and safety/approval checks.
Use or help implement evaluation frameworks such as Pydantic Evals / Pydantic AI, Strands Evals, LangSmith, DeepEval, Ragas, promptfoo, or similar tools.
Validate multi-turn support scenarios, clarification flows, knowledge retrieval, script/remediation recommendations, escalation paths, and failure handling.
Test on-device agent behavior where needed, including Windows service/tray behavior, telemetry collection, anomaly detection, local remediation handoff, logs, and resource impact.
Debug quality issues directly by reading logs, tracing requests, reproducing failures, and making small code/test changes without heavy engineering hand-holding.
Partner with engineering and product to define release gates, quality metrics, evaluation rubrics, and confidence thresholds for pilot readiness.
Contribute to CI quality checks, test fixtures, mocked integrations, regression suites, and test data management.
Identify risks in AI behavior, including hallucinated diagnosis, unsafe remediation suggestions, missing consent, weak ticket summaries, brittle tool use, and poor escalation behavior.
About the Role
Responsibilities
4+ years of experience in QA engineering, SDET, test automation, software engineering, or similar hands-on quality roles.
Strong Python experience, including writing production-quality tests and debugging application code.