You will own the end-to-end architecture of production AI systems with a strong hands-on bias.
Responsibilities
Define target architectures for agentic systems, GenAI/RAG pipelines, and evaluation loops; produce design documents with flow/diagrams and AWS topologies.
Size and optimize infrastructure for cost and performance, covering model throughput, latency, concurrency, and vector index sizing.
Lead deep-dive debugging and incident resolution by profiling bottlenecks, fixing defects, and stabilizing services.
Establish reference implementations for multi-agent frameworks, tool schemas, validation, and multi-step planning.
Productionize systems on AWS using EKS, S3, and Bedrock; integrate identity, secrets, and observability while enforcing SLIs/SLOs.
Required Skills
7–10 years in software/AI engineering, including 4+ years building GenAI applications and 2+ years architecting production agentic systems.
Strong hands-on expertise in Python 3.11+ (asyncio, profiling, pytest) for performance optimization.
Experience with agent frameworks such as Semantic Kernel, LangGraph, or AutoGen, including function/tool calling with schema validation.
Proven design of GenAI/RAG/hybrid retrieval systems using AWS Bedrock, OpenSearch, or vector databases.
Deep knowledge of AWS architecture: EKS, Bedrock, S3, RDS, Secrets Manager, IAM, and OpenSearch Serverless.
Observability expertise using OpenTelemetry for distributed tracing, metrics, and logs, aligned with service-level objectives.
Cost and performance engineering mindset, including capacity modeling, autoscaling (HPA), and FinOps discipline.
Security fundamentals covering least privilege, prompt-injection defense, and data governance.