You will contribute features to vLLM that leverage the latest NVIDIA GPU hardware capabilities for new models.
Responsibilities
Profile and optimize the inference framework (vLLM) using speculative decoding, data/tensor/expert/pipeline-parallelism, and prefill-decode disaggregation.
Develop, optimize, and benchmark GPU kernels (hand-tuned and compiler-generated) using fusion, autotuning, and memory/layout optimization.
Build and extend high-level DSLs and compiler infrastructure to boost kernel productivity toward peak hardware utilization.
Define inference benchmarking methodologies and tools, contributing to MLPerf submissions.
Architect the scheduling and orchestration of containerized large-scale inference deployments on GPU clusters across clouds.
Required Skills
7+ years of professional experience or equivalent experience with a Master’s degree.
Strong programming skills in Python and C/C++.
Solid CS fundamentals: algorithms & data structures, operating systems, computer architecture, parallel programming, distributed systems, deep learning theories.
Knowledge of performance engineering in ML frameworks (e.g., PyTorch) and inference engines (e.g., vLLM).
Familiarity with GPU programming and performance: CUDA, memory hierarchy, streams, NCCL.
Proficiency with profiling/debug tools (e.g., Nsight Systems/Compute).
Experience with containers and orchestration (Docker, Kubernetes, Slurm).
Bachelor’s degree in Computer Science, Computer Engineering, or Software Engineering, or equivalent experience.
Preferred Skills
Experience with Go or Rust.
Hands-on work with ML compilers and DSLs (e.g., Triton, MLIR/LLVM, XLA).