Strong understanding of modern machine learning techniques and toolsets
Experience debugging and optimizing ML training performance end-to-end
Deep knowledge of GPU architecture, including PTX, SASS, warps, cooperative groups, Tensor Cores, and GPU memory hierarchy
Hands-on debugging and optimization experience with CUDA GDB, NVIDIA Nsight Systems, and Nsight Compute
Experience with GPU libraries such as Triton, CUTLASS, CUB, Thrust, cuDNN, and cuBLAS
Strong understanding of CUDA graph launches, Tensor Core arithmetic, warp-level synchronization, asynchronous memory operations, latency, and throughput optimization
Experience with high-performance networking technologies including InfiniBand, RoCE, GPUDirect, PXN, rail optimization, and NVLink for GPU cluster communication
Knowledge of distributed GPU training and collective communication frameworks such as NCCL and MPI