Job Description
NVIDIA is hiring exceptional software engineers to build and optimize the core inference infrastructure for large language models. Join the TensorRT‑LLM team - the group defining how generative AI performs at global scale on NVIDIA GPUs. We’re looking for engineers who love squeezing every drop of throughput, memory efficiency, and scalability out of modern model runtimes. Your work will directly shape the frameworks behind state‑of‑the‑art LLM inference used across NVIDIA and the AI community. Join us to redefine what “fast” means for LLM inference - building the frameworks that power the next generation of generative AI at scale.
What you'll be doing:
+ Design, implement, and optimize high‑performance inference pipelines for large language models running on GPUs
+ Profile and tune model execution across the stack - from scheduler design to kernel fusions and everything in-between
+ Design and experiment with memory management strategies for improved memory ba...