Agent Inference at the “Speed of Light” — How NVIDIA moves like a $4.3 Trillion Startup

Agent Inference at the “Speed of Light” — How NVIDIA moves like a $4.3 Trillion Startup

More

Descriptions:

This Latent Space podcast episode features Netter and Kyle from NVIDIA—engineering leaders on the Dynamo inference system—in a technical conversation recorded ahead of NVIDIA GTC. The discussion covers how NVIDIA approaches agent inference infrastructure, the tradeoffs involved in serving large language models at production scale, and the security considerations that arise as agents gain simultaneous access to files, the internet, and code execution capabilities.

The guests walk through a three-axis framework NVIDIA uses when evaluating model serving decisions: output quality, cost, and latency. They explain how these axes interact with model selection, tensor parallel size, and test-time scaling—illustrating how running a smaller model with extended inference can match the output quality of a larger model while hitting a lower cost target. The conversation also addresses differences between models trained for inference density (such as open-weight models optimized for throughput) versus those trained for raw capability, and how deployment context—chat, coding copilot, multi-turn agent—shapes the right architecture.

A significant portion of the discussion focuses on agent security. The guests argue that agents should only ever be granted two of three capability classes at once (file access, internet access, and code execution), since granting all three simultaneously opens vectors for malware injection and uncontrolled lateral movement. Kyle also recounts the origin story of NVIDIA’s acquisition of Brev, the GPU-access developer tool he co-founded, which now forms part of NVIDIA’s broader developer toolchain under the Launchables initiative.


📺 Source: Latent Space · Published March 08, 2026
🏷️ Format: Podcast

1 Item

Companies