Agent Inference at the “Speed of Light” — How NVIDIA moves like a $4.3 Trillion Startup

Interviews2 months ago

Agent Inference at the “Speed of Light” — How NVIDIA moves like a $4.3 Trillion Startup

Descriptions:

This Latent Space podcast episode features Netter and Kyle from NVIDIA—engineering leaders on the Dynamo inference system—in a technical conversation recorded ahead of NVIDIA GTC. The discussion covers how NVIDIA approaches agent inference infrastructure, the tradeoffs involved in serving large language models at production scale, and the security considerations that arise as agents gain simultaneous access to files, the internet, and code execution capabilities.

The guests walk through a three-axis framework NVIDIA uses when evaluating model serving decisions: output quality, cost, and latency. They explain how these axes interact with model selection, tensor parallel size, and test-time scaling—illustrating how running a smaller model with extended inference can match the output quality of a larger model while hitting a lower cost target. The conversation also addresses differences between models trained for inference density (such as open-weight models optimized for throughput) versus those trained for raw capability, and how deployment context—chat, coding copilot, multi-turn agent—shapes the right architecture.

A significant portion of the discussion focuses on agent security. The guests argue that agents should only ever be granted two of three capability classes at once (file access, internet access, and code execution), since granting all three simultaneously opens vectors for malware injection and uncontrolled lateral movement. Kyle also recounts the origin story of NVIDIA’s acquisition of Brev, the GPU-access developer tool he co-founded, which now forms part of NVIDIA’s broader developer toolchain under the Launchables initiative.

📺 Source: Latent Space · Published March 08, 2026
🏷️ Format: Podcast

1 Item

Companies

No Image Available

Nvidia

Tags

Blackwell Claude Cursor DeepSeek DGX Spark Jensen Huang Meta Nvidia OpenClaw

Prev

pplx-embed: Embedding Models for Web-Scale Retrieval: Run Locally

pplx-embed: Embedding Models for Web-Scale Retrieval: Run Locally

Next

Claude Code can Design Now (Figma is officially in trouble)

Claude Code can Design Now (Figma is officially in trouble)

18 Related Posts

Related Posts

08:44

Interviews

AI Chipmaker Cerebras Raises $5.55 Billion in Year’s Biggest IPO

1 day ago

01:06:38

Interviews

Inside Abridge: The AI Listening to 100 Million Doctor Visits — Abridge’s Janie Lee & Chai Asawa

1 day ago

23:34

Interviews

The Founders Who Left Tesla to Rebuild America | a16z

2 days ago

46:56

Interviews

“There Is No Task Agents Cannot Do” – Magnus Müller

2 days ago

16:39

Interviews

How Emergent is making app building more accessible with Claude

2 days ago

01:16:02

Interviews

TypeScript, C# and Turbo Pascal with Anders Hejlsberg

2 days ago