Under 5 minutes to a deployed LLM endpoint — Audry Hsu, RunPod

Under 5 minutes to a deployed LLM endpoint — Audry Hsu, RunPod

More

Descriptions:

At the AI Engineer summit, Audrey Hsu, developer advocate at RunPod, delivers a live demo showing how to deploy a production-ready LLM inference endpoint in under five minutes using RunPod’s serverless infrastructure. RunPod is a GPU cloud company with over 500,000 developers on platform, 30-plus data centers worldwide (including EU locations), and $120 million in annual recurring revenue — bootstrapped from GPU rigs in a basement in 2022 after a failed crypto mining venture.

The talk focuses on RunPod’s serverless product, which auto-scales inference workers and charges nothing when idle — making it well-suited for bursty or batch inference without pre-committing to always-on compute. Hsu walks through deploying an open-source LLM from RunPod’s Hub (a curated repository of pre-configured, community-vetted AI repos), configuring vLLM parameters such as context window length and LoRA settings via environment variables, and generating a live API endpoint through the web console. CLI support and agent-compatible skills are also mentioned.

The broader RunPod product lineup covered includes Pods (container-based sandboxes with direct GPU allocation), Clusters (multi-node training with high-speed networking), and the Hub. The talk is aimed at developers who want flexible, on-demand GPU access without managing infrastructure — and serves as a practical introduction to serverless LLM deployment for teams evaluating alternatives to AWS, Google Cloud, or bare-metal GPU procurement during the current global supply crunch.


📺 Source: AI Engineer · Published June 07, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels