AI Kernel Generation: What’s working, what’s not, what’s next – Natalie Serrino, Gimlet Labs

AI Kernel Generation: What’s working, what’s not, what’s next – Natalie Serrino, Gimlet Labs

More

Descriptions:

Natalie Serrino, co-founder of Gimlet Labs, presents one of the most technically specific talks at the AI Engineer conference: using AI agents to automatically generate and optimize GPU kernels for machine learning workloads across heterogeneous hardware platforms.

Gimlet Labs builds an agentic inference cloud that orchestrates AI workloads across different hardware vendors and chip sizes. The core problem: most ML kernels are heavily optimized for specific architectures, and the proliferation of frameworks — CUDA, Triton, Metal, and vendor-specific DSLs — combined with a severe shortage of kernel engineering experts creates a bottleneck the company is trying to close with AI. Their system takes a PyTorch workload and a target hardware specification, then runs an autonomous loop of compile, execute, validate, and profile — directly mirroring the human expert workflow. A live demo shows the agent targeting an H100 and finding an optimization that achieves 22% throughput improvement over the torch compile baseline.

The talk covers concrete results across hardware targets and problem complexity levels. A kernel fusion technique — combining convolution, softmax, bias scaling, and sigmoid into a single fused operation — achieved a 40% speedup on an Apple M4. A separate optimization achieved an 80% improvement by recognizing that average pooling could be reexpressed as a convolution, exploiting Metal’s faster convolution path. Across moderate-complexity problems, the system averages roughly 25% speedup. Serrino is candid about limitations: performance degrades significantly on high-complexity problems, and the talk closes with an honest discussion of where current agents succeed and where the research frontier lies.


📺 Source: AI Engineer · Published December 17, 2025
🏷️ Format: Hands On Build

1 Item

Channels