Descriptions:
John Wang — a co-author of the DeepSeek V2 paper and contributor to the open-sourced DeepSeek R1 training methodology, with over 1,300 academic citations as a first-year PhD student — speaks with the David Ondrej podcast about his time inside DeepSeek, the architectural innovations that made the lab competitive at a fraction of Western compute budgets, and the frontier of memory systems for AI agents.
Wang explains two of DeepSeek’s defining technical contributions from his time there. First, Multi-Head Latent Attention (MLA), which compresses the KV cache to dramatically reduce memory requirements during inference. Second, his own research into expert specialization within Mixture-of-Experts (MoE) models — a method for training individual experts to specialize in specific downstream domains without overfitting other experts or sacrificing general capabilities. He argues DeepSeek’s real competitive edge was less about raw compute and more about infrastructure speed: researchers could implement an idea in the morning and have experimental results by afternoon, a cadence he says was genuinely faster than larger Western labs.
The conversation covers how the top labs differentiate across compute, data quality, algorithmic innovation, and iteration speed — and why Wang believes China’s engineering culture and infrastructure-building capacity are more formidable than commonly acknowledged in Western discourse. He also identifies a critical open problem in AI evaluation: the lack of realistic memory benchmarks. Current benchmarks test needle-in-a-haystack retrieval, but fail to simulate how real users interact with agents over time through vague, evolving, and contextually messy queries — a gap Wang sees as one of the more important unsolved problems in agent development.
📺 Source: David Ondrej · Published March 04, 2026
🏷️ Format: Interview







