Descriptions:
While most coverage of DeepSeek V4 focuses on benchmark scores, Fahd Mirza goes a level deeper to explain the two open-sourced infrastructure technologies that made the model’s performance possible: DeepEP2 and TileLang.
DeepSeek V4 Pro ships with 1.6 trillion parameters and a 1 million token context window, yet requires only 10% of the GPU memory that V3.2 needed at the same context length. Mirza explains how two new attention mechanisms achieve this: Compressed Sparse Attention (CSA), which groups four tokens into a single compressed entry and attends only to the most relevant subset, and Heavily Compressed Attention (HCA), which compresses 128 tokens into one entry for distant context where fine detail matters less. Together, these reduce memory overhead at scale by roughly 90%.
DeepEP2 addresses a separate bottleneck: routing tokens efficiently across hundreds of GPUs in a mixture-of-experts architecture. By breaking communication into overlapping waves—sending wave 2 while wave 1 is still computing—DeepEP2 hides network latency inside compute time and achieves nearly double the throughput of its predecessor. Finally, TileLang, a new GPU programming language built by DeepSeek, dramatically reduces the expertise and time required to write custom CUDA kernels, enabling rapid iteration on novel attention designs. All three technologies have been open-sourced and are available for any team building large-scale inference infrastructure.
📺 Source: Fahd Mirza · Published April 28, 2026
🏷️ Format: Deep Dive







