Run DeepSeek v4 Flash Locally and Get Blown Away

Run DeepSeek v4 Flash Locally and Get Blown Away

More

Descriptions:

Fahd Mirza walks through the complete process of running DeepSeek V4 Flash locally on a dual-H100 GPU server, from hardware provisioning through live inference — no API, no managed cloud service. The setup uses two NVIDIA H100 cards with 96GB VRAM each (192GB total), which Mirza identifies as the minimum viable configuration for the 284-billion-parameter model. FP4/FP8 mixed-precision quantization reduces the actual memory footprint to approximately 142–150GB, making the hardware just sufficient when paired with a 32K context window to stay within VRAM limits.

The video documents every step of the installation sequence using vLLM, including a key troubleshooting moment: an outdated Transformers library causes the initial serve command to fail, requiring a fresh clone of DeepSeek’s own inference repository. Mirza then converts the 160GB Hugging Face weights to DeepSeek’s native inference format using their generator.py script, distributes the model’s 256 experts across both GPUs via tensor parallelism, and launches the model with PyTorch’s torchrun for interactive inference.

Beyond installation mechanics, the video briefly covers DeepSeek V4 Flash’s architectural innovations — including new attention mechanisms (CSA and HCA) that enable a dramatically extended context window — and its benchmark performance relative to DeepSeek V3.2. The result is a reproducible, hardware-specific guide that gives engineers with equivalent GPU resources everything they need to self-host one of the most capable open-weight models currently available.


📺 Source: Fahd Mirza · Published April 24, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

1 Item

Companies