Run DeepSeek v4 Flash Locally and Get Blown Away

Tutorials3 weeks ago

Run DeepSeek v4 Flash Locally and Get Blown Away

Descriptions:

Fahd Mirza walks through the complete process of running DeepSeek V4 Flash locally on a dual-H100 GPU server, from hardware provisioning through live inference — no API, no managed cloud service. The setup uses two NVIDIA H100 cards with 96GB VRAM each (192GB total), which Mirza identifies as the minimum viable configuration for the 284-billion-parameter model. FP4/FP8 mixed-precision quantization reduces the actual memory footprint to approximately 142–150GB, making the hardware just sufficient when paired with a 32K context window to stay within VRAM limits.

The video documents every step of the installation sequence using vLLM, including a key troubleshooting moment: an outdated Transformers library causes the initial serve command to fail, requiring a fresh clone of DeepSeek’s own inference repository. Mirza then converts the 160GB Hugging Face weights to DeepSeek’s native inference format using their generator.py script, distributes the model’s 256 experts across both GPUs via tensor parallelism, and launches the model with PyTorch’s torchrun for interactive inference.

Beyond installation mechanics, the video briefly covers DeepSeek V4 Flash’s architectural innovations — including new attention mechanisms (CSA and HCA) that enable a dramatically extended context window — and its benchmark performance relative to DeepSeek V3.2. The result is a reproducible, hardware-specific guide that gives engineers with equivalent GPU resources everything they need to self-host one of the most capable open-weight models currently available.

📺 Source: Fahd Mirza · Published April 24, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

No Image Available

Fahd Mirza

1 Item

Companies

No Image Available

DeepSeek

Tags

DeepSeek DeepSeek V3.2 DeepSeek V4 Flash DeepSeek V4 Pro VLLM

Prev

OpenAI just dropped GPT-5.5… (WOAH)

Next

DeepSeek V4 Pro + Hermes Agent + Telegram: Full-Stack Bug Fixing From Your Phone

18 Related Posts

Related Posts

14:22

Tutorials

Codex Mobile Released and It’s Insane

10 minutes ago

10:54

Tutorials

Talkie: I Ran a 1930 AI Model Locally and Talked to People from the Past

1 day ago

03:02

Tutorials

Installing Claude Code

1 day ago

08:17

Tutorials

OpenAI Codex Now Works from Anywhere (Dispatch Killer?)

1 day ago

24:07

Tutorials

Hermes Agent powered by local models on the DGX Spark is basically magic

2 days ago

03:21

Tutorials

Goal Mode Changes Everything for AI Coding

2 days ago