Nanowhale-100m: Fascinating Implemention of DeepSeek-V4 Architecture

Foundation Models2 months ago

Nanowhale-100m: Fascinating Implemention of DeepSeek-V4 Architecture

Descriptions:

Fahd Mirza walks through Nanowhale-100M, a 110 million parameter language model built entirely from scratch—no borrowed weights—that implements the full DeepSeek-V4 architecture on a single consumer GPU. Pre-trained on 2.6 billion tokens from FineWeb-Edu over 5,000 steps and fine-tuned on 460,000 chat examples from Hugging Face’s SmallTalk dataset, the model fits in a few hundred megabytes and runs on just over 1GB of VRAM, demonstrated live on an NVIDIA RTX 6000 with 48GB available.

The architectural choices faithfully mirror DeepSeek-V4 at a radically compressed scale: eight transformer layers with a 320-dimensional hidden size, multi-head latent attention (MLA) for query compression, a mixture-of-experts layer with four routed experts plus one shared expert using top-2 routing, hyper-connections with Sinkhorn routing replacing standard residuals, and a multi-token prediction head. A notable quirk: most of the 110M parameters live in the embedding table due to DeepSeek’s large 129K-token vocabulary, leaving the actual transformer stack surprisingly lightweight.

Mirza is candid about the model’s practical limits—instruction following essentially fails at this scale, and outputs are largely incoherent. But that is framed as the point: Nanowhale-100M is a runnable, inspectable educational artifact that makes DeepSeek’s architectural innovations accessible to anyone who wants to study them hands-on, experiment with custom fine-tuning, or understand how MLA, MoE, and hyper-connections interact without needing frontier-scale compute.

📺 Source: Fahd Mirza · Published June 06, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

Fahd Mirza

1 Item

Companies

No Image Available

DeepSeek

Tags

DeepSeek DeepSeek V4 Fahd Mirza

Prev

AI Not Holding Back Companies From Hiring: Yale Budget Lab

Next

BLS-Mini-Code-1.0: Testing Cohere’s Secret Coding Model Locally

18 Related Posts

Related Posts

21:09

Foundation Models

Persona Engineering: A Field Guide to AI Synthetic Personas — Ishan Anand, InsightSciences.ai

24 hours ago

21:39

Foundation Models

Serving 2 Million Models Without Melting: Scaling the Hugging Face Hub — Arek Borucki, Hugging Face

2 days ago

06:40

Foundation Models

AMD Releases First Ever AI model: Instella-MoE-16B-A3B-Think

2 days ago

24:01

Foundation Models

US AI Dominance Is Over: Here’s Why

3 days ago

17:31

Foundation Models

The Messy Reality of Scale: Synthetic Data and Pre-Training — Marah Abdin & Robert McHardy, poolside

4 days ago

23:13

Foundation Models

Evaling Video Slop — Maor Bril, Character.ai

5 days ago