Descriptions:
Fahd Mirza walks through Nanowhale-100M, a 110 million parameter language model built entirely from scratch—no borrowed weights—that implements the full DeepSeek-V4 architecture on a single consumer GPU. Pre-trained on 2.6 billion tokens from FineWeb-Edu over 5,000 steps and fine-tuned on 460,000 chat examples from Hugging Face’s SmallTalk dataset, the model fits in a few hundred megabytes and runs on just over 1GB of VRAM, demonstrated live on an NVIDIA RTX 6000 with 48GB available.
The architectural choices faithfully mirror DeepSeek-V4 at a radically compressed scale: eight transformer layers with a 320-dimensional hidden size, multi-head latent attention (MLA) for query compression, a mixture-of-experts layer with four routed experts plus one shared expert using top-2 routing, hyper-connections with Sinkhorn routing replacing standard residuals, and a multi-token prediction head. A notable quirk: most of the 110M parameters live in the embedding table due to DeepSeek’s large 129K-token vocabulary, leaving the actual transformer stack surprisingly lightweight.
Mirza is candid about the model’s practical limits—instruction following essentially fails at this scale, and outputs are largely incoherent. But that is framed as the point: Nanowhale-100M is a runnable, inspectable educational artifact that makes DeepSeek’s architectural innovations accessible to anyone who wants to study them hands-on, experiment with custom fine-tuning, or understand how MLA, MoE, and hyper-connections interact without needing frontier-scale compute.
📺 Source: Fahd Mirza · Published June 06, 2026
🏷️ Format: Deep Dive






