Luce Spark: Run a 35B Model Under 16GB VRAM Locally

Tutorials2 months ago

Luce Spark: Run a 35B Model Under 16GB VRAM Locally

Descriptions:

Fahd Mirza demonstrates LuceSpark, a memory management technique that allows a 35-billion-parameter mixture-of-experts model to run within a 16GB VRAM budget — hardware far below the 24GB typically required. The core mechanism is intelligent expert offloading: LuceSpark profiles real user traffic to identify which MoE experts are activated most frequently, keeps those ‘hot’ experts resident on the GPU, and offloads ‘cold’ experts to system RAM, pulling them in only when a specific token activates them. Critically, the system learns and updates the hot/cold split from actual inference traffic, so the cache becomes more efficient over time.

The video places Spark in context alongside three other LuceBox inference techniques — MegaKernel (fused CUDA launches for small models), DFlash and PFlash (speculative decoding for generation and prefill speed). Mirza explains that while those three address throughput, Spark is the only one that addresses model fit — enabling models that simply would not run on consumer hardware to do so at near-full decode speed.

Mirza walks through the full setup on Ubuntu with an NVIDIA RTX 6000 (48GB), compiling the server binary, downloading the Qwen model checkpoint, and launching with the key flags: `–spark` to enable expert offloading and `–spark-vm 16` to set the VRAM budget. Live NV-top output confirms the model runs comfortably under 16GB, with the budget split showing 11.21GB of hot experts on-GPU and 7.02GB of cold experts in RAM. One important caveat: LuceSpark does not currently support MTP-enabled quantized checkpoints, requiring a standard non-MTP GGUF file.

📺 Source: Fahd Mirza · Published June 12, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

No Image Available

Fahd Mirza

Tags

Fahd Mirza

Prev

Brian Armstrong on Bitcoin, Anthropic Drops Fable 5 & Mythos 5, NewLimit’s $435M Age-Reversal | 264

Next

Why the Government Just Killed Claude Fable 5

18 Related Posts

Related Posts

08:04

Tutorials

Herdr: Run Multiple AI Coding Agents in Parallel from Your Terminal

60 minutes ago

15:54

Tutorials

Buzz Huddle Test: 4 Humans, 2 AI Agents

60 minutes ago

22:53

Tutorials

The Viral $1 Website Effect That Looks Like $10K (Tutorial)

1 day ago

20:17

Tutorials

Paste This Into Claude, Never Hit a Token Limit Again

1 day ago

15:54

Tutorials

AI Video 101: How to Master AI Videos (Beginner to Advanced)

1 day ago

08:12

Tutorials

How to Run Kimi K3 Locally (3 Ways)

1 day ago