Descriptions:
Fahd Mirza demonstrates LuceSpark, a memory management technique that allows a 35-billion-parameter mixture-of-experts model to run within a 16GB VRAM budget — hardware far below the 24GB typically required. The core mechanism is intelligent expert offloading: LuceSpark profiles real user traffic to identify which MoE experts are activated most frequently, keeps those ‘hot’ experts resident on the GPU, and offloads ‘cold’ experts to system RAM, pulling them in only when a specific token activates them. Critically, the system learns and updates the hot/cold split from actual inference traffic, so the cache becomes more efficient over time.
The video places Spark in context alongside three other LuceBox inference techniques — MegaKernel (fused CUDA launches for small models), DFlash and PFlash (speculative decoding for generation and prefill speed). Mirza explains that while those three address throughput, Spark is the only one that addresses model fit — enabling models that simply would not run on consumer hardware to do so at near-full decode speed.
Mirza walks through the full setup on Ubuntu with an NVIDIA RTX 6000 (48GB), compiling the server binary, downloading the Qwen model checkpoint, and launching with the key flags: `–spark` to enable expert offloading and `–spark-vm 16` to set the VRAM budget. Live NV-top output confirms the model runs comfortably under 16GB, with the budget split showing 11.21GB of hot experts on-GPU and 7.02GB of cold experts in RAM. One important caveat: LuceSpark does not currently support MTP-enabled quantized checkpoints, requiring a standard non-MTP GGUF file.
📺 Source: Fahd Mirza · Published June 12, 2026
🏷️ Format: Tutorial Demo







