Qwen3-8B at 74 tok/s with RedHat DFlash Speculator on vLLM Locally

Qwen3-8B at 74 tok/s with RedHat DFlash Speculator on vLLM Locally

More

Descriptions:

Fahd Mirza walks through running Red Hat’s DFlash speculative decoding implementation on Qwen3-8B using vLLM, achieving 74 tokens per second on a single NVIDIA RTX A6000 GPU. This is part of a broader series on DFlash, a speculative decoding technique originating from UC San Diego that has been rapidly adopted across the AI inference ecosystem.

The video explains how DFlash differs from standard speculative decoding: traditional draft models still process tokens sequentially, while DFlash uses block diffusion β€” drawing on hidden states from the main model β€” to propose an entire block of tokens in one pass. This eliminates sequential dependency and yields roughly 3x speedup compared to the 1.3x typical of conventional speculative decoding. Red Hat trained its Qwen3-8B drafter using their Speculative Library on a mix of MacPie and UltraChat data, with responses regenerated by Qwen3-8B itself to ensure the drafter matches the main model’s output distribution.

Mirza covers the full setup on Ubuntu: installing vLLM from an unmerged GitHub pull request via the UV package manager, serving the model with the speculator proposing seven tokens per step, and benchmarking throughput with a coding prompt. He also breaks down the ~45GB VRAM footprint β€” covering model precision (BF16), draft model overhead, and upfront KV cache allocation for a 16K context window. For practitioners looking to push inference speed on smaller open models without enterprise-grade multi-GPU hardware, this is a detailed, reproducible demonstration of cutting-edge speculative decoding on genuinely accessible consumer hardware.


πŸ“Ί Source: Fahd Mirza Β· Published May 11, 2026
🏷️ Format: Hands On Build

1 Item

Channels