Qwen3-8B at 74 tok/s with RedHat DFlash Speculator on vLLM Locally

Coding & Dev Tools4 days ago

Qwen3-8B at 74 tok/s with RedHat DFlash Speculator on vLLM Locally

Descriptions:

Fahd Mirza walks through running Red Hat’s DFlash speculative decoding implementation on Qwen3-8B using vLLM, achieving 74 tokens per second on a single NVIDIA RTX A6000 GPU. This is part of a broader series on DFlash, a speculative decoding technique originating from UC San Diego that has been rapidly adopted across the AI inference ecosystem.

The video explains how DFlash differs from standard speculative decoding: traditional draft models still process tokens sequentially, while DFlash uses block diffusion — drawing on hidden states from the main model — to propose an entire block of tokens in one pass. This eliminates sequential dependency and yields roughly 3x speedup compared to the 1.3x typical of conventional speculative decoding. Red Hat trained its Qwen3-8B drafter using their Speculative Library on a mix of MacPie and UltraChat data, with responses regenerated by Qwen3-8B itself to ensure the drafter matches the main model’s output distribution.

Mirza covers the full setup on Ubuntu: installing vLLM from an unmerged GitHub pull request via the UV package manager, serving the model with the speculator proposing seven tokens per step, and benchmarking throughput with a coding prompt. He also breaks down the ~45GB VRAM footprint — covering model precision (BF16), draft model overhead, and upfront KV cache allocation for a 16K context window. For practitioners looking to push inference speed on smaller open models without enterprise-grade multi-GPU hardware, this is a detailed, reproducible demonstration of cutting-edge speculative decoding on genuinely accessible consumer hardware.

📺 Source: Fahd Mirza · Published May 11, 2026
🏷️ Format: Hands On Build

1 Item

Channels

No Image Available

Fahd Mirza

Tags

VLLM

Prev

Two Roads to Durable Agents: Replay vs. Snapshot — Eric Allam, Trigger.dev

Next

TurboQuant + DFlash: Supercharge Local LLM Speed

18 Related Posts

Related Posts

15:13

Coding & Dev Tools

Make the PERFECT Videos with Claude Code (Full Workflow)

23 hours ago

01:04:27

Coding & Dev Tools

Make your own event-sourced agent harness using stream processors — Jonas Templestein, Iterate

23 hours ago

24:11

Coding & Dev Tools

Building a Polymarket AI Trading Bot From Scratch

3 days ago

20:42

Coding & Dev Tools

A Piece of Pi: Embedding The OpenClaw Coding Agent In Your Product — Matthias Luebken, Tavon

4 days ago

10:15

Coding & Dev Tools

Why Can’t We Build UIs Like Blizzard?

4 days ago

42:35

Coding & Dev Tools

Build a Claude Code Personal OS Step by Step in 40 Minutes | Moritz Kremb

5 days ago