D-flash - Frontier Models

There are 10 items in this page

59:10

Interviews3 weeks ago

The 100,000 Sandbox Problem — Akshat Bubna, Modal CTO

Latent Space hosts Modal CTO Akshat Bubna for an in-depth conversation on how Modal evolved from a general serverless runtime into a...

09:39

Coding & Dev Tools4 weeks ago

DeepSeek DFlash on Gemma 12B Locally: Up To 5x Faster

Fahd Mirza demonstrates how to run DeepSeek's DFlash speculative decoding method locally, pairing the open-source DeepSeek drafter mo...

08:49

Foundation Models1 month ago

DSpark – DeepSeek Just Made Inference 85% Faster

DeepSeek has released DSpark, a speculative decoding system that makes their models generate text 60 to 85% faster without any change...

09:40

Benchmarks1 month ago

DFlash Just Got Faster: 4x Speed with 160 tok/s Locally

Fahd Mirza benchmarks DFlash with SGLang's new SpecV2 overlapping scheduler on an NVIDIA H100 80GB GPU, demonstrating a 4.3x throughp...

12:30

Coding & Dev Tools2 months ago

Luce KVFlash: Fit 256K Context on a Small GPU – Local Hands-On Guide

KV Flash is a new memory management engine for local LLM inference that keeps only the most relevant tokens on GPU VRAM while paging...

13:13

Tutorials2 months ago

Adaptive PFlash + Hermes Agent – Self-Tuning Prefill on a Single GPU Locally

Fahd Mirza demonstrates the newly shipped adaptive compression feature in PFlash, the prefill-acceleration component of the open-sour...

08:41

Tutorials3 months ago

Luce DFlash Meets OpenClaw – Local AI Agents at 2x Speed with Qwen3.6-27B

Fahd Mirza walks through a complete, reproducible integration of DFlash — a speculative decoding inference engine — with OpenClaw, an...

09:45

Tutorials3 months ago

TurboQuant + DFlash: Supercharge Local LLM Speed

Fahd Mirza demonstrates the practical integration of two recently released local inference tools: Google Research's TurboCore KV cach...

08:28

Coding & Dev Tools3 months ago

Qwen3-8B at 74 tok/s with RedHat DFlash Speculator on vLLM Locally

Fahd Mirza walks through running Red Hat's DFlash speculative decoding implementation on Qwen3-8B using vLLM, achieving 74 tokens per...

11:12

Benchmarks3 months ago

Qwen3.6 27B Gets 20% Faster with MTP and llama.cpp Locally

Fahd Mirza demonstrates how to enable multi-token prediction (MTP) on Qwen3.6 27B using ik_llama.cpp — a community fork of the popula...