CUDA - Frontier Models

There are 21 items in this page

02:01:45

Interviews2 weeks ago

Compute Improves Compute + Europe 2031

The Cognitive Revolution's daily AI briefing for June 23, 2026 covers a wide sweep of the AI industry, from semiconductor market turb...

09:07

Tutorials1 month ago

DwarfStar: Run DeepSeek V4 Locally with DS4 at 34 tok/s

Fahd Mirza covers DwarfStar, a brand-new inference engine built specifically for DeepSeek V4 Flash (DS4) by the creator of Radius. Un...

10:06

Coding & Dev Tools1 month ago

DFlash Leaves Qwen Territory – Gemma 4 31B Now Runs 5x Faster with Speculative Decoding

Fahd Mirza demonstrates the first end-to-end deployment of Llama Box DFlash with Google's Gemma 4 31B model, following the merge of P...

08:53

Research & Benchmarks1 month ago

$400 Chinese GPU That Wants to Dethrone NVIDIA

Fahd Mirza takes a close look at the Lision LX7G 100, a roughly $485 consumer GPU developed entirely in China without CUDA, AMD archi...

01:47:27

Interviews1 month ago

we are NOT PREPARED for the end of 2026

Wes Roth and co-host Dylan deliver a wide-ranging AI industry podcast covering the most significant developments from the week of Goo...

18:25

Coding & Dev Tools1 month ago

Your Coding Agent Should Do AI System Engineering — Ben Burtenshaw, Hugging Face

Ben Burtenshaw, an engineer at Hugging Face, makes the case that coding agents have crossed a capability threshold where they can now...

09:52

Benchmarks2 months ago

Luce Megakernel — 25x Faster Than PyTorch on a Single GPU – Test Locally

A new open-source project called Luce Megakernel is challenging long-held assumptions about GPU inference efficiency by fusing all 24...

09:01

Coding & Dev Tools2 months ago

Running a 27B model at 130 tokens sec on a single GPU Locally with Luce DFlash

LlamaDeFlash is a custom inference engine built from scratch in C++ and CUDA — no vLLM, no llama.cpp, no Python in the critical path...

10:01

Foundation Models2 months ago

The Hidden Engine Behind DeepSeek V4 – DeepEP V2 and TileKernels Explained

While most coverage of DeepSeek V4 focuses on benchmark scores, Fahd Mirza goes a level deeper to explain the two open-sourced infras...

10:33

Coding & Dev Tools2 months ago

Kimi FlashKDA: 2x Faster AI Prefill — Installed, Explained and Tested Locally

Fahd Mirza walks through the live installation of Flash KDA, Moonshot AI's open-source CUDA kernel that accelerates the prefill phase...