Luce Megakernel — 25x Faster Than PyTorch on a Single GPU – Test Locally

Benchmarks16 hours ago

Luce Megakernel — 25x Faster Than PyTorch on a Single GPU – Test Locally

Descriptions:

A new open-source project called Luce Megakernel is challenging long-held assumptions about GPU inference efficiency by fusing all 24 layers of the WhenLM 3.5 8B model into a single CUDA kernel dispatch. Traditional inference frameworks such as llama.cpp and PyTorch require the CPU to dispatch a separate kernel for each model layer — roughly 100 kernel launches per token — creating substantial overhead from CPU round trips, weight re-fetching, and thread synchronization. The megakernel eliminates this bottleneck using cooperative grid synchronization, keeping activations in GPU registers and shared memory throughout the entire forward pass.

The benchmark results, demonstrated live on an NVIDIA RTX A6000 running Ubuntu, are striking. On an RTX 3090, the megakernel achieves 413 tokens per second at decode, versus 267 tokens/sec for llama.cpp on the same card and 229 tokens/sec on an Apple M5 Max. More significantly, the efficiency picture reverses completely: the megakernel delivers approximately 1.84 tokens per joule on the RTX 3090, compared to 0.76 tokens/joule for llama.cpp on the same hardware and 1.76 tokens/joule for the M5 Max. Prefill speeds are even more dramatic — 37,800 tokens/sec for the megakernel versus 11,000 for llama.cpp, a 3.4× improvement.

Host Fahd Mirza also explains why WhenLM 3.5 8B was the specific target: its hybrid architecture combining 18 Delta net linear-attention layers with six standard full-attention layers in a roughly 3:2:1 ratio represents a design pattern increasingly common in next-generation LLMs, including WhenLM 3 Next and Kimmy Linear. Luce Megakernel is the first fused kernel built specifically for this hybrid Delta net and attention pattern, with no equivalent support in MLX or generic llama.cpp.

📺 Source: Fahd Mirza · Published May 15, 2026
🏷️ Format: Benchmark Test

1 Item

Channels

No Image Available

Fahd Mirza

Tags

Apple CUDA MLX Nvidia

Prev

Cerebras Goes Public in Year’s Biggest IPO | Bloomberg Tech 5/14/2026

Next

800+ hours of Learning Claude Code in 8 minutes (2026 tutorial / unknown tricks / newest model)

800+ hours of Learning Claude Code in 8 minutes (2026 tutorial / unknown tricks / newest model)

18 Related Posts

Related Posts

11:12

Benchmarks

Qwen3.6 27B Gets 20% Faster with MTP and llama.cpp Locally

6 days ago

09:15

Benchmarks

ZAYA1-VL-8B: Efficient Open Visual Intelligence – Run Locally

7 days ago

04:40

Benchmarks

One API Key for Every AI Model (Pay With Crypto)

1 week ago

08:57

Benchmarks

Google Releases Gemma 4 MTP Drafters – Run Locally and DFlash Comparison

2 weeks ago

08:44

Benchmarks

Are AI Coding Skills Just Hype? I Tested Them

2 weeks ago

11:03

Benchmarks

I Didn’t Expect This: Opus 4.7 vs GPT 5.5

2 weeks ago