Luce Megakernel — 25x Faster Than PyTorch on a Single GPU – Test Locally

Luce Megakernel — 25x Faster Than PyTorch on a Single GPU – Test Locally

More

Descriptions:

A new open-source project called Luce Megakernel is challenging long-held assumptions about GPU inference efficiency by fusing all 24 layers of the WhenLM 3.5 8B model into a single CUDA kernel dispatch. Traditional inference frameworks such as llama.cpp and PyTorch require the CPU to dispatch a separate kernel for each model layer — roughly 100 kernel launches per token — creating substantial overhead from CPU round trips, weight re-fetching, and thread synchronization. The megakernel eliminates this bottleneck using cooperative grid synchronization, keeping activations in GPU registers and shared memory throughout the entire forward pass.

The benchmark results, demonstrated live on an NVIDIA RTX A6000 running Ubuntu, are striking. On an RTX 3090, the megakernel achieves 413 tokens per second at decode, versus 267 tokens/sec for llama.cpp on the same card and 229 tokens/sec on an Apple M5 Max. More significantly, the efficiency picture reverses completely: the megakernel delivers approximately 1.84 tokens per joule on the RTX 3090, compared to 0.76 tokens/joule for llama.cpp on the same hardware and 1.76 tokens/joule for the M5 Max. Prefill speeds are even more dramatic — 37,800 tokens/sec for the megakernel versus 11,000 for llama.cpp, a 3.4× improvement.

Host Fahd Mirza also explains why WhenLM 3.5 8B was the specific target: its hybrid architecture combining 18 Delta net linear-attention layers with six standard full-attention layers in a roughly 3:2:1 ratio represents a design pattern increasingly common in next-generation LLMs, including WhenLM 3 Next and Kimmy Linear. Luce Megakernel is the first fused kernel built specifically for this hybrid Delta net and attention pattern, with no equivalent support in MLX or generic llama.cpp.


📺 Source: Fahd Mirza · Published May 15, 2026
🏷️ Format: Benchmark Test

1 Item

Channels