Llama CPP - Frontier Models

There are 22 items in this page

29:59

Interviews2 months ago

⚡️ Google’s Open AI Strategy — Omar Sanseviero, Google DeepMind

In this Latent Space podcast interview, Omar Sanseviero from Google DeepMind walks through the technical decisions and strategic thin...

08:11

Business & Strategy2 months ago

Weekly AI Recap – Qwen3.7, MTP in llama.cpp, SANA and More | May 2026

Fahd Mirza's weekly AI recap for May 2026 covers the most consequential model releases, infrastructure updates, and industry deals of...

09:01

Tutorials2 months ago

Llama.cpp Router Mode: Switch Models Instantly: Hands-on Local Demo

Fahd Mirza demonstrates llama.cpp's built-in router mode, a native feature that enables instant model hot-swapping without third-part...

10:48

Tutorials2 months ago

LM Studio Just Got MTP — Qwen3.6-27B Runs 63% Faster with One Toggle

Fahd Mirza demonstrates how to enable Multi-Token Prediction (MTP) speculative decoding in LM Studio's new beta release (version 0.4....

09:45

Tutorials2 months ago

Llama.cpp Just Got MTP – Qwen3.6 27B Runs 2x Faster Locally with Two Flags

Multi-token prediction (MTP) support has officially merged into the mainline llama.cpp repository—not a fork or custom branch, but th...

14:24

Business & Strategy2 months ago

AI Dev 26 x SF | Anush Elangovan: Impact of AI on Software

Anush Elangovan, VP of Software at AMD, delivered a keynote at the AI Dev 26 x SF conference hosted by DeepLearning.AI, sharing how h...

15:56

Research & Benchmarks2 months ago

MiniCPM-V 4.6: The Agent Vision Model

Sam Witteveen examines MiniCPM-V 4.6, a 1.3 billion parameter vision-language model released by OpenBMB—a joint initiative between AI...

08:06

Research & Benchmarks2 months ago

MTP vs DFlash — Speculative Decoding Explained Simply

This video by Fahd Mirza offers a clear, structured comparison of two speculative decoding techniques — Multi-Token Prediction (MTP)...

09:52

Benchmarks3 months ago

Luce Megakernel — 25x Faster Than PyTorch on a Single GPU – Test Locally

A new open-source project called Luce Megakernel is challenging long-held assumptions about GPU inference efficiency by fusing all 24...

15:31

Coding & Dev Tools3 months ago

PFlash + Qwen3.6-27B-DFlash: 10x Faster Prefill on a Single GPU: Run Locally

Fahd Mirza builds and benchmarks PFlash, a prefill acceleration tool that dramatically reduces the blank-screen wait time when feedin...