llama.cpp - Frontier Models

There are 33 items in this page

10:06

Coding & Dev Tools1 month ago

DFlash Leaves Qwen Territory – Gemma 4 31B Now Runs 5x Faster with Speculative Decoding

Fahd Mirza demonstrates the first end-to-end deployment of Llama Box DFlash with Google's Gemma 4 31B model, following the merge of P...

08:53

Research & Benchmarks1 month ago

$400 Chinese GPU That Wants to Dethrone NVIDIA

Fahd Mirza takes a close look at the Lision LX7G 100, a roughly $485 consumer GPU developed entirely in China without CUDA, AMD archi...

19:11

Business & Strategy2 months ago

Your Agent Can Now Train Models — Merve Noyan, Hugging Face

Merve Noyan from the Hugging Face open-source team delivers a broad survey of the current open-model landscape alongside several firs...

22:54

Tutorials2 months ago

This 100% uncensored AI model is insane… let’s run it

David Ondrej walks through the rationale, setup, and practical use of uncensored large language models running locally in 2026. The v...

11:12

Benchmarks2 months ago

Qwen3.6 27B Gets 20% Faster with MTP and llama.cpp Locally

Fahd Mirza demonstrates how to enable multi-token prediction (MTP) on Qwen3.6 27B using ik_llama.cpp — a community fork of the popula...

09:01

Coding & Dev Tools2 months ago

Running a 27B model at 130 tokens sec on a single GPU Locally with Luce DFlash

LlamaDeFlash is a custom inference engine built from scratch in C++ and CUDA — no vLLM, no llama.cpp, no Python in the critical path...

14:53

Coding & Dev Tools2 months ago

This Mutant AI Model Should Not Exist: Qwopus-GLM-18B-Merged Locally

Fahd Mirza walks through the creation and live testing of Qwopus-GLM-18B-Merged, a community-built model that stitches together two s...

09:08

Tutorials2 months ago

Open WebUI Desktop App – Install on Linux, Windows & Mac

Open WebUI has shipped its first native desktop application for Windows, macOS, and Linux, and Fahd Mirza walks through the complete...

15:26

Gemma, DeepMind’s Family of Open Models — Omar Sanseviero, Google DeepMind

Business & Strategy2 months ago

Gemma, DeepMind’s Family of Open Models — Omar Sanseviero, Google DeepMind

Omar Sanseviero, a researcher at Google DeepMind, delivers the first public conference talk on Gemma 4 just one week after its releas...

14:56

MiniMax M2.7 Running Locally on CPU + GPU – Everyone Can Do It

Coding & Dev Tools3 months ago

MiniMax M2.7 Running Locally on CPU + GPU – Everyone Can Do It

Fahd Mirza walks through the complete process of running MiniMax M2.7 — a newly open-sourced 229-billion-parameter mixture-of-experts...