VLLM - Frontier Models

There are 63 items in this page

08:43

Tutorials2 months ago

DFlash Drafter for Gemma 4 26B – Official Speculative Decoding is Here: Run Locally

ZLab, the UC San Diego research team that invented DFlash speculative decoding, has released the first official drafter model paired...

08:41

Tutorials2 months ago

Gemma 4 31B at 196 tok/s with RedHat DFlash Speculator Locally

This hands-on tutorial from the Fahd Mirza channel demonstrates running Google's Gemma 4 31B model locally at 196 tokens per second u...

32:36

Research & Benchmarks2 months ago

RTX 5090, Mac Studio, or DGX Spark? I tried all three.

Nate B Jones tests the RTX 5090, Apple Mac Studio, and NVIDIA DGX Spark as personal AI computing platforms, but the video is as much...

09:01

Coding & Dev Tools2 months ago

Running a 27B model at 130 tokens sec on a single GPU Locally with Luce DFlash

LlamaDeFlash is a custom inference engine built from scratch in C++ and CUDA — no vLLM, no llama.cpp, no Python in the critical path...

11:39

Coding & Dev Tools2 months ago

Poolside Laguna XS.2: New Open Weight Coding Model Tested Locally with vLLM

Poolside AI has released two new open-weight coding models: Laguna M.1 (2–5 billion parameters) and Laguna XS.2, a 33-billion-paramet...

13:58

Tutorials2 months ago

NVIDIA’s NEW Open Multimodal Intelligence – Nemotron 3 Nano Omni

NVIDIA has released the Nemotron 3 Nano Omni, a unified open multimodal model that fuses three of the company's strongest components...

18:10

Coding & Dev Tools2 months ago

NVIDIA Nemotron 3 Nano Omni — See, Hear & Read Everything Locally

NVIDIA's Nemotron 3 Nano Omni is a newly released multimodal model capable of processing video, audio, images, and long-form text sim...

10:22

Tutorials2 months ago

Run DeepSeek v4 Flash Locally and Get Blown Away

Fahd Mirza walks through the complete process of running DeepSeek V4 Flash locally on a dual-H100 GPU server, from hardware provision...

10:20

Coding & Dev Tools2 months ago

Qwen3.6-27B + OpenClaw: Multifile Agentic Coding at Scale Locally

Fahd Mirza demonstrates how to integrate Alibaba's Qwen 3.6 27B model with Open Claw, the open-source agentic coding platform officia...

12:47

Tutorials2 months ago

Run Qwen3.6-27B Locally – Prioritizes Stability and Real-World Utility

Fahd Mirza walks through a complete local deployment of Qwen 3.6 27B, Alibaba's latest dense language model, on an Ubuntu server equi...