ZAYA1-VL-8B: Efficient Open Visual Intelligence – Run Locally

ZAYA1-VL-8B: Efficient Open Visual Intelligence – Run Locally

More

Descriptions:

Fahd Mirza puts ZAYA1-VL-8B — the new vision-language model from Zeffa — through its paces on an NVIDIA RTX 6000 with 48GB of VRAM, showing installation, VRAM consumption (just over 26GB), and a series of progressively harder tests. The model is a Mixture-of-Experts architecture with 8 billion total parameters but only 700 million active during inference, and it was trained on roughly 140 billion vision-language tokens — a dramatically leaner dataset than competing models that use trillions.

The benchmark results are notable: Zeffa claims ZAYA1-VL beats Molmo, DeepSeek VL2, and Qwen3 VL at equivalent active parameter counts. Mirza’s live tests cover dense OCR on a vintage newspaper (strong performance across five headlines), handwritten letter extraction (initial failure on prompt following, then clean success after prompt clarification), and multilingual text recognition across an AI-generated airport sign in English, Japanese, Korean, and Russian. A broader multilingual test with more obscure languages — including several Southeast Asian scripts — reveals clear gaps, suggesting multilinguality is not the model’s strongest suit.

The video also walks through the architecture’s two key design ideas: treating image tokens differently from text during causal processing, and its efficiency-first training philosophy. Released under an Apache 2.0 license, ZAYA1-VL-8B is fully commercially usable and runnable locally, making it a practical option for developers who need vision capabilities without cloud API costs.


📺 Source: Fahd Mirza · Published May 09, 2026
🏷️ Format: Benchmark Test

1 Item

Channels

1 Item

People