Penguin-VL in 2B and 8B: Worst Vision AI Model Ever: Full Local Testing

Penguin-VL in 2B and 8B: Worst Vision AI Model Ever: Full Local Testing

More

Descriptions:

Fahd Mirza puts Tencent’s newly released Penguin-VL vision-language models — available in 2B and 8B parameter sizes — through a series of real-world vision tests on an Nvidia RTX 6000 (48GB VRAM), and the results are blunt: both models perform poorly across multiple visual understanding tasks.

Architecturally, Penguin-VL is notable for replacing traditional contrastive vision encoders with an LLM-based vision encoder, a design choice inspired by Qwen’s approach that enables tighter visual-language alignment. The 2B model consumes under 10GB of VRAM. Tencent’s benchmarks claim strong performance on DocVQA, ChartQA, and InfoVQA — topping other 2B models. Mirza tests against those stated strengths: a chart comprehension task where the model refuses to answer, claiming it cannot perform real-time measurements; a traffic scene interpretation that returns an incorrect lane identification; and a simple object-in-hand recognition that hallucinates an apple. The 8B model fares no better on the same prompts.

Throughout, Mirza draws explicit comparisons to Qwen 3.5, noting that even Qwen’s 8B vision model significantly outperforms Penguin-VL on equivalent tasks. The video serves as a practical warning for developers considering small open-weight VLMs: published benchmark scores on curated datasets do not reliably predict real-world vision reasoning capability, and Penguin-VL’s gap between claimed and observed performance is unusually wide.


📺 Source: Fahd Mirza · Published March 14, 2026
🏷️ Format: Benchmark Test

1 Item

Channels