Marco-Nano & Marco-Mini: Alibaba’s Insane Sparse MoE Models: Run Locally

Marco-Nano & Marco-Mini: Alibaba’s Insane Sparse MoE Models: Run Locally

More

Descriptions:

Fahd Mirza installs and tests two new sparse mixture-of-experts models from Alibaba’s AIDC AI division — Marco-Nano Instruct and Marco-Mini Instruct — running them locally on an NVIDIA RTX 6000 with 48GB VRAM. Both models are built on the same decoder-only transformer architecture with MoE layers upcycled from a Qwen 3.6B base, but their efficiency profiles are strikingly different: Nano has 8 billion total parameters but activates only 0.6 billion per token (a 7.5% activation ratio), while Mini uses 17.3B total parameters with 0.86B activated per token.

The most interesting finding comes from direct side-by-side comparison. On a structured JSON output task (listing five planets with distances), Nano responded almost instantly with clean, valid output — while Mini took significantly longer, hallucinated a single-planet input, self-corrected, then returned six planets anyway. On multilingual translation across 25 languages, the roles partially reversed: Mini produced richer, more culturally adapted phrasing, while Nano was faster and more consistent but occasionally more literal. A SQL bug-finding task rounds out the evaluation.

The video makes a useful practical point: in sparse MoE architectures, more total or even more activated parameters does not automatically mean better instruction-following or faster inference. For developers evaluating efficient open-weight models for multilingual or structured-output workloads, Marco-Nano’s extreme activation sparsity offers a compelling tradeoff worth testing.


📺 Source: Fahd Mirza · Published April 09, 2026
🏷️ Format: Hands On Build

1 Item

Channels

1 Item

Companies