The First Real LLM Breakthrough Is Here… SubQ (1000x Less Compute)

The First Real LLM Breakthrough Is Here… SubQ (1000x Less Compute)

More

Descriptions:

TheAIGRID covers the release of SubQ 1.1 Small, which its developers claim is the first large language model built on a fully sub-quadratic sparse attention architecture — a potential departure from the quadratic scaling problem that has constrained context window size and cost since the original Transformers paper in 2017. The company’s core claim is that standard dense attention wastes compute by evaluating every word-to-word relationship in a sequence, while SubQ’s Sparse Selective Attention (SSA) learns, per token, which small subset of relationships actually matters and computes full attention only on those.

The numbers are specific: at 1 million tokens, SubQ reports 64.5 times less compute than dense attention and 56 times faster throughput than Flash Attention 2. The model ships with a 12 million token context window — despite being primarily trained at 1 million tokens — and scores 100% on needle-in-haystack retrieval at 1M and 2M tokens, dropping to 98% at 6M and 12M. On RULER, a multi-step reasoning retrieval benchmark, it scores 99.12% at 128k. On GPQA Diamond (graduate-level science), it scores 85.4, below GPT-4.5 at 93.2 and Opus 4.8 at 92, but above Haiku 4.5 at 67.2.

The video clearly distinguishes SubQ’s content-aware token selection from earlier positional shortcuts like Longformer and BigBird, and from fixed-memory compression approaches like Mamba, making it a useful explainer for developers evaluating whether this architecture represents a genuine scaling inflection or an incremental improvement.


📺 Source: TheAIGRID · Published June 18, 2026
🏷️ Format: News Analysis

1 Item

Channels