Taalas Is Running AI at 17,000 Tokens Per Second — What’s the Catch?

Taalas Is Running AI at 17,000 Tokens Per Second — What’s the Catch?

More

Descriptions:

Fahd Mirza examines Telas, a chip startup claiming to run AI inference at approximately 17,000 tokens per second — roughly ten times faster than current GPU clusters, at one-twentieth the cost and one-tenth the power draw. The core technology involves etching a model’s weights directly onto custom silicon rather than loading them into memory at runtime, collapsing the memory-compute bottleneck that causes latency in conventional GPU-based deployments. Mirza demonstrates the system live, recording approximately 15,780 tokens per second on Llama 3.1 8 billion.

Rather than treating the demo as a verdict, the video works through four specific concerns. First, Telas is currently running Llama 3.1 8B — a small, older open-source model — not any frontier system, so the speed numbers do not represent state-of-the-art intelligence running fast. Second, the model uses aggressive custom 3-bit and 6-bit quantization, which Telas itself acknowledges introduces quality degradation relative to standard GPU benchmarks. Third, the company’s claim of a two-month turnaround for new models is a forward-looking target, not a proven track record — their first product took two and a half years to build. Fourth, the AI model landscape moves fast enough that today’s capable model can be irrelevant within weeks, raising genuine questions about whether hardwired silicon can keep pace.

Mirza concludes that the underlying architectural insight — unified memory and compute on a single chip — is technically sound and potentially significant, but the current implementation is a first-generation proof of concept whose real-world viability at the frontier remains undemonstrated.


📺 Source: Fahd Mirza · Published February 23, 2026
🏷️ Format: Review

1 Item

Channels