Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI

Foundation Models2 months ago

Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI

Descriptions:

Max Ryabinin, VP of Research and Development at Together AI, presents the company’s research project on extending transformer training to 5 million token sequence lengths at the AI Engineer conference. Using a Llama 3B model on a single 8×H100 GPU node as the reference configuration, Ryabinin walks through a layered stack of memory optimization techniques and quantifies the memory reduction each one provides.

The optimization sequence begins with fully sharded data parallelism (FSDP) to distribute model parameters, then applies DeepSpeed Ulysses context parallelism — originally introduced by Microsoft — which splits multi-head attention computation across GPUs by head rather than by sequence position, enabling use of Flash Attention throughout. Activation checkpointing reduces activation memory by approximately 8× by recomputing tensors during the backward pass rather than storing them. CPU offloading of transformer block inputs (a technique Ryabinin credits to Unsloth) provides further relief, followed by chunked tiling of element-wise operations like loss computation and MLP layers to avoid allocating enormous intermediate buffers along the sequence dimension.

Together AI’s primary contribution — referred to as Arctic sequence length training — pushes the achievable sequence length from 3 million to 5 million tokens on the same hardware budget. Ryabinin frames the work as relevant even for teams not targeting million-token contexts: understanding where GPU memory goes during training can unlock throughput improvements across standard fine-tuning workloads.

📺 Source: AI Engineer · Published June 08, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

AI Engineer

Tags

H100 Llama Microsoft Together AI Unsloth

Prev

Father of the iPod and iPhone on building taste, judgment, and creativity in the AI era

Next

Only the best are using them…

18 Related Posts

Related Posts

21:09

Foundation Models

Persona Engineering: A Field Guide to AI Synthetic Personas — Ishan Anand, InsightSciences.ai

1 day ago

21:39

Foundation Models

Serving 2 Million Models Without Melting: Scaling the Hugging Face Hub — Arek Borucki, Hugging Face

2 days ago

06:40

Foundation Models

AMD Releases First Ever AI model: Instella-MoE-16B-A3B-Think

2 days ago

24:01

Foundation Models

US AI Dominance Is Over: Here’s Why

3 days ago

17:31

Foundation Models

The Messy Reality of Scale: Synthetic Data and Pre-Training — Marah Abdin & Robert McHardy, poolside

4 days ago

20:24

Foundation Models

From Agent Traces to Agent Simulations — Rustem Feyzkhanov, Snorkel AI

5 days ago