Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI

Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI

More

Descriptions:

Max Ryabinin, VP of Research and Development at Together AI, presents the company’s research project on extending transformer training to 5 million token sequence lengths at the AI Engineer conference. Using a Llama 3B model on a single 8×H100 GPU node as the reference configuration, Ryabinin walks through a layered stack of memory optimization techniques and quantifies the memory reduction each one provides.

The optimization sequence begins with fully sharded data parallelism (FSDP) to distribute model parameters, then applies DeepSpeed Ulysses context parallelism — originally introduced by Microsoft — which splits multi-head attention computation across GPUs by head rather than by sequence position, enabling use of Flash Attention throughout. Activation checkpointing reduces activation memory by approximately 8× by recomputing tensors during the backward pass rather than storing them. CPU offloading of transformer block inputs (a technique Ryabinin credits to Unsloth) provides further relief, followed by chunked tiling of element-wise operations like loss computation and MLP layers to avoid allocating enormous intermediate buffers along the sequence dimension.

Together AI’s primary contribution — referred to as Arctic sequence length training — pushes the achievable sequence length from 3 million to 5 million tokens on the same hardware budget. Ryabinin frames the work as relevant even for teams not targeting million-token contexts: understanding where GPU memory goes during training can unlock throughput improvements across standard fine-tuning workloads.


📺 Source: AI Engineer · Published June 08, 2026
🏷️ Format: Deep Dive

1 Item

Channels