Descriptions:
Isaac Robinson, research lead at Roboflow, presents a technical history of how Vision Transformers (ViTs) came to dominate computer vision at the AI Engineer conference, tracing a winding architectural evolution that ultimately validates a familiar lesson: simple architectures with massive pre-training beat clever inductive biases at scale. The talk begins with convolutional neural networks and their well-motivated inductive biases — locality, translation equivariance — before introducing the Vision Transformer and its apparent paradox: no hard-coded inductive bias, O(n⁴) compute scaling with image resolution, yet empirically superior results.
Robinson walks through the field’s successive responses: Swin Transformers, which reintroduced locality via shifted attention windows and brought complexity back to O(n²); ConvNeXt, which applied transformer design principles directly to convolutional networks and showed competitive results; and Hiera from Meta, which systematically stripped inductive biases from a strong baseline one at a time, replaced them with pre-training, and explicitly benchmarked the accuracy-versus-speed tradeoff against standard ViTs. The talk then explains why plain ViTs ultimately prevailed: self-supervised pre-training techniques — Masked Autoencoders (MAE), DINOv2, and DINOv3 — teach models the inductive biases that were previously hard-coded into architecture, while infrastructure advances developed for large language models, particularly FlashAttention, dramatically reduced the practical cost of attention computation for vision as a side effect.
The presentation is particularly valuable for practitioners moving from language models into computer vision, or for anyone reasoning about when architectural priors are worth the engineering complexity versus simply investing in pre-training scale.
📺 Source: AI Engineer · Published May 08, 2026
🏷️ Format: Deep Dive







