Descriptions:
Pratyush Maini, a CMU PhD candidate and founding team member at Datology, joins the Latent Space Lightning Pod to discuss original research into how frontier language models—including the GPT-4.1 and GPT-5 series—appear to be trained, using behavioral analysis to reverse-engineer aspects of OpenAI’s training data pipeline without direct access to it.
A central finding involves benchmark contamination detection: by prompting models with only the first few words of exam questions (including problems from JEE, India’s national engineering entrance exam), Maini’s team found that multiple frontier models would complete the full question text along with answer options—evidence of near-perfect memorization from late-stage training. This pattern is consistent across multiple model families and suggests widespread inclusion of benchmark and exam data in final training phases. Maini also uses OLMo Trace—an open-source tool from the OLMo project that maps model outputs back to pre-training data—to investigate whether reasoning-style thinking tokens appear in the mid-training of non-reasoning models, identifying an apparent inflection point around the GPT-4.1 generation where self-reflective behavior emerges.
The broader context is Maini’s PhD thesis on responsible and efficient use of web-scale pre-training data, covering data attribution, copyright and credit for data contributors, and training efficiency. Datology, the company he co-founded, applies this research to help organizations curate and manage training datasets. The episode is essential listening for ML researchers and practitioners interested in data provenance, training transparency, and the largely opaque mechanics of how frontier models are built.
📺 Source: Latent Space · Published February 10, 2026
🏷️ Format: Deep Dive







