⚡️ Reverse Engineering OpenAI’s Training Data — Pratyush Maini, Datology

Foundation Models3 months ago

⚡️ Reverse Engineering OpenAI’s Training Data — Pratyush Maini, Datology

Descriptions:

Pratyush Maini, a CMU PhD candidate and founding team member at Datology, joins the Latent Space Lightning Pod to discuss original research into how frontier language models—including the GPT-4.1 and GPT-5 series—appear to be trained, using behavioral analysis to reverse-engineer aspects of OpenAI’s training data pipeline without direct access to it.

A central finding involves benchmark contamination detection: by prompting models with only the first few words of exam questions (including problems from JEE, India’s national engineering entrance exam), Maini’s team found that multiple frontier models would complete the full question text along with answer options—evidence of near-perfect memorization from late-stage training. This pattern is consistent across multiple model families and suggests widespread inclusion of benchmark and exam data in final training phases. Maini also uses OLMo Trace—an open-source tool from the OLMo project that maps model outputs back to pre-training data—to investigate whether reasoning-style thinking tokens appear in the mid-training of non-reasoning models, identifying an apparent inflection point around the GPT-4.1 generation where self-reflective behavior emerges.

The broader context is Maini’s PhD thesis on responsible and efficient use of web-scale pre-training data, covering data attribution, copyright and credit for data contributors, and training efficiency. Datology, the company he co-founded, applies this research to help organizations curate and manage training datasets. The episode is essential listening for ML researchers and practitioners interested in data provenance, training transparency, and the largely opaque mechanics of how frontier models are built.

📺 Source: Latent Space · Published February 10, 2026
🏷️ Format: Deep Dive

1 Item

Companies

No Image Available

OpenAI

Tags

Apple GPT-4.1 Nvidia OpenAI Qwen

Prev

ComfyUI + Google Anti Gravity: Connect ComfyUI to LLMs: Build Your Own Image Agent!|

ComfyUI + Google Anti Gravity: Connect ComfyUI to LLMs: Build Your Own Image Agent!|

Next

Elon Musk Reveals the Future of AI – XAI Full Reveal (Supercut)

Elon Musk Reveals the Future of AI – XAI Full Reveal (Supercut)

18 Related Posts

Related Posts

16:23

Foundation Models

Your SaaS Bill Just Got a Second Meter. You’re About to Pay It.

1 hour ago

31:55

Foundation Models

The biggest AI breakthrough in medicine & drug discovery

1 day ago

01:20:07

Foundation Models

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

1 day ago

25:53

Foundation Models

The Trillion Dollar Agentic Workflow Opportunity Is Here

1 day ago

20:09

Foundation Models

Pinecone Just Demoted Vector Search. Here’s the Knowledge Layer.

2 days ago

14:27

Foundation Models

Claude Makes Dashboards Too Easy. That’s the Problem.

2 days ago