Descriptions:
Fahd Mirza demonstrates Qianfan-OCR (Chainfan OCR), a 4-billion parameter document intelligence model released by Baidu’s Chainfan team. Unlike conventional OCR pipelines that chain separate layout detection, text recognition, and language understanding stages — where errors compound between each step — Qianfan-OCR collapses the entire process into a single vision-language model. A vision encoder processes the full document image, a cross-modal adapter connects it to a language model, and the system reasons about page structure holistically. Its headline feature, “Layout-as-Thought,” adds an optional thinking phase in which the model generates bounding boxes, element types, and reading order before producing final output, recovering the explicit layout reasoning that typical end-to-end models sacrifice.
Mirza installs the model locally on an Nvidia RTX 6000, where it consumes just over 9GB of VRAM, and runs it through three progressively difficult test cases. Handwritten physics equations — including integrals, Greek symbols, and nested fractions — are converted to valid LaTeX that renders correctly in an online compiler. A structured form is parsed into JSON with field labels, types, and values extracted accurately. A historical newspaper front page is analyzed for article headlines, bylines, primary versus secondary story ranking, and advertisements, with the model’s Layout-as-Thought output showing it explicitly mapped the page before writing the response.
The model outputs structured formats including Markdown, JSON, and HTML, and is available via Hugging Face. Mirza estimates overall accuracy on the test cases at around 95%, with handwritten content — typically the hardest OCR challenge — performing particularly well.
📺 Source: Fahd Mirza · Published March 19, 2026
🏷️ Format: Tutorial Demo






