
Introducing Typhoon OCR: An Open-Source Vision-Language Document Parsing Model for English and Thai
New ReleaseTyphoon OCROCRVision Language

Table of Contents
From now on, extracting text content from large volumes of documents or image files will no longer be a headache. We are glad to release Typhoon OCR: a next-generation, open-source vision-language document parsing model built for real-world use cases in English and Thai. Typhoon OCR delivers structured, layout-aware, and semantically rich outputs—making it a powerful tool for applications like document retrieval, summarization, and information extraction.
Limitations of Traditional OCR
Traditional OCR systems are typically built using Convolutional Neural Networks (CNNs) for visual recognition, paired with sequence decoders like RNNs or Transformers to convert image features into text. These models are effective at identifying characters in images, but they work best with clean, well-structured documents—such as high-quality scans with simple layouts.
Popular OCR frameworks like EasyOCR, PaddleOCR, and Tesseract already support multiple languages, including Thai. However, they face several challenges when applied to real-world documents:
-
Weak layout awareness: These systems often treat documents as plain sequences of text, failing to recognize structural elements like tables, headings, columns, or mixed media sections.
-
Lack of image-text understanding: Visual content such as charts, diagrams, and figures is usually ignored, leading to incomplete or fragmented outputs.
-
Limited PDF support: While these tools can process PDFs by converting them into images, this approach strips away critical metadata—like reading order, bounding boxes, and annotations—that help preserve the document's original structure.
-
Loss of context: Traditional OCR typically processes content at the token or line level, without considering the broader document context. This makes it hard to support advanced tasks such as summarization, entity linking, or intelligent retrieval.
Meet Typhoon OCR
To overcome the limitations of traditional OCR systems, Vision-Language Models (VLMs) present a fundamentally new approach. By combining visual perception with natural language understanding, VLM-based OCR moves beyond simple text recognition. These models can interpret document structure, grasp semantic meaning, and capture the intent behind the content—without relying on complex, rule-based pipelines.
Typhoon OCR is an open-source, bilingual document parsing model built specifically for real-world documents in Thai and English. Inspired by models like olmOCR, Typhoon OCR introduces a redesigned architecture that is:
-
Robust to noisy inputs and complex, irregular layouts
-
Multilingual, with dedicated support for both Thai and English
-
Layout-aware, preserving the document’s structural integrity in its output
Unlike conventional OCR tools, Typhoon OCR doesn't just extract raw text—it produces semantic, structured, and layout-preserving outputs that are optimized for downstream tasks such as:
-
Retrieval-Augmented Generation (RAG)
-
Comprehensive document parsing and understanding
-
Accurate interpretation of tables, charts, and forms
Real-World Document Support
Typhoon OCR is optimized to handle a wide variety of real-world documents in both PDF and image formats—from structured reports to informal content. It preserves both semantic meaning and visual structure, delivering outputs ready for downstream applications.
-
PDFs: Utilizes embedded metadata such as reading order, bounding boxes, and annotations to improve accuracy and maintain document structure
-
Images: Maintains layout fidelity even without metadata, using visual cues to reconstruct structure
Structured Documents
Typhoon OCR is optimized for supporting structured documents including financial reports, academic papers, books, and government forms.
Output format:
-
Markdown for general text
-
HTML for tables (including merged cells and complex layouts)
-
Figures, charts, and diagrams are represented using
<figure>
tags for structured visual understanding
Each figure undergoes multi-layered interpretation:
-
Observation: Detects elements like landscapes, buildings, people, logos, and embedded text
-
Context Analysis: Infers context such as location, event, or document section
-
Text Recognition: Extracts and interprets embedded text (e.g., chart labels, captions) in Thai or English
-
Artistic & Structural Analysis: Captures layout style, diagram type, or design choices contributing to document tone
-
Final Summary: Combines all insights into a structured figure description for tasks like summarization and retrieval
Layout-Heavy & Informal Documents
Typhoon OCR can also support several kinds of informal documents and has been tested with documents such as infographics, receipts, menus, tickets, and hand-written notes.
Output format: Markdown with embedded tables and layout-aware structures
Real-World Demos & Highlights
Financial Statement and Financial Tables: Accurately extracts complex tabular data, including merged cells
Image source: scb.co.th
Charts: Converts statistical chart content into human-readable Markdown summaries
Image source: scb.co.th
Government Documents: Performs high-accuracy full-page OCR, including support for Thai numerals
Infographics: Excels in Visual Text Understanding, achieving near-perfect results
Image source: longtunman
Handwritten Notes: Demonstrates promising results across varied handwriting styles
Bills & Receipts: Performs well even on out-of-domain formats like utility bills
Evaluation Methodology
To evaluate Typhoon OCR, we used standard metrics widely adopted in OCR and text generation tasks:
-
BLEU – Measures n-gram precision (higher is better)
-
ROUGE-L – Captures structural and sequence similarity (higher is better)
-
Levenshtein Distance – Character-level edit distance (lower is better)
We benchmarked Typhoon OCR in two settings — with PDF metadata support and without it (image-only input) — against state-of-the-art models, including:
-
GPT-4o (2024-11-20)
-
Gemini 2.5 Flash Preview (2025-04-17)
The evaluation was conducted on our in-house Thai dataset comprising:
📈 Thai Financial Reports
🏛️ Thai Government Forms
📖 Thai Books
Summary
Typhoon OCR outperforms both GPT-4o and Gemini 2.5 Flash in Thai document understanding, particularly on documents with complex layouts and mixed-language content.
However, in the Thai books benchmark, performance slightly declined due to the high frequency and diversity of embedded figures. These images vary significantly in type and structure, which poses challenges for our current <figure>
tag parsing. This highlights a potential area for future improvement—specifically, in enhancing the model's image understanding capabilities.
For this version, our primary focus has been on achieving high-quality OCR for both English and Thai text. Future releases may extend support to more advanced image analysis and figure interpretation.
Try Typhoon OCR Today
English PDF extraction from Typhoon OCR Playground
Thai PDF extraction from Typhoon OCR Playground
Whether you're parsing complex tables, interpreting multilingual forms, or unlocking insights from visually rich documents, Typhoon OCR is ready to transform how you work with text and structure.
-
🔍 Test it instantly on our OCR Playground – just upload an image or a single-page PDF and see the results in seconds.
-
🤗 Check out the model weight on Hugging Face – fine-tune or integrate it into your own workflows.
-
⚙️ Use the API – full API access is available now. Visit our documentation to get started.
Typhoon OCR is open-source, bilingual, and built for real-world performance—start building with it today.