Introducing Typhoon OCR: An Open-Source Vision-Language Document Parsing Model for English and Thai

From now on, extracting text content from large volumes of documents or image files will no longer be a headache. We are glad to release Typhoon OCR: a next-generation, open-source vision-language document parsing model built for real-world use cases in English and Thai. Typhoon OCR delivers structured, layout-aware, and semantically rich outputs—making it a powerful tool for applications like document retrieval, summarization, and information extraction.

Limitations of Traditional OCR

Traditional OCR systems are typically built using Convolutional Neural Networks (CNNs) for visual recognition, paired with sequence decoders like RNNs or Transformers to convert image features into text. These models are effective at identifying characters in images, but they work best with clean, well-structured documents—such as high-quality scans with simple layouts.

Popular OCR frameworks like EasyOCR, PaddleOCR, and Tesseract already support multiple languages, including Thai. However, they face several challenges when applied to real-world documents:

Weak layout awareness: These systems often treat documents as plain sequences of text, failing to recognize structural elements like tables, headings, columns, or mixed media sections.
Lack of image-text understanding: Visual content such as charts, diagrams, and figures is usually ignored, leading to incomplete or fragmented outputs.
Limited PDF support: While these tools can process PDFs by converting them into images, this approach strips away critical metadata—like reading order, bounding boxes, and annotations—that help preserve the document's original structure.
Loss of context: Traditional OCR typically processes content at the token or line level, without considering the broader document context. This makes it hard to support advanced tasks such as summarization, entity linking, or intelligent retrieval.

Meet Typhoon OCR

To overcome the limitations of traditional OCR systems, Vision-Language Models (VLMs) present a fundamentally new approach. By combining visual perception with natural language understanding, VLM-based OCR moves beyond simple text recognition. These models can interpret document structure, grasp semantic meaning, and capture the intent behind the content—without relying on complex, rule-based pipelines.

Typhoon OCR is an open-source, bilingual document parsing model built specifically for real-world documents in Thai and English. Inspired by models like olmOCR, Typhoon OCR introduces a redesigned architecture that is:

Robust to noisy inputs and complex, irregular layouts
Multilingual, with dedicated support for both Thai and English
Layout-aware, preserving the document’s structural integrity in its output

Unlike conventional OCR tools, Typhoon OCR doesn't just extract raw text—it produces semantic, structured, and layout-preserving outputs that are optimized for downstream tasks such as:

Retrieval-Augmented Generation (RAG)
Comprehensive document parsing and understanding
Accurate interpretation of tables, charts, and forms

Real-World Document Support

Typhoon OCR is optimized to handle a wide variety of real-world documents in both PDF and image formats—from structured reports to informal content. It preserves both semantic meaning and visual structure, delivering outputs ready for downstream applications.

PDFs: Utilizes embedded metadata such as reading order, bounding boxes, and annotations to improve accuracy and maintain document structure
Images: Maintains layout fidelity even without metadata, using visual cues to reconstruct structure

Structured Documents

Typhoon OCR is optimized for supporting structured documents including financial reports, academic papers, books, and government forms.

Output format:

Markdown for general text
HTML for tables (including merged cells and complex layouts)
Figures, charts, and diagrams are represented using <figure> tags for structured visual understanding

Each figure undergoes multi-layered interpretation:

Observation: Detects elements like landscapes, buildings, people, logos, and embedded text
Context Analysis: Infers context such as location, event, or document section
Text Recognition: Extracts and interprets embedded text (e.g., chart labels, captions) in Thai or English
Artistic & Structural Analysis: Captures layout style, diagram type, or design choices contributing to document tone
Final Summary: Combines all insights into a structured figure description for tasks like summarization and retrieval

Layout-Heavy & Informal Documents

Typhoon OCR can also support several kinds of informal documents and has been tested with documents such as infographics, receipts, menus, tickets, and hand-written notes.

Output format: Markdown with embedded tables and layout-aware structures

Real-World Demos & Highlights

Financial Statement and Financial Tables: Accurately extracts complex tabular data, including merged cells

Financial Statement Tabular Information Extraction Demo

Image source: scb.co.th

Charts: Converts statistical chart content into human-readable Markdown summaries

OCR Chart

Image source: scb.co.th

Government Documents: Performs high-accuracy full-page OCR, including support for Thai numerals

OCR Thai Government Documents

Infographics: Excels in Visual Text Understanding, achieving near-perfect results

Typhoon OCR infographic sample

Image source: longtunman

Handwritten Notes: Demonstrates promising results across varied handwriting styles

Typhoon OCR handwritten notes

Bills & Receipts: Performs well even on out-of-domain formats like utility bills

Typhoon OCR bills

Evaluation Methodology

To evaluate Typhoon OCR, we used standard metrics widely adopted in OCR and text generation tasks:

BLEU – Measures n-gram precision (higher is better)
ROUGE-L – Captures structural and sequence similarity (higher is better)
Levenshtein Distance – Character-level edit distance (lower is better)

We benchmarked Typhoon OCR in two settings — with PDF metadata support and without it (image-only input) — against state-of-the-art models, including:

GPT-4o (2024-11-20)
Gemini 2.5 Flash Preview (2025-04-17)

The evaluation was conducted on our in-house Thai dataset comprising:

📈 Thai Financial Reports

Typhoon OCR Performance in Thai Financial Reports

🏛️ Thai Government Forms

Typhoon OCR Performance in Thai Government forms

📖 Thai Books

Typhoon OCR Performance in Thai books

Summary

Typhoon OCR outperforms both GPT-4o and Gemini 2.5 Flash in Thai document understanding, particularly on documents with complex layouts and mixed-language content.

However, in the Thai books benchmark, performance slightly declined due to the high frequency and diversity of embedded figures. These images vary significantly in type and structure, which poses challenges for our current <figure> tag parsing. This highlights a potential area for future improvement—specifically, in enhancing the model's image understanding capabilities.

For this version, our primary focus has been on achieving high-quality OCR for both English and Thai text. Future releases may extend support to more advanced image analysis and figure interpretation.

Try Typhoon OCR Today

English PDF extraction from Typhoon OCR Playground

Thai PDF extraction from Typhoon OCR Playground

Whether you're parsing complex tables, interpreting multilingual forms, or unlocking insights from visually rich documents, Typhoon OCR is ready to transform how you work with text and structure.

🔍 Test it instantly on our OCR Playground – just upload an image or a single-page PDF and see the results in seconds.
🤗 Check out the model weight on Hugging Face – fine-tune or integrate it into your own workflows.
⚙️ Use the API – full API access is available now. Visit our documentation to get started.

Typhoon OCR is open-source, bilingual, and built for real-world performance—start building with it today.

Limitations of Traditional OCR

Popular OCR frameworks like EasyOCR, PaddleOCR, and Tesseract already support multiple languages, including Thai. However, they face several challenges when applied to real-world documents:

Weak layout awareness: These systems often treat documents as plain sequences of text, failing to recognize structural elements like tables, headings, columns, or mixed media sections.
Lack of image-text understanding: Visual content such as charts, diagrams, and figures is usually ignored, leading to incomplete or fragmented outputs.
Limited PDF support: While these tools can process PDFs by converting them into images, this approach strips away critical metadata—like reading order, bounding boxes, and annotations—that help preserve the document's original structure.
Loss of context: Traditional OCR typically processes content at the token or line level, without considering the broader document context. This makes it hard to support advanced tasks such as summarization, entity linking, or intelligent retrieval.

Meet Typhoon OCR

Robust to noisy inputs and complex, irregular layouts
Multilingual, with dedicated support for both Thai and English
Layout-aware, preserving the document’s structural integrity in its output

Unlike conventional OCR tools, Typhoon OCR doesn't just extract raw text—it produces semantic, structured, and layout-preserving outputs that are optimized for downstream tasks such as:

Retrieval-Augmented Generation (RAG)
Comprehensive document parsing and understanding
Accurate interpretation of tables, charts, and forms