We’re excited to share that our work, ThaiOCRBench: A Task-Diverse Benchmark for Vision–Language Understanding in Thai, has been accepted to AACL 2025, which will be held in Mumbai, India from December 20–24, 2025.
ThaiOCRBench fills a gap in Thai AI development: while vision–language models (VLMs) have made rapid progress globally, there’s never been a comprehensive way to evaluate how well they understand Thai documents—with all their unique scripts, layouts, and cultural context.
We build this benchmark to change that.
Why ThaiOCRBench Matters
ThaiOCRBench is the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks.
It is created to fill these gaps:
1. There was no Thai-specific VLM benchmark — until now
Most existing VLM benchmarks are designed for English or other high-resource languages. Prior to ThaiOCRBench, there was no comprehensive benchmark tailored to Thai document understanding. Even newer multilingual datasets provide only limited task diversity for Thai, especially for structured content such as tables, charts, forms, and handwritten documents.
2. Existing Thai OCR datasets are too narrow and don’t reflect real-world complexity
Thai documents contain:
- Thai numerals
- visually similar Thai–English letters
- mixed scripts (Pali/Sanskrit within Thai text)
- varied layouts such as forms, tables, charts, infographics
Most Thai OCR datasets usually cover only text-line OCR or handwriting but not yet include several other formats such as charts, tables, forms, diagrams, or infographics.
3. No unified way to evaluate Thai multimodal reasoning
Before ThaiOCRBench, there was no single framework that measured OCR, structure parsing, semantic extraction, and VQA together.
Overview of the ThaiOCRBench
We built ThaiOCRBench to reflect what Thai document understanding actually looks like in the real world. The final benchmark includes 2,808 human-verified samples across 13 tasks, all designed to test a different layer of capability.
The tasks cover four major areas:
1) OCR and text recognition
- Full-page OCR
- Fine-grained text recognition
- Handwriting extraction
2) Structural understanding
- Table parsing
- Chart parsing
- Document parsing
3) Key-information tasks
- Key information extraction
- Key information mapping
4) Multimodal understanding and reasoning
- Document classification
- Diagram VQA
- Cognition VQA
- Infographics VQA
These tasks were selected because together they capture the full pipeline of Thai document intelligence—from reading text accurately, to understanding structure, to answering grounded questions based on Thai content.
To keep the benchmark realistic, the dataset covers 30+ domains, including government, finance, food & beverage, transportation, education, retail, legal, and more. The domain distribution is shown here:

Task Examples


How ThaiOCRBench Was Built
The pipeline diagram presented in the paper summarizes the four-stage process:
- Data sourcing – original photos, public-domain materials, synthesized documents
- Annotation & PII removal
- LLM-assisted QA generation + human validation
- Final quality check by humans

What the Benchmark Reveals
The benchmark measures model performance across four metric families—each aligned with different types of tasks:
-
TED for structure-heavy tasks
(table parsing, chart parsing, document parsing)
-
BMFL for text recognition
(fine-grained text, full-page OCR, handwriting)
-
F1 for key information extraction and mapping
-
ANLS for semantic understanding and VQA
The evaluation table below compares all major proprietary and open-source VLMs across these metrics. This gives a single view of where models are strong, where they struggle, and which aspects of Thai documents are most difficult for current VLMs.
Performance Evaluations of VLMs on ThaiOCRBench

Key Findings
-
Proprietary models still lead—but all models struggle with Thai complexity
Gemini 2.5 Pro ranks highest across most tasks. GPT-4o follows closely and especially excels in document classification.
-
Qwen2.5-VL 72B is the strongest open-source model.
Its multilingual training and larger size help close the gap, but it still trails proprietary systems.
-
Some tasks are significantly harder than others.
-
Fine-grained text recognition is the hardest overall.
Thai diacritics, small fonts, headless Thai scripts, and visually similar Thai–English characters lead to heavy penalties under strict edit-distance metrics.
-
Handwriting and multi-column layouts also consistently reduce accuracy.
The dataset includes real handwriting and mixed-script content (Thai + Pali/Sanskrit), all of which models struggle with.
-
Document classification is comparatively easy.
Coarse layout and visual cues allow even smaller models to perform fairly well.
-
-
Structure-aware metrics mask deeper weaknesses.
Models sometimes appear strong on TED-based tasks because structure is correct even when textual details are wrong—but stricter metrics like ANLS or BMFL quickly expose those errors.
Why Models Fail
Three recurring weaknesses:
-
Language bias & code-switching
Models sometimes drift into English or mix languages even when the input is fully Thai.
-
Structural mismatch
Layout-heavy tasks (tables, forms, charts) often produce misaligned cells, missing tags, or malformed structures—even when the model “understands” the image.
-
Incorrect or hallucinated content
Inserted characters, missing diacritics, and invented words appear frequently, especially in OCR-heavy tasks.
Together, these patterns explain why VLMs continue to score well on structure-aware metrics (like TED) yet still fail on text-accurate tasks (BMFL and ANLS).
Overall, the table shows a consistent pattern: today’s VLMs can manage broad structure but still struggle with Thai OCR precision, handwriting variability, mixed scripts, and fine-grained visual reasoning.
Explore ThaiOCRBench
-
Paper (arXiv)
-
Hugging Face Dataset
-
GitHub (Evaluation Toolkit)
Why This Work Matters
We built ThaiOCRBench to address a simple problem: Thai documents are everywhere—government services, financial workflows, healthcare, education—but no benchmark previously captured their full complexity. This meant Thai-language VLMs were being evaluated against tasks that didn’t match real usage.
ThaiOCRBench changes that. It offers:
- a single, standardized benchmark covering OCR, layout understanding, and multimodal reasoning
- high-quality, human-verified annotations across 13 task types
- representation of 30+ real-world Thai domains
- the first systematic comparison of proprietary and open-source VLMs on Thai document understanding
The result is a resource that not only reveals current model limitations but also provides a clear roadmap for improvement—making Thai-language AI more accurate, more accessible, and more reliable for real-world applications.
We hope this benchmark will help:
- provide developers with a standardized way to evaluate VLM performance
- push the development of models that understand Thai documents more accurately
- enable teams to improve OCR systems, document AI solutions, and Thai-language agents for both public and private sector applications
See You at AACL 2025 in Mumbai
We’re looking forward to presenting this work at AACL 2025. If you’re attending, come say hi and talk with us!


