Explore our research publications and technical papers that advance Thai language AI development. From foundational models to cutting-edge applications, discover the scientific contributions driving innovation in Thai NLP.
Access our latest research publications covering Thai language models, multimodal systems, and evaluation frameworks.
ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.
A Thai-script–based orthographic system designed to represent Isan words consistently and systematically. It provides clear rules for writing Isan in a way that supports linguistic research, dataset creation, and AI model training.
A comprehensive guideline for transcribing spoken Isan in a consistent, machine-readable form. It defines rules for segmenting speech, marking tones, representing pronunciation, and handling variations across regions—ensuring high-quality annotations for AI and NLP training.
This work studies how to aggregate granular human feedback into reliable overall judgments, proposing methods tailored for complex evaluation settings such as machine learning systems.
AudioJudge investigates how large audio models can be used to evaluate speech, analyzing which design choices and configurations lead to reliable automatic assessments.
ThaiOCRBench introduces a diverse benchmark for testing vision-language models on Thai OCR and understanding tasks, enabling more robust evaluation of Thai-capable multimodal systems.
ThaiSafetyBench proposes a benchmark for evaluating language model safety in culturally specific Thai scenarios, highlighting gaps and risks in current safety alignment.
This paper explores medical reasoning models that output ranked lists of answers instead of a single prediction, aiming to better capture diagnostic uncertainty and support clinical decision-making.
This work investigates techniques for extending audio context in large audio-language models, enabling better comprehension of long-form audio such as lectures and conversations.
Mangosteen provides an open large-scale Thai text corpus designed for pretraining language models, supporting research and development of Thai-centric NLP systems.
This report investigates approaches for prompting a tool-augmented large language model (LLM) to act as a role-playing dialogue agent in the API track of the Commonsense Persona-grounded Dialogue Challenge (CPDC) 2025.
This paper presents FinCoT, a structured chain-of-thought (CoT) prompting framework that embeds domain-specific expert financial reasoning blueprints to guide large language models' behaviors.
Unlearning has emerged as a critical capability for large language models (LLMs) to support data privacy, regulatory compliance, and ethical AI deployment. Recent techniques often rely on obfuscation by injecting incorrect or irrelevant information to suppress knowledge.
This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals.
We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types.
This paper presents an integrated architecture and training strategy that improves performance in Thai while retaining strong English capabilities. Our model combines audio understanding + speech instruction following—two capabilities that were previously treated separately.
The paper explores how to improve reasoning in multilingual environments using Program-of-Thought (PoT) prompting, a technique that separates reasoning (written as code) from execution (done by an interpreter).
The paper introduces SEA-VL, a large-scale, open-source, multicultural vision-language dataset specifically designed to address the underrepresentation of Southeast Asian (SEA) cultures in AI and machine learning research. By combining three methods—crowdsourcing, web crawling, and image generation—the authors collected 1.28 million culturally relevant image-caption pairs from 11 SEA countries, far surpassing existing datasets in both scale and cultural diversity.
This paper presents TalkArena, a new platform for evaluating Large Audio Models (LAMs) through interactive user engagement rather than static benchmarks. By collecting over 7,500 interactions from 484 users using speech-based queries, the authors uncover that users mainly use audio interfaces for tasks that benefit from speed and ease—like seeking knowledge or advice—rather than tasks requiring nuanced speech understanding.
Safeguarding LLMs requires separating harmful prompts from safe ones. We frame this reliance as a shortcut learning problem and conduct experiments revealing how existing models depend on specific keywords for classification rather than semantic understanding. Performance evaluations across six safety benchmarks show that models perform well when keyword distributions align but degrade on out-of-distribution prompts. Results from our counterfactual analysis demonstrate that current safeguard models are vulnerable to keyword distribution shifts due to shortcut learning. These findings highlight the importance of addressing shortcut learning to enhance the robustness of safeguard models.
This paper explores data selection and model merging to enhance language-specific LLMs (e.g., Thai) with DeepSeek R1-level reasoning. Using only public datasets and a $120 budget, we achieve this without compromising performance on language tasks.
An open-source Thai reasoning model development effort with comprehensive ablation study
This work introduces SkillAggregation, a method for aggregating judgments from multiple LLMs without reference labels, extending crowd-layer ideas to NLP and achieving strong results across tasks.
Mind the Gap compares static and interactive evaluation setups for large audio models, highlighting gaps between benchmark performance and real-world interactive behavior.
SEA-VL is a multicultural vision-language dataset for Southeast Asia built from crowdsourcing, web crawling, and generation, enabling better evaluation and training of regional vision-language models.
This paper analyzes program-of-thought reasoning for multilingual and cross-lingual settings, studying how reasoning programs transfer across languages and where failure modes arise.
This paper evaluates audio language models for low-resource languages like Thai and proposes data and training strategies that jointly improve audio comprehension and speech instruction-following.
Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.
This paper studies how prior prompt engineering influences reinforcement fine-tuning of language models, comparing different prompting strategies and showing that carefully designed prior prompts can induce distinct and beneficial behaviors.
The authors compare unlearning and obfuscation approaches for removing knowledge from language models, examining how much sensitive information actually remains accessible after each method.
This paper presents Typhoon 2, Thai-optimized models for text, vision, and audio. It outlines methods like continual pre-training and post-training to enhance Thai performance, with evaluation across tasks. The series includes models from 1 to 70 billion parameters, safety tools, and advances in document understanding and speech processing.
This paper explores multilingual reasoning distillation in LLMs, proposing d-CoT-nR, a novel approach that incorporates incorrect rationales alongside positive ones to enhance learning. Experiments on multilingual high-school exams show that d-CoT-nR improves accuracy in unseen languages and step-by-step reasoning, outperforming existing methods focused primarily on English. In collaboration with VISTEC.
This work addresses the challenge of overshadowed entities in entity disambiguation (ED) by proposing a debiasing technique to prevent shortcut learning during training. Unlike knowledge-based methods, this approach avoids added computational overhead at inference. Experiments show state-of-the-art performance on ED datasets, offering a fast and effective solution for improving ED. In collaboration with VISTEC.
McCrolin is a multi-consistency cross-lingual training framework designed to enhance consistency, ranking stability, and robustness in cross-lingual QA systems. Using multi-task learning, McCrolin achieves state-of-the-art results on standard QA datasets and excels with varying input sizes. It demonstrates strong generalizability across different encoder architectures and sizes. In collaboration with VISTEC.
This paper evaluates audio language models in low-resource languages, using Thai as an example, revealing their limitations despite multilingual pretraining. It explores data mixtures to optimize models for both a target language and English, integrating audio comprehension and speech instruction-following into a unified framework. The proposed model, Typhoon-Audio, significantly outperforms open-source models and rivals state-of-the-art systems like Gemini-1.5-Pro in both English and Thai.
CrossCheckGPT introduces a reference-free method for ranking hallucinations in multimodal foundation models, leveraging cross-system consistency as a measure of robustness. Applicable across domains and tasks, it uses explicit and implicit consistency metrics to assess hallucination levels. The method demonstrates high correlation with human judgments and supports new benchmarks, including the first audio-visual hallucination benchmark, AVHalluBench. In collaboration with University of Cambridge, Tsinghua University.
The Typhoon series introduces Thai LLMs optimized for low-resource challenges, using continual training and ThaiExam for evaluation. Fine-tuned for Thai tasks, Typhoon outperforms open-source models and rivals GPT-3.5 in Thai, with greater efficiency.
This paper frames LLM safety classification as a shortcut learning problem, showing that safeguards often rely on keyword distributions rather than deep semantic understanding, leading to brittleness under distribution shifts.
FinCoT proposes a structured chain-of-thought prompting framework grounded in expert financial blueprints, improving model accuracy and interpretability on CFA-style financial reasoning tasks.
This work studies prompting strategies for tool-augmented role-play dialogue agents, proposing rule-based role prompting that reduces over-speaking and improves tool calling to achieve better task performance.