Typhoon Logo
TYPHOON

Publications

Explore our research publications and technical papers that advance Thai language AI development. From foundational models to cutting-edge applications, discover the scientific contributions driving innovation in Thai NLP.

Research Papers

Access our latest research publications covering Thai language models, multimodal systems, and evaluation frameworks.

ThaiOCRBench

November 2025Research Paper @ IJCNLP-AACL 2025 (Main)

ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.

Isan Spelling Standard

November 2025Research Artifact

A Thai-script–based orthographic system designed to represent Isan words consistently and systematically. It provides clear rules for writing Isan in a way that supports linguistic research, dataset creation, and AI model training.

Isan Speech Transcription Convention

November 2025Research Artifact

A comprehensive guideline for transcribing spoken Isan in a consistent, machine-readable form. It defines rules for segmenting speech, marking tones, representing pronunciation, and handling variations across regions—ensuring high-quality annotations for AI and NLP training.

Granular feedback merits sophisticated aggregation

September 2025

This work studies how to aggregate granular human feedback into reliable overall judgments, proposing methods tailored for complex evaluation settings such as machine learning systems.

AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation

July 2025

AudioJudge investigates how large audio models can be used to evaluate speech, analyzing which design choices and configurations lead to reliable automatic assessments.

ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai

July 2025

ThaiOCRBench introduces a diverse benchmark for testing vision-language models on Thai OCR and understanding tasks, enabling more robust evaluation of Thai-capable multimodal systems.

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

July 2025

ThaiSafetyBench proposes a benchmark for evaluating language model safety in culturally specific Thai scenarios, highlighting gaps and risks in current safety alignment.

Single Answer is Not Enough: On Generating Ranked Lists with Medical Reasoning Models

October 2025

This paper explores medical reasoning models that output ranked lists of answers instead of a single prediction, aiming to better capture diagnostic uncertainty and support clinical decision-making.

Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

October 2025

This work investigates techniques for extending audio context in large audio-language models, enabling better comprehension of long-form audio such as lectures and conversations.

Mangosteen: An Open Thai Corpus for Language Model Pretraining

July 2025

Mangosteen provides an open large-scale Thai text corpus designed for pretraining language models, supporting research and development of Thai-centric NLP systems.

Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting

October 2025Research Paper @ EMNLP 2025 (Wordplay Workshop)

This report investigates approaches for prompting a tool-augmented large language model (LLM) to act as a role-playing dialogue agent in the API track of the Commonsense Persona-grounded Dialogue Challenge (CPDC) 2025.

FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning

September 2025Research Paper @ EMNLP 2025 (FinNLP Workshop)

This paper presents FinCoT, a structured chain-of-thought (CoT) prompting framework that embeds domain-specific expert financial reasoning blueprints to guide large language models' behaviors.

Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?

September 2025Research Paper @ EMNLP 2025 (Main)

Unlearning has emerged as a critical capability for large language models (LLMs) to support data privacy, regulatory compliance, and ethical AI deployment. Recent techniques often rely on obfuscation by injecting incorrect or irrelevant information to suppress knowledge.

Prior Prompt Engineering for Reinforcement Fine-Tuning

September 2025Research Paper @ EMNLP 2025 (Main)

This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals.

WangchanThaiInstruct: An instruction-following Dataset for Culturally-Aware, Multitask, and Multi-domain Evaluation in Thai

September 2025Research Paper @ EMNLP 2025 (Main)

We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types.

Enhancing low-resource language and instruction following capabilities of audio language models

May 2025Research Paper @ Interspeech 2025

This paper presents an integrated architecture and training strategy that improves performance in Thai while retaining strong English capabilities. Our model combines audio understanding + speech instruction following—two capabilities that were previously treated separately.

Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

May 2025Research Paper @ ACL 2025 (Findings)

The paper explores how to improve reasoning in multilingual environments using Program-of-Thought (PoT) prompting, a technique that separates reasoning (written as code) from execution (done by an interpreter).

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

March 2025Research Paper @ ACL 2025 (Main)

The paper introduces SEA-VL, a large-scale, open-source, multicultural vision-language dataset specifically designed to address the underrepresentation of Southeast Asian (SEA) cultures in AI and machine learning research. By combining three methods—crowdsourcing, web crawling, and image generation—the authors collected 1.28 million culturally relevant image-caption pairs from 11 SEA countries, far surpassing existing datasets in both scale and cultural diversity.

Mind the Gap! Static and Interactive Evaluations of Large Audio Models

February 2025Research Paper @ ACL 2025 (Main)

This paper presents TalkArena, a new platform for evaluating Large Audio Models (LAMs) through interactive user engagement rather than static benchmarks. By collecting over 7,500 interactions from 484 users using speech-based queries, the authors uncover that users mainly use audio interfaces for tasks that benefit from speed and ease—like seeking knowledge or advice—rather than tasks requiring nuanced speech understanding.

Shortcut Learning in Safety: The Impact of Keyword Bias in Safeguards

February 2025Research Paper @ ACL 2025 (Workshops)

Safeguarding LLMs requires separating harmful prompts from safe ones. We frame this reliance as a shortcut learning problem and conduct experiments revealing how existing models depend on specific keywords for classification rather than semantic understanding. Performance evaluations across six safety benchmarks show that models perform well when keyword distributions align but degrade on out-of-distribution prompts. Results from our counterfactual analysis demonstrate that current safeguard models are vulnerable to keyword distribution shifts due to shortcut learning. These findings highlight the importance of addressing shortcut learning to enhance the robustness of safeguard models.

Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging - An Open Recipe

February 2025Research Paper @ ICLR SCI-FM Workshop 2025

This paper explores data selection and model merging to enhance language-specific LLMs (e.g., Thai) with DeepSeek R1-level reasoning. Using only public datasets and a $120 budget, we achieve this without compromising performance on language tasks.

Typhoon T1: An Open Thai Reasoning Model

February 2025Research Paper @ ICLR SCI-FM Workshop 2025

An open-source Thai reasoning model development effort with comprehensive ablation study

SkillAggregation: Reference-free LLM-Dependent Aggregation

May 2025Research Paper @ ACL 2025 (Main)

This work introduces SkillAggregation, a method for aggregating judgments from multiple LLMs without reference labels, extending crowd-layer ideas to NLP and achieving strong results across tasks.

Mind the Gap! Static and Interactive Evaluations of Large Audio Models

May 2025Research Paper @ ACL 2025 (Main)

Mind the Gap compares static and interactive evaluation setups for large audio models, highlighting gaps between benchmark performance and real-world interactive behavior.

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

May 2025Research Paper @ ACL 2025 (Main)

SEA-VL is a multicultural vision-language dataset for Southeast Asia built from crowdsourcing, web crawling, and generation, enabling better evaluation and training of regional vision-language models.

Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

May 2025Research Paper @ ACL 2025 (Findings)

This paper analyzes program-of-thought reasoning for multilingual and cross-lingual settings, studying how reasoning programs transfer across languages and where failure modes arise.

Enhancing low-resource language and instruction following capabilities of audio language models

May 2025Research Paper @ Interspeech 2025

This paper evaluates audio language models for low-resource languages like Thai and proposes data and training strategies that jointly improve audio comprehension and speech instruction-following.

ThaiInstruct: An instruction-following Dataset for Culturally-Aware, Multitask, and Multi-domain Evaluation in Thai

August 2025Research Paper @ EMNLP 2025 (Main)

Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.

Prior Prompt Engineering for Reinforcement Fine-Tuning

August 2025Research Paper @ EMNLP 2025 (Main)

This paper studies how prior prompt engineering influences reinforcement fine-tuning of language models, comparing different prompting strategies and showing that carefully designed prior prompts can induce distinct and beneficial behaviors.

Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?

Preprint

The authors compare unlearning and obfuscation approaches for removing knowledge from language models, examining how much sensitive information actually remains accessible after each method.

Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Model

December 2024Technical Report

This paper presents Typhoon 2, Thai-optimized models for text, vision, and audio. It outlines methods like continual pre-training and post-training to enhance Thai performance, with evaluation across tasks. The series includes models from 1 to 70 billion parameters, safety tools, and advances in document understanding and speech processing.

An Empirical Study of Multilingual Reasoning Distillation for Question Answering

November 2024Research Paper @ EMNLP 2024 (Main)

This paper explores multilingual reasoning distillation in LLMs, proposing d-CoT-nR, a novel approach that incorporates incorrect rationales alongside positive ones to enhance learning. Experiments on multilingual high-school exams show that d-CoT-nR improves accuracy in unseen languages and step-by-step reasoning, outperforming existing methods focused primarily on English. In collaboration with VISTEC.

Efficient Overshadowed Entity Disambiguation by Mitigating Shortcut Learning

November 2024Research Paper @ EMNLP 2024 (Main)

This work addresses the challenge of overshadowed entities in entity disambiguation (ED) by proposing a debiasing technique to prevent shortcut learning during training. Unlike knowledge-based methods, this approach avoids added computational overhead at inference. Experiments show state-of-the-art performance on ED datasets, offering a fast and effective solution for improving ED. In collaboration with VISTEC.

McCrolin: Multi-consistency Cross-lingual Training for Retrieval Question Answering

November 2024Research Paper @ EMNLP 2024 (Findings)

McCrolin is a multi-consistency cross-lingual training framework designed to enhance consistency, ranking stability, and robustness in cross-lingual QA systems. Using multi-task learning, McCrolin achieves state-of-the-art results on standard QA datasets and excels with varying input sizes. It demonstrates strong generalizability across different encoder architectures and sizes. In collaboration with VISTEC.

Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

September 2024Research Paper @ Interspeech 2025

This paper evaluates audio language models in low-resource languages, using Thai as an example, revealing their limitations despite multilingual pretraining. It explores data mixtures to optimize models for both a target language and English, integrating audio comprehension and speech instruction-following into a unified framework. The proposed model, Typhoon-Audio, significantly outperforms open-source models and rivals state-of-the-art systems like Gemini-1.5-Pro in both English and Thai.

CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models

May 2024Research Paper @ NeurIPS RBFM Workshop 2024

CrossCheckGPT introduces a reference-free method for ranking hallucinations in multimodal foundation models, leveraging cross-system consistency as a measure of robustness. Applicable across domains and tasks, it uses explicit and implicit consistency metrics to assess hallucination levels. The method demonstrates high correlation with human judgments and supports new benchmarks, including the first audio-visual hallucination benchmark, AVHalluBench. In collaboration with University of Cambridge, Tsinghua University.

Typhoon: Thai Large Language Models

December 2023Technical Report

The Typhoon series introduces Thai LLMs optimized for low-resource challenges, using continual training and ThaiExam for evaluation. Fine-tuned for Thai tasks, Typhoon outperforms open-source models and rivals GPT-3.5 in Thai, with greater efficiency.

Shortcut Learning in Safety: The Impact of Keyword Bias in Safeguards

May 2025Research Paper @ ACL 2025 (LLMSEC Workshop)

This paper frames LLM safety classification as a shortcut learning problem, showing that safeguards often rely on keyword distributions rather than deep semantic understanding, leading to brittleness under distribution shifts.

FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning

September 2025Research Paper @ EMNLP 2025 (FinNLP Workshop)

FinCoT proposes a structured chain-of-thought prompting framework grounded in expert financial blueprints, improving model accuracy and interpretability on CFA-style financial reasoning tasks.

Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting

September 2025Research Paper @ EMNLP 2025 (Wordplay Workshop)

This work studies prompting strategies for tool-augmented role-play dialogue agents, proposing rule-based role prompting that reduces over-speaking and improves tool calling to achieve better task performance.