We are excited to announce that four research papers from Typhoon team have been accepted to EMNLP 2025 including two papers in the main conference and two papers in workshops.
This marks a major milestone that reflects our joint commitment to advancing open, inclusive, and impactful AI research for Thailand and the global NLP community.
Our accepted papers are:
Main Conference Papers
ThaiInstruct: An Instruction-Following Dataset for Culturally-Aware, Multitask, and Multi-domain Evaluation in Thai

Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, which miss cultural and domain-specific nuances critical for real-world Thai applications.
Key idea:
ThaiInstruct introduces the first large-scale, human-authored Thai dataset designed for both evaluation and instruction tuning.
Dataset design:
-
**Domains: **Legal, Medical, Finance, Retail
-
**Task types: **Classification, Summarization, Open QA, Closed QA, MCQ, Brainstorming, Creative Writing
**Coverage: **Both general-purpose and culturally-specific instructions
**Quality control: **Built with annotators, domain experts, and AI researchers through a multi-stage process
Findings:
-
Zero-shot evaluation reveals significant performance gaps in Thai, especially on cultural and professional tasks.
-
Instruction tuning on ThaiInstruct outperforms translated-data baselines in both in-domain and out-of-domain benchmarks.
-
Results confirm that native, culturally grounded supervision is crucial for aligning LLMs in diverse linguistic settings.
Prior Prompt Engineering for Reinforcement Fine-Tuning

This paper introduces Prior Prompt Engineering (pPE) as a new dimension in reinforcement fine-tuning (RFT) of language models. Instead of focusing only on algorithms, reward design, or data selection (as most RFT work does), the authors ask: What if the training prompts themselves could systematically guide models toward specific behaviors?
Key idea:
-
At inference, prompt engineering (iPE) uses instructions (e.g., “think step by step”) to guide behaviors.
-
This paper adapts iPE into training-time prompts (pPE), so that models internalize these behaviors during RFT, not just at inference.
Approach:
- Translate five inference-time prompt engineering strategies into prior prompts for training:
-
Reasoning (Chain-of-Thought)
-
Planning (Plan-and-Solve)
-
Code-based reasoning (Program-of-Thought)
-
Knowledge recall (Generated Knowledge)
-
Null-example utilization (Null-Shot)
- Evaluate on in-domain and out-of-domain benchmarks (AIME2024, HumanEval+, GPQA-Diamond).
Findings:
-
All pPE-trained models outperform their inference-time (iPE) baselines.
-
Null-example pPE yields the largest overall gain, even surpassing reasoning prompts on AIME2024 and GPQA-Diamond.
-
Using a behavior-classification framework, the authors show that **different pPE strategies leave distinct behavioral “signatures” **in the trained models.
Workshop Papers
FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning

Venue: FinNLP Workshop @ EMNLP 2025
Reasoning in financial NLP tasks often requires more than general-purpose chain-of-thought (CoT). Prior work mainly explores standard prompting (zero-shot) and unstructured CoT (free-form reasoning), but structured CoT with domain-specific knowledge remains underexplored.
Key idea:
FinCoT introduces a structured chain-of-thought prompting framework grounded in expert-designed financial reasoning blueprints. This helps large language models produce answers that are not only more accurate, but also domain-aligned and interpretable.
Approach:
-
Identify and compare three prompting styles in finance tasks:
-
Standard prompting (zero-shot)
-
Unstructured CoT (free-form reasoning)
-
Structured CoT (explicit reasoning steps)
-
-
Develop FinCoT, embedding expert financial reasoning blueprints into structured CoT prompts.
-
Evaluate across 10 CFA-style financial domains with both general-purpose and finance-specific models.
Findings:
-
FinCoT boosts Qwen3-8B-Base from 63.2% → 80.5% accuracy.
-
FinCoT improves Fin-R1 (7B) from 65.7% → 75.7% accuracy.
-
Reduces output length by up to 8.9x (general model) and 1.16x (finance-specific model).
-
Most effective for models without finance post-training.
Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting

Venue: WordPlay Workshop @ EMNLP 2025
In role-playing dialogue settings, tool-augmented LLM agents often over-speak (producing overly long responses) and under-act (misusing or failing to use tools properly in line with their persona). This work explores how prompt design can make role-playing agents more effective, concise, and reliable.
Key idea:
The paper systematically investigates four prompting approaches to address over-speaking and under-acting:
-
Basic role prompting
-
Human-crafted role prompting
-
Automatic Prompt Optimization (APO)
-
Rule-based Role Prompting (RRP)
Approach & Findings:
-
RRP achieved the best performance using two novel techniques:
-
Character-card & scene-contract design
-
Strict enforcement of function calling
-
-
RRP scored 0.571, improving on the zero-shot baseline of 0.519.
-
Compared with APO and other prompting strategies, RRP proved more effective at balancing concise role-play and accurate tool usage.
Open-source contribution:
We released all best-performing prompts and the APO tool to support future development of persona-grounded dialogue agents.
Recognition:
The team also placed Top 10 (8th place) on the API Track of the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 🏆.

Looking Ahead
With two main conference papers (A*) and two workshop papers, EMNLP 2025 is an important milestone for our collaboration.
We’re proud of the joint efforts with partners like SCBX, VISTEC and AI Singapore for pushing the boundaries of NLP research.
💡 If you’re attending EMNLP 2025, come meet us! We’d love to connect at our sessions!



