Typhoon’s Joint Research Included in 5 Accepted Papers at ACL 2025

We’re thrilled to share that five papers involving the Typhoon research team—developed in collaboration with VISTEC, Cambridge, Stanford, and SeaCrowd—have been accepted to ACL 2025 , one of the most prestigious conferences in natural language processing and computational linguistics.

ACL (Association for Computational Linguistics) serves as a global stage for groundbreaking research in AI, with rigorous peer review and high visibility among the international research community. It’s an honor to contribute to this year’s conference with three papers in the Main Conference, one in the Findings, and one in a specialized Workshop.

These papers span a diverse range of topics—from language model evaluation and multilingual reasoning to dataset creation and LLM safety. While each project tackles a different challenge, together they reflect our shared goal: advancing AI in a way that is context-aware, inclusive, and practically grounded.

We’re deeply grateful to our collaborators, co-authors, and reviewers who made this possible. Below is a closer look at each paper and the contribution it brings.

1. SkillAggregation: Reference-free LLM-Dependent Aggregation

Accepted to Main Conference
Paper link: https://arxiv.org/abs/2410.10215
Authors from SCB 10X: Guangzhi Sun and Potsawee Manakul

This paper proposes SkillAggregation, a novel reference-free method for aggregating judgments from multiple large language models (LLMs) without requiring ground truth labels.

Unlike traditional approaches that assign equal weight to all LLMs or are task-specific, SkillAggregation dynamically learns the skill of each LLM judge based on contextual inputs, enabling more accurate and adaptive decision-making. It builds upon and improves the Crowdlayer method by incorporating context-dependent skill estimates and a regularization term to mitigate overconfidence in predictions.

Evaluated on tasks like HaluEval-Dialogue, TruthfulQA, and Chatbot Arena, SkillAggregation consistently outperforms existing aggregation baselines, especially when combining outputs from varied-quality LLMs, and demonstrates robustness across different model sizes, datasets, and encoders.

2. Mind the Gap! Static and Interactive Evaluations of Large Audio Models

Accepted to Main Conference
Paper link: https://arxiv.org/abs/2502.15919
Authors from SCB 10X: Kunat Pipatanaku and Potsawee Manakul

This paper presents TalkArena, a new platform for evaluating Large Audio Models (LAMs) through interactive user engagement rather than static benchmarks. By collecting over 7,500 interactions from 484 users using speech-based queries, the authors uncover that users mainly use audio interfaces for tasks that benefit from speed and ease—like seeking knowledge or advice—rather than tasks requiring nuanced speech understanding.

The study finds that a simple pipeline combining Whisper and LLaMA outperforms even advanced commercial models in user preference, primarily due to better text response quality. Notably, the paper reveals that existing static benchmarks poorly predict real-world user preferences, highlighting a significant gap in how LAMs are currently evaluated. This work underscores the need for more user-aligned evaluation methods to guide the development of voice-based AI systems.

3. Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Accepted to Main Conference
Paper link: https://arxiv.org/abs/2503.07920
SCB 10X’s Adisai Na-Thalang involved in dataset contribution

The paper introduces SEA-VL, a large-scale, open-source, multicultural vision-language dataset specifically designed to address the underrepresentation of Southeast Asian (SEA) cultures in AI and machine learning research. By combining three methods—crowdsourcing, web crawling, and image generation—the authors collected 1.28 million culturally relevant image-caption pairs from 11 SEA countries, far surpassing existing datasets in both scale and cultural diversity.

The study finds that while crowdsourcing yields the highest quality data, web crawling is more scalable and cost-efficient, and image generation remains inadequate for capturing nuanced cultural contexts. Extensive human evaluation validates the cultural relevance of the collected data, highlighting the limitations of current AI in representing diverse cultures and advocating for more inclusive, culturally grounded dataset creation.

4. Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

Accepted to Findings
Paper link: https://arxiv.org/abs/2502.17956
Author from SCB 10X: Potsawee Manakul (as a co-advisor)

The paper explores how to improve reasoning in multilingual environments using Program-of-Thought (PoT) prompting, a technique that separates reasoning (written as code) from execution (done by an interpreter).

The authors investigate two key challenges: aligning questions in different languages with accurate reasoning steps, and understanding how the quality of those steps affects final answer accuracy. They develop and evaluate fine-tuning strategies across multiple languages and find that PoT outperforms the more commonly used Chain-of-Thought (CoT) prompting, especially in non-English languages.

By using a code quality metric called ICE-Score, they show that better reasoning leads to better results and propose a test-time inference method (Soft Self-Consistency) that further boosts performance. Overall, the study demonstrates that PoT, when carefully fine-tuned and evaluated, significantly enhances multilingual reasoning in large language models.

5. Shortcut Learning in Safety: The Impact of Keyword Bias in Safeguards

Accepted to LLM Security Workshop
Paper link: https://openreview.net/forum?id=IOP5nuRx5S

This study investigates the vulnerability of Large Language Model (LLM) safeguard systems to shortcut learning, where models rely on superficial keyword cues rather than genuine semantic understanding to classify prompts as safe or harmful.

Such reliance can undermine the robustness of safeguards, especially when facing out-of-distribution (OOD) inputs. The findings underscore the need to address shortcut learning in LLM safeguards to enhance their robustness and reliability. Relying on synthetic data with repetitive patterns can inadvertently teach models to focus on keywords, making them susceptible to misclassification when encountering novel or rephrased inputs.

Summary

ACL 2025 has given us a valuable opportunity to showcase Typhoon’s collaborative contributions to both global and regional research efforts across multiple fronts:

3 papers accepted to the Main Conference explore reference-free aggregation for LLMs, interactive evaluation for audio models, and the creation of a multicultural vision-language dataset for Southeast Asia.
1 paper accepted to Findings advances our understanding of multilingual reasoning through Program-of-Thought prompting.
1 paper accepted to the LLM Security Workshop addresses critical concerns around keyword bias and shortcut learning in LLM safeguards.

We’re especially proud to see research rooted in Southeast Asia—and driven by researchers based in Thailand—contributing meaningfully to the global conversation on NLP and AI.

A heartfelt thank-you to all our collaborators, co-authors, and supporters in the research community. Your encouragement and partnership continue to inspire our work.

Stay Tuned: ACL 2025 Insights Coming Soon

One of our Typhoon team members (myself!) will be attending ACL 2025 in person from July 27 to August 1. I’m looking forward to learning from the global community and sharing key insights and highlights with all of you after the event.

If you’ll be at the conference too, feel free to reach out—we’d love to connect!

Join Our Community

💡 Explore our open-source projects

Open-weight models: huggingface.co/scb10x

More initiatives: opentyphoon.ai

💬 Join the conversation

Connect with us on Discord to discuss ideas, collaborate, or just say hi!