
Typhoon’s Paper Acceptance at Interspeech 2025: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
ConferenceResearchInterspeechNLP

Table of Contents
We're proud to share that our paper on the Typhoon-Audio development, "Enhancing Low-Resource Language and Instruction-Following Abilities of Audio Language Models," has been accepted to Interspeech 2025 🎉
As the world’s premier conference in spoken language processing, Interspeech recognizes work that pushes the boundaries of audio AI. This acceptance is a significant step forward in Typhoon’s mission to bring inclusive, multilingual AI to underrepresented communities—starting with Thai, and building a replicable foundation for other low-resource languages.
TL;DR
-
Most audio LMs are English-centric, underperforming in low-resource languages like Thai.
-
Our model combines audio understanding + speech instruction following—two capabilities that were previously treated separately.
-
We present an integrated architecture and training strategy that improves performance in Thai while retaining strong English capabilities.
Core Problems We Address
Most open-source audio language models (ALMs) are built for English. While some leverage multilingual backbones, they often fail on truly low-resource languages like Thai without dedicated training.
Other key limitations:
-
Lack of balance between audio comprehension and instruction-following.
-
High compute costs for adaptation in low-resource settings.
-
Minimal performance benchmarks for Southeast Asian languages in this space.
Our Goals
We aimed to build an ALM that:
-
Improves performance in Thai, without compromising on English.
-
Unifies audio understanding and instruction-following (Speech IF) in one model.
-
Scales to other languages like Lao, Burmese, and Khmer with minimal retraining.
Model architecture
We designed a modular architecture that fuses speech and general audio into a shared instruction-following pipeline.
Audio Encoder Backbone
-
Whisper-th-large-v3-combined (from biodatlab): fine-tuned for Thai speech, transforms speech into rich spectrogram-based embeddings.
-
BEATs: captures general non-speech audio (e.g., music, ambient sounds).
Adapter Module (Q-Former)
Maps audio embeddings into a semantic space aligned with language, enabling better integration with LLMs. This bridging module ensures that the model can understand audio like it does text.
LLM Backbone
We use Typhoon-1.5-8B-Instruct, a LLaMA3-based language model that is retrained on a balanced Thai-English corpus and fine-tuned with multilingual instruction-following tasks.
Evaluation & Results
We benchmarked the model on tasks spanning both comprehension and generation:
Task | Metric | Result |
---|---|---|
ASR (Automatic Speech Recognition) | ↓ Word Error Rate (WER) | Significant reduction |
Translation | ↑ BLEU Score | Improved bilingual quality |
Gender Classification | ↑ Accuracy | Higher accuracy across languages |
Spoken QA | ↑ F1 Score | Better comprehension & generation |
Speech Instruction Following (Speech IF) | ↑ Human/GPT-4o Score (1–10) | Clearer, more accurate responses |
Complex Instruction Following (Complex IF) | ↑ Judge Score (Quality & Format) | Better handling of multi-step tasks |
Summary
While we focus on Thai, this architecture and training strategy are designed to generalize. The modular pipeline and adapter-based design allow easy extension to other low-resource languages in Southeast Asia—with the goal of making inclusive audio AI accessible to all.
This work lays the foundation for a new generation of multilingual ALMs that understand audio, follow spoken instructions, and serve diverse linguistic communities. We're excited to present it at Interspeech 2025 and contribute to a more inclusive future for speech AI.
Join Our Community
💡 Explore our open-source projects
Open-weight models: huggingface.co/scb10x
More initiatives: opentyphoon.ai
💬 Join the conversation
Connect with us on Discord to discuss ideas, collaborate, or just say hi!