Typhoon’s Paper Acceptance at Interspeech 2025: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

We're proud to share that our paper on the Typhoon-Audio development, "Enhancing Low-Resource Language and Instruction-Following Abilities of Audio Language Models," has been accepted to Interspeech 2025 🎉

As the world’s premier conference in spoken language processing, Interspeech recognizes work that pushes the boundaries of audio AI. This acceptance is a significant step forward in Typhoon’s mission to bring inclusive, multilingual AI to underrepresented communities—starting with Thai, and building a replicable foundation for other low-resource languages.

TL;DR

Most audio LMs are English-centric, underperforming in low-resource languages like Thai.
Our model combines audio understanding + speech instruction following—two capabilities that were previously treated separately.
We present an integrated architecture and training strategy that improves performance in Thai while retaining strong English capabilities.
📄 Read the full paper

Core Problems We Address

Most open-source audio language models (ALMs) are built for English. While some leverage multilingual backbones, they often fail on truly low-resource languages like Thai without dedicated training.

Other key limitations:

Lack of balance between audio comprehension and instruction-following.
High compute costs for adaptation in low-resource settings.
Minimal performance benchmarks for Southeast Asian languages in this space.

Our Goals

We aimed to build an ALM that:

Improves performance in Thai, without compromising on English.
Unifies audio understanding and instruction-following (Speech IF) in one model.
Scales to other languages like Lao, Burmese, and Khmer with minimal retraining.

Model architecture

We designed a modular architecture that fuses speech and general audio into a shared instruction-following pipeline.

Audio Encoder Backbone

Whisper-th-large-v3-combined (from biodatlab): fine-tuned for Thai speech, transforms speech into rich spectrogram-based embeddings.
BEATs: captures general non-speech audio (e.g., music, ambient sounds).

Adapter Module (Q-Former)

Maps audio embeddings into a semantic space aligned with language, enabling better integration with LLMs. This bridging module ensures that the model can understand audio like it does text.

LLM Backbone

We use Typhoon-1.5-8B-Instruct, a LLaMA3-based language model that is retrained on a balanced Thai-English corpus and fine-tuned with multilingual instruction-following tasks.

Evaluation & Results

We benchmarked the model on tasks spanning both comprehension and generation:

Task	Metric	Result
ASR (Automatic Speech Recognition)	↓ Word Error Rate (WER)	Significant reduction
Translation	↑ BLEU Score	Improved bilingual quality
Gender Classification	↑ Accuracy	Higher accuracy across languages
Spoken QA	↑ F1 Score	Better comprehension & generation
Speech Instruction Following (Speech IF)	↑ Human/GPT-4o Score (1–10)	Clearer, more accurate responses
Complex Instruction Following (Complex IF)	↑ Judge Score (Quality & Format)	Better handling of multi-step tasks

Summary

While we focus on Thai, this architecture and training strategy are designed to generalize. The modular pipeline and adapter-based design allow easy extension to other low-resource languages in Southeast Asia—with the goal of making inclusive audio AI accessible to all.

This work lays the foundation for a new generation of multilingual ALMs that understand audio, follow spoken instructions, and serve diverse linguistic communities. We're excited to present it at Interspeech 2025 and contribute to a more inclusive future for speech AI.

Read the paper on arXiv

Join Our Community

💡 Explore our open-source projects

Open-weight models: huggingface.co/scb10x

More initiatives: opentyphoon.ai

💬 Join the conversation

Connect with us on Discord to discuss ideas, collaborate, or just say hi!