Typhoon Logo
TYPHOON
Typhoon’s Paper Acceptance at Interspeech 2025: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

Typhoon’s Paper Acceptance at Interspeech 2025: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

Conference
Research
Interspeech
NLP

The recap of our paper that has been accepted to Interspeech

Oravee (Orn) Smithiphol

Oravee (Orn) Smithiphol

June 17, 2025

Typhoon’s Paper Acceptance at Interspeech 2025: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

We're proud to share that our paper on the Typhoon-Audio development, "Enhancing Low-Resource Language and Instruction-Following Abilities of Audio Language Models," has been accepted to Interspeech 2025 🎉

As the world’s premier conference in spoken language processing, Interspeech recognizes work that pushes the boundaries of audio AI. This acceptance is a significant step forward in Typhoon’s mission to bring inclusive, multilingual AI to underrepresented communities—starting with Thai, and building a replicable foundation for other low-resource languages.

TL;DR

  • Most audio LMs are English-centric, underperforming in low-resource languages like Thai.

  • Our model combines audio understanding + speech instruction following—two capabilities that were previously treated separately.

  • We present an integrated architecture and training strategy that improves performance in Thai while retaining strong English capabilities.

  • 📄 Read the full paper

Core Problems We Address

Most open-source audio language models (ALMs) are built for English. While some leverage multilingual backbones, they often fail on truly low-resource languages like Thai without dedicated training.

Other key limitations:

  • Lack of balance between audio comprehension and instruction-following.

  • High compute costs for adaptation in low-resource settings.

  • Minimal performance benchmarks for Southeast Asian languages in this space.

Our Goals

We aimed to build an ALM that:

  • Improves performance in Thai, without compromising on English.

  • Unifies audio understanding and instruction-following (Speech IF) in one model.

  • Scales to other languages like Lao, Burmese, and Khmer with minimal retraining.

Model architecture

Typhoon Audio Model Architecture

We designed a modular architecture that fuses speech and general audio into a shared instruction-following pipeline.

Audio Encoder Backbone

  • Whisper-th-large-v3-combined (from biodatlab): fine-tuned for Thai speech, transforms speech into rich spectrogram-based embeddings.

  • BEATs: captures general non-speech audio (e.g., music, ambient sounds).

Adapter Module (Q-Former)

Maps audio embeddings into a semantic space aligned with language, enabling better integration with LLMs. This bridging module ensures that the model can understand audio like it does text.

LLM Backbone

We use Typhoon-1.5-8B-Instruct, a LLaMA3-based language model that is retrained on a balanced Thai-English corpus and fine-tuned with multilingual instruction-following tasks.

Evaluation & Results

We benchmarked the model on tasks spanning both comprehension and generation:

TaskMetricResult
ASR (Automatic Speech Recognition)↓ Word Error Rate (WER)Significant reduction
Translation↑ BLEU ScoreImproved bilingual quality
Gender Classification↑ AccuracyHigher accuracy across languages
Spoken QA↑ F1 ScoreBetter comprehension & generation
Speech Instruction Following (Speech IF)↑ Human/GPT-4o Score (1–10)Clearer, more accurate responses
Complex Instruction Following (Complex IF)↑ Judge Score (Quality & Format)Better handling of multi-step tasks

Summary

While we focus on Thai, this architecture and training strategy are designed to generalize. The modular pipeline and adapter-based design allow easy extension to other low-resource languages in Southeast Asia—with the goal of making inclusive audio AI accessible to all.

This work lays the foundation for a new generation of multilingual ALMs that understand audio, follow spoken instructions, and serve diverse linguistic communities. We're excited to present it at Interspeech 2025 and contribute to a more inclusive future for speech AI.

Read the paper on arXiv

Join Our Community

💡 Explore our open-source projects

Open-weight models: huggingface.co/scb10x

More initiatives: opentyphoon.ai

💬 Join the conversation

Connect with us on Discord to discuss ideas, collaborate, or just say hi!