Blog LogoTYPHOON
Home
Research
Join Text-to-Speech Research
DocumentationDemo AppsPlayground
Blog
About
Blog LogoTyphoon
  • Home
  • Research
    • Join Text-to-Speech Research
  • Get Started
    • Documentation
    • Demo Apps
    • Playground
  • Blog
  • About

© 2025 SCB 10X Co., Ltd.

Typhoon’s Paper Acceptance at Interspeech 2025: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

Typhoon’s Paper Acceptance at Interspeech 2025: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

ConferenceResearchInterspeechNLP
Oravee (Orn) Smithiphol
Oravee (Orn) Smithiphol
June 17, 2025

Table of Contents

TL;DRCore Problems We AddressOur GoalsModel architectureEvaluation & ResultsSummaryJoin Our Community

We're proud to share that our paper on the Typhoon-Audio development, "Enhancing Low-Resource Language and Instruction-Following Abilities of Audio Language Models," has been accepted to Interspeech 2025 🎉

As the world’s premier conference in spoken language processing, Interspeech recognizes work that pushes the boundaries of audio AI. This acceptance is a significant step forward in Typhoon’s mission to bring inclusive, multilingual AI to underrepresented communities—starting with Thai, and building a replicable foundation for other low-resource languages.

TL;DR

  • Most audio LMs are English-centric, underperforming in low-resource languages like Thai.

  • Our model combines audio understanding + speech instruction following—two capabilities that were previously treated separately.

  • We present an integrated architecture and training strategy that improves performance in Thai while retaining strong English capabilities.

  • 📄 Read the full paper

Core Problems We Address

Most open-source audio language models (ALMs) are built for English. While some leverage multilingual backbones, they often fail on truly low-resource languages like Thai without dedicated training.

Other key limitations:

  • Lack of balance between audio comprehension and instruction-following.

  • High compute costs for adaptation in low-resource settings.

  • Minimal performance benchmarks for Southeast Asian languages in this space.

Our Goals

We aimed to build an ALM that:

  • Improves performance in Thai, without compromising on English.

  • Unifies audio understanding and instruction-following (Speech IF) in one model.

  • Scales to other languages like Lao, Burmese, and Khmer with minimal retraining.

Model architecture

Typhoon Audio Model Architecture

We designed a modular architecture that fuses speech and general audio into a shared instruction-following pipeline.

Audio Encoder Backbone

  • Whisper-th-large-v3-combined (from biodatlab): fine-tuned for Thai speech, transforms speech into rich spectrogram-based embeddings.

  • BEATs: captures general non-speech audio (e.g., music, ambient sounds).

Adapter Module (Q-Former)

Maps audio embeddings into a semantic space aligned with language, enabling better integration with LLMs. This bridging module ensures that the model can understand audio like it does text.

LLM Backbone

We use Typhoon-1.5-8B-Instruct, a LLaMA3-based language model that is retrained on a balanced Thai-English corpus and fine-tuned with multilingual instruction-following tasks.

Evaluation & Results

We benchmarked the model on tasks spanning both comprehension and generation:

Task Metric Result
ASR (Automatic Speech Recognition) ↓ Word Error Rate (WER) Significant reduction
Translation ↑ BLEU Score Improved bilingual quality
Gender Classification ↑ Accuracy Higher accuracy across languages
Spoken QA ↑ F1 Score Better comprehension & generation
Speech Instruction Following (Speech IF) ↑ Human/GPT-4o Score (1–10) Clearer, more accurate responses
Complex Instruction Following (Complex IF) ↑ Judge Score (Quality & Format) Better handling of multi-step tasks

Summary

While we focus on Thai, this architecture and training strategy are designed to generalize. The modular pipeline and adapter-based design allow easy extension to other low-resource languages in Southeast Asia—with the goal of making inclusive audio AI accessible to all.

This work lays the foundation for a new generation of multilingual ALMs that understand audio, follow spoken instructions, and serve diverse linguistic communities. We're excited to present it at Interspeech 2025 and contribute to a more inclusive future for speech AI.

Read the paper on arXiv

Join Our Community

💡 Explore our open-source projects

Open-weight models: huggingface.co/scb10x

More initiatives: opentyphoon.ai

💬 Join the conversation

Connect with us on Discord to discuss ideas, collaborate, or just say hi!

Previous
Typhoon’s Joint Research Included in 5 Accepted Papers at ACL 2025

Typhoon’s Joint Research Included in 5 Accepted Papers at ACL 2025

Next

Introducing Typhoon 2 API Pro: Accessible, Production-grade Thai LLMs

Introducing Typhoon 2 API Pro: Accessible, Production-grade Thai LLMs

© 2025 SCB 10X Co., Ltd.. All rights reserved