SEA-LION x Typhoon: Exploring Cross-Lingual Audio Modeling in Southeast Asia

Introduction

Southeast Asia (SEA) is home to hundreds of languages, many of which are underrepresented in cutting-edge AI applications such as Large Language Models (LLMs). As the frontiers of LLMs expand into multi-modal competencies including text, audio, images and video, the need of the hour is large-scale, creative and contextualised collaboration across the region to prepare our region for these new paradigms.

This blog highlights the one important collaboration between AI Singapore (AISG) and SCB 10X R&D’s Typhoon team, investigating cross-lingual effects in audio-language models. As audio data is available only in the main SEA languages e.g. Thai, Indonesian, and Tamil, we explore the possibility of exploiting the cross-lingual capabilities in SEA models, using only a few SEA languages to lift the performance of other lower-resource SEA languages as well. Using Thai as a starting point, we dive into an empirical analysis: how far can a Thai-English trained model generalize to other SEA languages without direct exposure?

We developed and investigated SEA-LION-TH-Audio, a multimodal LLM fine-tuned on Thai and English audio tasks. It reveals compelling behaviors in cross-lingual generalization, by outperforming or performing similar to other audio LLMs without the target language data in the training step—especially in zero-shot scenarios like Indonesian-to-Thai and Thai-to-Tamil translation.

From Local Training to Regional Generalization

SEA-LION-TH-Audio is derived from the Typhoon audio model family, trained on under 1,000 hours of Thai-English data. Its architecture is tuned for instruction-style generalization, which contrasts with large multilingual models like MERaLiON (120k+ hours) and SeaLLMs-Audio.

Our research focus is on using it as a probe for cross-lingual transfer in SEA—especially in low-resource settings where multilingual data may not always be available.

Experimental Setup

We evaluated several models—including Typhoon, SEA-LION variants, MERaLiON, and SeaLLMs—on a range of multilingual tasks:

Translation (BLEU) across seen (Thai↔English) and unseen SEA languages (Indonesian↔Thai, Thai→Tamil) directions. We use Tamil and Indonesian because of the availability of the test data in SEA languages. In addition, we also perform X2Th, where X equals multilingual (e.g., English, Chinese, Spanish, French, Arabic)
ASR (Word Error Rate) in Thai and English
Speaker Gender Classification (Thai)

To isolate cross-lingual effects, we grouped models by training configuration:

Trained only on Thai-English (e.g., SEA-LION, Typhoon)
Trained on broader SEA languages (e.g., MERaLiON, SeaLLMs)

Key Observations

Table 1: Thai-English Trained Models

Same data, different model: SEA-LION lost slightly to Typhoon2 on seen languages but outperformed it on unseen ones—showing better zero-shot transfer. However, the gaps are always modest (<1 BLEU).
Adding Thai data matters: SEA-LION trained on Thai+English outperformed the English-only version. This suggests SEA-specific cross-lingual grounding helps generalization.

Table 2: Comparison with Multilingual SEA Models

ASR strength: SEA-LION (Thai+English) outperforms larger SEA models (MERaLiON, SeaLLMs) in Thai ASR—even without multilingual training.
Seen-language advantage: While SeaLLMs and MERaLiON have broader multilingual support, SEA-LION performs well in seen-language tasks, emphasizing the value of focused fine-tuning.
Smaller model, strong results: Even with fewer training hours, SEA-LION demonstrates effective cross-lingual audio understanding in tasks like Thai→Indonesian and Thai→Tamil.

Insights: Cross-Lingual Transfer in Low-Resource Scenarios

Our study suggests that focused bilingual training can unlock surprising cross-lingual capabilities, particularly in cross-knowledge in SEA languages. For instance, Thai ↔ Indonesian and Thai → Tamil showed strong generalization, despite no Indonesian data being seen during training.

These findings echo a key question for AI model research on Southeast Asian languages:

Can we generalize well with limited language data by leveraging cross-lingual effects?

Our results suggest: yes, with the right pretraining and fine-tuning strategy.

Limitations

Restricted language coverage: SEA-LION sees only Thai and English during training. Its zero-shot generalization works—but not always better than true multilingual models.
No speech-to-speech capability: Unlike Typhoon2, SEA-LION is limited to audio-to-text outputs.
English ASR performance is relatively weak, likely due to limited English audio exposure.

Future Directions

We see several avenues for expanding on this cross-lingual potential:

Multilingual Fine-Tuning: Add more SEA languages (e.g., Malay, Vietnamese) to extend coverage.
Speech-to-Speech Modeling: Introduce direct audio-to-audio capabilities.
Data-efficient Learning: Investigate how far small, focused datasets can go with the right architectural biases.
Regional Collaboration: Continue building open resources with partners like AI Singapore to support low-resource languages across the region.

Conclusion

SEA-LION-TH-Audio is not a story of the strongest model—it’s a case study in how local training can lead to regional generalization. By studying cross-lingual effects in focused Thai-English models, we find surprising potential in zero-shot transfer to other SEA languages.

In a region as linguistically diverse as Southeast Asia, efficient, generalizable, and accessible models are the key to inclusive AI development. Through collaboration, we aim to push this frontier.

Resources