
Typhoon 2 Multimodal Release (Research Preview)
New ReleaseTyphoon 2AudioMultimodalVision

Table of Contents
We’re excited to share the research preview for our Typhoon 2 multimodal models, which include two cutting-edge models designed for Thai-language and multimodal applications:
- Typhoon2-Vision: A model optimized for visual data, featuring advanced OCR capabilities for Thai documents, Chart VQA, and more. It enables highly accurate text extraction and context-aware reasoning for tasks like document understanding and visual question answering.
- Typhoon2-Audio: An end-to-end model that processes and generates both text and audio. It performs well on speech-centric tasks like transcription, audio captioning, and speech-to-speech translation, offering robust multi-turn dialogue support and text-to-speech capabilities.
Building on the success of Typhoon 2’s text models, these multimodal advancements take Thai AI capabilities to the next level. With dedicated features for processing visual and audio data, Typhoon 2’s multimodal models are built to solve real-world challenges in Thai-specific contexts.
These models reflect our commitment to building accessible, state-of-the-art tools tailored to Thai language and multimodal use cases.
Typhoon 2 Vision
Typhoon2-Vision is a cutting-edge vision-language model built on Qwen2-VL, designed to excel at understanding and processing visual and text data. With advanced architectures like Vision Transformer (ViT) and Multimodal Rotary Position Embedding (M-ROPE), the model offers enhanced visual understanding by interpreting complex images and integrating them with textual data for multimodal tasks.
Through bilingual data preparation with translation and distillation techniques, Typhoon2-Vision achieves superior performance in Thai document understanding while maintaining both visual and contextual accuracy.
Explore Typhoon2-Vision at https://vision.opentyphoon.ai.
Key Features
- Advanced OCR Capabilities: Enables seamless and accurate text extraction, even from highly complex visuals and Thai documents. Ability to extract data from charts and graphs, including financial understanding of visuals.
- Bilingual Data Handling: Enhances performance in tasks like Thai document understanding by leveraging model training on both Thai and English datasets.
Performance
To evaluate and understand Typhoon2-Vision’s performance, we tested its capabilities across Thai and English tasks using four evaluation datasets per language. The evaluation relied on two key metrics:
- Accuracy: Evaluates how often the model makes correct predictions, especially for OCR and visual question answering tasks.
- ROUGE-L: Measures the quality of text outputs by comparing them with reference answers to ensure relevance and coherence.
Benchmark | Metric | Llama-3.2 11B-Instruct | Qwen2-VL 7B-Instruct | Pathumma Vision-1.0.0-8B | Typhoon2-llama-3.2 11B-Instruct (Exp) | Typhoon2-qwen2vl 7B-vision-instruct |
---|---|---|---|---|---|---|
OCRBench | ROUGE-L | 72.84% | 72.31% | 32.74% | 81.2% | 64.38% |
Accuracy | 51.1% | 57.9% | 25.87% | 71.7% | 49.6% | |
MMBench (Dev) | ROUGE-L | - | - | - | - | - |
Accuracy | 76.54% | 84.1% | 19.51% | 83.66% | 83.66% | |
ChartQA | ROUGE-L | 13.41% | 47.45% | 64.2% | 74.12% | 75.71% |
Accuracy | x | 45% | 57.83% | 67.36% | 72.56% | |
TextVQA | ROUGE-L | 32.82% | 91.4% | 32.54% | 89.44% | 91.45% |
Accuracy | x | 88.7% | 28.84% | 85.74% | 88.97% | |
OCR (TH) | ROUGE-L | 64.41% | 56.47% | 6.38% | 79.51% | 64.24% |
Accuracy | 35.58% | 55.34% | 2.88% | 58.65% | 63.11% | |
M3Exam Images-(TH) | ROUGE-L | - | - | - | - | - |
Accuracy | 25.46% | 32.17% | 29.01% | 27.93% | 33.67% | |
GQA (TH) | ROUGE-L | 31.33% | 34.55% | 10.2% | 44.51% | 50.25% |
Accuracy | - | - | - | - | - | |
MTVQ (TH) | ROUGE-L | 11.21% | 23.39% | 7.63% | 15.2% | 30.59% |
Accuracy | 4.31% | 13.79% | 1.72% | 7.56% | 21.55% | |
Average (ROUGE-L) | 37.67% | 54.26% | 25.61% | 64.16% | 62.77% | |
Average (Accuracy) | x | 53.85% | 23.67% | 58.75% | 59.02% |
Typhoon2-Vision demonstrates strong performance in Thai-centric vision-language applications, excelling in key Thai OCR benchmarks such as ChartQA, TextVQA, OCR (TH), M3Exam Images-(TH), and MTVQ (TH). It also shows notable improvements in Thai document understanding. Typhoon2-Qwen2-VL stands out as a top-performing solution with its efficient design and optimized parameter count, helping to shape the field of vision-language technology for Thai-focused use cases.
Typhoon2-Audio
Typhoon2-Audio is a versatile end-to-end model designed for speech processing and generation, seamlessly bridging audio, speech, and text modalities. Built on SALMONN and Llama-Omni architecture, the model performs well in tasks such as transcription, audio captioning, and speech translation. It uses Thai-English bilingual datasets combined with carefully designed pre-training and fine-tuning strategies to achieve solid results across benchmarks.
With notable results in tasks such as audio captioning and speech instruction following, Typhoon2-Audio offers valuable solutions for Thai-centric audio and speech applications.
Explore Typhoon2-Audio at https://audio.opentyphoon.ai/
Key Features
- Parallel Text and Audio Outputs: Generates text and audio simultaneously, reducing latency and enhancing efficiency.
- Extended Context Windows: Handles audio inputs of up to 30 seconds, enabling more detailed and comprehensive speech analysis.
- Enhanced Instruction-Following: Supports multi-turn conversations, system prompts, and complex commands with improved accuracy.
- Speech-to-Speech Processing: Delivers accurate transcription, seamless translation, and conversational audio generation.
Performance
Instruction-Following
Model | Size | ASR (WER↓) En | ASR (WER↓) Th | Translation (BLEU↑) Th2En | Translation (BLEU↑) En2Th | Translation (BLEU↑) X2Th | Gender (Acc↑) En | Gender (Acc↑) Th |
---|---|---|---|---|---|---|---|---|
Qwen-Audio | 7B | 6.94 | 95.12 | 0.00 | 2.48 | 0.29 | 37.09 | 67.97 |
SALMONN | 13B | 5.79 | 98.07 | 14.97 | 0.07 | 0.10 | 95.69 | 93.26 |
DiVA | 8B | 30.28 | 65.21 | 7.97 | 9.82 | 5.31 | 47.30 | 50.12 |
Gemini-1.5-Pro | - | 5.98 | 13.56 | 22.54 | 20.69 | 13.52 | 90.73 | 81.32 |
Typhoon-Audio | 8B | 8.72 | 14.17 | 24.14 | 17.52 | 10.67 | 98.76 | 93.74 |
Typhoon2-Audio | 8B | 5.83 | 14.04 | 33.25 | 27.15 | 15.93 | 76.51 | 75.65 |
Typhoon2-Audio demonstrates notable improvements over its predecessor, Typhoon-Audio, in several key tasks, including ASR (Automatic Speech Recognition), translation, and speech instruction-following.
The model achieves a lower Word Error Rate (WER) in both English and Thai ASR tasks, outperforming SALMONN and approaching the performance of Gemini-1.5-Pro, showcasing its enhanced ability to process spoken language effectively.
In translation tasks, Typhoon2-Audio exhibits significant gains, achieving higher BLEU scores across English-to-Thai and Thai-to-English translations, surpassing most benchmarks.
Despite these advancements, it underperforms in gender classification and struggles with nested and complex instructions in speech instruction tasks. These limitations suggest potential weaknesses in handling nuanced contexts or multiple layers of meaning.
Model | Size | SpokenQA (F1↑) En | SpokenQA (F1↑) Th | SpeechIF (Judge↑) En | SpeechIF (Judge↑) Th | ComplexIF (Judge↑) Qual | ComplexIF (Judge↑) Format | ComplexIF (Judge↑) Avg. |
---|---|---|---|---|---|---|---|---|
Qwen-Audio | 7B | 25.34 | 0.00 | 1.07 | 1.03 | 3.13 | 1.68 | 2.41 |
SALMONN | 13B | 52.92 | 2.95 | 2.47 | 1.18 | 4.10 | 5.09 | 4.60 |
DiVA | 8B | 44.52 | 15.13 | 6.81 | 2.68 | 6.33 | 7.83 | 7.08 |
Gemini-1.5-Pro | - | 74.09 | 62.10 | 3.24 | 3.93 | 7.25 | 8.99 | 8.12 |
Typhoon-Audio | 8B | 48.83 | 64.60 | 5.62 | 6.11 | 6.34 | 8.73 | 7.54 |
Typhoon2-Audio | 8B | 69.22 | 70.01 | 6.00 | 6.79 | 5.35 | 9.01 | 7.18 |
In spoken QA and instruction following, Typhoon2-Audio outshines Typhoon-Audio by demonstrating higher F1 scores and improved accuracy in more complex tasks, particularly in Thai. However, its lower performance on complex nested instructions, as indicated by reduced judgment scores, reveals that further refinements are needed to enhance its reasoning and contextual understanding. Overall, while Typhoon2-Audio marks a significant step forward in audio-language processing, areas like fine-grained classification and robust handling of intricate instructions remain opportunities for improvement.
End-to-End Speech-to-Speech Evaluation
Typhoon2-Audio’s performance was assessed on SpeechIF tasks for both English and Thai, with a focus on content generation and speech quality. Content generation was evaluated using an LLM-as-a-judge framework, while speech quality was measured through accuracy metrics such as CER (Character Error Rate) and WER (Word Error Rate), as well as naturalness ratings like UTMOS.
Content Generation Score
Results using Text Output:
Model | SpeechIF (English) | SpeechIF (Thai) | ||
---|---|---|---|---|
Quality (↑) | Style (↑) | Quality (↑) | Style (↑) | |
Llama-Omni | 5.58 | 6.52 | 1.88 | 2.53 |
GPT-4o-Audio | 7.23 | 8.25 | 6.96 | 8.38 |
Typhoon2-Audio | 6.34 | 7.12 | 7.43 | 8.18 |
Results using Transcribed Speech:
Model | SpeechIF (English) | SpeechIF (Thai) | ||
---|---|---|---|---|
Quality (↑) | Style (↑) | Quality (↑) | Style (↑) | |
Llama-Omni | 5.15 | 5.79 | 1.71 | 2.14 |
GPT-4o-Audio | 6.82 | 7.86 | 6.66 | 8.07 |
Typhoon2-Audio | 4.92 | 5.39 | 7.19 | 8.04 |
Typhoon2-Audio demonstrates strong performance in generating content for speech-to-text (S2TIF) and speech-to-speech (S2SIF) tasks, particularly in Thai, where it significantly outperforms Llama-Omni and performs competitively with GPT-4o-Audio.
The model also achieves the highest quality and style scores for Thai responses using both text output and transcribed speech. However, its English performance, while competitive, falls behind Llama-Omni and GPT-4o-Audio, particularly in transcribed speech, indicating room for improvement in handling transcription imperfections.
Speech Quality Metrics
Model | SpeechIF (English) | SpeechIF (Thai) | ||||
---|---|---|---|---|---|---|
WER (↓) | CER (↓) | UTMOS (↑) | WER (↓) | CER (↓) | UTMOS (↑) | |
Llama-Omni | 4.98 | 3.4 | 3.932 | N/A | N/A | N/A |
GPT-4o-Audio | 4.88 | 3.2 | 3.652 | 11.71 | 8.05 | 3.464 |
Typhoon2-Audio | 33 | 26.5 | 2.285 | 10.04 | 8.67 | 2.348 |
In terms of speech quality, Typhoon2-Audio struggles with high Word Error Rates (WER) and Character Error Rates (CER) for English, which result in lower UTMOS (naturalness) scores.
The system performs better in Thai, with WER and CER scores closer to GPT-4o-Audio. Llama-Omni, despite achieving lower error rates in both languages and being capable of following instructions in Thai, is constrained by its ability to generate English speech only. This limitation underscores Typhoon2-Audio’s advantage in providing true multilingual responses.
Overall, Typhoon2-Audio’s strength lies in Thai content generation, where it delivers competitive results in speech quality, performing on par with GPT-4o-Audio. However, there is significant room for improvement in its English output, particularly in terms of speech quality and transcription handling. While Llama-Omni achieves the lowest WER and CER for Thai tasks, its inability to generate Thai speech, responding exclusively in English, limits its applicability in multilingual contexts, highlighting Typhoon2-Audio’s advantage in multilingual settings. Additionally, Typhoon2-Audio’s UTMOS scores, which measure naturalness, remain lower than both GPT-4o-Audio and Llama-Omni, highlighting an area for further enhancement in speech generation.
Continuing with our Open Releases
We are excited to release the weights for our audio and vision models under a research license. We hope to support researchers, developers, and tech enthusiasts in exploring advanced multimodal AI, and driving advancements in the field.
You can download the weights for Typhoon2-Vision and Typhoon2-Audio at: https://huggingface.co/collections/scb10x/typhoon-2-multimodal-675e59368326ac2328c8210f
Insights and findings from our model development are available in the Typhoon 2 Technical Report at https://arxiv.org/abs/2412.13702v2m.
Future Work
We are exploring a range of potential directions to build on Typhoon2’s current capabilities. These include enhancing support for diverse Thai language use cases (including dialects), improving multimodal functionalities, and exploring ways to make the models more adaptable across different domains and tasks.
As we look ahead, our focus remains on identifying opportunities to refine and expand Typhoon2’s performance, ensuring it can better serve evolving needs in language understanding, speech, and real-world applications.
SCB 10X R&D Team
Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach, Warit Sirichotedumrong, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, Kasima Tharnpipitchai
Contact Us
General & Collaborations: krisanapong@scb10x.com, kasima@scb10x.com
Technical: kunat@scb10x.com
For an in-depth understanding of our methodologies and findings, please refer to the Typhoon2 Technical Report.