Blog LogoTYPHOON
Home
Research
Join Text-to-Speech Research
DocumentationDemo AppsPlayground
Blog
About
Blog LogoTyphoon
  • Home
  • Research
    • Join Text-to-Speech Research
  • Get Started
    • Documentation
    • Demo Apps
    • Playground
  • Blog
  • About

© 2025 SCB 10X Co., Ltd.

Typhoon-Vision Preview Release

Typhoon-Vision Preview Release

New ReleaseTyphoon 1.5VisionMultimodal
Parinthapat Pengpun
August 13, 2024

Table of Contents

Release Note: Typhoon-Vision (v1.5 Preview)1. Introduction2. Model Architecture3. Training Methodology3.1 Datasets3.2 Training Process4. Evaluation and Performance4.1 Image Captioning Performance4.2 Multimodal Benchmark Performance4.3 Examples from Typhoon-Vision5. Model Specifications and Deployment6. Future Work7. ConclusionAppendixReferences

Release Note: Typhoon-Vision (v1.5 Preview)

💻 Demo: https://vision.opentyphoon.ai/

🤗 Model weights: https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-vision-preview

📝 Technical report (paper) and evaluation data to be released later

We present Typhoon-Vision, a multimodal AI model designed specifically for Thai language and visual understanding. Built upon the Typhoon 1.5 8B Instruct language model and integrating advanced visual processing capabilities, Typhoon-Vision represents a significant advancement in Thai-centric AI technology. This report details the model’s architecture, training methodology, performance metrics, and future research directions.

1. Introduction

Multimodal AI models that can process both textual and visual information have seen rapid advancements in recent years. However, most of these models have been primarily developed for English and other high-resource languages. Typhoon-Vision addresses this gap by providing a powerful multimodal model optimized for the Thai language, enabling sophisticated visual-language understanding tasks in Thai contexts.

2. Model Architecture

Typhoon Vision Architecture LLaVa Architecture (image courtesy of the LLaVa paper)

Typhoon-Vision builds upon the architecture introduced by the Bunny model from BAAI, which follows the LLaVA (Large Language and Vision Assistant) architecture. The key components of our model are:

  1. Vision Encoder: SigLIP [1]
  2. Language Model: Typhoon 1.5 8B Instruct, based on Llama3–8B and pre-trained on Thai language data
  3. Multimodal Projector: A 2-layer GELU MLP connecting the vision and language components

The total parameter count of Typhoon-Vision is 8.5 billion, with 8 billion parameters in the language model and 0.5 billion in the vision components.

3. Training Methodology

3.1 Datasets

Our training process utilized the Bunny Dataset [2] and a Thai Translated Subset of the Bunny Dataset as well. For some select complex tasks (e.g., image captioning) we choose to distil from GPT-4o [3].

A significant challenge in developing Typhoon-Vision was the scarcity of high-quality Thai multimodal data. To address this, we employed:

  1. Translation of existing datasets: We use our in-house model for the translation. We filter out low-quality translations and use only high-quality ones.

  2. Translated Data Ratio Ablations: In order to find the best ratio of translated data that does not degrade English performance, we ablate the ratio of Thai translated data we add into the training set. For this preliminary study, we discovered that the best ratio is around 10 to 25%.

3.2 Training Process

Following Bunny, we use the same 2 stage training. In the first stage (pretraining), our goal is to align the embeddings between the vision encoder and the language model. Thus for this stage, we only train the projection layer and freeze everything else. In the second stage (multimodal instruction tuning), we tune the projection layer and LLM (LoRA) while keeping the vision encoder itself frozen.

4. Evaluation and Performance

We evaluated Typhoon-Vision on several benchmarks to assess its performance in various visual-language tasks. The following sections detail our findings.

4.1 Image Captioning Performance

We utilized the IPU-24 dataset [4], a Thai-centric image captioning benchmark developed by NECTEC. This dataset combines images from the COCO2017 validation set with custom images focused on Thai cultural elements. The captions for COCO2017 images were rewritten in Thai to ensure cultural relevance. The validation set of IPU-24 contains around 9000 records. Here is an example:

Gold Standard Captions:

  • อุโบสถหลังสีขาว หลังคาสีแดงปิดด้วยสีทอง ตั้งภายในวัด และมีกระถางต้นไม้วางอยู่ข้างหน้า
  • โบสถ์สีขาวหลังหนึ่งมีหลังคาสีแดงปนทองและขอบประตูสีทอง
  • โบสถ์มีผนังสีขาว มีซุ้มประตูสีทอง รอบ ๆ มีกระถางต้นไม้ มีต้นไม้

We prompt Bunny Llama3 8B, SeaLMMM 7B v0.1 [5], and GPT4o-mini to generate captions for these images by saying “อธิบายเนื้อหาของภาพนี้อย่างละเอียดเป็นภาษาไทย.” We use Llama3.1–70B-Instruct [6] to act as a judge evaluating the performance of all models. Please see the Appendix for the exact prompt used.

Performance comparison of models on IPU-24.

As shown, Typhoon-Vision significantly outperforms other models, including Bunny Llama3 8B v1.0, SeaLMMM-v0.1, and GPT4o-mini, demonstrating its superior capability in Thai image captioning tasks.

4.2 Multimodal Benchmark Performance

We evaluated Typhoon-Vision on several established multimodal benchmarks to assess its general capabilities:

The benchmarks used include:

  1. MMBench [7] (Dev Set): MMBench is a visual LLM evaluation dataset measuring multiple skills such as image reasoning, actioning recognition, scene understanding, and more.

Example questions from the MMBench Benchmark

  1. Pope [8]: Pope is a VQA dataset to probe LLMs for object hallucination in visual LLMs.

  1. GQA [9]: A large-scale visual question answering dataset with real images from the Visual Genome dataset and balanced question-answer pairs. Each training and validation image is also associated with scene graph annotations describing the classes and attributes of those objects in the scene, and their pairwise relations.

Example of questions from the GQA Benchmark

  1. GQA (Thai): A Thai-language adaptation (translation) of the GQA benchmark, assessing the model’s performance on visual question answering tasks in Thai.
Model MMBench (Dev) Pope GQA GQA (Thai)
Typhoon-Vision 8B Preview 70.9 84.8 62.0 43.6
SeaLMMM 7B v0.1 64.8 86.3 61.4 25.3
Bunny Llama3 8B Vision 76.0 86.9 64.8 24.0
GPT-4o Mini 69.8 45.4 42.6 18.1

Typhoon-Vision demonstrates comparable performance across these benchmarks, particularly excelling in the Thai-specific GQA variant when compared to other available LLMs. It is important to note that Pope and GQA use exact matching for their evaluation while MMBench uses GPT-3.5. This fact may impact the score of GPT-4o Mini on Pope and GQA (as observed in other papers such as Cambrian-1).

4.3 Examples from Typhoon-Vision

Here are a few examples to showcase our model’s performance:

Image Captioning:

Visual Question Answering:

Currently, our most significant area for improvement is in optical character recognition (OCR). We recognize this limitation and are actively researching and developing solutions to boost OCR performance. To illustrate the current state and the potential for improvement, here is an example of our model’s OCR output:

5. Model Specifications and Deployment

Typhoon-Vision can be deployed on hardware with specifications similar to those required for Typhoon 1.5 8B models. The model can run on GPUs with at least 16GB of VRAM. Our model is available on HuggingFace: https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-vision-preview.

6. Future Work

Our ongoing research and development efforts for Typhoon-Vision focus on several key areas:

  1. Vision Encoder Optimization: We plan to fine-tune and potentially replace the current SigLIP encoder to better capture Thai-specific visual elements.
  2. OCR Enhancement: Improving optical character recognition capabilities for Thai text in images is a priority, with plans to acquire and integrate more Thai OCR data.
  3. Charts and Infographics Understanding: We aim to enhance Typhoon-Vision’s ability to interpret and answer questions about charts and infographics in Thai. This involves exploring approaches to curate more chart-related data in Thai contexts.

7. Conclusion

Typhoon-Vision represents a significant step forward in multimodal AI for the Thai language. By combining advanced visual processing with a Thai-optimized language model, we have created a powerful tool for Thai-centric visual-language tasks. The model’s strong performance across various benchmarks demonstrates its potential to drive innovation in Thai AI applications.

As we continue to refine and expand Typhoon-Vision’s capabilities, we invite the research community to explore and build upon our work. The model weights are available under the META LLAMA 3 COMMUNITY LICENSE AGREEMENT, fostering open collaboration and further advancements in Thai-centric AI technology.

Appendix

IPU Evaluation: LLM-as-a-Judge Prompt

You are an impartial judge evaluating the quality of Thai language image captions generated by two AI assistants. You will be presented with a gold standard Thai caption, and two AI-generated Thai captions. Your task is to assess which AI-generated caption better captures the image content and quality of the gold standard, with a specific focus on proper Thai language use.
Evaluation criteria:
1. Accuracy: How well does the Thai caption reflect the key elements described in the gold standard?
2. Completeness: Does the Thai caption cover all important aspects mentioned in the gold standard?
3. Conciseness: Is the Thai caption concise and to the point, similar to the gold standard?
4. Natural Thai language: Does the caption read fluently and naturally in Thai? Penalize any unnatural language use.
5. Correct Thai language: Is the grammar, vocabulary, and syntax of the Thai caption correct? Heavily penalize any incorrect language use.
6. Relevance: Does the Thai caption focus on the most important aspects of the image, as highlighted in the gold standard?
7. Thai cultural appropriateness: If applicable, does the caption use culturally appropriate Thai terms and expressions?

Provide a detailed of your evaluation in English, comparing both AI-generated Thai captions to the gold standard and to each other. Pay special attention to the quality of Thai language use. Avoid any biases related to the order of presentation, length of captions, or assistant names. Be as objective as possible.

Conclude your evaluation with a verdict in this exact format:
[[A]] if Assistant A's Thai caption is better
[[B]] if Assistant B's Thai caption is better
[[C]] if both Thai captions are of equal quality
You are an impartial judge evaluating the quality of Thai language image captions generated by two AI assistants. You will be presented with a gold standard Thai caption, and two AI-generated Thai captions. Your task is to assess which AI-generated caption better captures the image content and quality of the gold standard, with a specific focus on proper Thai language use.
Evaluation criteria:
1. Accuracy: How well does the Thai caption reflect the key elements described in the gold standard?
2. Completeness: Does the Thai caption cover all important aspects mentioned in the gold standard?
3. Conciseness: Is the Thai caption concise and to the point, similar to the gold standard?
4. Natural Thai language: Does the caption read fluently and naturally in Thai? Penalize any unnatural language use.
5. Correct Thai language: Is the grammar, vocabulary, and syntax of the Thai caption correct? Heavily penalize any incorrect language use.
6. Relevance: Does the Thai caption focus on the most important aspects of the image, as highlighted in the gold standard?
7. Thai cultural appropriateness: If applicable, does the caption use culturally appropriate Thai terms and expressions?

Provide a detailed of your evaluation in English, comparing both AI-generated Thai captions to the gold standard and to each other. Pay special attention to the quality of Thai language use. Avoid any biases related to the order of presentation, length of captions, or assistant names. Be as objective as possible.

Conclude your evaluation with a verdict in this exact format:
[[A]] if Assistant A's Thai caption is better
[[B]] if Assistant B's Thai caption is better
[[C]] if both Thai captions are of equal quality
bash

References

[1] Sigmoid Loss for Language Image Pre-Training, Zhai et al., 2023

[2] Efficient Multimodal Learning from Data-centric Perspective, He et al., 2024

[3] GPT-4o, https://openai.com/index/hello-gpt-4o/, 2024

[4] https://kaggle.com/competitions/ai-cooking-image-captioning, Theerasit et al., 2024

[5] SeaLLMs — Large Language Models for Southeast Asia, Nguyen et al., 2023

[6] The Llama 3 Herd of Models, Dubey et al., 2024

[7] MMBench: Is Your Multi-modal Model an All-around Player?, Liu et al., 2024

[8] Evaluating Object Hallucination in Large Vision-Language Models, Li et al., 2023

[9] GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, Hudson et al., 2019

Next

Typhoon-Audio Preview Release

Typhoon-Audio Preview Release

© 2025 SCB 10X Co., Ltd.. All rights reserved