Introducing the ThaiLLM Leaderboard: ThaiLLM Evaluation Ecosystem

Introduction

Leaderboards have been a standard method for evaluating the performance of large language models (LLMs). In English, there are major leaderboards such as HELM, Chatbot Arena, and the Open LLM Leaderboard, which provide standardized evaluation frameworks for LLMs. However, there has not yet been a platform for Thai.

To address this gap, the “Thai LLM Leaderboard” is initiated through a collaboration between SCB 10X (Typhoon), VISTEC, and the SEACrowd project. This leaderboard is specifically designed to evaluate and compare LLMs with Thai language capabilities.

ThaiLLM Leaderboard

The leaderboard tracks the performance of various LLMs across a range of benchmarks and tasks, providing a standard environment where models are assessed under the same conditions. This ensures that results are reproducible and comparable, allowing developers and researchers to gauge how their models’ performance relative to others in the community, and ultimately fostering growth in Thai NLP research and development.

Methodology

We created the leaderboard based on four tasks and ten datasets to ensure diversity and mitigate the risk of leaderboard overfitting. The tasks are as follows:

Exam — Thai knowledge testing (📚)
LLM-as-a-judge — an LLM to judge more complex generation (🤝)
NLU — natural language understanding (🕵️)
NLG — traditional natural language generation (🖊️)

We cover all these tasks by using 10 evaluation datasets:

Exam Datasets (📚 )

ThaiExam: ThaiExam is a Thai language benchmark based on examinations for high-school students and investment professionals in Thailand.
M3Exam: M3Exam is a novel benchmark sourced from authentic and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. This leaderboard uses the Thai subset of M3Exam.

LLM as Judge (🤝)

Thai MT-Bench: A Thai version of MT-Bench developed specially by VISTEC for probing Thai generative skills using the LLM-as-a-judge method.

NLU (🕵️)

Belebele: Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants, where the Thai subset is used in this leaderboard.
XNLI: XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages. This leaderboard uses the Thai subset of this corpus.
XCOPA: XCOPA is a corpus of translated and re-annotated English COPA, covers 11 languages. This is designed to measure the commonsense reasoning ability in non-English languages. This leaderboard uses the Thai subset of this corpus.
Wisesight: Wisesight sentiment analysis corpus contains social media messages in the Thai language with sentiment labels.ข

NLG (🖊️)

XLSum: XLSum is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from the BBC. This corpus evaluates the summarization performance in non-English languages, and this leaderboard uses the Thai subset.
Flores200: FLORES is a machine translation benchmark dataset used to evaluate translation quality between English and low-resource languages. This leaderboard uses the Thai subset of Flores200.
iapp Wiki QA Squad: iapp Wiki QA Squad is an extractive question-answering dataset derived from Thai Wikipedia articles.

Evaluation Metrics & Implementation

Exam and NLU tasks are evaluated based on multiple-choice classification using logits-based comparison following the SEACrowd Implementation.
LLM-as-a-Judge is evaluated using gpt-4o-2024–05–13, with evaluation prompts taken from lmsys MT-bench.
NLG tasks are evaluated using the following metrics: BLEU for machine translation, ROUGE for summarization, and SEACrowd Implementation for question-answering.

What's Next for Thai LLM Leaderboard

The ThaiLLM Leaderboard is open for contributions, and we would like it to be a community-driven effort. We encourage contributions of evaluation datasets or model submissions through pull requests or by completing the model evaluation form.
Live leaderboard — We will continue to update the leaderboard with new evaluation datasets and tasks to make this leaderboard more relevant to future updates of newer models and trends.
We are also exploring the development of a challenging subset to evaluate LLMs, which will help distinguish genuine performance improvements from noise.

Conclusion

By providing a leaderboard platform for evaluating and comparing Thai large language models, we aim to standardize assessments and foster the development and adoption of LLMs in Thailand. We hope that this open initiative will help measure how effectively these models capture the unique characteristics and nuances of the Thai language and culture.