Introducing ThaiSafetyBench: An LLM Safety Benchmark Built for the Thai Language and Thai Culture
Many of us have seen firsthand how quickly LLMs such as Gemini, ChatGPT, and Typhoon have evolved. With each generation, these models become more capable, more useful, and more deeply integrated into how people learn, work, and make decisions. At Typhoon, we believe this progress is incredibly exciting, but capability alone is not enough. As models grow more powerful, it becomes just as important to define the boundaries of how they should be used and to ensure they remain safe in real-world settings.
The stakes are not theoretical. A highly capable model in the wrong context could be misused to generate dangerous instructions, enable harmful behavior, or lower the barrier to abuse at scale. As AI becomes more accessible, safety can no longer be treated as an optional layer or a feature to add later. It must be a foundational part of how we build, evaluate, and deploy these systems.
This is why safety benchmarking matters. Safety benchmarks help us measure whether an LLM behaves responsibly in practice, especially when it is confronted with harmful, adversarial, or ambiguous prompts. But today, most safety benchmarks are designed primarily in English. That leaves a critical gap. Research has already shown that low-resource languages can be used to jailbreak even highly safety-aligned models such as GPT-4[1]. In other words, if safety is only measured in English, then safety is not truly universal.
That vision led us to develop ThaiSafetyBench — a safety benchmark designed specifically for the Thai language. Our goal is not only to evaluate how models respond to harmful prompts in Thai, but also to help raise the standard for multilingual AI safety more broadly. We believe that trustworthy AI must be safe for every community it serves, and ThaiSafetyBench is one step toward making that vision real.
ThaiSafetyBench

ThaiSafetyBench is a dataset of 1,954 Thai malicious prompts, all of which have been carefully curated and human-verified to ensure high data quality. To enable accurate safety evaluation of models in the Thai context, ThaiSafetyBench also includes harmful prompts that are specific to Thai culture and the social context of Thai society.
The dataset is divided into 6 Risk Areas:
- Discrimination, Exclusion, Toxicity, Hateful, Offensive
- Human-Chatbot Interaction Harms
- Information Hazards
- Malicious Uses
- Misinformation Harms
- Thai Socio-Cultural Harms
Interesting Insights
To get a clearer picture of what LLM safety looks like in the Thai context today, we evaluated 24 LLM models using ThaiSafetyBench, with GPT-4.1 and Gemini-2.5-pro serving as LLM-as-a-judge. Results are reported as Average Attack Success Rate (ASR), a metric that measures on average how many malicious prompts successfully jailbreak the model (lower is better). The results show that most closed-source models are significantly safer than open-source models, highlighting the importance of safety-focused development within the Thai open-source community.

The graph below reveals a consistent trend: the jailbreak success rate using Thai culture-specific prompts is significantly higher than that of general Thai prompts. This underscores the danger of deploying LLMs in a specific cultural context when malicious prompts are involved (Cultural Jailbreaking). It also suggests that Safety Benchmarks translated from English may not accurately reflect LLM safety if they do not incorporate culturally and socially relevant dimensions.

Finally, we observe a trend among open-source models: in general, larger LLMs tend to be safer.
Example LLM Responses from ThaiSafetyBench


Let's Build Safe LLMs for Thai Users
Let's build an LLM Safety Community together. We've started by creating a HuggingFace Leaderboard that aggregates evaluation results on ThaiSafetyBench, giving you an at-a-glance view of the safety of currently available models. We encourage Thai language model developers to submit their models to the leaderboard. 😊 We've also released a lightweight ThaiSafetyClassifier, a binary classifier for Thai language safety, to make it easier to future develop and improve the community together!

Closing Thoughts
Thank you to everyone who has taken an interest in this work and read all the way to the end. This research has been accepted for presentation at the ICLR 2026 Workshop on Trustworthy AI. I hope this post plays a part in helping our community build AI that is safe for everyone.
Resources
- Research Paper: https://arxiv.org/abs/2603.04992v2
- ThaiSafetyBench Dataset: https://huggingface.co/datasets/typhoon-ai/ThaiSafetyBench
- ThaiSafetyBench Leaderboard: https://huggingface.co/spaces/typhoon-ai/ThaiSafetyBench-Leaderboard
- ThaiSafetyClassifier: https://huggingface.co/typhoon-ai/ThaiSafetyClassifier



