Typhoon 2.5: Step Forward in Agentic AI, Thai Fluency, and Efficiency

Introduction

We’re excited to introduce Typhoon 2.5, the latest milestone in our open-source text LLM family. This release marks a major leap forward in three critical areas:

🔹Agentic by Design – Smarter tool use, multi-step reasoning, and seamless integration into workflows.

🔹 Scalable Performance & Efficiency — High throughput, ultra-low token cost, more efficient than any previous Typhoon release.

🔹 More Fluent Interactions – Responses that better capture rhythm, and tone — especially in Thai, where nuance matters.

While proprietary models still dominate the AI landscape, they often fall short in accessibility and openness. On the other hand, open-source models offer clear advantages — transparency, flexibility, and lower cost — yet many still lag behind in real-world usability.

That’s why we built Typhoon 2.5 — an open-source model designed to bridge that gap, delivering seamless integration into agent-driven workflows and enabling natural, fluent, human-like interactions across real-world applications.

Key Highlights

Two Variants; Lean or Mighty

→ 4B: Ultra-efficient inference, runs on edge devices.

→ 30B (A3B): Production-grade scale with MoE efficiency — delivers the strength of a 30B model while consuming compute closer to 3B.
Proprietary-Grade Performance with Open Source Benefits
Matches GPT-4o and Claude Sonnet 4 in benchmarks, while staying fully transparent, controllable, and cost-efficient.
Optimize for Fluency, Not Only Accuracy

Delivers smooth, natural responses that feel human — not just accurate.
High Throughput, Low Cost

A single H100 can handle 3,000+ tokens per second at 64 concurrent requests — bringing inference costs down to as low as $0.10 per million tokens.
Enhanced Function Calling & Built for Agents

Higher accuracy and reliability for real-world automation — works smoothly with tools like n8n, LangChain, or custom orchestration.
Built on Qwen3 Instruct 2507
Built on the latest open-source foundation for robust instruction-following, knowledge accuracy, and versatile task execution.

🤖 Agentic by Design

The future of LLMs isn’t just chat — it’s action. Typhoon 2.5 is built to act, not just reply.

Multi-step reasoning: Plan, chain, and execute tasks across tools.
Smarter function calling: Higher accuracy and reliability in structured outputs.
Workflow integration: Works seamlessly with n8n, LangGraph, or custom orchestration pipelines.
Real-world use cases: From weekly business reports to customer service automation, Typhoon 2.5 handles tasks end-to-end.

Typhoon 2.5 becomes more than a chatbot, it’s a reliable agent for research, operations, and automation.

⚡ High Throughput & Ultra-Low Cost

One of the most exciting leaps in Typhoon 2.5 is not just what it can do — but how efficiently it does it.

3,000+ tokens/sec on a single H100 at 64 concurrent requests
Inference cost as low as $0.10 per million tokens

📊 Here’s how Typhoon 2.5 stacks up against past releases and peers:

Same GPU. Higher Speed. Typhoon 2.5 delivers over 3,000 tokens/sec on a single H100— a ~40% improvement in throughput compared to Typhoon 2.1.

Typhoon 2.5 delivers great performance at just $0.10 per million tokens — that's 33% cheaper than Typhoon 2.1, 91% cheaper than GPT-5 Mini and 93% less than Gemini 2.5 Flash.

💡 **What this means for your business: **Process more data for the same budget, or achieve the same results at a fraction of the cost. Whether you're running large-scale operations, serving thousands of chatbot sessions, conducting research, or building AI-powered applications, Typhoon 2.5's industry-leading price point makes sophisticated language processing economically viable for projects of any size.

🧮 For our calculation setup, please see this appendix

🗣️ Fluency That Feels Natural

Great AI isn’t just correct — it feels right.

This is especially important in Thai, where literal translations often miss the intended meaning. Drawing on our earlier work in Typhoon Translate, we built Typhoon 2.5 with human-in-the-loop labeling and fluency-focused evaluation.

Many models can produce accurate answers, but their language often reads like a rough translation: stiff, awkward, or missing cultural nuance. Typhoon 2.5 changes that by redefining fluency as more than correctness — it’s about rhythm, tone, and natural word choice. We are working toward this goal, and exciting to share progress on this aspect.

Example Prompt:

TEXT

Typhoon 2.5 Response:

TEXT

Claude 4 Response:

TEXT

Evaluation Methodology

We benchmarked Typhoon 2.5 across instruction-following, agentic reasoning, real-world use cases, and fluency combining both academic tests and production-style simulations.

📘 1. General Instruction Following & Agentic Capabilities

MT-Bench (English & Thai)

An LLM-as-judge framework that evaluates correctness and instruction adherence on open-ended tasks. We tested Typhoon 2.5 on both the official LMSYS English benchmark and a Thai adaptation, spanning domains such as Thai knowledge, math, role-play, and creative writing.
Instruction-Following Accuracy — IFEval & IFBench (English & Thai)

Benchmarks that measure how well the model follows instructions in verifiable, testable scenarios. We report results by averaging across both datasets:
- IFEval: Evaluates factual consistency across 500+ test cases.
- IFBench: A more challenging successor with complex constraint chains.
Agentic & Tool-Reasoning Capabilities — HotpotQA

A multi-hop QA benchmark that requires retrieving and synthesizing information from multiple Wikipedia sources. We used it to test Typhoon 2.5’s ability to:
- Decide when to call external tools (e.g., search APIs) versus answering directly.
- Plan and chain tool calls with well-formed queries.
- Stop at the right time, extract relevant information, and synthesize accurate answers while avoiding hallucinations.
Evaluation was conducted on the medium subset of 100 queries per language (English & Thai), reflecting real-world research workflows.

Instruction-following and Agentic Evaluation Results — Typhoon 2.5 shows improvements over Typhoon 2.1 and other state-of-the-art open-source models in both general instruction-following and agentic use cases. The large model is competitive with proprietary alternatives while retaining the benefits of open-source—privacy, customization, and cost efficiency. Meanwhile, the small model is the best-performing model in Thai.

Figure: Model Performance vs. Cost

Typhoon 2.5 (Qwen3-30B-A3B) achieves Gemini 2.5 Flash–level performance while being 14× cheaper. Among open-source models, it also outperforms comparable alternatives at the same price point (+5.3% higher average performance at $0.10 per million tokens).

This makes Typhoon 2.5 one of the most cost-efficient models for real-world Thai and English applications.

Figures: Results by Benchmark

Typhoon2.5 30B A3B - General Evaluation

Note: Typhoon 2.1-gemma3-12b and gemma3-12b-it cannot reliably perform tool calling. For these models, results are reported using a ReAct-style agent with manual parsing.

Typhoon2.5 4B - General Evaluation

Note: Typhoon 2.1-gemma3-4b and gemma3-4b-it cannot reliably perform tool calling. For these models, results are reported using a ReAct-style agent with manual parsing.

🛎️ 2. Customer Service Simulation (Tau-Bench)

Beyond academic benchmarks, we tested Typhoon 2.5 in production-style retail support using Tau-Bench. We sampled 50 realistic scenarios per language (English & Thai), with the customer role played by GPT-4o to generate natural, multi-turn requests in English & Thai.

What we tested

Complex, policy-aware tasks: order changes, cancellations, returns/refunds, address updates, and product recommendations.
Agent reliability: choosing when to use tools, forming precise queries, tracking state across turns, and avoiding hallucinations.
End-to-end success: resolution rate and task completion accuracy (did the agent finish the workflow correctly, with the right actions and fields?).

Why this matters

These scenarios stress the skills that real customer agents need: multi-step reasoning, tool orchestration, and policy compliance—in natural Thai.

Example: Thai role-play + tool calling (trimmed)

JSON

Trace:
Intent → Verify → Retrieve → Propose → Confirm → Execute → Confirm
Each tool_call is a concrete step (e.g., get_user_details, get_order_details, exchange_delivered_order_items).
The agent maintains state (selected items, SKUs, prices), handles corrections, and produces a policy-compliant final confirmation in Thai.

Evaluation setup

50 scenarios per language with GPT-4o role-playing the customer.
Scored on resolution success and task completion accuracy, mirroring how real agents are measured.

Takeaway: On Tau-Bench retail simulations, Typhoon 2.5 delivers a balanced performance across Thai (50) and English (60), averaging 55. This places it well ahead of earlier Typhoon and Qwen baselines — and approaching the reliability of leading proprietary models like Claude Sonnet 4 and GPT-4o. Unlike GPT-4o, which excels in English but lags in Thai, Typhoon 2.5 shows more consistent multilingual capability, especially in Thai — a critical differentiator for local deployment.

🗣️ 3. Toward SOTA Fluency & Naturalness

In developing Typhoon 2.5, we’ve also explored the challenge of fluency. Our approach shows clear progress: even when trained on relatively small datasets, responses often feel less like literal translations and more aligned with natural communication — especially in Thai. Benchmarks reflect these improvements, though real-world fluency must also account for neutral, out-of-domain prompts and edge cases at scale.

What we tested: We introduce Fluency Win Rate metrics - where we’re building a fluency predictor that simulates how Thai speakers would choose a fluent response.

How Fluency Win Rate work?

We built a three-stage pipeline to evaluate fluency beyond correctness:

Human-in-the-loop labeling: Our in-house linguists rated responses on fluency, tone, and contextual word choice.
Fluency Predictor Model: A RFT-tuned, classifier trained on these labels to scale evaluations reliably.
Benchmarking: Applied the predictor to responses sampled from WangchanInstruct and IFEval-TH.

Validation: On blind testing, the fluency predictor reached 82% agreement with expert annotators (n=300) — effectively matching human inter-annotator consistency (77%).

**Results: **Reported as win rates in head-to-head comparisons, where model outperform baselines - Claude Sonnet 4.

Typhoon2.5 30B A3B - Fluency evaluation

Typhoon2.5 4B - Fluency evaluation

Even, Typhoon 2.5 represents significant progress toward state-of-the-art fluency. That said, fluency remains an unsolved problem. Edge cases persist, and achieving broader alignment with Thai conversational norms continues to be a challenge. This release includes our early fluency-focused experiments, which show meaningful improvements — but further scaling of these techniques is essential to consistently generate natural, human-like responses across all domains.

Conclusion

Typhoon 2.5 isn’t just another upgrade — it’s a redefinition of what open-source LLMs can achieve. With stronger agentic behavior, optimized for Thai fluency, and production-grade efficiency, it’s built for real-world deployment at scale.

At Typhoon, we believe AI should be powerful, reliable, and human-centered. This release is a milestone, but also part of an ongoing journey and we welcome your feedback to help us shape the next steps.

Limitations & Future Work

Like any LLM, Typhoon 2.5 has boundaries. Human-in-the-loop evaluation and domain-specific testing remain critical to manage risk and ensure safe deployment
Not yet for deep reasoning: Typhoon 2.5 prioritizes speed, stability, and cost-effectiveness for everyday workflows. It isn’t designed for long-horizon planning or complex logical chains — those capabilities will come in future iterations or specialized modes. If you’re working on high-stakes tasks that demand deeper reasoning or domain-specific reliability, we’d love to explore what’s possible together.
Fluency Generalization: In Typhoon 2.5, we began by focusing on fluency as a key step toward solving the broader fluency challenge. However, fluency remains an unresolved issue due to persistent edge cases, and achieving broader alignment with Thai conversational norms continues to be a significant challenge.

We’re committed to continuous improvement and your input is essential. Join the conversation on our Typhoon Discord and help us shape the future of open-source AI.

🚀 Try Typhoon 2.5 Today

Experience Typhoon 2.5 across your favorite platforms and workflows:

🌐 Web Playground — try it instantly in your browser.
🔌 Typhoon API — integrate with your own applications.
🤗 Hugging Face — run or fine-tune directly from the Hub.
💻 Ollama — deploy locally in one line.
🔄 Your Workflow — automate actions with your workflow orchestration tools
For n8n, see Typhoon + n8n Integration Guide.

Typhoon 2.5 is open, fluent, and ready to act — wherever you build.

Typhoon 2.5: Step Forward in Agentic AI, Thai Fluency, and Efficiency

Table of Contents

Introduction

Key Highlights

🤖 Agentic by Design

⚡ High Throughput & Ultra-Low Cost

🗣️ Fluency That Feels Natural

Example Prompt:

Typhoon 2.5 Response:

Claude 4 Response:

Evaluation Methodology

📘 1. General Instruction Following & Agentic Capabilities

🛎️ 2. Customer Service Simulation (Tau-Bench)

Example: Thai role-play + tool calling (trimmed)

🗣️ 3. Toward SOTA Fluency & Naturalness

Conclusion

Limitations & Future Work

🚀 Try Typhoon 2.5 Today

RISA: Reimagine Thai Education with Typhoon AI