Typhoon Logo
TYPHOON
Typhoon 2.5: Step Forward in Agentic AI, Thai Fluency, and Efficiency

Typhoon 2.5: Step Forward in Agentic AI, Thai Fluency, and Efficiency

New Release
Typhoon 2.5

Smarter agents, smoother conversations, unmatched efficiency, and open-source freedom for every scale from edge devices to enterprise systems.

Kunat Pipatanakul

Kunat Pipatanakul

October 20, 2025

Typhoon 2.5: Step Forward in Agentic AI, Thai Fluency, and Efficiency

Introduction

We’re excited to introduce Typhoon 2.5, the latest milestone in our open-source text LLM family. This release marks a major leap forward in three critical areas:

🔹Agentic by Design – Smarter tool use, multi-step reasoning, and seamless integration into workflows.

🔹 Scalable Performance & Efficiency — High throughput, ultra-low token cost, more efficient than any previous Typhoon release.

🔹 More Fluent Interactions – Responses that better capture rhythm, and tone — especially in Thai, where nuance matters.

While proprietary models still dominate the AI landscape, they often fall short in accessibility and openness. On the other hand, open-source models offer clear advantages — transparency, flexibility, and lower cost — yet many still lag behind in real-world usability.

That’s why we built Typhoon 2.5 — an open-source model designed to bridge that gap, delivering seamless integration into agent-driven workflows and enabling natural, fluent, human-like interactions across real-world applications.

Key Highlights

  • Two Variants; Lean or Mighty

    4B: Ultra-efficient inference, runs on edge devices.

    30B (A3B): Production-grade scale with MoE efficiency — delivers the strength of a 30B model while consuming compute closer to 3B.

  • Proprietary-Grade Performance with Open Source Benefits
    Matches GPT-4o and Claude Sonnet 4 in benchmarks, while staying fully transparent, controllable, and cost-efficient.

  • Optimize for Fluency, Not Only Accuracy

    Delivers smooth, natural responses that feel human — not just accurate.

  • High Throughput, Low Cost

    A single H100 can handle 3,000+ tokens per second at 64 concurrent requests — bringing inference costs down to as low as $0.10 per million tokens.

  • Enhanced Function Calling & Built for Agents

    Higher accuracy and reliability for real-world automation — works smoothly with tools like n8n, LangChain, or custom orchestration.

  • Built on Qwen3 Instruct 2507
    Built on the latest open-source foundation for robust instruction-following, knowledge accuracy, and versatile task execution.

🤖 Agentic by Design

The future of LLMs isn’t just chat — it’s action. Typhoon 2.5 is built to act, not just reply.

  • Multi-step reasoning: Plan, chain, and execute tasks across tools.
  • Smarter function calling: Higher accuracy and reliability in structured outputs.
  • Workflow integration: Works seamlessly with n8n, LangGraph, or custom orchestration pipelines.
  • Real-world use cases: From weekly business reports to customer service automation, Typhoon 2.5 handles tasks end-to-end.

Typhoon 2.5 becomes more than a chatbot, it’s a reliable agent for research, operations, and automation.

⚡ High Throughput & Ultra-Low Cost

One of the most exciting leaps in Typhoon 2.5 is not just what it can do — but how efficiently it does it.

  • 3,000+ tokens/sec on a single H100 at 64 concurrent requests
  • Inference cost as low as $0.10 per million tokens

📊 Here’s how Typhoon 2.5 stacks up against past releases and peers:

Typhoon 2.5 Throughput

Same GPU. Higher Speed. Typhoon 2.5 delivers over 3,000 tokens/sec on a single H100— a ~40% improvement in throughput compared to Typhoon 2.1.

Typhoon 2.5 Inference Cost

Typhoon 2.5 delivers great performance at just $0.10 per million tokens — that's 33% cheaper than Typhoon 2.1, 91% cheaper than GPT-5 Mini and 93% less than Gemini 2.5 Flash.

💡 **What this means for your business: **Process more data for the same budget, or achieve the same results at a fraction of the cost. Whether you're running large-scale operations, serving thousands of chatbot sessions, conducting research, or building AI-powered applications, Typhoon 2.5's industry-leading price point makes sophisticated language processing economically viable for projects of any size.

🧮 For our calculation setup, please see this appendix

🗣️ Fluency That Feels Natural

Great AI isn’t just correct — it feels right.

This is especially important in Thai, where literal translations often miss the intended meaning. Drawing on our earlier work in Typhoon Translate, we built Typhoon 2.5 with human-in-the-loop labeling and fluency-focused evaluation.

Many models can produce accurate answers, but their language often reads like a rough translation: stiff, awkward, or missing cultural nuance. Typhoon 2.5 changes that by redefining fluency as more than correctness — it’s about rhythm, tone, and natural word choice. We are working toward this goal, and exciting to share progress on this aspect.

Example Prompt:

TEXT

Typhoon 2.5 Response:

TEXT

Claude 4 Response:

TEXT

Evaluation Methodology

We benchmarked Typhoon 2.5 across instruction-following, agentic reasoning, real-world use cases, and fluency combining both academic tests and production-style simulations.

📘 1. General Instruction Following & Agentic Capabilities

  • MT-Bench (English & Thai)

    An LLM-as-judge framework that evaluates correctness and instruction adherence on open-ended tasks. We tested Typhoon 2.5 on both the official LMSYS English benchmark and a Thai adaptation, spanning domains such as Thai knowledge, math, role-play, and creative writing.

  • Instruction-Following Accuracy — IFEval & IFBench (English & Thai)

    Benchmarks that measure how well the model follows instructions in verifiable, testable scenarios. We report results by averaging across both datasets:

    • IFEval: Evaluates factual consistency across 500+ test cases.
    • IFBench: A more challenging successor with complex constraint chains.
  • Agentic & Tool-Reasoning Capabilities — HotpotQA

    A multi-hop QA benchmark that requires retrieving and synthesizing information from multiple Wikipedia sources. We used it to test Typhoon 2.5’s ability to:

    • Decide when to call external tools (e.g., search APIs) versus answering directly.
    • Plan and chain tool calls with well-formed queries.
    • Stop at the right time, extract relevant information, and synthesize accurate answers while avoiding hallucinations.

    Evaluation was conducted on the medium subset of 100 queries per language (English & Thai), reflecting real-world research workflows.

Instruction-following and Agentic Evaluation Results — Typhoon 2.5 shows improvements over Typhoon 2.1 and other state-of-the-art open-source models in both general instruction-following and agentic use cases. The large model is competitive with proprietary alternatives while retaining the benefits of open-source—privacy, customization, and cost efficiency. Meanwhile, the small model is the best-performing model in Thai.

Figure: Model Performance vs. Cost

Typhoon 2.5 Cost-Performance

Typhoon 2.5 (Qwen3-30B-A3B) achieves Gemini 2.5 Flash–level performance while being 14× cheaper. Among open-source models, it also outperforms comparable alternatives at the same price point (+5.3% higher average performance at $0.10 per million tokens).

This makes Typhoon 2.5 one of the most cost-efficient models for real-world Thai and English applications.

Figures: Results by Benchmark

Typhoon 2.5 30b benchmark

Typhoon2.5 30B A3B - General Evaluation

Note: Typhoon 2.1-gemma3-12b and gemma3-12b-it cannot reliably perform tool calling. For these models, results are reported using a ReAct-style agent with manual parsing.

Typhoon 2.5 4b benchmark

Typhoon2.5 4B - General Evaluation

Note: Typhoon 2.1-gemma3-4b and gemma3-4b-it cannot reliably perform tool calling. For these models, results are reported using a ReAct-style agent with manual parsing.

🛎️ 2. Customer Service Simulation (Tau-Bench)

Beyond academic benchmarks, we tested Typhoon 2.5 in production-style retail support using Tau-Bench. We sampled 50 realistic scenarios per language (English & Thai), with the customer role played by GPT-4o to generate natural, multi-turn requests in English & Thai.

What we tested

  • Complex, policy-aware tasks: order changes, cancellations, returns/refunds, address updates, and product recommendations.
  • Agent reliability: choosing when to use tools, forming precise queries, tracking state across turns, and avoiding hallucinations.
  • End-to-end success: resolution rate and task completion accuracy (did the agent finish the workflow correctly, with the right actions and fields?).

Why this matters

  • These scenarios stress the skills that real customer agents need: multi-step reasoning, tool orchestration, and policy compliance—in natural Thai.

Example: Thai role-play + tool calling (trimmed)

JSON
  • Trace:
    Intent → Verify → Retrieve → Propose → Confirm → Execute → Confirm
  • Each tool_call is a concrete step (e.g., get_user_details, get_order_details, exchange_delivered_order_items).
  • The agent maintains state (selected items, SKUs, prices), handles corrections, and produces a policy-compliant final confirmation in Thai.

Evaluation setup

  • 50 scenarios per language with GPT-4o role-playing the customer.
  • Scored on resolution success and task completion accuracy, mirroring how real agents are measured.
Typhoon 2.5 Agentic eval

Takeaway: On Tau-Bench retail simulations, Typhoon 2.5 delivers a balanced performance across Thai (50) and English (60), averaging 55. This places it well ahead of earlier Typhoon and Qwen baselines — and approaching the reliability of leading proprietary models like Claude Sonnet 4 and GPT-4o. Unlike GPT-4o, which excels in English but lags in Thai, Typhoon 2.5 shows more consistent multilingual capability, especially in Thai — a critical differentiator for local deployment.

🗣️ 3. Toward SOTA Fluency & Naturalness

In developing Typhoon 2.5, we’ve also explored the challenge of fluency. Our approach shows clear progress: even when trained on relatively small datasets, responses often feel less like literal translations and more aligned with natural communication — especially in Thai. Benchmarks reflect these improvements, though real-world fluency must also account for neutral, out-of-domain prompts and edge cases at scale.

What we tested: We introduce Fluency Win Rate metrics - where we’re building a fluency predictor that simulates how Thai speakers would choose a fluent response.

How Fluency Win Rate work?

We built a three-stage pipeline to evaluate fluency beyond correctness:

  1. Human-in-the-loop labeling: Our in-house linguists rated responses on fluency, tone, and contextual word choice.
  2. Fluency Predictor Model: A RFT-tuned, classifier trained on these labels to scale evaluations reliably.
  3. Benchmarking: Applied the predictor to responses sampled from WangchanInstruct and IFEval-TH.

Validation: On blind testing, the fluency predictor reached 82% agreement with expert annotators (n=300) — effectively matching human inter-annotator consistency (77%).

**Results: **Reported as win rates in head-to-head comparisons, where model outperform baselines - Claude Sonnet 4.

Typhoon 2.5 30b fluency eval

Typhoon2.5 30B A3B - Fluency evaluation
Typhoon 2.5 4b fluency eval

Typhoon2.5 4B - Fluency evaluation

Even, Typhoon 2.5 represents significant progress toward state-of-the-art fluency. That said, fluency remains an unsolved problem. Edge cases persist, and achieving broader alignment with Thai conversational norms continues to be a challenge. This release includes our early fluency-focused experiments, which show meaningful improvements — but further scaling of these techniques is essential to consistently generate natural, human-like responses across all domains.

Conclusion

Typhoon 2.5 isn’t just another upgrade — it’s a redefinition of what open-source LLMs can achieve. With stronger agentic behavior, optimized for Thai fluency, and production-grade efficiency, it’s built for real-world deployment at scale.

At Typhoon, we believe AI should be powerful, reliable, and human-centered. This release is a milestone, but also part of an ongoing journey and we welcome your feedback to help us shape the next steps.

Limitations & Future Work

  • Like any LLM, Typhoon 2.5 has boundaries. Human-in-the-loop evaluation and domain-specific testing remain critical to manage risk and ensure safe deployment
  • Not yet for deep reasoning: Typhoon 2.5 prioritizes speed, stability, and cost-effectiveness for everyday workflows. It isn’t designed for long-horizon planning or complex logical chains — those capabilities will come in future iterations or specialized modes. If you’re working on high-stakes tasks that demand deeper reasoning or domain-specific reliability, we’d love to explore what’s possible together.
  • Fluency Generalization: In Typhoon 2.5, we began by focusing on fluency as a key step toward solving the broader fluency challenge. However, fluency remains an unresolved issue due to persistent edge cases, and achieving broader alignment with Thai conversational norms continues to be a significant challenge.

We’re committed to continuous improvement and your input is essential. Join the conversation on our Typhoon Discord and help us shape the future of open-source AI.

🚀 Try Typhoon 2.5 Today

Experience Typhoon 2.5 across your favorite platforms and workflows:

Typhoon 2.5 is open, fluent, and ready to act — wherever you build.