Introduction
We’re excited to introduce Typhoon 2.5, the latest milestone in our open-source text LLM family. This release marks a major leap forward in three critical areas:
🔹Agentic by Design – Smarter tool use, multi-step reasoning, and seamless integration into workflows.
🔹 Scalable Performance & Efficiency — High throughput, ultra-low token cost, more efficient than any previous Typhoon release.
🔹 More Fluent Interactions – Responses that better capture rhythm, and tone — especially in Thai, where nuance matters.
While proprietary models still dominate the AI landscape, they often fall short in accessibility and openness. On the other hand, open-source models offer clear advantages — transparency, flexibility, and lower cost — yet many still lag behind in real-world usability.
That’s why we built Typhoon 2.5 — an open-source model designed to bridge that gap, delivering seamless integration into agent-driven workflows and enabling natural, fluent, human-like interactions across real-world applications.
Key Highlights
-
Two Variants; Lean or Mighty
→ 4B: Ultra-efficient inference, runs on edge devices.
→ 30B (A3B): Production-grade scale with MoE efficiency — delivers the strength of a 30B model while consuming compute closer to 3B.
-
Proprietary-Grade Performance with Open Source Benefits
Matches GPT-4o and Claude Sonnet 4 in benchmarks, while staying fully transparent, controllable, and cost-efficient. -
Optimize for Fluency, Not Only Accuracy
Delivers smooth, natural responses that feel human — not just accurate.
-
High Throughput, Low Cost
A single H100 can handle 3,000+ tokens per second at 64 concurrent requests — bringing inference costs down to as low as $0.10 per million tokens.
-
Enhanced Function Calling & Built for Agents
Higher accuracy and reliability for real-world automation — works smoothly with tools like n8n, LangChain, or custom orchestration.
-
Built on Qwen3 Instruct 2507
Built on the latest open-source foundation for robust instruction-following, knowledge accuracy, and versatile task execution.
🤖 Agentic by Design
The future of LLMs isn’t just chat — it’s action. Typhoon 2.5 is built to act, not just reply.
- Multi-step reasoning: Plan, chain, and execute tasks across tools.
- Smarter function calling: Higher accuracy and reliability in structured outputs.
- Workflow integration: Works seamlessly with n8n, LangGraph, or custom orchestration pipelines.
- Real-world use cases: From weekly business reports to customer service automation, Typhoon 2.5 handles tasks end-to-end.
Typhoon 2.5 becomes more than a chatbot, it’s a reliable agent for research, operations, and automation.
⚡ High Throughput & Ultra-Low Cost
One of the most exciting leaps in Typhoon 2.5 is not just what it can do — but how efficiently it does it.
- 3,000+ tokens/sec on a single H100 at 64 concurrent requests
- Inference cost as low as $0.10 per million tokens
📊 Here’s how Typhoon 2.5 stacks up against past releases and peers:

Same GPU. Higher Speed. Typhoon 2.5 delivers over 3,000 tokens/sec on a single H100— a ~40% improvement in throughput compared to Typhoon 2.1.

Typhoon 2.5 delivers great performance at just $0.10 per million tokens — that's 33% cheaper than Typhoon 2.1, 91% cheaper than GPT-5 Mini and 93% less than Gemini 2.5 Flash.
💡 **What this means for your business: **Process more data for the same budget, or achieve the same results at a fraction of the cost. Whether you're running large-scale operations, serving thousands of chatbot sessions, conducting research, or building AI-powered applications, Typhoon 2.5's industry-leading price point makes sophisticated language processing economically viable for projects of any size.
🧮 For our calculation setup, please see this appendix
🗣️ Fluency That Feels Natural
Great AI isn’t just correct — it feels right.
This is especially important in Thai, where literal translations often miss the intended meaning. Drawing on our earlier work in Typhoon Translate, we built Typhoon 2.5 with human-in-the-loop labeling and fluency-focused evaluation.
Many models can produce accurate answers, but their language often reads like a rough translation: stiff, awkward, or missing cultural nuance. Typhoon 2.5 changes that by redefining fluency as more than correctness — it’s about rhythm, tone, and natural word choice. We are working toward this goal, and exciting to share progress on this aspect.
Example Prompt:
Typhoon 2.5 Response:
Claude 4 Response:
Evaluation Methodology
We benchmarked Typhoon 2.5 across instruction-following, agentic reasoning, real-world use cases, and fluency combining both academic tests and production-style simulations.
📘 1. General Instruction Following & Agentic Capabilities
-
An LLM-as-judge framework that evaluates correctness and instruction adherence on open-ended tasks. We tested Typhoon 2.5 on both the official LMSYS English benchmark and a Thai adaptation, spanning domains such as Thai knowledge, math, role-play, and creative writing.
-
Instruction-Following Accuracy — IFEval & IFBench (English & Thai)
Benchmarks that measure how well the model follows instructions in verifiable, testable scenarios. We report results by averaging across both datasets:
-
Agentic & Tool-Reasoning Capabilities — HotpotQA
A multi-hop QA benchmark that requires retrieving and synthesizing information from multiple Wikipedia sources. We used it to test Typhoon 2.5’s ability to:
- Decide when to call external tools (e.g., search APIs) versus answering directly.
- Plan and chain tool calls with well-formed queries.
- Stop at the right time, extract relevant information, and synthesize accurate answers while avoiding hallucinations.
Evaluation was conducted on the medium subset of 100 queries per language (English & Thai), reflecting real-world research workflows.
Instruction-following and Agentic Evaluation Results — Typhoon 2.5 shows improvements over Typhoon 2.1 and other state-of-the-art open-source models in both general instruction-following and agentic use cases. The large model is competitive with proprietary alternatives while retaining the benefits of open-source—privacy, customization, and cost efficiency. Meanwhile, the small model is the best-performing model in Thai.
Figure: Model Performance vs. Cost

Typhoon 2.5 (Qwen3-30B-A3B) achieves Gemini 2.5 Flash–level performance while being 14× cheaper. Among open-source models, it also outperforms comparable alternatives at the same price point (+5.3% higher average performance at $0.10 per million tokens).
This makes Typhoon 2.5 one of the most cost-efficient models for real-world Thai and English applications.
Figures: Results by Benchmark

Typhoon2.5 30B A3B - General Evaluation
Note: Typhoon 2.1-gemma3-12b and gemma3-12b-it cannot reliably perform tool calling. For these models, results are reported using a ReAct-style agent with manual parsing.

Typhoon2.5 4B - General Evaluation
Note: Typhoon 2.1-gemma3-4b and gemma3-4b-it cannot reliably perform tool calling. For these models, results are reported using a ReAct-style agent with manual parsing.
🛎️ 2. Customer Service Simulation (Tau-Bench)
Beyond academic benchmarks, we tested Typhoon 2.5 in production-style retail support using Tau-Bench. We sampled 50 realistic scenarios per language (English & Thai), with the customer role played by GPT-4o to generate natural, multi-turn requests in English & Thai.
What we tested
- Complex, policy-aware tasks: order changes, cancellations, returns/refunds, address updates, and product recommendations.
- Agent reliability: choosing when to use tools, forming precise queries, tracking state across turns, and avoiding hallucinations.
- End-to-end success: resolution rate and task completion accuracy (did the agent finish the workflow correctly, with the right actions and fields?).
Why this matters
- These scenarios stress the skills that real customer agents need: multi-step reasoning, tool orchestration, and policy compliance—in natural Thai.
Example: Thai role-play + tool calling (trimmed)
- Trace:
Intent → Verify → Retrieve → Propose → Confirm → Execute → Confirm - Each tool_call is a concrete step (e.g.,
get_user_details
,get_order_details
,exchange_delivered_order_items
). - The agent maintains state (selected items, SKUs, prices), handles corrections, and produces a policy-compliant final confirmation in Thai.
Evaluation setup
- 50 scenarios per language with GPT-4o role-playing the customer.
- Scored on resolution success and task completion accuracy, mirroring how real agents are measured.

Takeaway: On Tau-Bench retail simulations, Typhoon 2.5 delivers a balanced performance across Thai (50) and English (60), averaging 55. This places it well ahead of earlier Typhoon and Qwen baselines — and approaching the reliability of leading proprietary models like Claude Sonnet 4 and GPT-4o. Unlike GPT-4o, which excels in English but lags in Thai, Typhoon 2.5 shows more consistent multilingual capability, especially in Thai — a critical differentiator for local deployment.
🗣️ 3. Toward SOTA Fluency & Naturalness
In developing Typhoon 2.5, we’ve also explored the challenge of fluency. Our approach shows clear progress: even when trained on relatively small datasets, responses often feel less like literal translations and more aligned with natural communication — especially in Thai. Benchmarks reflect these improvements, though real-world fluency must also account for neutral, out-of-domain prompts and edge cases at scale.
What we tested: We introduce Fluency Win Rate metrics - where we’re building a fluency predictor that simulates how Thai speakers would choose a fluent response.
How Fluency Win Rate work?
We built a three-stage pipeline to evaluate fluency beyond correctness:
- Human-in-the-loop labeling: Our in-house linguists rated responses on fluency, tone, and contextual word choice.
- Fluency Predictor Model: A RFT-tuned, classifier trained on these labels to scale evaluations reliably.
- Benchmarking: Applied the predictor to responses sampled from WangchanInstruct and IFEval-TH.
Validation: On blind testing, the fluency predictor reached 82% agreement with expert annotators (n=300) — effectively matching human inter-annotator consistency (77%).
**Results: **Reported as win rates in head-to-head comparisons, where model outperform baselines - Claude Sonnet 4.

Typhoon2.5 30B A3B - Fluency evaluation

Typhoon2.5 4B - Fluency evaluation
Even, Typhoon 2.5 represents significant progress toward state-of-the-art fluency. That said, fluency remains an unsolved problem. Edge cases persist, and achieving broader alignment with Thai conversational norms continues to be a challenge. This release includes our early fluency-focused experiments, which show meaningful improvements — but further scaling of these techniques is essential to consistently generate natural, human-like responses across all domains.
Conclusion
Typhoon 2.5 isn’t just another upgrade — it’s a redefinition of what open-source LLMs can achieve. With stronger agentic behavior, optimized for Thai fluency, and production-grade efficiency, it’s built for real-world deployment at scale.
At Typhoon, we believe AI should be powerful, reliable, and human-centered. This release is a milestone, but also part of an ongoing journey and we welcome your feedback to help us shape the next steps.
Limitations & Future Work
- Like any LLM, Typhoon 2.5 has boundaries. Human-in-the-loop evaluation and domain-specific testing remain critical to manage risk and ensure safe deployment
- Not yet for deep reasoning: Typhoon 2.5 prioritizes speed, stability, and cost-effectiveness for everyday workflows. It isn’t designed for long-horizon planning or complex logical chains — those capabilities will come in future iterations or specialized modes. If you’re working on high-stakes tasks that demand deeper reasoning or domain-specific reliability, we’d love to explore what’s possible together.
- Fluency Generalization: In Typhoon 2.5, we began by focusing on fluency as a key step toward solving the broader fluency challenge. However, fluency remains an unresolved issue due to persistent edge cases, and achieving broader alignment with Thai conversational norms continues to be a significant challenge.
We’re committed to continuous improvement and your input is essential. Join the conversation on our Typhoon Discord and help us shape the future of open-source AI.
🚀 Try Typhoon 2.5 Today
Experience Typhoon 2.5 across your favorite platforms and workflows:
- 🌐 Web Playground — try it instantly in your browser.
- 🔌 Typhoon API — integrate with your own applications.
- 🤗 Hugging Face — run or fine-tune directly from the Hub.
- 💻 Ollama — deploy locally in one line.
- 🔄 Your Workflow — automate actions with your workflow orchestration tools
For n8n, see Typhoon + n8n Integration Guide.
Typhoon 2.5 is open, fluent, and ready to act — wherever you build.