Meet Typhoon ASR Real-Time: A Game-Changing Open-Source Streaming Speech Recognition Model for Thai

Introducing Typhoon ASR Real-Time

We are excited to release Typhoon ASR Real-Time, a next-generation, open-source Automatic Speech Recognition (ASR) model optimized for real-world performance, and designed to democratize access to Thai speech recognition technology.

It delivers fast and accurate transcriptions while running efficiently on just standard CPUs enabling anyone to host their own ASR service without expensive hardware or sending sensitive data to third-party clouds.

Key Differentiations

⚡ True streaming capability: Delivers near-instant Thai transcriptions as audio arrives
💻 CPU-optimized performance: Runs efficiently on standard hardware without relying on costly GPUs
🔒 Privacy-first design: Supports full on-premises deployment, ensuring sensitive audio stays under your control
🎯 Fine-tuning accessibility: Compact enough to be customized with minimal resources, even on Google Colab
💰 Low-cost deployment: Cheap to host and run, making real-time Thai ASR truly democratized

The Most Affordable Speech-to-Text in The Market

We are proud to make Speech-to-Text solutions accessible to everyone. It is free for casual users on CPU, and super affordable for high-volume workloads. Typhoon ASR Real-Time redefines what cost-effective ASR can be.

Free for light use: Run locally on your own CPU, no GPU or cloud fees.
Ultra-low cost at scale: Run massive workloads at a fraction of the price.
- On NVIDIA L4 (available on GCP, AWS): only 0.08 THB ($0.0023) per hour for 20,000 seconds of audio (1,000 files).
- On NVIDIA RTX 2000 Ada (affordable, ownable GPU): just 0.02 THB ($0.0006) per hour.
Cheaper than alternatives:

Up to 156x cheaper than Whisper API and 400x+ cheaper than Google or Azure.

Example Cost Comparison

Real-Time ASR Solution	Cost per Audio Hour (USD)	Pricing Source
Typhoon ASR Real-Time	$0.0023	Test Run (20k sec, NVIDIA L4 on Cloud)
Whisper API	$0.36	Several providers deliver Whisper API services. OpenAI’s pricing is shown here.
Google Speech-to-Text	$0.96	Google Cloud Speech-to-Text V2 Pricing
Azure Speech-to-Text	$1.00	Azure Speech-to-Text Standard Real-Time Transcription Pricing

When you think about it…

With Typhoon ASR Real-Time, transcribing 720 hours of conversation (an entire month of nonstop speech) costs less than a cup of coffee ☕ — only $1.65.

Meanwhile, Google/Azure ASR racks up hundreds of dollars in bills.

Example Results

See Typhoon ASR Real-Time in action as it transcribes real Thai speech inputs.

The demo includes:

Numerical speech: Spoken numbers are transcribed into standard written form for accuracy. (Tip: You can further refine outputs with an LLM to match your desired format.)
Business conversations: Real-world Thai dialogues mixed with some borrowed English words.

Real-World Applications

Typhoon ASR Real-Time makes real-time Thai speech recognition possible across industries and user groups—whether you’re running a large-scale service or just starting out as an independent creator.

1. Live Transcription & Accessibility

Conferences & meetings: Real-time Thai subtitles for business and webinars
Broadcasting: Live captions for TV, radio, and streaming content
Inclusive access: Support for hearing-impaired users in classrooms and workplaces

2. Voice-Driven Workflows

Dictation & documentation: Instant voice-to-text for reports, forms, and data entry
Customer interactions: Call transcription and quality analysis for support teams

3. Private & Affordable for Everyone

Healthcare & legal: On-premises transcription with full confidentiality
Small businesses: Deploy ASR without ongoing API or cloud fees
Content creators & educators: Generate transcripts for Thai podcasts, videos, and lectures at minimal cost

Research & Technology Contributions

To build Typhoon ASR Real-Time, we first examined the limitations of existing open-source ASR models. While models like OpenAI Whisper, Thonburian Whisper, and Pathumma Whisper have set new benchmarks for offline transcription accuracy, they fall short when it comes to real-time, streaming use cases. This gap inspired us to design a new architecture optimized for live, low-latency speech recognition in Thai.

Limitations of Current ASR Models

State-of-the-art open-source systems excel at batch transcription: given a complete audio file, they deliver highly accurate transcripts. This is ideal for podcasts, recorded lectures, or post-processing tasks. But in real-time scenarios, several challenges emerge:

Non-causal architecture: Models rely on future context, which is unavailable during live streaming
Chunking artifacts: Splitting audio into artificial chunks often cuts words in half, causing errors
Processing overhead: Overlapping chunks must be re-processed, wasting compute resources
Latency: Waiting to collect large enough chunks creates noticeable delays

For use cases like live captions, speech assistants, or call transcription, these shortcomings disrupt the natural flow of conversation.

Proprietary solutions introduce further barriers:

High cost & resource demands: Expensive GPU infrastructure or recurring cloud API fees
Privacy risks: Sensitive audio data must be sent to third-party servers

Typhoon ASR Real-Time: A New Approach

Typhoon ASR Real-Time was designed to overcome these barriers with a streaming-first architecture that transcribes speech continuously, as it happens. Words or sub-words are emitted the moment confidence is high—enabling smooth, natural, low-latency transcription.

At its core, the model leverages FastConformer-Transducer architecture, chosen for its balance of speed, accuracy, and efficiency.

Key Features

Causal transducer-based design: Processes speech sequentially without future context, enabling true streaming
Low-latency decoding: Encoder and decoder operate in sync, requiring minimal buffering
Scalable & flexible: Runs on standard CPUs, compact GPUs, or production-scale deployments

By processing audio in small chunks while preserving context across segments, Typhoon ASR Real-Time delivers the responsiveness of streaming ASR with the accuracy required for real-world applications.

Performance Evaluation

Typhoon ASR Real-Time was benchmarked on diverse datasets, including 970 utterances from GigaSpeech2 and 1,021 utterances from the Google FLEURS test set, to assess both accuracy and throughput performance.

Evaluation Metrics

Character Error Rate (CER): Measures transcription accuracy across different Thai speaking styles. Lower is better.
RTFx (Real-Time Factor X): Ratio of audio duration to processing time. Higher values indicate faster throughput.

Results

Typhoon ASR Real-Time demonstrates exceptional speed and efficiency:

⚡ 4097 RFTx real-time processing speed which is over 6× faster throughput than the next-best model
🎯 Competitive accuracy with a CER of 0.0984, on par with state-of-the-art models
📊 15–19× faster than Whisper variants, while maintaining comparable transcription quality

These results highlight Typhoon ASR Real-Time as a breakthrough for production-grade Thai speech recognition, combining low latency with the scalability needed for high-volume workloads.

Typhoon ASR Real-Time delivers 4,097× real-time speed with competitive accuracy—up to 19× faster than Whisper.

Try Typhoon ASR Real-Time Today

We try our best to make it easy for you to get started. Choose the option that fits your needs:

🌐 Web Playground: Try it out instantly in your browser. Perfect for individuals and casual use.
🔌 Typhoon API: Call ASR directly from your app without hosting. Great for prototypes and POCs (rate limits apply).
- 📖 See our Documentation
🖥️ Self-Hosting: Run it on your own device (CPU or GPU). Ideal for enterprises or tech-savvy users who want full control.
- See GitHub Guide
🤗 Model Weights on Hugging Face: Download, experiment, and fine-tune your own ASR.
- Model Card on Hugging Face
- See Example Fine-tuning Code here

Limitations & Future Work

We believe in transparency and continuous improvement. Typhoon ASR Real-Time prioritizes speed, efficiency, and accessibility while acknowledging current limitations that guide our development roadmap.

Current Limitations

Speech-to-Text Focus Only: Designed as a dedicated transcription engine that converts speech to text accurately, without generative capabilities or prompt-based interactions like LLMs.
Single-Speaker Transcription: Produces continuous transcripts without speaker diarization—transcribes what is said, not who said it in multi-speaker scenarios.
Noise Sensitivity: Performance degrades with significant background noise, overlapping speech, or poor audio quality. Optimized for clear primary speaker audio.
Code-Switching Challenges: Limited accuracy on English loanwords and Thai-English code-switching common in modern conversations.

Future Development

Our roadmap is driven by community feedback and real-world usage:

Enhanced Code-Switching Support: Improving accuracy on mixed Thai-English speech patterns and loanwords for natural, modern conversations.
Noise Robustness: Training on diverse, challenging audio datasets to handle everyday environments more effectively.
Community-Driven Features: Open-source development guided by user feedback and contributions from the Thai developer community.

Help Us Improve

This is just the beginning of our journey with Typhoon ASR models. We're committed to continuous improvement based on real-world usage and community feedback.

Found words or vocabulary that our model doesn't recognize well? We'd love to hear about them! Simply share the specific words, terms, or vocabulary that the model struggles with. Your feedback will directly help us enhance the model's performance for everyone.

Join our community channel and be part of building open-source, real-world Thai language technology for everyone.