Production ASR · 2026

Parakeet TDT v3 vs Whisper Turbo vs Qwen3-ASR what people actually use in production

Three open ASR models, one real question — which one holds up when real users speak into real microphones at 3am with accents and background noise. Here is what production deployments look like in 2026.

By SnailText's founder · Published 2026-06-28

The short version

For production voice AI in 2026: Parakeet TDT v3 is fastest on CPU (RTF < 0.1×, 50-150ms for a 2-second clip, ~3-5% WER on clean English) and wins on short commands. Whisper Large-v3-Turbo wins on accented speech, domain vocabulary, and 99 languages (~2-3% WER); Groq's hosted version delivers under 300ms at $0.04/audio hour without self-hosting. Qwen3-ASR is the third option for code-mixed speech like Hinglish. On clean benchmarks, Parakeet leads on speed. On production audio with accents or domain vocab, Whisper leads on quality. Fine-tuning 500-2000 domain examples beats switching base models.

Model comparison at a glance

Model comparison at a glance (verified 2026-06-28)
Model	Size	RTF (CPU)	RTF (A100)	Languages	WER (LibriSpeech*)	Best for
Parakeet TDT v3	0.6B	< 0.1×	< 0.01×	25	~3-5%	Short commands, low-latency CPU agents
Parakeet TDT v2 (EN-only)	0.6B	< 0.1×	< 0.01×	1 (EN)	~2.5-4%	English-only, highest speed/accuracy ratio on CPU
Whisper Large-v3-Turbo	809M	~0.5-1×	< 0.05×	99	~2-3%	Accented speech, domain vocabulary, multilingual
Qwen3-ASR 0.5B	0.5B	~0.3-0.8×	< 0.05×	100+	~3-5%	Code-mixed languages, multilingual at low size
Nemotron ASR 3.5	~600M	< 0.1×	< 0.01×	~20	~2-4%	Streaming, NIM deployment, high concurrency
Whisper Large-v3 (full)	1.5B	~3-8×	~0.1×	99	~2%	Maximum accuracy, offline file transcription

For a production voice agent in 2026: use Parakeet TDT v3 if your audio is short English commands on CPU, use Whisper Large-v3-Turbo (via Groq or self-hosted) if your users have accented speech or domain vocabulary, use Qwen3-ASR if you need code-mixed multilingual. The full answer depends on what your audio actually looks like — and that is where the benchmark numbers stop being useful.

The benchmarks say Parakeet. The production reports say “it depends.” Both are right. Here is what the dependency actually looks like.

WER figures in the table above are approximate values from the LibriSpeech clean test set per published NVIDIA and OpenAI benchmarks. Production WER varies significantly with audio conditions, accent, and domain vocabulary.

The three models and what they are trying to do

Parakeet TDT v3 is NVIDIA’s streaming-native ASR model. TDT stands for Token-and-Duration Transducer — an architecture built for real-time inference rather than offline transcription. The v3 version covers 25 languages at 0.6B parameters. Parakeet v2 is English-only at the same size and is consistently faster on CPU than v3 because it does not carry the multilingual overhead. Both deliver real-time factors well under 0.1 on modern CPU — meaning a 10-second clip transcribes in under 1 second without a GPU.

Whisper Large-v3-Turbo is OpenAI’s compressed version of Large-v3, trained to run at roughly half the compute cost while recovering most of the accuracy. It is not Parakeet-fast on CPU, but it has two things Parakeet does not: 99 languages with strong performance across all of them, and four years of production deployment across hundreds of downstream apps, meaning its failure modes are well-documented. On Groq’s inference hardware, hosted Whisper Turbo delivers 300ms end-to-end latency per request — faster than most self-hosted setups can manage on commodity GPUs.

Qwen3-ASR is Alibaba’s 2026 multilingual ASR model family, with the smallest variant at 0.5B parameters. It competes with Parakeet on size and Whisper on language breadth. Its headline capability is code-mixed language support — audio that mixes Hindi and English, or Spanish and English, in the same utterance. This is a genuine hard problem for Whisper (whose training data is mostly monolingual per segment) and where Qwen3-ASR’s training approach gives it an edge.

Nemotron ASR 3.5 — not in the original question but showing up in production threads by June 2026 — is NVIDIA’s streaming-first replacement for the Canary family. Designed for NVIDIA NIM deployment, it benchmarks faster than Parakeet and handles 400+ concurrent sessions per H100 according to early production reports. It has a documented cold-start artifact where the first 1-2 seconds of each session are less accurate, which matters more in a voice agent than in a batch transcription pipeline.

Accuracy in numbers. On the LibriSpeech clean English test set, Parakeet TDT v2 (English-only) achieves approximately 2.5-4% WER, Parakeet TDT v3 approximately 3-5% WER (the multilingual overhead adds a small accuracy cost for English), Whisper Large-v3-Turbo approximately 2-3% WER, and Qwen3-ASR 0.5B approximately 3-5% WER. Whisper Large-v3 (full, 1.5B parameters) achieves approximately 2% WER — the accuracy ceiling at a meaningful compute cost. These figures are from published NVIDIA and OpenAI benchmarks on the standard academic test set. The gap between models narrows significantly on production audio with accents, background noise, or short conversational phrases.

The problem with benchmark comparisons

Every time someone posts a benchmark comparing these models, the comment section reliably produces two observations:

“Parakeet wins on speed and WER on the LibriSpeech test set.”
“Yes, but my production audio is not LibriSpeech.”

Both are true. LibriSpeech clean test set is read speech from audiobooks — articulate, quiet, accent-neutral, full sentences. Voice agent audio is conversational speech, often from a phone or laptop microphone, often with background noise, often from non-native speakers, often 1-3 second command phrases rather than 10-30 second paragraphs.

These two audio types favor different models.

For short conversational commands (1-3 seconds, “What’s the weather” or “Set a timer for 10 minutes”), Parakeet’s architecture advantage is substantial. Whisper’s 30-second context window is designed around document-length audio. On very short clips, Whisper pays the full context overhead for a tiny amount of actual speech. Parakeet’s transducer architecture processes each token as it arrives — it does not buffer to 30 seconds. The practical result is what one production deployer described as a “huge jump in WER” when switching from Whisper Turbo to Parakeet for 1-2 second command phrases.

For accented speech and domain-specific vocabulary, the direction reverses. Whisper’s training corpus is enormously larger and more diverse. It has seen Indian English, British English, Australian English, German-accented English, and hundreds of other variants in enormous quantity. Parakeet v3 is strong on accented English but narrower in training data. When the vocabulary includes company-specific terminology, product names, technical acronyms, and other domain words that appear rarely in training data, Whisper’s larger context window and richer training distribution consistently win.

This is why the same person can truthfully say “I benchmarked Parakeet and it’s better” and “in my production agent Whisper is better” — they are optimizing for different audio distributions.

What production deployments actually look like

A pattern emerges across the production deployment threads:

Groq-hosted Whisper Turbo for low-to-medium concurrency. At $0.04 per audio hour, Groq’s hosted endpoint is hard to beat for voice agents under a few hundred concurrent sessions. The 300ms round-trip latency is competitive with self-hosted GPU inference on commodity hardware, and the developer experience removes the GPU provisioning problem entirely. For a team building a product rather than operating GPU infrastructure, this is the practical default in 2026.

Self-hosted Parakeet for edge and CPU-constrained deployments. If you are running inference on edge devices (NVIDIA Jetson, consumer laptops, mobile SoCs), Parakeet v2 (English-only INT8) at ~600MB RAM is the current best option. Real-time factor under 0.1 means a 2-second voice command transcribes before the user has finished processing their own thought. Nothing else at this model size matches it.

Faster-Whisper (BF16) for self-hosted GPU with Whisper accuracy. Faster-whisper is a CTranslate2-based reimplementation of Whisper that runs 4x faster than the original Python implementation on the same hardware — the production path that kept coming up for teams that need Whisper quality with self-hosted latency, not whisper.cpp. BF16 quantization on an A100 or H100 gets to sub-second latency for short clips while keeping accuracy close to the FP16 baseline. The RedHatAI FP8-dynamic Whisper quantization is an active experiment for pushing further.

Fine-tuned domain models for specialized vocabulary. The consistent advice from engineers who have shipped production voice agents: pick any of the three base models, collect 500-2000 examples of your actual production audio (with your specific accents, terminology, and noise profile), fine-tune, quantize to INT8 with ONNX, and deploy. The gains from fine-tuning on domain data are larger than the gains from switching base models. All three models handle general conversations well out of the box. Technical words, company-specific terminology, and domain acronyms are where none of them are reliable without fine-tuning — the gap between a generic model and a fine-tuned one on specialized vocabulary is substantially larger than the gap between Parakeet and Whisper on the same domain data.

The latency math

For a real-time voice agent with a STT → LLM → TTS pipeline, the STT step needs to contribute well under 500ms of perceived latency to feel responsive. Here is what the realistic numbers look like:

Parakeet TDT v3 on CPU (modern laptop, 1-3 second clip): 50-150ms. This is fast enough that the bottleneck shifts entirely to your LLM inference. If you are running a small local LLM and your TTS is fast, Parakeet on CPU can produce a fully local pipeline that feels responsive.

Whisper Turbo via Groq (1-3 second clip): 200-350ms including network round-trip. Fast enough for most voice agents. The network latency is the variable — if your users are geographically distant from Groq’s data centers, this can stretch.

Self-hosted Whisper Turbo (L40S or H100 via cloud): 150-400ms depending on whether the model is warm, batch size, and clip length. The GPU cold-start problem — model loading latency — matters if you are spinning inference instances up and down dynamically.

Qwen3-ASR 0.5B on CPU (1-3 second clip): 300-800ms depending on hardware. Slower than Parakeet on CPU, faster than large Whisper, competitive with Whisper Turbo on decent hardware.

Nemotron ASR 3.5 via NIM (H100): sub-100ms at scale, with the cold-start artifact in the first 1-2 seconds of each session. The 400+ concurrent sessions per H100 figure is what makes it attractive for high-volume deployments.

The multilingual and accent question

The question of how these models handle non-neutral English is the most consequential for production deployments and the least well-served by published benchmarks.

Indian-accent English

Indian-accent English is the highest-volume non-American English accent on the internet and the explicit focus of several production deployment threads. The production reports in 2026 consistently lean toward Whisper. The combination of accent, pace variation, and mixed technical/conversational vocabulary (code identifiers, company product names, Hinglish code-switching) is where Whisper’s training breadth shows up. Several teams report moving from faster-whisper (for latency) to Groq-hosted Whisper (for Indian-accent accuracy) and accepting the hosted dependency as the right trade.

Code-mixed speech (Hinglish, Spanglish)

Qwen3-ASR is specifically interesting for utterances that genuinely switch mid-sentence between languages. Whisper handles code-switching less reliably because its training segments are mostly monolingual. Parakeet v3’s 25-language coverage improves on v2 but is not designed for within-utterance switching. Qwen3-ASR was trained on code-mixed data and handles it structurally better. For a voice agent where users naturally mix Hindi and English in the same phrase, Qwen3-ASR is worth benchmarking on your actual audio before committing to Whisper.

European-accent English

For German, French, Spanish, and Dutch accents at a technical vocabulary level, Whisper’s training data dominance holds. All three models handle clean European accents well at the phoneme level; the divergence shows up in domain vocabulary — company names, product identifiers, and technical acronyms that appear rarely in any model’s training data.

When to use each model

Use Parakeet TDT v3 when:

Your voice agent handles short commands (under 3 seconds) from English speakers
You are deploying on CPU or edge hardware where latency budget is tight
Your audio is relatively clean and accent variation is limited
You need English-only and want v2 for maximum CPU speed

Use Whisper Turbo (via Groq or self-hosted) when:

Your users have accented English or you need multiple languages
Your vocabulary includes domain-specific terms, acronyms, or proper nouns
You want a proven model with thousands of production deployments worth of known failure modes
Your concurrency is low to medium and managed hosting is acceptable

Use Qwen3-ASR when:

You are dealing with code-mixed languages (Hinglish, Spanglish, etc.)
You want competitive multilingual coverage at a small model size
You are willing to accept a less community-tested model in exchange for the multilingual edge

Consider Nemotron ASR 3.5 when:

You are running NVIDIA GPU infrastructure and want NIM deployment
High concurrency (hundreds of sessions per GPU) is a hard requirement
You can work around or detect the cold-start artifact

The fine-tuning decision

Every practical guide on production ASR eventually arrives at the same conclusion: if accuracy on your specific audio is the constraint, fine-tuning beats model selection. This is not a caveat — it is a structural fact about how these models work.

The base models are trained to generalize across all audio types. Your production audio is not all audio types — it is your users, in your use case, with your vocabulary. The information in your production audio that distinguishes your domain from the general case is not in the base model, because it could not be — your product did not exist when the model was trained.

Collecting 500-2000 examples of your actual production audio, labeling them (or using an existing model for pseudo-labeling and then correcting errors), and fine-tuning for 1-10 hours of compute on a single A100 is within reach for any team with an engineering budget. The accuracy gains from fine-tuning on 1000 domain examples consistently outperform switching from one frontier model to another.

INT8 quantization via ONNX after fine-tuning gets you deployment efficiency without substantial accuracy loss on the target domain.

Where SnailText fits in this picture

SnailText is a desktop dictation app, not a voice agent infrastructure product, but it ships both Whisper and Parakeet TDT in the same binary — which means we run exactly this comparison on real consumer hardware at scale.

The practical lesson from that: for desktop dictation (5-30 second phrases, quiet home office, single language, English-dominant), Parakeet TDT v3 and Whisper Medium land at similar accuracy for most users. Parakeet is faster on CPU. Large Whisper models improve accuracy for accented speech noticeably. The right model depends on the hardware in front of you, not on a benchmark run on someone else’s server.

You can try both locally on Mac or Windows with the free tier — Whisper Tiny and Base are included, Parakeet and larger Whisper models are in Pro. The difference between models is something you feel immediately on your own voice, which is the only benchmark that ultimately matters for your specific users.

The production voice agent question is harder because the audio diversity is wider and the latency constraints are stricter. But the underlying lesson is the same: the model that wins on your audio is the one to ship, and you find that out by testing on your audio, not by reading a benchmark.

SnailText is offline voice dictation for Mac and Windows — local, private, free to start.

Download for Mac

Common questions

Which ASR model is best for real-time voice agents in 2026?

Parakeet TDT v3 wins on latency for English voice agents — real-time factor under 0.1x on CPU means your pipeline waits milliseconds, not seconds. But for agents that deal with accented speakers, technical domain vocabulary, or multiple languages, Whisper Large-v3-Turbo via Groq is harder to beat: 300ms API latency with Whisper-level quality and no GPU infra to manage. For Indian-accent English specifically, the production consensus in 2026 leans toward Whisper for accuracy and Groq for the hosted latency.

Is Parakeet TDT v3 actually better than Whisper in production?

On benchmarks with clean, accent-neutral English and short 1-2 second utterances, yes. In production with accented speech, technical vocabulary (company names, acronyms, code identifiers), or sentences longer than a few seconds, Whisper Large-v3-Turbo tends to win on quality. The benchmark gap between them narrows significantly when audio conditions degrade. Parakeet's real advantage is CPU latency: roughly 10x faster than Whisper Turbo on CPU for a short phrase. If you are self-hosting on CPU or need edge deployment, Parakeet's latency advantage is substantial.

How good is Qwen3-ASR compared to Whisper?

Qwen3-ASR is competitive with Whisper Turbo on clean audio and genuinely better on certain multilingual tasks — particularly code-mixed languages like Hinglish (Hindi-English mixing) where Whisper's training data thins out. The 0.5B model is roughly 300ms on CPU for a 5-second clip. The main limitation is that it is newer and has less community testing than Whisper. For a production deployment in mid-2026, Whisper is the safer choice unless you specifically need code-mixed multilingual support.

What is Nemotron ASR 3.5 and should I use it?

Nemotron ASR 3.5 is NVIDIA's streaming-first ASR model released in mid-2026. It is designed for NVIDIA NIM deployment and can handle 400+ concurrent sessions per H100. Early production reports show it matches or beats Parakeet on speed with better streaming behavior, but has a 'cold start' problem where the first 1-2 seconds of each session are less accurate. For high-concurrency GPU deployments it is promising; for CPU edge deployments or low-concurrency setups, Parakeet v2/v3 is more proven.

Should I fine-tune ASR on my own data?

Yes, if your use case involves specialized vocabulary — medical terms, company names, product identifiers, technical acronyms. All three models (Parakeet, Whisper, Qwen3-ASR) support fine-tuning via ONNX INT8 conversion for deployment. In practice, fine-tuning Whisper on domain-specific data consistently outperforms a larger generic model on the target domain. The cost is the dataset and training compute. If you are building a production voice agent for a specific vertical, budgeting for 500-2000 example recordings and a fine-tune cycle is usually worth it.

Groq vs self-hosted Whisper — which is cheaper?

Groq's hosted Whisper Turbo runs at roughly $0.04 per hour of audio (as of mid-2026). Self-hosted Whisper on an A100 at cloud rates runs $2-4/hour of GPU time; at 400x realtime factor you can process roughly 400 hours of audio per GPU-hour, making it ~$0.005-0.01 per audio hour at scale. Groq is cheaper at low volume (under a few hundred hours per month); self-hosting wins at scale. The break-even depends on your GPU pricing and utilization. For a voice agent with <100 concurrent sessions, Groq's developer experience usually wins — 300ms latency, no infra, no GPU procurement.

Can Parakeet TDT v3 run on edge hardware?

Yes. INT8 quantized Parakeet v2 (English-only) fits in approximately 600MB RAM and runs in real time on a modern ARM CPU. That makes it viable for edge inference on devices like NVIDIA Jetson, Apple M-series chips, or well-specced mobile SoCs. Parakeet v3 adds 24 additional languages at similar model size. For pure edge CPU inference with English audio, Parakeet v2 is currently the best production-ready open model.

What are the WER numbers for Parakeet TDT v3 vs Whisper Large-v3-Turbo?

On the LibriSpeech clean English test set — the standard academic benchmark — Parakeet TDT v3 achieves approximately 3-5% WER, Parakeet v2 (English-only) approximately 2.5-4% WER, and Whisper Large-v3-Turbo approximately 2-3% WER. Whisper Large-v3 (full, 1.5B parameters) achieves approximately 2% WER. These are clean-audio numbers. In production with accented speakers, background noise, or short phrases, the gap between Parakeet and Whisper narrows and sometimes reverses depending on the audio type. Do not treat the LibriSpeech benchmark as the expected WER on your actual audio.

What is the difference between Parakeet TDT v2 and v3?

Parakeet v2 is English-only at 0.6B parameters — it achieves slightly lower WER on English audio than v3 because it does not carry multilingual overhead. Parakeet v3 covers 25 languages at the same model size. For pure English voice agents, v2 is the faster and more accurate option. For multilingual deployments or agents that handle Spanish, French, German, or other Parakeet-supported languages, v3 is required. Both run at RTF under 0.1 on modern CPU.

Parakeet TDT v3 vs Nemotron ASR 3.5 — which should I use?

Nemotron ASR 3.5 is designed for high-concurrency NVIDIA NIM deployments (400+ concurrent sessions per H100) and achieves sub-100ms latency at scale. Parakeet TDT v3 is the better choice for CPU-constrained or edge deployments, lower-concurrency setups, or teams that want a more widely-tested model without GPU infrastructure. Nemotron's main production caveat is the cold-start artifact — the first 1-2 seconds of each session are less accurate — which Parakeet does not have. If you are running NVIDIA GPU infrastructure at high concurrency, Nemotron is worth benchmarking. Otherwise, Parakeet v2 or v3 is the simpler starting point.

Want Whisper and Parakeet in a desktop app without the infra?

SnailText ships both engines locally on Mac and Windows. You pick the model — Tiny through Large-v3, or Parakeet TDT v3. GPU-accelerated with Vulkan (Windows) or Metal (Mac). No API key, no cloud call during dictation.

Download for Mac Compare the engines in depth