For a production voice agent in 2026: use Parakeet TDT v3 if your audio is short English commands on CPU, use Whisper Large-v3-Turbo (via Groq or self-hosted) if your users have accented speech or domain vocabulary, use Qwen3-ASR if you need code-mixed multilingual. The full answer depends on what your audio actually looks like — and that is where the benchmark numbers stop being useful.
The benchmarks say Parakeet. The production reports say “it depends.” Both are right. Here is what the dependency actually looks like.
WER figures in the table above are approximate values from the LibriSpeech clean test set per published NVIDIA and OpenAI benchmarks. Production WER varies significantly with audio conditions, accent, and domain vocabulary.
The three models and what they are trying to do
Parakeet TDT v3 is NVIDIA’s streaming-native ASR model. TDT stands for Token-and-Duration Transducer — an architecture built for real-time inference rather than offline transcription. The v3 version covers 25 languages at 0.6B parameters. Parakeet v2 is English-only at the same size and is consistently faster on CPU than v3 because it does not carry the multilingual overhead. Both deliver real-time factors well under 0.1 on modern CPU — meaning a 10-second clip transcribes in under 1 second without a GPU.
Whisper Large-v3-Turbo is OpenAI’s compressed version of Large-v3, trained to run at roughly half the compute cost while recovering most of the accuracy. It is not Parakeet-fast on CPU, but it has two things Parakeet does not: 99 languages with strong performance across all of them, and four years of production deployment across hundreds of downstream apps, meaning its failure modes are well-documented. On Groq’s inference hardware, hosted Whisper Turbo delivers 300ms end-to-end latency per request — faster than most self-hosted setups can manage on commodity GPUs.
Qwen3-ASR is Alibaba’s 2026 multilingual ASR model family, with the smallest variant at 0.5B parameters. It competes with Parakeet on size and Whisper on language breadth. Its headline capability is code-mixed language support — audio that mixes Hindi and English, or Spanish and English, in the same utterance. This is a genuine hard problem for Whisper (whose training data is mostly monolingual per segment) and where Qwen3-ASR’s training approach gives it an edge.
Nemotron ASR 3.5 — not in the original question but showing up in production threads by June 2026 — is NVIDIA’s streaming-first replacement for the Canary family. Designed for NVIDIA NIM deployment, it benchmarks faster than Parakeet and handles 400+ concurrent sessions per H100 according to early production reports. It has a documented cold-start artifact where the first 1-2 seconds of each session are less accurate, which matters more in a voice agent than in a batch transcription pipeline.
Accuracy in numbers. On the LibriSpeech clean English test set, Parakeet TDT v2 (English-only) achieves approximately 2.5-4% WER, Parakeet TDT v3 approximately 3-5% WER (the multilingual overhead adds a small accuracy cost for English), Whisper Large-v3-Turbo approximately 2-3% WER, and Qwen3-ASR 0.5B approximately 3-5% WER. Whisper Large-v3 (full, 1.5B parameters) achieves approximately 2% WER — the accuracy ceiling at a meaningful compute cost. These figures are from published NVIDIA and OpenAI benchmarks on the standard academic test set. The gap between models narrows significantly on production audio with accents, background noise, or short conversational phrases.
The problem with benchmark comparisons
Every time someone posts a benchmark comparing these models, the comment section reliably produces two observations:
- “Parakeet wins on speed and WER on the LibriSpeech test set.”
- “Yes, but my production audio is not LibriSpeech.”
Both are true. LibriSpeech clean test set is read speech from audiobooks — articulate, quiet, accent-neutral, full sentences. Voice agent audio is conversational speech, often from a phone or laptop microphone, often with background noise, often from non-native speakers, often 1-3 second command phrases rather than 10-30 second paragraphs.
These two audio types favor different models.
For short conversational commands (1-3 seconds, “What’s the weather” or “Set a timer for 10 minutes”), Parakeet’s architecture advantage is substantial. Whisper’s 30-second context window is designed around document-length audio. On very short clips, Whisper pays the full context overhead for a tiny amount of actual speech. Parakeet’s transducer architecture processes each token as it arrives — it does not buffer to 30 seconds. The practical result is what one production deployer described as a “huge jump in WER” when switching from Whisper Turbo to Parakeet for 1-2 second command phrases.
For accented speech and domain-specific vocabulary, the direction reverses. Whisper’s training corpus is enormously larger and more diverse. It has seen Indian English, British English, Australian English, German-accented English, and hundreds of other variants in enormous quantity. Parakeet v3 is strong on accented English but narrower in training data. When the vocabulary includes company-specific terminology, product names, technical acronyms, and other domain words that appear rarely in training data, Whisper’s larger context window and richer training distribution consistently win.
This is why the same person can truthfully say “I benchmarked Parakeet and it’s better” and “in my production agent Whisper is better” — they are optimizing for different audio distributions.
What production deployments actually look like
A pattern emerges across the production deployment threads:
Groq-hosted Whisper Turbo for low-to-medium concurrency. At $0.04 per audio hour, Groq’s hosted endpoint is hard to beat for voice agents under a few hundred concurrent sessions. The 300ms round-trip latency is competitive with self-hosted GPU inference on commodity hardware, and the developer experience removes the GPU provisioning problem entirely. For a team building a product rather than operating GPU infrastructure, this is the practical default in 2026.
Self-hosted Parakeet for edge and CPU-constrained deployments. If you are running inference on edge devices (NVIDIA Jetson, consumer laptops, mobile SoCs), Parakeet v2 (English-only INT8) at ~600MB RAM is the current best option. Real-time factor under 0.1 means a 2-second voice command transcribes before the user has finished processing their own thought. Nothing else at this model size matches it.
Faster-Whisper (BF16) for self-hosted GPU with Whisper accuracy. Faster-whisper is a CTranslate2-based reimplementation of Whisper that runs 4x faster than the original Python implementation on the same hardware — the production path that kept coming up for teams that need Whisper quality with self-hosted latency, not whisper.cpp. BF16 quantization on an A100 or H100 gets to sub-second latency for short clips while keeping accuracy close to the FP16 baseline. The RedHatAI FP8-dynamic Whisper quantization is an active experiment for pushing further.
Fine-tuned domain models for specialized vocabulary. The consistent advice from engineers who have shipped production voice agents: pick any of the three base models, collect 500-2000 examples of your actual production audio (with your specific accents, terminology, and noise profile), fine-tune, quantize to INT8 with ONNX, and deploy. The gains from fine-tuning on domain data are larger than the gains from switching base models. All three models handle general conversations well out of the box. Technical words, company-specific terminology, and domain acronyms are where none of them are reliable without fine-tuning — the gap between a generic model and a fine-tuned one on specialized vocabulary is substantially larger than the gap between Parakeet and Whisper on the same domain data.
The latency math
For a real-time voice agent with a STT → LLM → TTS pipeline, the STT step needs to contribute well under 500ms of perceived latency to feel responsive. Here is what the realistic numbers look like:
Parakeet TDT v3 on CPU (modern laptop, 1-3 second clip): 50-150ms. This is fast enough that the bottleneck shifts entirely to your LLM inference. If you are running a small local LLM and your TTS is fast, Parakeet on CPU can produce a fully local pipeline that feels responsive.
Whisper Turbo via Groq (1-3 second clip): 200-350ms including network round-trip. Fast enough for most voice agents. The network latency is the variable — if your users are geographically distant from Groq’s data centers, this can stretch.
Self-hosted Whisper Turbo (L40S or H100 via cloud): 150-400ms depending on whether the model is warm, batch size, and clip length. The GPU cold-start problem — model loading latency — matters if you are spinning inference instances up and down dynamically.
Qwen3-ASR 0.5B on CPU (1-3 second clip): 300-800ms depending on hardware. Slower than Parakeet on CPU, faster than large Whisper, competitive with Whisper Turbo on decent hardware.
Nemotron ASR 3.5 via NIM (H100): sub-100ms at scale, with the cold-start artifact in the first 1-2 seconds of each session. The 400+ concurrent sessions per H100 figure is what makes it attractive for high-volume deployments.
The multilingual and accent question
The question of how these models handle non-neutral English is the most consequential for production deployments and the least well-served by published benchmarks.
Indian-accent English
Indian-accent English is the highest-volume non-American English accent on the internet and the explicit focus of several production deployment threads. The production reports in 2026 consistently lean toward Whisper. The combination of accent, pace variation, and mixed technical/conversational vocabulary (code identifiers, company product names, Hinglish code-switching) is where Whisper’s training breadth shows up. Several teams report moving from faster-whisper (for latency) to Groq-hosted Whisper (for Indian-accent accuracy) and accepting the hosted dependency as the right trade.
Code-mixed speech (Hinglish, Spanglish)
Qwen3-ASR is specifically interesting for utterances that genuinely switch mid-sentence between languages. Whisper handles code-switching less reliably because its training segments are mostly monolingual. Parakeet v3’s 25-language coverage improves on v2 but is not designed for within-utterance switching. Qwen3-ASR was trained on code-mixed data and handles it structurally better. For a voice agent where users naturally mix Hindi and English in the same phrase, Qwen3-ASR is worth benchmarking on your actual audio before committing to Whisper.
European-accent English
For German, French, Spanish, and Dutch accents at a technical vocabulary level, Whisper’s training data dominance holds. All three models handle clean European accents well at the phoneme level; the divergence shows up in domain vocabulary — company names, product identifiers, and technical acronyms that appear rarely in any model’s training data.
When to use each model
Use Parakeet TDT v3 when:
- Your voice agent handles short commands (under 3 seconds) from English speakers
- You are deploying on CPU or edge hardware where latency budget is tight
- Your audio is relatively clean and accent variation is limited
- You need English-only and want v2 for maximum CPU speed
Use Whisper Turbo (via Groq or self-hosted) when:
- Your users have accented English or you need multiple languages
- Your vocabulary includes domain-specific terms, acronyms, or proper nouns
- You want a proven model with thousands of production deployments worth of known failure modes
- Your concurrency is low to medium and managed hosting is acceptable
Use Qwen3-ASR when:
- You are dealing with code-mixed languages (Hinglish, Spanglish, etc.)
- You want competitive multilingual coverage at a small model size
- You are willing to accept a less community-tested model in exchange for the multilingual edge
Consider Nemotron ASR 3.5 when:
- You are running NVIDIA GPU infrastructure and want NIM deployment
- High concurrency (hundreds of sessions per GPU) is a hard requirement
- You can work around or detect the cold-start artifact
The fine-tuning decision
Every practical guide on production ASR eventually arrives at the same conclusion: if accuracy on your specific audio is the constraint, fine-tuning beats model selection. This is not a caveat — it is a structural fact about how these models work.
The base models are trained to generalize across all audio types. Your production audio is not all audio types — it is your users, in your use case, with your vocabulary. The information in your production audio that distinguishes your domain from the general case is not in the base model, because it could not be — your product did not exist when the model was trained.
Collecting 500-2000 examples of your actual production audio, labeling them (or using an existing model for pseudo-labeling and then correcting errors), and fine-tuning for 1-10 hours of compute on a single A100 is within reach for any team with an engineering budget. The accuracy gains from fine-tuning on 1000 domain examples consistently outperform switching from one frontier model to another.
INT8 quantization via ONNX after fine-tuning gets you deployment efficiency without substantial accuracy loss on the target domain.
Where SnailText fits in this picture
SnailText is a desktop dictation app, not a voice agent infrastructure product, but it ships both Whisper and Parakeet TDT in the same binary — which means we run exactly this comparison on real consumer hardware at scale.
The practical lesson from that: for desktop dictation (5-30 second phrases, quiet home office, single language, English-dominant), Parakeet TDT v3 and Whisper Medium land at similar accuracy for most users. Parakeet is faster on CPU. Large Whisper models improve accuracy for accented speech noticeably. The right model depends on the hardware in front of you, not on a benchmark run on someone else’s server.
You can try both locally on Mac or Windows with the free tier — Whisper Tiny and Base are included, Parakeet and larger Whisper models are in Pro. The difference between models is something you feel immediately on your own voice, which is the only benchmark that ultimately matters for your specific users.
The production voice agent question is harder because the audio diversity is wider and the latency constraints are stricter. But the underlying lesson is the same: the model that wins on your audio is the one to ship, and you find that out by testing on your audio, not by reading a benchmark.