Before installing a local dictation app, most people ask the same question: do I need a gaming graphics card, or will my laptop handle this? The GPU anxiety is understandable — “AI” and “GPU” have become synonymous in the news. But running a finished model on your own machine to transcribe your own voice is a completely different task from training models in a data center, and the hardware requirements are an order of magnitude smaller.
The direct answer: NVIDIA’s Parakeet TDT model runs at RTF 0.033 on an ordinary Intel i7 CPU — 30 times faster than real time, no GPU required. Whisper small with int8 quantization runs about 7.8× real time on the same class of hardware. For everyday dictation, that is more than fast enough. Here is the full picture, with actual benchmark numbers.
Why people think they need a GPU
The association is misplaced but understandable. The GPUs you see in AI headlines are doing one of two things: training models from scratch on petabytes of data in a data center, or serving thousands of simultaneous users through a cloud API. Both tasks genuinely need racks of graphics cards.
Running a finished, quantized model on your own machine to transcribe a few seconds of your own voice is a completely different workload. You are doing inference on one person’s audio, a few seconds at a time. The compute requirement is orders of magnitude smaller.
The other source of confusion is 2022-era benchmarks of the original Python OpenAI Whisper on CPU. That implementation was genuinely slow — a user with a Ryzen 5 3600 reported 11 minutes to process 31 seconds of audio with the medium model. That benchmark gets cited constantly and is still accurate for Python Whisper, but it has nothing to do with the whisper.cpp implementation that dictation apps use today. whisper.cpp with quantized models on a modern CPU is a completely different experience.
What voice-to-text actually requires
The real requirements are modest:
- A modern CPU. Any processor from roughly the last decade with AVX instructions — which is almost every laptop and desktop sold since 2015.
- A few gigabytes of free RAM. This is the main constraint, and it is smaller than people expect (see the table below).
- Disk space for the model. A one-time download, typically 75 MB to 3 GB depending on the model you choose.
No graphics card appears anywhere in that list.
RAM requirements by model (per the whisper.cpp project):
| Model | Disk (GGML) | RAM in use |
|---|---|---|
| Whisper tiny | 75 MB | ~273 MB |
| Whisper base | 142 MB | ~388 MB |
| Whisper small | 466 MB | ~852 MB |
| Whisper medium | 1.5 GB | ~2.1 GB |
| Whisper large-v3 | 2.9 GB | ~3.9 GB |
| Parakeet TDT v3 | ~2.3 GB | ~1.5 GB |
A machine with 8 GB of total RAM and 4 GB free runs Whisper small or Parakeet comfortably. That is a fairly modest machine by 2026 standards.
The actual benchmark numbers
Here is what models genuinely run on various hardware. RTF (real-time factor) below 1.0 means faster than real time — the model transcribes audio faster than it plays. RTF 3.0 means 1 minute of audio takes 3 minutes to process.
The performance winner is not what most people expect. NVIDIA’s Parakeet TDT v3 on a modern Intel i7 CPU (i7-12700KF) achieves RTF 0.033 — roughly 30× faster than real time. For comparison, faster-whisper running the large-v2 model on an RTX 3070 Ti GPU achieves RTF ~0.076 — about 13× real time. Parakeet on CPU is faster than mid-range GPU Whisper.
The reason is architectural. Parakeet TDT is a Token-and-Duration Transducer — a non-autoregressive model that runs the entire audio through the encoder in a single forward pass. Whisper uses an autoregressive decoder that generates tokens one at a time in a loop, which is inherently sequential and slower. For short audio clips this matters a lot.
Apple Silicon is a special case. Every Mac with an M-series chip (M1 through M4) automatically uses Metal GPU acceleration in whisper.cpp and in SnailText. An M1 Mac running Whisper large-v3 via Metal achieves approximately real-time (RTF ~1.0) — good enough for dictation without any discrete GPU. An M2 Pro with Metal runs large-v3 at 2.5× real time. MacBook users do not have a GPU problem.
Windows users without a discrete GPU have a more nuanced picture. A modern i7 (12th/13th gen) runs Whisper small at about 7–8× real time — comfortable for dictation where each phrase is 5–15 seconds. A mid-range Ryzen CPU runs large-v3 at RTF ~3.0 — too slow for interactive dictation. The right answer for Windows CPU-only is: Parakeet TDT (30× real time) or Whisper small (7–8× real time). Not large Whisper.
One more development worth knowing: whisper.cpp 1.8.3 added integrated GPU support via Vulkan, delivering roughly a 12× performance boost on AMD Radeon 680M and Intel Arc integrated graphics. If your Windows laptop has a modern AMD Ryzen (6000+) or Intel 12th-gen+, whisper.cpp can use that integrated GPU automatically — bringing latency down significantly even without a discrete card.
”CPU is too slow” depends entirely on the model
People say “local dictation is slow on CPU” as if the chip is the problem. The variable is the model.
Smaller Whisper models — tiny, base, small — run at or faster than real time on any 2020+ laptop CPU. The experience is fast. Large Whisper models — medium and large-v3 — are genuinely slow on CPU: several times slower than real time on typical hardware. That is the case people remember and report, and it gives CPU a bad name that the smaller models do not deserve.
Quantization makes a meaningful difference too. The int8 quantized version of a model runs significantly faster than the float16 version on CPU with negligible accuracy loss. faster-whisper benchmarks show the small int8 model completing 13 minutes of audio in 102 seconds on an i7-12700K — the float32 version takes 157 seconds for the same file. Choosing the quantized version of a model is free speed.
When a GPU genuinely earns its keep
To be fair to the other side: a dedicated GPU is a real improvement in specific situations.
Running large-v3 in real time on Windows. An RTX 3090 processes 1 minute of audio in about 11 seconds (RTF ~0.19). The same CPU task on a Ryzen 7 takes about 3 minutes. If large-v3 accuracy is your requirement and you are on Windows without Apple Silicon, a GPU is not optional — it is the only way to get there.
Batch transcribing long recordings. 1 hour of audio on an RTX 3090 with large-v3 takes roughly 5 minutes. On a Ryzen 7 CPU the same job takes approximately 3 hours. If you regularly transcribe podcasts, lectures, or long meetings, a GPU saves real time across the week.
High-accuracy multilingual dictation. If you dictate in a language where smaller models have noticeably higher error rates, you may need medium or large — and that is where GPU matters.
For everyday dictation — short phrases, one person, stop-and-transcribe — you will not notice the GPU.
The four hardware situations
MacBook (any M-series chip): No GPU concern. Metal acceleration is automatic. M1 handles large-v3 at real time. M2 Pro handles it at 2.5× real time. Use SnailText, pick any model, and it works. The only decision is whether you want the speed of a smaller model or the accuracy ceiling of large.
Windows laptop, integrated graphics only (no discrete GPU): Use Parakeet TDT for English — 30× real time on CPU, no GPU needed. Or use Whisper small (7–8× real time). If your chip supports Vulkan (AMD Ryzen 6000+ or Intel 12th-gen+), whisper.cpp will use your integrated GPU automatically for an additional speedup. Avoid Whisper large — it is too slow without a discrete card.
Windows desktop or laptop with NVIDIA/AMD discrete GPU: Any model runs well. Whisper large-v3-turbo on an RTX 2080 Ti processes 13 minutes of audio in 19 seconds. Large-v3 on an RTX 3090 takes 11 seconds per minute of audio. Pick your accuracy level and the GPU handles it.
Old hardware (pre-2015, or low-power ARM): Tiny or base Whisper only. whisper.cpp’s tiny model runs on a Raspberry Pi 4 in approximately real time with optimized settings (-ac 512). Very old Intel (Core i5-460M, 2010) achieves RTF ~0.86 with base — just below real time. Workable for the right use case, but the experience is slow by today’s standards.
How SnailText handles this
SnailText runs Whisper and Parakeet TDT locally on Mac and Windows, on whatever hardware you have. When you first install it, SnailText detects your hardware — CPU generation, available GPU, VRAM — and recommends a model that fits your machine.
On a MacBook with M-series, it recommends a model that takes advantage of Metal. On a Windows machine with a discrete GPU, it recommends a larger Whisper model and uses Vulkan GPU acceleration. On a CPU-only Windows machine, it recommends Parakeet TDT (30× real time on CPU, English) or Whisper small — models that stay fast without a graphics card.
That recommendation is a starting point, not a cage. You can switch to any available model at any time from Settings — a faster, smaller one if you value instant response, or a larger, more accurate one if your machine can handle it.
SnailText is free to start, needs no account, and runs entirely on your device. The model downloads once and then works offline. The practical answer to “do I need a GPU” for dictation is no — but SnailText will tell you exactly which model fits your specific machine.
Pro tier adds Parakeet TDT and the larger Whisper models ($7.49 / month · $89 / year), with local LLM post-processing that cleans up technical terms and identifier style automatically. The free tier includes Whisper compact and base — fast on any CPU, no account needed.
The short version
You do not need a GPU for voice-to-text. A modern CPU and 4 GB of free RAM are enough for fast, accurate local dictation. NVIDIA’s Parakeet TDT runs at 30× real time on an ordinary i7 — faster on CPU than mid-range GPU Whisper. Whisper small runs at 7–8× real time on modern Intel/AMD. The “you need a GPU for AI” belief comes from data-center training workloads, not from transcribing your own speech.
Where a GPU genuinely matters: running Whisper large-v3 in real time on Windows, batch transcribing hours of audio, or high-concurrency server deployments. For stop-and-dictate one sentence at a time — the core use case for dictation apps — CPU is the right tool for most people.
Match the model to your machine, or let SnailText do it automatically, and CPU dictation is fast.
Benchmark sources: whisper.cpp README (RAM table); whisper.cpp GitHub discussions #89, #166, #3752; JustVoice Apple Silicon benchmarks; Parakeet TDT DeepWiki benchmarks; faster-whisper README; 1qubit.de GPU benchmarks; Phoronix whisper.cpp 1.8.3 iGPU.