SnailText
EN

Dictation deep-dive · 2026

Do you need a GPU for voice-to-text? No — here is what the numbers actually say

The short answer is no. The longer answer explains why "CPU is too slow" depends entirely on which model you run — and has nothing to do with owning a graphics card.

By SnailText's founder · Published

The short version

You do not need a GPU for voice-to-text. Parakeet TDT runs at RTF 0.033 on a modern CPU — 30x faster than real time. Whisper base and small run at or above real time on any 2020+ laptop. A GPU makes the largest Whisper models usable in real time: large-v3 on CPU takes 3-6 minutes per audio minute; on an RTX 3090 that drops to 11 seconds. For dictation (short phrases, one at a time), CPU is sufficient. For bulk transcription of long recordings, a GPU saves real hours. The "you need a GPU for AI" idea comes from training in data centers — running inference on your own speech is a far smaller task.

Real-time factors by model and hardware

Real-time factors by model and hardware (verified 2026-06-28)
Model + hardware RTF Speed label Good for dictation?
Parakeet TDT v3 — CPU (i7-12700KF) 0.033 30x real time Yes — fastest CPU option
Whisper tiny — Apple M1 CPU-only 0.04 25x real time Yes
Whisper base — Apple M1 CPU-only 0.07 14x real time Yes
Whisper small — Apple M1 CPU-only 0.17 6x real time Yes
Whisper small int8 — CPU (i7-12700K) 0.13 7.8x real time Yes
Whisper medium — Apple M1 CPU-only 0.40 2.5x real time Marginal (1-2s delay)
Whisper large-v3 — Apple M1, Metal GPU 1.0 approx. real time Borderline
Whisper large-v3 — CPU (Ryzen 7 5700G) 3.0 3x slower than real time No
Whisper large-v3 — RTX 3090 GPU 0.19 5x real time Yes — where GPU earns its keep

Before installing a local dictation app, most people ask the same question: do I need a gaming graphics card, or will my laptop handle this? The GPU anxiety is understandable — “AI” and “GPU” have become synonymous in the news. But running a finished model on your own machine to transcribe your own voice is a completely different task from training models in a data center, and the hardware requirements are an order of magnitude smaller.

The direct answer: NVIDIA’s Parakeet TDT model runs at RTF 0.033 on an ordinary Intel i7 CPU — 30 times faster than real time, no GPU required. Whisper small with int8 quantization runs about 7.8× real time on the same class of hardware. For everyday dictation, that is more than fast enough. Here is the full picture, with actual benchmark numbers.

Why people think they need a GPU

The association is misplaced but understandable. The GPUs you see in AI headlines are doing one of two things: training models from scratch on petabytes of data in a data center, or serving thousands of simultaneous users through a cloud API. Both tasks genuinely need racks of graphics cards.

Running a finished, quantized model on your own machine to transcribe a few seconds of your own voice is a completely different workload. You are doing inference on one person’s audio, a few seconds at a time. The compute requirement is orders of magnitude smaller.

The other source of confusion is 2022-era benchmarks of the original Python OpenAI Whisper on CPU. That implementation was genuinely slow — a user with a Ryzen 5 3600 reported 11 minutes to process 31 seconds of audio with the medium model. That benchmark gets cited constantly and is still accurate for Python Whisper, but it has nothing to do with the whisper.cpp implementation that dictation apps use today. whisper.cpp with quantized models on a modern CPU is a completely different experience.

What voice-to-text actually requires

The real requirements are modest:

  • A modern CPU. Any processor from roughly the last decade with AVX instructions — which is almost every laptop and desktop sold since 2015.
  • A few gigabytes of free RAM. This is the main constraint, and it is smaller than people expect (see the table below).
  • Disk space for the model. A one-time download, typically 75 MB to 3 GB depending on the model you choose.

No graphics card appears anywhere in that list.

RAM requirements by model (per the whisper.cpp project):

ModelDisk (GGML)RAM in use
Whisper tiny75 MB~273 MB
Whisper base142 MB~388 MB
Whisper small466 MB~852 MB
Whisper medium1.5 GB~2.1 GB
Whisper large-v32.9 GB~3.9 GB
Parakeet TDT v3~2.3 GB~1.5 GB

A machine with 8 GB of total RAM and 4 GB free runs Whisper small or Parakeet comfortably. That is a fairly modest machine by 2026 standards.

The actual benchmark numbers

Here is what models genuinely run on various hardware. RTF (real-time factor) below 1.0 means faster than real time — the model transcribes audio faster than it plays. RTF 3.0 means 1 minute of audio takes 3 minutes to process.

The performance winner is not what most people expect. NVIDIA’s Parakeet TDT v3 on a modern Intel i7 CPU (i7-12700KF) achieves RTF 0.033 — roughly 30× faster than real time. For comparison, faster-whisper running the large-v2 model on an RTX 3070 Ti GPU achieves RTF ~0.076 — about 13× real time. Parakeet on CPU is faster than mid-range GPU Whisper.

The reason is architectural. Parakeet TDT is a Token-and-Duration Transducer — a non-autoregressive model that runs the entire audio through the encoder in a single forward pass. Whisper uses an autoregressive decoder that generates tokens one at a time in a loop, which is inherently sequential and slower. For short audio clips this matters a lot.

Apple Silicon is a special case. Every Mac with an M-series chip (M1 through M4) automatically uses Metal GPU acceleration in whisper.cpp and in SnailText. An M1 Mac running Whisper large-v3 via Metal achieves approximately real-time (RTF ~1.0) — good enough for dictation without any discrete GPU. An M2 Pro with Metal runs large-v3 at 2.5× real time. MacBook users do not have a GPU problem.

Windows users without a discrete GPU have a more nuanced picture. A modern i7 (12th/13th gen) runs Whisper small at about 7–8× real time — comfortable for dictation where each phrase is 5–15 seconds. A mid-range Ryzen CPU runs large-v3 at RTF ~3.0 — too slow for interactive dictation. The right answer for Windows CPU-only is: Parakeet TDT (30× real time) or Whisper small (7–8× real time). Not large Whisper.

One more development worth knowing: whisper.cpp 1.8.3 added integrated GPU support via Vulkan, delivering roughly a 12× performance boost on AMD Radeon 680M and Intel Arc integrated graphics. If your Windows laptop has a modern AMD Ryzen (6000+) or Intel 12th-gen+, whisper.cpp can use that integrated GPU automatically — bringing latency down significantly even without a discrete card.

”CPU is too slow” depends entirely on the model

People say “local dictation is slow on CPU” as if the chip is the problem. The variable is the model.

Smaller Whisper models — tiny, base, small — run at or faster than real time on any 2020+ laptop CPU. The experience is fast. Large Whisper models — medium and large-v3 — are genuinely slow on CPU: several times slower than real time on typical hardware. That is the case people remember and report, and it gives CPU a bad name that the smaller models do not deserve.

Quantization makes a meaningful difference too. The int8 quantized version of a model runs significantly faster than the float16 version on CPU with negligible accuracy loss. faster-whisper benchmarks show the small int8 model completing 13 minutes of audio in 102 seconds on an i7-12700K — the float32 version takes 157 seconds for the same file. Choosing the quantized version of a model is free speed.

When a GPU genuinely earns its keep

To be fair to the other side: a dedicated GPU is a real improvement in specific situations.

Running large-v3 in real time on Windows. An RTX 3090 processes 1 minute of audio in about 11 seconds (RTF ~0.19). The same CPU task on a Ryzen 7 takes about 3 minutes. If large-v3 accuracy is your requirement and you are on Windows without Apple Silicon, a GPU is not optional — it is the only way to get there.

Batch transcribing long recordings. 1 hour of audio on an RTX 3090 with large-v3 takes roughly 5 minutes. On a Ryzen 7 CPU the same job takes approximately 3 hours. If you regularly transcribe podcasts, lectures, or long meetings, a GPU saves real time across the week.

High-accuracy multilingual dictation. If you dictate in a language where smaller models have noticeably higher error rates, you may need medium or large — and that is where GPU matters.

For everyday dictation — short phrases, one person, stop-and-transcribe — you will not notice the GPU.

The four hardware situations

MacBook (any M-series chip): No GPU concern. Metal acceleration is automatic. M1 handles large-v3 at real time. M2 Pro handles it at 2.5× real time. Use SnailText, pick any model, and it works. The only decision is whether you want the speed of a smaller model or the accuracy ceiling of large.

Windows laptop, integrated graphics only (no discrete GPU): Use Parakeet TDT for English — 30× real time on CPU, no GPU needed. Or use Whisper small (7–8× real time). If your chip supports Vulkan (AMD Ryzen 6000+ or Intel 12th-gen+), whisper.cpp will use your integrated GPU automatically for an additional speedup. Avoid Whisper large — it is too slow without a discrete card.

Windows desktop or laptop with NVIDIA/AMD discrete GPU: Any model runs well. Whisper large-v3-turbo on an RTX 2080 Ti processes 13 minutes of audio in 19 seconds. Large-v3 on an RTX 3090 takes 11 seconds per minute of audio. Pick your accuracy level and the GPU handles it.

Old hardware (pre-2015, or low-power ARM): Tiny or base Whisper only. whisper.cpp’s tiny model runs on a Raspberry Pi 4 in approximately real time with optimized settings (-ac 512). Very old Intel (Core i5-460M, 2010) achieves RTF ~0.86 with base — just below real time. Workable for the right use case, but the experience is slow by today’s standards.

How SnailText handles this

SnailText runs Whisper and Parakeet TDT locally on Mac and Windows, on whatever hardware you have. When you first install it, SnailText detects your hardware — CPU generation, available GPU, VRAM — and recommends a model that fits your machine.

On a MacBook with M-series, it recommends a model that takes advantage of Metal. On a Windows machine with a discrete GPU, it recommends a larger Whisper model and uses Vulkan GPU acceleration. On a CPU-only Windows machine, it recommends Parakeet TDT (30× real time on CPU, English) or Whisper small — models that stay fast without a graphics card.

That recommendation is a starting point, not a cage. You can switch to any available model at any time from Settings — a faster, smaller one if you value instant response, or a larger, more accurate one if your machine can handle it.

SnailText is free to start, needs no account, and runs entirely on your device. The model downloads once and then works offline. The practical answer to “do I need a GPU” for dictation is no — but SnailText will tell you exactly which model fits your specific machine.

Pro tier adds Parakeet TDT and the larger Whisper models ($7.49 / month · $89 / year), with local LLM post-processing that cleans up technical terms and identifier style automatically. The free tier includes Whisper compact and base — fast on any CPU, no account needed.

The short version

You do not need a GPU for voice-to-text. A modern CPU and 4 GB of free RAM are enough for fast, accurate local dictation. NVIDIA’s Parakeet TDT runs at 30× real time on an ordinary i7 — faster on CPU than mid-range GPU Whisper. Whisper small runs at 7–8× real time on modern Intel/AMD. The “you need a GPU for AI” belief comes from data-center training workloads, not from transcribing your own speech.

Where a GPU genuinely matters: running Whisper large-v3 in real time on Windows, batch transcribing hours of audio, or high-concurrency server deployments. For stop-and-dictate one sentence at a time — the core use case for dictation apps — CPU is the right tool for most people.

Match the model to your machine, or let SnailText do it automatically, and CPU dictation is fast.


Benchmark sources: whisper.cpp README (RAM table); whisper.cpp GitHub discussions #89, #166, #3752; JustVoice Apple Silicon benchmarks; Parakeet TDT DeepWiki benchmarks; faster-whisper README; 1qubit.de GPU benchmarks; Phoronix whisper.cpp 1.8.3 iGPU.

SnailText is offline voice dictation for Mac and Windows — local, private, free to start.

Download for Mac

Common questions

Do you need a GPU to run voice-to-text?

No. Speech recognition runs on an ordinary CPU. NVIDIA's Parakeet TDT model achieves RTF 0.033 on an i7 CPU — 30× faster than real time — without a GPU at all. Whisper base and small models run at real-time or faster on any 2020+ laptop CPU. A GPU helps with the largest Whisper models, but it is not required to dictate your own speech.

How much RAM does local voice-to-text need?

Less than most people expect. According to the whisper.cpp project, the tiny model uses ~273 MB RAM, base ~388 MB, small ~852 MB, medium ~2.1 GB, and large-v3 ~3.9 GB. Parakeet TDT sits around 1.5 GB in use. A practical rule — if you have 4 GB of free RAM, you can run small Whisper or Parakeet comfortably. Even an 8 GB machine handles local dictation well.

Is voice-to-text slow on CPU?

It depends entirely on the model. Parakeet TDT runs at RTF 0.033 on a modern CPU — 30× faster than real time. Whisper small with int8 quantization runs about 7–8× real time on a mid-range i7. Where CPU is genuinely slow is large Whisper models: large-v3 on a Ryzen 7 takes about 3 minutes to process 1 minute of audio. That is the experience people remember when they say CPU dictation is too slow — they were running the heaviest model.

Which is faster on CPU, Parakeet or Whisper?

Parakeet TDT, by a wide margin for English. Benchmarks put Parakeet TDT at RTF 0.033 on an i7-12700KF — 30× real time — while faster-whisper's large-v2 running on an RTX 3070 Ti GPU only manages 13× real time. Parakeet on CPU outperforms Whisper large on a mid-range GPU. The reason is architectural — Parakeet is a non-autoregressive Transducer that runs the full audio in a single encoder pass, while Whisper's autoregressive decoder loops through tokens one at a time.

Will voice-to-text work on a MacBook without a dedicated GPU?

Yes, very well. Every Apple Silicon Mac (M1 through M4) automatically uses Metal GPU acceleration in whisper.cpp and SnailText. An M1 runs Whisper small at ~6× real time. An M2 Pro runs large-v3 at 2.5× real time. Even M1 CPU-only (without Metal) runs the base model at 14× real time. MacBook users have no practical GPU concern — the integrated chip handles it.

When does a GPU actually help with voice-to-text?

A GPU makes the biggest difference in three cases: running large-v3 or large-v3-turbo at real time on Windows (where CPU is too slow), batch transcription of long recordings (an hour of audio that takes 3 hours on CPU takes 5 minutes on an RTX 3090), and high-concurrency server deployments with multiple simultaneous users. For everyday dictation — short phrases, one person, stop-and-transcribe — CPU is sufficient.

What does a GPU do that CPU cannot for speech recognition?

A GPU does not improve accuracy — the same model produces identical output on CPU and GPU. What GPU provides is speed: parallel matrix operations that the encoder and decoder run faster. For voice dictation where the audio is 1-10 seconds at a time, the speed difference is mostly invisible. For long-form transcription or large models, GPU is a meaningful upgrade.

Can Whisper run on a Raspberry Pi without a GPU?

Yes, with the right model. Raspberry Pi 4 with the tiny.en model achieves approximately real-time transcription using whisper.cpp's streaming mode with reduced audio context (-ac 512). Pi 5 handles the base model. Larger models are too slow for real-time use on Pi hardware but work for post-recording transcription if you are willing to wait.

Want SnailText?

Free tier has unlimited local dictation, no account needed.