For most of the last decade, “dictation” meant one thing: a model listened to your voice and typed out what it heard, word for word. Filler and all. AI dictation adds a second step. After the transcript exists, a language model reads it and cleans it up - the way a careful editor would, but in a fraction of a second.
That second step is the whole difference. It is also where the privacy question hides, because in most apps the cleanup happens on someone else’s server.
Speech-to-text: the first model
Speech-to-text (also called speech recognition, or STT) is the foundational technology. You speak, a model converts the audio into a string of words. The two open models that power most desktop dictation in 2026 are OpenAI’s Whisper and NVIDIA’s Parakeet TDT. Both can run entirely on your own hardware.
What you get from this step is a faithful transcript. If you said “um, so I think we should, you know, ship it on Friday”, that is roughly what comes out. Accurate, but not something you would paste into an email without tidying it first.
That tidying used to be your job. Now a second model does it.
AI dictation: adding the language model
AI dictation runs the raw transcript through a language model (the same class of model behind ChatGPT, Claude, and Gemini). The language model does the editing pass:
- Removes filler. “Um”, “uh”, “you know”, “like” - gone.
- Fixes punctuation and grammar. Run-on speech becomes properly punctuated sentences.
- Adjusts style. Casual speech can become a professional message, a formal note, or code-style text with the right identifier casing.
- Translates. Speak in your native language, get the text in another.
So “um, so I think we should, you know, ship it on Friday” becomes “I think we should ship it on Friday.” Same meaning, ready to send.
This is why the category is called AI dictation rather than plain dictation: there are two models in the pipeline, and the second one is a language model. The speech model hears you; the language model edits you.
You said
so umm i pushed the fix to githab and the the latency droped on postgress
AI dictation gives you
So I pushed the fix to GitHub, and the latency dropped on Postgres.
Speech-to-text vs AI dictation, side by side
| Axis | Plain speech-to-text | AI dictation |
|---|---|---|
| What it produces | Raw transcript of what you said | Cleaned-up text, ready to send |
| Models involved | One (speech-to-text) | Two (speech, then language model) |
| Filler words | Left in | Removed |
| Punctuation & grammar | Best-effort from the speech model | Corrected |
| Style / tone | Verbatim only | Casual to formal, or code style |
| Translation | No | Speak one language, get another |
| Where cleanup runs | N/A | Cloud in most apps; on-device in SnailText |
The part most comparisons skip: where the second model runs
Here is the question that decides whether AI dictation is private: where does the language-model step happen?
In most AI dictation apps, the speech-to-text step may run on your device, but the cleanup step calls a cloud language model - OpenAI, Anthropic, or Google. That means your transcript is uploaded on every dictation, even when your audio never left the machine. “Local speech recognition” and “local AI dictation” are not the same claim. The first can be true while the second is false.
For a Slack message about lunch, that may not matter. For a commit message that quotes proprietary code, a legal note about a client, or a clinical observation, it matters a lot. The transcript is the sensitive part, and the cleanup step is exactly where it gets sent away.
How SnailText’s AI dictation works
SnailText runs both models on your device. Whisper (or Parakeet TDT) handles speech-to-text locally, in RAM. Then a local language model - a compact Gemma model running on your own hardware - does the cleanup pass. No API key, no cloud call, nothing uploaded at either stage. Here is what that second model actually does for you.
Cleanup and correction
Every dictation gets the basic editing pass: filler words dropped, punctuation and capitalization repaired, obvious grammar slips fixed, and known brand and product names restored to their proper casing (so “github” becomes “GitHub” and “postgres” becomes “Postgres”). This is the difference between a transcript you have to fix and a sentence you can send.
Topic profiles
Cleanup is not one-size-fits-all - a developer dictating code wants different handling than a novelist dictating prose. SnailText ships five topic profiles, and you pick the one that matches what you mostly dictate:
- General - no topic bias, for dictations that cover many areas.
- Development & IT - restores
snake_case/camelCaseidentifiers and library names (Python, React, Docker, Postgres, and the like). The default for fresh installs. - Writing - articles, essays, prose. Preserves your voice and sentence rhythm, and skips identifier rewriting entirely so it never mangles a normal sentence into code.
- Business - meetings, emails, project management. Knows KPI / OKR / ROI vocabulary and casts brand names correctly.
- Academic - scientific writing, formula references, Latin species names, preserved technical terminology.
The profile is the single biggest lever on how the cleanup behaves, because it tells the language model what kind of text you are producing before it touches a word.
Identifier styles for code
If you dictate code, you can set the convention the model restores symbols into: snake_case, camelCase, kebab-case, PascalCase, or Auto (let the model infer from context). Say “recording completed” while the Development profile is active and the right style is set, and it comes out recording_completed rather than two plain words. This is the kind of thing that makes voice-driven coding actually usable instead of a constant cleanup chore.
Style, tone, and translation
The same model can shift register - turning a casual spoken sentence into a professional message - and translate: speak in your native language and get the text in another, processed locally rather than sent to a translation API.
You stay in control
The cleanup is deliberately conservative. It is tuned to preserve your meaning rather than rewrite it, and it leaves text alone when it is already clean. If you want the raw transcript with no editing, you turn the step off and get plain verbatim speech-to-text. AI dictation is a mode you switch on, not a filter you are stuck with.
This is also why we can call SnailText AI dictation honestly. Before the local language-model step shipped, it was a fast, private speech-to-text app. With two models in the pipeline - both on-device - it is AI dictation that uploads nothing.
The local language-model cleanup is a Pro feature, currently in beta. The free tier gives you the full local speech-to-text engine with no account and no word limit; Pro ($7.49/mo or $89/yr, up to 3 devices) adds the on-device cleanup model, the topic profiles, and the identifier styles described above.
When you want plain speech-to-text instead
AI dictation is not always the right mode. If you are transcribing a quote and need the exact words, or you are dictating into a system that has its own formatting rules, the cleanup step can get in the way. That is what the off switch is for. The point is not that one replaces the other - AI dictation gives you a second mode, and a good app lets you pick per task.
The short version: speech-to-text writes down what you said. AI dictation hands you what you meant to send - in the style your work needs. The only thing left to check is whether that second step keeps your words on your machine.