Whisper: Why the Machine Hears Åre as Are

Eighty-Nine Out of a Hundred

When you ran the Swedish transcription model over your podcast archive, you measured it on the hard thing, the proper nouns. Place names, people, the local words that matter most in a paper about Jämtland. It got them right about eighty-nine times in a hundred, which is good, and the eleven it missed are the interesting ones. It did not mangle them at random. It heard a real Swedish place and confidently wrote down a different, more ordinary word that sounds almost the same. To understand why a transcription model fails in exactly that pattern, you have to know that it is not really listening. It is doing something far stranger, and the strangeness is the whole reason it is good and also the reason it betrays you on names.

A Picture of Sound

Then comes the part that surprises people. The model that turns that picture into text is built exactly like a translation model, the kind that turns French into English. One half studies the whole sound-picture and builds a rich understanding of it. The other half writes out words, one at a time, glancing back at that understanding as it goes. It is translating, from the language of sound-pictures into the language of written Swedish. And like any translator, the half that writes the words is not only matching sounds. It is also a fluent speaker, predicting what word most likely comes next based on the words it has already written.

The Fluency That Betrays You

That fluency is the trap. The writing half has read mountains of text and learned what Swedish usually sounds like, which words commonly follow which. So when the sound is clear, it does beautifully, sound and expectation agreeing. But when a sound is even slightly ambiguous, the model leans on what it expects, and what it expects is the common word, not the rare one. A small local place name is, by definition, rare. It appeared seldom in everything the model ever read. A nearby ordinary word appeared constantly. So when the audio for your village sits halfway between the real name and a common word, the model's thumb is on the scale, and it writes the common word with total confidence. It is not mishearing. It is overruling its own ears with its expectations.

That is why the failures cluster on names and never on the filler around them. The grammar comes out perfect because grammar is exactly what a fluent predictor is best at. The one irreplaceable word, the name a local reader would notice instantly, is the one most likely to get quietly swapped for its common-sounding cousin. The better the model is at sounding Swedish, the more dangerous it is precisely at the words that do not follow the usual pattern. And there is a sibling failure, the famous one, where you feed it silence or music and it invents a plausible sentence out of nothing, because it was built to always produce words, and faced with no speech it produces the words it expects to see, like a polite sign-off, rather than admitting it heard nothing.

Why the Swedish Version Wins

This is the whole reason a Swedish-trained version beats the general one on your archive. The fix is not better ears. The ears, the sound-picture, are nearly the same. The fix is retraining the fluent writing half on Swedish, so that its expectations are Swedish expectations. Now when the audio is ambiguous, the thumb on the scale pushes toward Swedish words and, crucially, toward Swedish names it has actually seen, instead of toward whatever the general model thought was common. You did not give it sharper hearing. You gave it the right prejudices. A model that expects Jämtland is far less likely to hear Jämtland and write down something blander.

The Keeper

So hold the real shape of it. A transcription model turns your voice into a picture of sound, then translates that picture into text using two halves, one that understands the sound and one that writes fluent language. The writing half is a predictor, always leaning toward the words it expects, and that lean is both why the transcripts read so smoothly and why they swallow the rare local name in favor of its common twin. It is not deaf. It is opinionated. When it writes are instead of the village, or invents a sentence over silence, it is doing exactly what a confident fluent speaker does when the sound gets thin, filling the gap with the likely instead of admitting the unknown. The eighty-nine it gets are the ones where sound and expectation agreed. The eleven it misses are the ones where you needed it to trust its ears and it trusted its habits instead.