BERT: The Model You Use Without Knowing It

The Name Everyone Has Heard

BERT. You've heard the name. You've maybe seen it mentioned alongside GPT, or Llama, or Claude. But unlike those models, nobody has a conversation with BERT. Nobody asks BERT to write an email or explain quantum physics. And yet BERT, or models directly descended from it, quietly power an enormous chunk of the AI that people actually use in production every single day. Search engines. Spam filters. Content moderation systems. Customer service routing. Legal document analysis. If you've ever wondered how your email client knows which messages are important, there's a decent chance BERT or something like it is involved.

So what is BERT, how does it work, and — this is the part that matters for someone building things — when should you reach for it instead of a large language model like Claude or GPT?

The Core Idea

BERT stands for Bidirectional Encoder Representations from Transformers. Google published it in twenty eighteen. It was a breakthrough, but to understand why, you need to understand what came before.

Before BERT, language models read text in one direction. Left to right, or right to left, but not both at the same time. If you were trying to predict the meaning of a word in a sentence, you could only look at the words that came before it, or the words that came after, but not both simultaneously. This is like trying to understand a joke where you can only read the setup or the punchline, but not both.

BERT's trick was simple but powerful: it reads in both directions at once. When processing the word "bank" in the sentence "I went to the bank to deposit money," BERT sees both "went to the" on the left and "to deposit money" on the right. This lets it figure out that "bank" means a financial institution, not a riverbank. The technical term is bidirectional attention, and it's what the B in BERT stands for.

The key insight is that understanding a word requires seeing its full context, not just what came before. BERT processes the entire sentence simultaneously, building a rich representation of each word that incorporates information from every other word in the sentence.

How It Learns

BERT is trained with a beautifully simple game. Take a sentence. Randomly mask out fifteen percent of the words, replacing them with a special mask token. Then ask the model to predict what the masked words were. This is called masked language modeling, and it's how BERT learns the structure of language.

For example, take the sentence "Kungen bor i Stockholm och regerar Sverige." Mask it as "Kungen bor i [MASK] och [MASK] Sverige." BERT has to figure out that the first mask is probably "Stockholm" and the second is probably "regerar." To do this, it needs to understand Swedish grammar, know facts about the Swedish monarchy, and grasp how Swedish sentences are structured.

There's a second training task too: next sentence prediction. Give BERT two sentences and ask whether the second one logically follows the first. This teaches the model to understand relationships between sentences, which matters for tasks like question answering where you need to connect a question to the relevant part of a document.

The original English BERT was trained on about three point three billion words from English Wikipedia and a corpus of books. The training took four days on sixty-four TPUs — Google's specialized AI chips. That was considered enormous in twenty eighteen. By today's standards, it's tiny. GPT-SW3 trained on three hundred and twenty billion tokens. Llama three trained on fifteen trillion.

BERT versus GPT: The Fundamental Split

Here's where people get confused. BERT and GPT are both transformers. They're built from the same fundamental architecture, the one described in the famous twenty seventeen paper "Attention Is All You Need." But they use that architecture in completely different ways, and this difference determines everything about when you should use each one.

GPT is a decoder. It generates text, one token at a time, left to right. You give it a prompt, and it predicts what comes next, word by word. This is why GPT can write essays, code, and poetry. It's an autoregressive model — each output token becomes input for generating the next one.

BERT is an encoder. It doesn't generate text. Instead, it takes a piece of text and produces a rich numerical representation of that text — a vector of numbers that captures the meaning of each word in context. These representations can then be used as input to other systems that make decisions.

Think of it this way. GPT is a writer. You give it a topic and it writes. BERT is a reader. You give it text and it understands. A writer can also read, which is why you can use GPT for classification tasks too. But a reader is often better at reading than a writer is, because reading is all it does.

So BERT can't write a single word? What's the point of a language model that can't produce language?

Exactly right. BERT can't write a single word. And that's precisely its strength. Because it doesn't need to be able to generate text, it can focus entirely on understanding text. And because it's smaller and more focused, it's dramatically faster and cheaper to run.

The Numbers That Matter

BERT-base has a hundred and ten million parameters. GPT-four has — well, nobody knows exactly, but estimates put it at over a trillion. Claude Opus is in a similar range. Llama three's largest version has four hundred and five billion parameters. Even the smallest useful version of Llama, at eight billion, is seventy times larger than BERT.

This size difference has practical consequences. BERT-base runs comfortably on a laptop CPU. Not fast, but it works. You can process a sentence in tens of milliseconds. Try running Llama eight B on a laptop CPU and you're looking at seconds per token of generated text. On a GPU, BERT processes thousands of sentences per second. A large language model generates maybe a hundred tokens per second.

Cost follows from speed. If you need to classify a million customer support tickets into categories, using Claude's API would cost you a significant amount per ticket — maybe a fraction of a cent each, but it adds up. Running BERT locally on a single GPU costs essentially nothing after the initial setup. At scale, this difference is the difference between a viable product and an unsustainable one.

And there's latency. When a user submits a search query, they expect results in milliseconds. BERT can deliver. A large language model cannot.

Swedish BERT: KB-BERT

This brings us to Sweden. In twenty twenty, KBLab at the National Library of Sweden released KB-BERT: a BERT model trained specifically on Swedish text. The training data was about fifteen to twenty gigabytes of text — roughly three billion tokens — from the library's collections. Books, newspapers, government publications, Swedish Wikipedia, and internet forums.

Why did they need to make a Swedish-specific model? Because the alternatives were bad. Google's multilingual M-BERT covered a hundred and four languages, but it spread its capacity thin. Swedish got a small slice of the vocabulary and limited exposure to Swedish text patterns. The Swedish Public Employment Service, Arbetsförmedlingen, had trained their own BERT variant, but it was optimized for their specific domain.

KB-BERT outperformed both multilingual BERT and Arbetsförmedlingen's model on Swedish NLP tasks. Named entity recognition — finding person names, organization names, locations, and dates in Swedish text. Part-of-speech tagging — knowing that "banken" in one context is a noun referring to a financial institution and in another refers to a hill. Sentiment analysis. Text classification. On every benchmark, the Swedish-specific model won.

KBLab later released additional models. A BERT fine-tuned specifically for named entity recognition using the SUC three point zero dataset, which is the standard annotated Swedish text corpus. An ALBERT model, which is a lighter-weight variant. And a Swedish sentence transformer — a model that turns entire sentences into numerical vectors that capture their meaning, which is essential for building search systems and similarity comparisons.

The sentence transformer deserves special mention. It was trained using a technique called knowledge distillation, where a strong English sentence embedding model acts as a teacher, and the Swedish model learns to produce similar embeddings for Swedish sentences. It achieved the highest published scores on SweParaphrase, the Swedish sentence similarity benchmark. If you need to build a semantic search system that works in Swedish — one that understands that "bostadspriserna stiger" and "lägenheter blir dyrare" mean roughly the same thing even though they share almost no words — KBLab's sentence transformer is the model to use.

When You Should Use BERT

Alright, practical question. You're building something. When should you reach for KB-BERT instead of calling Claude's API?

Use BERT when the task is classification. You have Swedish text. You need to sort it into categories. Is this email spam or not? Is this customer review positive, negative, or neutral? Is this document about healthcare, education, or infrastructure? These are classification problems, and BERT excels at them. You fine-tune KB-BERT on a few hundred or a few thousand labeled examples of your specific task, and you get a fast, cheap, accurate classifier.

Use BERT when you need to extract structured information from text. Named entity recognition — finding every person, organization, date, and location mentioned in a document — is a classic BERT task. The NER-specific KB-BERT model can process a Swedish text and tag every entity with its type. Feed it "Pär Boman driver Årebladet från Kall i Jämtland" and it will identify Pär Boman as a person, Årebladet as an organization, Kall as a location, and Jämtland as a location.

Use BERT when you need embeddings — numerical representations of text that capture meaning. This is the backbone of semantic search, recommendation systems, and duplicate detection. If you have ten thousand documents and you need to find which ones are about the same topic, you embed all of them with the sentence transformer and then compute similarities. This takes seconds, not hours.

Use BERT when latency matters. Search systems. Real-time content moderation. Auto-complete suggestions. Anything where the user is waiting and you need an answer in milliseconds.

Use BERT when cost matters at scale. If you're processing millions of documents per day, the cost difference between a local BERT model and an LLM API is enormous.

Use BERT when privacy matters. BERT runs locally. Your data never leaves your server. No API calls to American companies. No data processing agreements. No GDPR concerns about data transfers. For Swedish government agencies and healthcare organizations, this alone can be the deciding factor.

When You Should NOT Use BERT

Do not use BERT when you need to generate text. BERT literally cannot do this. If you need to write, summarize, translate, or produce any kind of text output, you need a generative model.

Do not use BERT when the task requires reasoning. If you need to analyze a contract and explain its implications, or debug code, or answer open-ended questions, BERT doesn't have the capacity. These tasks require the kind of broad knowledge and reasoning ability that only comes with much larger models.

Do not use BERT when you have no labeled training data and can't create any. BERT needs fine-tuning for most tasks, and fine-tuning needs labeled examples. If you have no examples of what you want the model to do, a large language model with its zero-shot ability is the better choice — you can describe the task in a prompt and it'll attempt it.

So BERT is for the boring stuff? The plumbing?

Yes. And the boring stuff is where most of the value is. The glamorous part of AI is asking Claude to write a sonnet about Swedish midsummer. The valuable part is automatically routing forty thousand incoming documents per day to the right department at Skatteverket. BERT does the valuable part, faster and cheaper than any large language model.

A Practical Example

Let's make this concrete. Imagine you run a local newspaper. Every week, dozens of press releases, tips, and submissions arrive. You want to automatically sort them: sports, local politics, culture, business, events. And you want to extract every person and organization mentioned, so you can cross-reference against your contact database.

Step one: download KB-BERT from Hugging Face. It's free. The base model is about four hundred megabytes.

Step two: collect a few hundred examples of each category from your own archives. Label them. This is the tedious part, but a few hundred per category is usually enough.

Step three: fine-tune KB-BERT on your labeled data. With Hugging Face's Transformers library, this is about twenty lines of Python code. On a decent laptop, fine-tuning takes maybe thirty minutes. On a GPU, a few minutes.

Step four: deploy. Your classifier now processes incoming text in milliseconds and sorts it into categories with high accuracy. The NER model extracts names and organizations. Total cost: your time to label the training data. The model itself is free, the inference is free, and it runs on hardware you already own.

Now compare this to the alternative. You could write a prompt for Claude: "Classify this Swedish text into one of these categories: sports, politics, culture, business, events." Claude would do a decent job. But each API call costs money, takes a second or two, and sends your text to Anthropic's servers. For a few documents a day, that's fine. For thousands, the BERT approach wins on every dimension except the initial setup time.

The Bigger Picture

BERT was published in twenty eighteen. That's ancient in AI years. The original model has been surpassed by newer encoder architectures — RoBERTa, DeBERTa, ELECTRA — that use the same basic idea but train more efficiently. KBLab's models incorporate some of these improvements.

But the fundamental insight of BERT hasn't changed: for understanding text rather than generating it, a focused encoder model is more efficient than a massive generative model. And for Swedish specifically, having a model trained on Swedish text from the Swedish National Library's own collections means it captures the nuances of Swedish that multilingual models miss.

Every time you use a search engine in Swedish and the results actually make sense, every time a government agency automatically routes your complaint to the right department, every time a news organization tags an article with the right categories — there's a good chance that a BERT-family model is involved. Not because it's flashy. Because it's fast, it's cheap, it's accurate, and it runs on hardware you control.

That's not the AI that makes headlines. It's the AI that makes everything else work.