Tesseract: The OCR Engine That Refused to Die

A Research Project in Bristol

In the early nineteen-eighties, a man named Ray Smith was working at the Hewlett-Packard research laboratory in Bristol, in the west of England. The lab was the kind of place that companies could afford to maintain in the era when computer hardware companies funded long-term research with no immediate commercial application. Smith was working on a problem that, at the time, was considered nearly impossible to solve well. The problem was optical character recognition, which is the task of converting an image of text into actual text that a computer can read.

The problem was hard for reasons that are easy to underestimate. The image of a letter A on a printed page is not actually the letter A to a computer. It is a pattern of dark and light pixels. The computer has no inherent knowledge that the pattern means A. It has to figure this out, despite the fact that the same letter A can appear in thousands of different fonts, in different sizes, slightly rotated, partially smudged, faded with age, photographed at strange angles, or scanned from a fax of a fax of a fax. The letter A is conceptually a single thing. The visual representations of the letter A are practically infinite.

[calm]

Smith and his team worked on this problem for many years. By the late nineteen-eighties, they had a working optical character recognition engine that they called Tesseract. The name came from the geometric shape that is a four-dimensional cube, a tesseract, because the algorithm worked by thinking of characters as patterns in a high-dimensional space. The engine was, by the standards of the time, very good. It was used inside several Hewlett-Packard products through the nineteen-nineties.

And then it sat. Hewlett-Packard had moved on. The commercial era of dedicated optical character recognition products had peaked and was declining. Microsoft and Google and Adobe were absorbing the capability into their own products. Tesseract was a piece of legacy software inside a hardware company that no longer needed it. It was kept alive by Smith and a small team, used internally, occasionally updated, mostly forgotten.

The Open Source Donation

In two thousand five, something unusual happened. Hewlett-Packard decided to donate Tesseract to the open source community. They released the source code under a permissive Apache license. The Information Sciences Institute at the University of Southern California received the code as a research project. They worked on it for a couple of years. And then, in two thousand six, Google adopted the project.

Google had its own reasons for caring about optical character recognition. The company was building Google Books, an enormous project to digitize and index every book ever published. The project required, among many other things, an excellent optical character recognition engine to convert the scanned pages of books into searchable text. Existing commercial engines were expensive to license at scale, and they were not perfectly suited to the variety of fonts and conditions in the historical book corpus. Tesseract was open source. Google could improve it.

[serious]

Ray Smith joined Google. Tesseract became, for several years, a heavily funded internal project at one of the largest companies in the world. The engine was rewritten in modern code. The training data was expanded to cover dozens of languages. The recognition accuracy improved dramatically. Modern machine learning techniques were integrated. By the time Google released subsequent major versions of Tesseract back to the open source community, the engine had become the most capable open source optical character recognition system in the world. It is, today, the underlying recognition engine in DocumentCloud, in many internal newsroom tools, in academic research projects, in humanitarian document processing, and in dozens of commercial products that use it as a free component.

How Optical Character Recognition Works

To appreciate what Tesseract does, it helps to think about how a human reads. When you look at a page of text, you do not consciously notice the individual letters. You see whole words. Your visual system has been trained over decades to recognize the patterns of words at a glance. You do not assemble W and O and R and D into the word WORD. You see the word as a single thing.

A computer cannot do this. It has to start at the lowest level. The image is a grid of pixels. Some pixels are dark. Some pixels are light. The computer first has to figure out where the text is on the page, as distinct from the background or the illustrations or the page borders. This is called layout analysis. Tesseract is unusually good at this. It can find blocks of text inside complex page layouts, separate columns, distinguish headings from body text, and handle mixed content like text wrapped around images.

Once the text regions are identified, the engine has to find the individual lines of text within them. Then the words within the lines. Then the characters within the words. Each of these steps is a separate piece of analysis, and each can go wrong. A line that is slightly tilted can confuse the line detection. Words that are joined together by smeared ink can confuse the word detection. Characters that are partially missing or merged with their neighbors can confuse the character detection.

For each candidate character, the engine has to decide which letter it is. Modern Tesseract uses a neural network for this final step, trained on millions of examples of characters in different fonts and conditions. The network looks at the pixel pattern of the candidate character and produces a probability distribution over all possible letters. The most likely letter is selected. Often, several candidates are tracked, and the final choice is made based on which combination of letters produces a valid word.

This is where context becomes important. The engine knows about words. If the pixel pattern is ambiguous between c and o, the engine prefers the interpretation that produces a real word. If the surrounding letters spell something close to a known word, the engine uses that to disambiguate the uncertain character. The recognition is not just pattern matching. It is pattern matching plus linguistic constraints.

The Language Problem

One of the things that makes Tesseract genuinely impressive is its multilingual capability. The engine supports over one hundred and twenty languages. Each language has its own model, trained on text in that language, with its own vocabulary for the disambiguation step. Tesseract can recognize Swedish text. It can recognize Arabic text. It can recognize Chinese text. It can recognize text in languages that use right-to-left scripts, scripts with combining characters, scripts with thousands of distinct characters.

This matters for journalism that crosses language boundaries. A Swedish reporter investigating an Australian mining company might receive documents in English. A document from a Russian-owned shell company might be in Russian. A Norwegian regulatory filing might be in Norwegian. Each language needs a separate recognition model, and Tesseract provides them all, free, with the same engine.

The training data for these models came from a combination of sources. Some was produced specifically for the project. Some came from publicly available text corpora. Some came from books in the Google Books project. The breadth of the training data is part of why Tesseract works as well as it does. Models trained on narrow corpora often fail on documents that are slightly outside their training distribution. Tesseract's models have been trained on enough variety that they handle most real-world documents reasonably well.

The Confidence Score

There is one feature of Tesseract worth mentioning, because it shapes how the tool is actually used in practice. For every character it recognizes, Tesseract reports a confidence score. The score is a number between zero and one hundred, representing how certain the engine is about its interpretation.

[calm]

This confidence score is critical for working journalists. When you process a thousand pages of scanned documents through Tesseract, some of the recognition will be perfect, some will be flawed, and some will be wrong. The confidence score tells you which is which. Pages where the average confidence is high can probably be trusted. Pages where the average confidence is low need human review. Specific words or phrases with low confidence might be the most important parts of the document, especially if they appear at locations that might be names or dates or numbers.

Modern workflows often use the confidence score to prioritize human attention. The reporter does not have time to read every character on every page. The reporter does have time to read the parts the machine is uncertain about. The combination of machine recognition plus targeted human review produces better results than either alone, and it produces them at a fraction of the cost of full human reading.

The Modern Competition

Tesseract is no longer alone in the open source optical character recognition space. There are newer engines, some of them more accurate on specific kinds of documents. There are commercial cloud services from Amazon, Google, and Microsoft that achieve higher accuracy by throwing enormous computing resources at the problem. There are specialized engines for handwritten text, for historical documents, for non-standard scripts.

But Tesseract has a specific advantage that none of the alternatives match. It runs locally. You do not have to send your documents to a third party. For sensitive investigative work, where the documents themselves are the story, the ability to do the recognition without uploading anything to a cloud service is critical. Some leaked documents cannot be sent to Amazon. Some confidential filings cannot be sent to Google. Tesseract runs on your laptop, on your server, behind your firewall, with your data never leaving your control.

This is the privacy story of Tesseract, and it is one of the reasons the tool has remained the standard despite the rise of cloud alternatives. The cloud services are better at the easy cases. They are not available for the cases where it matters most. The local engine, free, open source, multilingual, running anywhere, is the only practical choice for serious investigative document processing.

What This Has To Do With Working Journalists

For a reporter dealing with documents, Tesseract is the foundation. You receive a PDF of a scanned letter. You run it through Tesseract. You get back searchable text. You can now grep that text for names or dates or phrases. The document has gone from a static image to a searchable artifact in a few seconds.

The same engine is the back end of DocumentCloud, which means even reporters who never directly invoke Tesseract are using it. The OCR pipeline of the modern newsroom is, almost everywhere, built on this thirty-five-year-old engine that started in a Hewlett-Packard research lab and was rescued from obsolescence by an open source donation.

For a Swedish reporter, the relevant detail is that Tesseract handles Swedish well. The language model is mature. The recognition accuracy on modern printed Swedish text is essentially perfect. Older documents, handwritten documents, faxed documents, get progressively harder, but for normal printed material, the tool is reliable enough that you stop worrying about it.

The pattern of usage is simple. Documents come in. They get fed to Tesseract. The text comes out. The text becomes the searchable layer of your investigation database. The original images are preserved for verification, but the search happens against the text. Every name, every date, every dollar amount, every place reference becomes findable. The investigation accelerates dramatically.

The Longer Lesson

The Tesseract story is worth knowing as a specific instance of a more general pattern. A piece of valuable research is created by a commercial company. The commercial company loses interest. The research is donated to the open source community. The community keeps it alive through volunteer maintenance for a while. Eventually a different commercial company invests heavily in improving it, because the alternative is paying licensing fees to competitors. The improved version is released back to the open source community. The result is that the world has a free tool that is better than any of the commercial alternatives would have been on their own.

[serious]

This is the open source ecosystem working in its most efficient mode. Commercial investment plus open availability plus volunteer maintenance plus community ownership. No single entity has to bear the full cost. No single entity captures the full value. The work compounds across the ecosystem in ways that no single company could have organized.

Tesseract is a small example. The pattern shows up in many places. The open compiler infrastructure called LLVM started at the University of Illinois, was adopted by Apple, became the foundation for several commercial compiler products, and is now used by everyone. The Linux kernel started as a hobby project, was adopted by IBM and Red Hat and Google and Amazon, and is now the foundation of most server infrastructure on earth. The open source model works when the incentives align, and the incentives align more often than the cynical view would expect.

For a reporter using Tesseract today, the practical use is to know that the engine is there, that it works, that it does not need to be paid for, and that the documents you scan will become searchable. The deeper use is to understand that this is what good open infrastructure looks like, why it works, and why the journalism we have access to today depends on tools that exist because of a series of small donations and volunteer hours and corporate decisions to release rather than hoard. The thirty-five-year-old engine is still working. The documents are still being read. The journalism is still being done. The pattern continues.