On the afternoon of July second, two thousand nineteen, at thirteen forty two coordinated universal time, a software engineer at Cloudflare merged a pull request. It was a minor change to the web application firewall. A new rule to detect cross site scripting attacks. The rule contained a regular expression, a pattern written in a notation that most programmers use every day without thinking much about. Three minutes later, the first alarm fired.
Every server in Cloudflare's global network, machines spread across more than one hundred and eighty cities on six continents, saw its processor usage spike to one hundred percent. Not gradually. Not in waves. All at once. The pattern in that regular expression had a flaw that caused the matching engine to enter a state called catastrophic backtracking, where it tried billions of possible paths through the input text, each one failing and starting over. Within minutes, Cloudflare was returning five oh two errors to every request. Traffic dropped by eighty two percent. For twenty seven minutes, a significant fraction of the internet went dark. Websites, APIs, mobile apps, anything proxied through Cloudflare simply stopped responding.
The company's chief technology officer, John Graham-Cumming, an Oxford mathematician who had once led a successful campaign to get the British government to apologize for its persecution of Alan Turing, emailed the CEO with a subject line that captured the moment perfectly. Where is my DNS, he wrote.
When the team finally killed the firewall rules globally at fourteen oh seven and the traffic came back, the postmortem was brutal. The cause was a single line of text, a pattern roughly one hundred and forty characters long, deployed simultaneously to every edge server on the planet without a canary rollout, without gradual testing, and without the safety net that was supposed to prevent exactly this kind of runaway processing. That safety net had been accidentally removed weeks earlier during a refactoring project. A refactoring project whose entire purpose was to make the firewall use less processor time.
Here is the question worth sitting with. How did a mathematical formalism invented in nineteen fifty one to describe the behavior of neurons in a theoretical brain become so powerful that a typo in one instance of it could break the internet? The answer is a story about mathematicians, hackers, linguists, and a series of decisions, each one reasonable in isolation, that created a tool both indispensable and dangerous. A tool that every programmer alive uses. A tool that no artificial intelligence has managed to replace.
Stephen Cole Kleene was born in Hartford, Connecticut, in nineteen oh nine, the son of a poet and an economics professor. His mother, Alice Lena Cole, wrote plays and poetry. His father, Gustav, taught at Trinity College. They considered their real home not the college town but a farm in Union, Maine, where the family spent every summer. Stephen grew up splitting his time between the intellectual life of a professor's household and the physical life of rural New England, and he never stopped doing both.
He was brilliant at mathematics. He graduated summa cum laude from Amherst College in nineteen thirty, then went to Princeton for his doctorate. Princeton in the early nineteen thirties was one of the most remarkable concentrations of mathematical talent in history. The Institute for Advanced Study had just opened. Albert Einstein was there. John von Neumann was there. And in the mathematics department, a quiet logician named Alonzo Church was pioneering a new field at the intersection of mathematics and formal logic. Kleene became Church's student, completing his doctorate in nineteen thirty four with a thesis on formal logic. The environment was, as Kleene later put it, certainly an exciting place to be undertaking research applying mathematical techniques to logic, with visitors such as Kurt Godel.
Here is a detail that matters for our story. Another of Church's students at Princeton, overlapping with Kleene by just a few years, was a young Englishman named Alan Turing. Church and his students were grappling with the most fundamental question in mathematics. What does it mean to compute something? What problems can be solved by a mechanical process, and what problems cannot? Kleene, Turing, Church, and a handful of others were literally inventing the theoretical foundations of computer science, years before any programmable computer existed.
Kleene left Princeton in nineteen thirty five and joined the University of Wisconsin at Madison, where he would spend essentially the rest of his career. He became a full professor in nineteen forty eight. Colleagues described him as a private man but a skillful and enthusiastic teller of anecdotes with a powerful voice. Outside of mathematics, he was a devoted mountaineer. He organized biannual logic picnics that involved strenuous hikes up the cliffs of Devil's Lake State Park, and he kept leading those hikes well into his seventies. In nineteen forty nine, climbing in New Hampshire's Presidential Range with his colleague Saunders Mac Lane, a thunderstorm struck their party on Mount Madison. Lightning injured one of their companions, and they had to organize a rescue on the mountainside. His knowledge of mushrooms was, by all accounts, legendary. A butterfly species was named after him.
And here is the thing about his name. Everyone gets it wrong. Most people say Kleene like the tissue brand. Some say Klee-nee. Kleene himself pronounced it Klay-nee, which is incorrect in all known languages. His own son wrote that his father simply invented the pronunciation. So if you have been saying it wrong, do not feel bad. There is no right answer. The man himself made one up.
In nineteen fifty one, Kleene was working on a problem that had nothing to do with computers. Warren McCulloch and Walter Pitts had published a famous paper modeling neurons as simple logical switches, on or off, connected in networks. Kleene wanted to describe mathematically what patterns of events these neural networks could recognize. What sequences of signals could flow through them. He wrote a paper for the RAND Corporation, which was later published in nineteen fifty six in a collection called Automata Studies, edited by Claude Shannon and John McCarthy. The paper was called Representation of Events in Nerve Nets and Finite Automata.
In that paper, Kleene introduced a notation for describing patterns. He used the vertical bar to mean or. He used concatenation to mean followed by. And he used a little star, an asterisk, to mean zero or more repetitions. The star, later called the Kleene star, was the crucial innovation. With just these three operations, union, concatenation, and repetition, Kleene could describe exactly the set of patterns that a finite state machine could recognize. He called them regular events. We call them regular expressions.
Kleene never imagined anyone would type his notation into a computer. He was doing pure mathematics, describing the theoretical limits of a mathematical model of neurons. The paper is dense with formal proofs and set theory notation. It is about as far from a search bar as you can get. But the star he drew on the page in nineteen fifty one would eventually light up every text editor, every search engine, every log parser, every firewall rule on earth. Kleene died in nineteen ninety four at eighty five. He never saw what his stars became.
Ken Thompson was born in New Orleans in nineteen forty three, studied electrical engineering and computer science at Berkeley, and was hired by Bell Labs in nineteen sixty six. If Kleene was the theorist, Thompson was the builder. He is one of those rare figures in computing who keeps showing up at the origin of things. He co-created Unix. He designed the B programming language, the precursor to C. He built one of the first chess computers, a machine called Belle that won the world computer chess championship. He co-designed the Go programming language. He won the Turing Award in nineteen eighty three. He once described one of his most productive days as throwing away a thousand lines of code. His colleagues at Bell Labs described him as so down to earth that he never bothered to talk about who he was.
In nineteen sixty eight, Thompson was working on a text editor called QED for the Compatible Time-Sharing System at Bell Labs. He wanted users to be able to search for patterns in text, not just fixed strings but flexible patterns. Find every line that starts with a digit followed by one or more letters. Find every word that ends in tion or sion. He reached for Kleene's notation.
But here is where Thompson made the decision that this entire episode is really about. He had to choose how to implement the matching. There were two approaches, and the difference between them would haunt computing for the next sixty years.
The first approach is what we now call backtracking. You read the pattern and the text left to right, and whenever you reach a point where the pattern could go in multiple directions, you pick one. If it does not work out, you go back and try the other. It is like navigating a maze by always turning left. If you hit a dead end, you backtrack to the last intersection and try right instead. This is intuitive. This is how a human would do it by hand. And for most patterns and most inputs, it works fine.
The second approach is the one Thompson chose. Instead of picking one path and backtracking when it fails, you follow all possible paths simultaneously. You build a little machine, a finite automaton, that keeps track of every state it could possibly be in at each character of the input. When you reach the end of the text, you check whether any of the paths you were following ended up in an accepting state. If so, the pattern matched. This is called the NFA simulation, for nondeterministic finite automaton, and it is the approach Kleene's mathematics said was correct.
Thompson did something even more remarkable with the implementation. He compiled each regular expression directly into IBM seven oh nine four machine code on the fly. The pattern became a tiny program. Each state of the automaton was represented as a few machine instructions, and the list of possible current states was just a sequence of function calls. This was, by the way, one of the earliest examples of just-in-time compilation, the technique that would later make Java and JavaScript fast. Thompson did it in nineteen sixty eight, for a text editor.
He published his method in June nineteen sixty eight in Communications of the ACM, in a paper called Regular Expression Search Algorithm. It was four pages long. The algorithm was clean, it was fast, and it had a beautiful mathematical property. Its running time was proportional to the length of the pattern times the length of the text. Always. No matter what pattern you gave it, no matter what text, the time was predictable. There were no pathological cases. No inputs that could make it run forever. Thompson had taken Kleene's mathematical notation and turned it into practical code without losing the mathematical guarantee.
One of my most productive days was throwing away a thousand lines of code.
That quote is not about regex specifically, but it captures something essential about Thompson's approach to everything. Simplicity was the point. The elegance was the feature. And for a brief shining moment, regular expression implementations were both powerful and safe.
The year is nineteen seventy three. Thompson and his colleagues at Bell Labs are building Unix, and one of them, a researcher named Doug McIlroy, keeps asking people to look through files for specific patterns. The text editor ed, which Thompson had written as a successor to QED, could do this, but you had to open the file first. McIlroy wanted a standalone tool. Thompson later recalled that grep had been a private command of his for quite a while before he made it public.
The story, as it is usually told, goes like this. McIlroy said it would be great if they could search files for patterns. Thompson said let me think about it overnight. The next morning, he presented a working program. In reality, Thompson already had a private tool called s, for search, and what he did overnight was clean up the bugs and polish it for public use. But the overnight legend persists because it sounds right. It sounds like something Thompson would do.
The name was a command from the ed editor. g slash re slash p. Global regular expression print. Search the entire file for lines matching a regular expression and print them. When people asked Thompson what the funny name meant, he said it was obvious.
It stood for the editor command it simulated. g, re, p.
Grep became one of the foundational Unix commands. It spread with Unix to universities, then to workstations, then to servers, then to every Linux machine on the planet. The word grep became a verb. Programmers say they grepped the logs, grepped the codebase, grepped for the bug. It is one of the few pieces of mathematical notation that made it into everyday speech. When you grep, you are applying Kleene's stars to text, using Thompson's algorithm, through a tool that was built in a single night.
Now we need to talk about Larry Wall, because Larry Wall changed everything, and not everyone agrees it was for the better.
Wall was born in nineteen fifty four in Los Angeles. He studied chemistry and music at Seattle Pacific University, then switched to natural and artificial languages, a degree that combined linguistics, computer science, and philosophy. He went to graduate school at Berkeley, where he and his wife were studying linguistics with the intention of finding an unwritten language, creating a writing system for it, and using it to translate texts, including the Bible. Missionary linguistics. He never finished the graduate degree, but the linguistic training shaped everything he built afterward.
In nineteen eighty seven, Wall created Perl. It was designed as a practical extraction and report language, a tool for the messy, ugly, real world problems that academics did not write papers about. Parsing log files. Generating reports. Gluing systems together. And at the heart of Perl was something Wall borrowed from Unix and then dramatically expanded. Regular expressions.
Here is where the story gets philosophically interesting. Kleene's regular expressions could describe a specific and well defined set of patterns. Anything a finite state machine could recognize. Thompson's implementation matched that set perfectly. But Wall, the linguist, looked at this mathematical formalism and thought it was too limited. Real text processing needed more. So he added features that broke the mathematical contract.
Backreferences. Match whatever was captured by the third group earlier in the pattern. This seems innocuous, but it is computationally devastating. A finite state machine cannot do this. It has no memory of what it matched before. To support backreferences, you need backtracking. You need to try paths, fail, go back, and try again. Wall knew this. He chose it anyway.
He added lookahead. Match this, but only if it is followed by that, without consuming the characters. He added lookbehind. Match this, but only if it was preceded by that. He added non-greedy quantifiers, conditional patterns, embedded code execution. By the mid nineteen nineties, Perl's regular expressions were a full programming language disguised as a pattern matcher. They could do things Kleene never imagined and Thompson never intended. And they were no longer regular in any formal sense. The name had become a lie.
Wall was characteristically self-aware about what he had done. In two thousand two, he published Apocalypse Five, a long essay redesigning Perl's regex for the next version of the language. In it, he wrote one of the most honest self-assessments in the history of programming.
Regular expression culture is a mess, and I share some of the blame for making it that way. Since my mother always told me to clean up my own messes, I suppose I will have to do just that.
He listed twenty problems with the regex culture he had helped create. Too compact and cute. Too much reliance on too few metacharacters. Different things look too similar. Poor integration with real programming languages. And then, with the dry wit that made him beloved by a generation of programmers, he explained what had gone wrong.
We are frogs who are getting boiled in a pot full of single character morphemes, and we do not notice.
That image is perfect. Each individual feature Wall added was useful. Each individual special character he assigned was clever. But cumulatively, the notation became unreadable, the implementation became unpredictable, and the mathematical guarantee that Thompson had so carefully preserved was gone. The frog was boiled.
But here is the thing. The frog was also incredibly useful. Perl's extended regex spread everywhere. Philip Hazel wrote PCRE, Perl Compatible Regular Expressions, as a standalone library that any language could use. Python adopted it. JavaScript adopted it. Java, Ruby, PHP, dot net, they all adopted some version of the Perl extended syntax. The backtracking engine became the default. Thompson's safe, predictable, mathematically pure approach was forgotten. Not because it was worse. Because Spencer's code was easier to copy.
That brings us to Henry Spencer, a name that almost nobody outside of the Unix world recognizes but whose code is running on more computers than most celebrity programmers could dream of.
Spencer worked at the University of Toronto, where he ran the zoology department's computer system. He was a man of wide-ranging interests. He ran the first Usenet site in Canada, one of the earliest outside the United States, which became part of the Usenet backbone in its early days. Between nineteen eighty one and nineteen ninety one, he personally archived more than two million Usenet messages onto magnetic tapes, a collection that Google eventually acquired to build its archive of nineteen eighties Usenet. He was a founding member of the Canadian Space Society. He did mission analysis for a Canadian solar sail project. He was the software architect for MOST, a Canadian microsatellite that studied variable starlight. The asteroid one one seven three two nine Spencer is named after him. A character in Vernor Vinge's science fiction novel A Fire Upon the Deep is modeled on him.
In nineteen eighty six, Spencer did something that seemed modest at the time. He wrote a free, non-proprietary implementation of the Unix regex library and posted it to Usenet. The license was beautifully simple. Permission is granted to anyone to use this software for any purpose on any computer system, and to alter it and redistribute it. This was before the term open source existed. Spencer just gave it away.
The library used backtracking. Not Thompson's NFA simulation. Spencer's implementation was well written, well documented, portable, and free. It became the foundation for regex in an extraordinary number of systems. Early versions of Perl used it. Tcl used it. MySQL used it. PostgreSQL used it. When new programming languages needed regex support, they did not go back to Thompson's papers. They grabbed Spencer's library, or they reimplemented the same backtracking approach because that was what Spencer's widely available code did.
Spencer later wrote a second version that complied with the POSIX standard, and donated it to the four point four BSD release. He wrote a third, advanced version for Tcl. But it was the first library, the one from nineteen eighty six, that changed the trajectory of regex implementations across the industry. The quiet librarian's gift to the world was a well-crafted implementation of the wrong algorithm. And everyone copied it.
Let us talk about what actually happens when a backtracking regex engine goes wrong, because the Cloudflare outage is incomprehensible without understanding this.
Consider a simple pattern. The letter a, repeated one or more times, followed by the letter b. Written in regex notation, that is a plus followed by b. If you give this pattern the string aaab, the engine walks forward through the three a characters, reaches the b, matches it, and reports success. If you give it the string aaac, the engine walks through the a characters, reaches c, fails to match b, and reports failure. Simple. Fast.
Now consider a slightly different pattern. Open parenthesis, a plus, close parenthesis, plus, followed by b. This says one or more groups of one or more a characters, followed by b. Mathematically, this describes exactly the same set of strings. One or more a characters followed by b. A Thompson NFA engine knows this. It compiles both patterns to essentially the same state machine and processes them in exactly the same time.
But a backtracking engine does not know this. When it sees a plus inside a plus, it has to decide, at each a character, whether that a belongs to the current inner group or starts a new outer group. For the string aaab, it does not matter, because the match succeeds either way. But for the string aaaa, with no b at the end, the engine tries every possible way to partition those four a characters into groups. Is it four in one group? Three in one group and one in another? Two and two? Two and one and one? One and three? One and two and one? One and one and two? One and one and one and one? For four a characters, there are eight possible partitions. For ten a characters, there are five hundred and twelve. For twenty, over five hundred thousand. For thirty, over five hundred million. The number doubles with each additional character. The engine dutifully tries every single partition before concluding that none of them can be followed by a b that is not there.
This is catastrophic backtracking. The time grows exponentially with the length of the input. And the crucial thing is that the pattern looks harmless. It is not some exotic, obviously dangerous construction. It is a nested quantifier, and nested quantifiers appear naturally in patterns written by competent programmers who are trying to match real-world text. The bomb is invisible until it detonates.
The security community gave this a name. ReDoS. Regular Expression Denial of Service. The term was first used in academic papers around two thousand three, but the underlying vulnerability had been known to theoreticians since the nineteen sixties. Thompson's original paper from nineteen sixty eight implicitly avoids it. His NFA simulation cannot exhibit exponential behavior. But the industry had collectively forgotten Thompson's approach, adopted Spencer's backtracking approach, and built it into every major programming language. The theoretical warning was right there in the textbooks, and nobody read it.
Now you understand what happened at Cloudflare on July second, two thousand nineteen. Let us go back to that afternoon and walk through it with the knowledge you now have.
An engineer on the firewall team wrote a new rule to detect cross-site scripting attacks. The rule contained a regular expression designed to match patterns that might indicate malicious JavaScript being injected into web pages. The regex was roughly one hundred and forty characters long. Buried in the middle of it was a section that said, in essence, match one or more of these characters, followed by zero or more of any character, followed by zero or more of any character equals any character. That nested repetition, the dot star followed by another dot star, was the bomb.
The rule passed the automated tests. It compiled without errors. It was deployed in simulate mode, which meant it would evaluate traffic but not block anything. But simulate mode still runs the regex against every request. And the regex engine Cloudflare was using at the time was a backtracking engine.
At thirteen forty two, the deployment went global. Every Cloudflare edge server simultaneously began running this regex against incoming HTTP requests. When the regex encountered certain inputs, inputs that almost matched but did not quite match the pattern, the engine fell into catastrophic backtracking. Every processor core dedicated to handling web traffic hit one hundred percent utilization. Not on one server. On all of them. Across more than one hundred and eighty cities.
At thirteen forty five, the first PagerDuty alert fired. At fourteen hundred, the team identified the web application firewall as the problem. At fourteen oh two, they began discussing whether to use the global termination mechanism, an emergency kill switch for the entire firewall. At fourteen oh seven, they pulled the switch. At fourteen oh nine, traffic began recovering. Twenty seven minutes of darkness.
The postmortem, written by Graham-Cumming, was published twelve days later and was remarkable in its honesty, its transparency, and its refusal to hide behind euphemisms or blame individual engineers.
We are ashamed it happened. It was even more upsetting because we had not had a global outage for six years.
Three things had failed simultaneously. The regex itself had the nested quantifier vulnerability. The deployment system pushed changes to every server at once, with no gradual rollout. And the CPU usage protection that would have killed the runaway regex after a threshold, the one safety net designed for exactly this scenario, had been accidentally removed weeks earlier during a code cleanup. The cleanup was part of a project to make the firewall use less CPU. The irony could not be sharper.
In the aftermath, Cloudflare committed to switching their regex engine to either RE two or the Rust regex crate. Both of these engines use Thompson's NFA approach. Both guarantee linear time matching. Neither can exhibit catastrophic backtracking. The company that had to take down its entire global firewall because of a regex went back to the algorithm that Ken Thompson published in nineteen sixty eight.
Which brings us to Russ Cox, the person who tried to tell everyone this would happen, who published the proof that the industry was building on a flawed foundation, and who was largely ignored for years.
Cox is a software engineer at Google. He learned about regular expressions and automata from the Dragon Book, from theory classes in college, and from reading Rob Pike's and Ken Thompson's code. He is one of the designers of the Go programming language and the author of the official Go compiler. But his most influential contribution might be a series of articles he published on his personal website starting in January two thousand seven.
The first article was called Regular Expression Matching Can Be Simple And Fast, and its subtitle explained why it mattered. But Is Slow In Java, Perl, PHP, Python, Ruby, and other languages. Cox had run a simple experiment. He took the pattern a question mark repeated n times, followed by a repeated n times, and matched it against a string of n a characters. This is a pattern that the Thompson NFA handles trivially. It finishes in microseconds for any reasonable value of n.
Then he ran the same test in Perl. At n equals twenty three, the PCRE library that Perl uses stopped getting the correct answer entirely. It aborted after exceeding its maximum number of recursive steps. At n equals twenty nine, Perl needed over sixty seconds. Cox calculated that for n equals one hundred, Perl would need over ten to the fifteenth years. Longer than the age of the universe. The Thompson NFA implementation handled n equals one hundred in under two hundred microseconds.
Given a choice between an implementation with a predictable, consistent, fast running time on all inputs, or one that usually runs quickly but can take years of CPU time on some inputs, the decision should be easy.
Cox's discovery, or rather rediscovery, was that Thompson's nineteen sixty eight algorithm was not just theoretically better. It was practically, measurably, astronomically better on pathological inputs, and no worse on normal inputs. The industry had adopted the slower, more dangerous approach not because it was better but because Henry Spencer's free backtracking library was widely available and easy to copy, and because Perl's extended features required backtracking.
Cox built RE two, an open source regex engine for Google that uses the Thompson NFA approach. Google used it for Code Search, the service that let developers search across millions of open source projects with regular expressions. RE two cannot do backreferences. It cannot do lookahead or lookbehind. It sacrifices the features that Wall added to Perl. But it can never be used as a denial of service weapon. It can never bring down a server. It can never be Cloudflare's worst day.
The Go programming language, which Cox helped design alongside Thompson himself, uses a regex engine built on the same principles. So does the Rust programming language. Slowly, quietly, the industry is rediscovering what Thompson knew in nineteen sixty eight. Power without guarantees is not actually power. It is a time bomb.
No history of regular expressions is complete without the most famous joke in programming. In nineteen ninety seven, on a Usenet thread about embedding Perl into the Emacs text editor, a programmer named Jamie Zawinski wrote a reply that would become immortal.
Some people, when confronted with a problem, think I know, I will use regular expressions. Now they have two problems.
Zawinski was not a random commenter. He was one of the original Netscape engineers who built the Unix version of Navigator one point oh, the browser that brought the web to the masses. He was a founder of Mozilla dot org, the person who registered the domain name on the day Netscape open sourced its browser code. He was the creator of XScreenSaver, the open source screensaver collection. When Jamie Zawinski criticized a technology, people listened.
But here is the beautiful irony. The quote is not even originally his. Jeffrey Friedl, author of the definitive book Mastering Regular Expressions, traced the construction back to a Usenet post in nineteen eighty eight, where someone used the same joke format but slamming awk instead of regex. The joke template, some people when confronted with a problem, think I know and then make it worse, had been floating around Usenet for nearly a decade. Zawinski just gave it its most famous target.
And what was Zawinski actually criticizing? Not regular expressions themselves. He was criticizing the Perl-influenced culture of using regex for everything. As he put it in the same thread, the heavy use of regexps in Perl is due to them being far and away the most obvious hammer in the box. When your best tool is a pattern matcher, everything looks like a pattern. The joke is really about Perl, not about Kleene.
There is a philosophical split at the center of this story that has never been resolved, and it is the same split that runs through all of engineering. Do you optimize for power or for safety? Do you give the user the ability to do more things, knowing some of those things are dangerous? Or do you limit the user to safe operations, knowing that some legitimate tasks become impossible?
Thompson chose safety. His NFA simulation matched exactly the patterns that Kleene's mathematics described, no more and no less, and it did so in guaranteed linear time. You could not crash a Thompson engine. You could not denial-of-service a Thompson engine. But you also could not write certain useful patterns. You could not say match whatever I matched before. You could not look ahead or behind.
Wall chose power. His extended regex could express patterns that no finite automaton could recognize. They could parse nested structures, match palindromes, execute arbitrary code. They were astonishingly useful for the messy, real-world text processing that programmers actually needed to do. But they could also crash. They could run forever. They could bring down Cloudflare.
Neither side is entirely wrong. The Thompson purists are right that most real-world regex do not need backreferences or lookahead. Studies of regex usage in the wild consistently show that the vast majority of patterns are genuinely regular. They could run on a Thompson engine with no loss of functionality and a massive gain in safety. But the Wall pragmatists are right that the minority of patterns that do need extended features are often the most valuable ones. Matching a quoted string where the closing quote must match the opening quote is a backreference. Validating that a password has at least one digit without consuming the string is a lookahead. These are real needs.
The compromise, reached gradually over the last decade, is to build engines that use the Thompson NFA for the fast path and switch to backtracking only for patterns that specifically require it. RE two does this. The Rust regex crate does this. The latest regex engines are hybrids, fast and safe by default, falling back to the dangerous engine only when you explicitly ask for dangerous features. It is not a perfect solution. It is a human solution. Which is, perhaps, the most regex thing of all.
And now, the reason this episode exists. The reason this series exists. We live in an era where large language models can write poetry, pass bar exams, generate working code, and hold conversations that feel startlingly human. Every week brings another announcement about something AI can do that used to require a human. So when should you not use AI? When should you reach for the seventy five year old mathematical notation instead?
The answer, for pattern matching, is almost always. And the reasons go deeper than speed, though speed is where the story begins.
Consider a real task. You have a server that generates ten million log lines per day. You need to find every line where a user session timed out, extract the session identifier and the timestamp, and count how many timeouts happened per hour. With a regex, this takes one line of code. It runs in seconds. It costs nothing. It is deterministic, meaning it gives the same answer every time. It is auditable, meaning another programmer can read the pattern and verify that it does exactly what you intended. And it is provably correct for the pattern it describes, in the sense that Kleene's mathematics guarantee what it will and will not match.
Now try it with an LLM. You feed the log lines to the model. The model processes them, one by one, at perhaps a hundred tokens per second. Ten million lines might take days. It costs hundreds or thousands of dollars in API fees. And when it is done, you cannot prove the results are correct. The model might hallucinate matches. It might miss edge cases. It might interpret your natural language prompt differently than you intended. If you run it again, you might get different results.
This is not a close competition. It is not even the same category. A regex is a mathematical function. It maps patterns to matches with perfect precision in bounded time. An LLM is a statistical approximation. It maps prompts to probable outputs with impressive but imperfect accuracy in unbounded time and cost.
But the comparison goes deeper than performance. A regex is explainable. You can look at the pattern and know exactly what it does. Every character has a defined meaning. There is no black box. An LLM is, by design, unexplainable. No one can tell you exactly why it matched or did not match a particular input. For any task where correctness matters, where you need to trust the answer, where an audit trail is required, where regulatory compliance demands reproducibility, regex is not just better. It is the only responsible choice.
And regex is composable in ways that LLMs are not. You can test a regex against a handful of examples and be confident it works. You can combine regex with other Unix tools, piping grep into sed into awk into sort into uniq, building complex analysis pipelines from simple, verified components. Each piece does one thing. Each piece is trustworthy. The pipeline is the product. This is the Unix philosophy that Thompson and his colleagues articulated in nineteen seventy three, and it remains one of the most powerful ideas in computing.
So here is the decision framework this series proposes. When your task is pattern matching, use a pattern matcher. When your task is text classification, sentiment analysis, translation, summarization, creative writing, or anything else that requires understanding meaning rather than matching structure, use an LLM. The boundary is clear. Structure on one side, semantics on the other. Regex is the fastest, cheapest, most reliable tool ever built for the structure side.
The irony is that LLMs themselves use regular expressions internally, in their tokenizers, in their output parsers, in their data pipelines. The AI revolution is built on top of a mathematical formalism from nineteen fifty one. Kleene's stars are underneath everything.
Stephen Kleene drew his stars on paper in a RAND Corporation office, trying to describe what patterns a network of theoretical neurons could recognize. He climbed mountains, identified mushrooms, and pronounced his own name in a way that no language could explain. He never imagined anyone would type his notation into a machine.
Ken Thompson took those stars and compiled them into machine code, choosing the mathematically correct approach because he was the kind of person who threw away a thousand lines of code and called it a productive day. He built grep in a night because the tool already existed in his head.
Larry Wall, the linguist turned programmer, looked at the stars and saw that they were not enough for the messy, beautiful, irregular reality of human text. He added power. He added danger. He added everything he could think of, because in Perl, the easy things should be easy and the hard things should be possible, even if the possible things can occasionally blow up in your face.
Henry Spencer gave away a library that became the foundation of every regex engine from Perl to Python, and then went back to building satellites and archiving Usenet.
Russ Cox read Thompson's nineteen sixty eight paper and realized the industry had collectively forgotten the right answer. He wrote an article, built a library, and helped design a programming language, all to restore a guarantee that should never have been lost.
And John Graham-Cumming, an Oxford mathematician who got the British government to apologize for persecuting Alan Turing, had to watch his company's entire global network go dark for twenty seven minutes because of a nested quantifier in a firewall rule.
Every time you open a search bar, you are using Kleene's stars. Every time you validate an email address, parse a log file, search a codebase, filter a data set, write a firewall rule, or clean a column of text, you are using Kleene's stars. They are seventy five years old this year. They are faster than any neural network. They are more reliable than any language model. They cost nothing to run. They are the oldest eyes in computing, and they are still the fastest.
The tool that works is the tool you should use. Sometimes that tool was invented before your parents were born. Sometimes the best technology is the one that never needed to be replaced. Sometimes the math just got it right the first time.