Beautiful Soup: The Poem in Your Parser

The Mess Beneath the Surface

This is episode thirteen of What Did I Just Install.

Every web page you have ever visited is a lie. Not the content, necessarily, though that too. The lie is structural. The page looks clean and ordered in your browser, neatly rendered columns and aligned headings and crisp paragraph breaks. But underneath, the actual markup that produced that appearance is, more often than not, a catastrophe. Unclosed tags. Attributes missing their quotation marks. Tables nested inside tables nested inside tables, three levels deep, for no reason other than that someone in two thousand and one did not know about CSS. Angle brackets pointing in directions that make no grammatical sense. The industry has a name for this. They call it tag soup.

And for twenty years, there has been one Python library that has made its entire reputation on the promise that it can take that soup, that absolute wreckage of angle brackets and orphaned divs, and turn it into something a programmer can actually work with. It is called Beautiful Soup, which is either the most ironic name in software history or the most honest one, depending on how you look at it. It gets downloaded over two hundred million times a month, and it was written and maintained, from the first line to the last, by a single person. A science fiction novelist from California who has been programming since he was eight years old and who once spent four months building technology for a presidential campaign that went nowhere.

His name is Leonard Richardson. And the library he built in two thousand and four is, in many ways, the anti-story of this series. No dramatic mass-unpublishing. No corporate acquisition. No licensing war. No fork. Just one person, one library, and two decades of quietly making the messy web navigable for anyone who needed it.

What Beautiful Soup Actually Does

The problem Beautiful Soup solves is deceptively simple. You have an HTML document. You want to find a specific piece of data inside it. Maybe the price of a product. Maybe the text of an article. Maybe every link on the page. In a perfect world, the HTML would be valid, well-structured, and parseable by any standard tool. In the actual world, the HTML was probably generated by a content management system that was last updated during the Bush administration, has been patched by six different contractors who never spoke to each other, and contains at least one tag that exists in no specification ever written.

Beautiful Soup handles all of it. You feed it a string of HTML, no matter how broken, and it gives you back a parse tree, a navigable structure where you can search for tags by name, by attribute, by CSS class, by the text they contain. You can walk up the tree to find parents, down to find children, sideways to find siblings. The API is simple enough that someone who has never parsed HTML before can get useful results in ten minutes. And it works on XML too, which matters more than you might think, because a surprising amount of the world's data still moves around in XML documents that look like they were written by a committee, which in most cases they were.

The key architectural insight, and this is the thing that has kept Beautiful Soup relevant for two decades while competitors have come and gone, is that Beautiful Soup is not itself a parser. It is a layer that sits on top of a parser. When Richardson created version four, he made the library parser-agnostic. You can plug in Python's built-in html dot parser, which ships with every Python installation and does a reasonable job with mildly broken markup. You can plug in lxml, which is written in C and is blisteringly fast but less forgiving of truly mangled HTML. Or you can plug in html5lib, which implements the actual HTML5 parsing algorithm, the same one browsers use, and will handle anything you throw at it because the HTML5 spec was essentially written to codify how browsers already dealt with garbage markup. Same Beautiful Soup API, different engine underneath. This matters because it means Beautiful Soup never has to solve the hardest problem in web parsing. It delegates that to specialists and focuses on what it does best, which is giving programmers a humane way to talk to the resulting tree.

The Science Fiction Novelist

Leonard Richardson grew up in California and started programming at age eight. He studied computer science at UCLA, graduating in two thousand. His first significant job was at CollabNet, a company that built tools for connecting issue trackers, version control systems, and mailing lists, the plumbing of collaborative software development. He spent five years there, from two thousand to two thousand five, which means he was deep in the infrastructure of how developers worked together right at the moment when open source was transitioning from a fringe movement to the default way software got built.

But here is the thing about Richardson that sets him apart from nearly every other maintainer in this series. He does not primarily identify as a programmer. He identifies as a science fiction writer.

He has published two novels. Constellation Games, which came out in two thousand and twelve, is a first-contact story about video games, love, and programming that reviewers compared to the best of Douglas Adams. Situation Normal, from two thousand and twenty, is a military space opera in the tradition of Catch Twenty-Two, satirical and deeply kind at the same time. He has published short fiction in Clarkesworld, one of the most respected science fiction magazines in the field. His personal website, crummy dot com, is where you find Beautiful Soup's documentation, but it is also where you find his fiction, his thoughts on storytelling, and a sensibility that is more literary than technical.

This matters because it explains a lot about how Beautiful Soup was built and how it has survived. Richardson approaches software the way a novelist approaches a long-running series. He thinks about scope. He thinks about what to leave out. He thinks about sustainability over decades, not sprints. At his two thousand and twenty-four PyCon talk, he described two separate periods of burnout in his twenty-year maintenance career, and then revealed something startling. He had confabulated a narrative about his own recovery. He distinctly remembered that certain architectural changes had pulled him out of burnout, but when he went back and checked the timeline, those changes had either happened before or after the burnout periods. The real cause of both burnout episodes was his day job, not the library. Beautiful Soup, carefully scoped to what one person could maintain, had never been the problem.

The Mock Turtle's Song

The name is perfect, and the reason it is perfect requires a brief detour into Victorian nonsense literature, the kind of detour that only makes sense when you remember that the person who chose this name is a novelist.

In Lewis Carroll's Alice's Adventures in Wonderland, published in eighteen sixty-five, Alice meets a character called the Mock Turtle. The Mock Turtle is a melancholy creature who exists as a pun. Mock turtle soup was a real Victorian dish, a cheap imitation of actual turtle soup made from calf's head instead of turtle meat. Carroll, being Carroll, imagined the animal that such a soup would come from, and drew it as a creature with the body of a turtle and the head and hooves of a calf. The Mock Turtle sings a song. The song begins with the words "Beautiful Soup, so rich and green, Waiting in a hot tureen." It is a parody of a popular song of the era, and it is, like most things in Alice, simultaneously absurd and oddly moving.

Richardson named his library after this song, and the joke works on two levels. The surface level is the pun on "tag soup," the industry term for broken, malformed HTML. The web is a tureen full of beautiful soup, messy and nourishing and not at all what the specifications intended. But there is a deeper level too. The Mock Turtle is a creature that should not exist, an impossible hybrid brought into being by a pun. And Beautiful Soup the library does something that should not quite work either. It takes markup that violates its own grammar, HTML that is not really HTML anymore, and makes it usable anyway. It finds the meaning in the mess.

The project's homepage carries another Carroll quote as its tagline. "We called him Tortoise because he taught us." This is from the same scene, another pun, tortoise and taught-us. Richardson's website, crummy dot com, is itself a quiet joke. The domain name is self-deprecating, but the software hosted there is anything but.

From One File to Four Parsers

Beautiful Soup first appeared in two thousand and four. In its earliest incarnation, it was a single Python file. You could copy it into your project directory and start using it immediately, no installation required, no dependencies, no packaging ceremony. This was before pip existed, before PyPI was widely used, before the infrastructure we now take for granted had been built. A single-file library was not a limitation. It was a feature. You could email it to a colleague.

The first versions relied on Python's built-in SGMLParser, which was part of the standard library in Python two. It worked well enough. Beautiful Soup quickly found an audience among people who needed to extract data from web pages and did not want to write their own parser. By two thousand and six, version three arrived, the first version to achieve wide adoption. It was still fundamentally the same idea, a friendly API wrapped around a standard library parser, but Richardson had refined the interface and the library had begun appearing in tutorials and blog posts and Stack Overflow answers.

Then Python three happened. And Python three dropped SGMLParser entirely. This was not a minor inconvenience. It was an existential crisis for Beautiful Soup. The parser that the entire library was built on top of simply ceased to exist in the new version of the language. Richardson had a choice. He could try to bundle a copy of the old parser and keep the old architecture alive. Or he could rethink the entire design.

He chose to rethink. Beautiful Soup four, which arrived in two thousand and twelve, was a complete rewrite. The single most important change was making the library parser-agnostic. Instead of depending on one specific parser, Beautiful Soup four defined an internal tree-builder interface that any parser could implement. Out of the box, it shipped with tree builders for three different parsers. Python's html dot parser, the replacement for the old SGMLParser, was the default. lxml, the C-based parser that was already popular in the Python ecosystem for its speed, was supported as an optional backend. And html5lib, which parsed HTML the way browsers actually do, became the option for maximum compatibility with the real-world web.

A constant, low-level stream of stressors that are out of your control. That is how the Nagoskis define burnout. But I control my library's architecture, and that is how I minimize the ongoing stress.

This architectural decision expanded the codebase by about thirty percent, but it gave Richardson something invaluable. Clear boundaries. He was responsible for the Beautiful Soup API and the tree builder interface. The parser maintainers were responsible for the parsing. When lxml had a bug, that was not his problem. When html5lib fell behind on updates, that was not his problem either. He could focus on what one person could realistically maintain, and he drew that boundary with the precision of someone who was planning to be here for a long time.

The Right to Read the Web

Beautiful Soup is a scraping library, and scraping is one of those activities that sits in a permanent ethical and legal gray zone. The word itself carries a faintly disreputable connotation, like picking a lock or reading someone else's mail. But the reality is that web scraping is how an enormous amount of the world's information gets collected, organized, and made useful. Google is, at its core, the world's most successful web scraper. Every price comparison website. Every job aggregator. Every academic research project that studies online discourse. Every journalist who has ever investigated a company by collecting its public filings. All of them are scraping.

The legal landscape is genuinely complicated. In the United States, the most important case was hiQ Labs versus LinkedIn, which wound through the courts for years. hiQ was a small analytics company that scraped publicly available LinkedIn profiles to predict which employees were likely to leave their jobs. LinkedIn sent a cease-and-desist, hiQ sued, and the case bounced between the Ninth Circuit and the Supreme Court before finally settling in two thousand and twenty-two. LinkedIn won, but the legal reasoning left the broader question unsettled. The court focused on the fact that hiQ had violated LinkedIn's terms of service and created fake profiles, not on whether scraping public data was inherently illegal. The legality still depends on what you scrape, how you scrape it, and what you agreed to when you visited the site.

Then there is robots dot txt, the gentleman's agreement of the web. It is a plain text file that sits at the root of a website and tells automated crawlers which parts of the site they are welcome to access and which parts they should leave alone. It has no legal force. It is purely advisory. And yet, for decades, it functioned as a surprisingly effective social contract. Responsible scrapers honored it. Search engines honored it. The convention was created in nineteen ninety-four by Martijn Koster, a Dutch web pioneer, and it worked because the early web was small enough that reputation mattered. If you ignored robots dot txt, people would notice, and they would talk.

The arrival of large language model training in the twenty-twenties shattered whatever remained of that social contract. Companies with billions of dollars in funding began scraping the entire web to build training datasets, often ignoring robots dot txt entirely, because the competitive pressure to accumulate data was stronger than any reputational cost. The conversation around scraping ethics has never been more contentious. And in the middle of it all, there sits a twenty-year-old Python library created by a science fiction novelist, quietly enabling all of it, the ethical and the questionable alike, without judgment. Beautiful Soup does not care what you scrape. It just makes the soup navigable.

What Holds It Up

Beautiful Soup's dependency tree is remarkably shallow. The library itself has only one hard dependency. Soupsieve, a CSS selector library that lets you use CSS-style queries to search the parse tree. Soupsieve was written by Isaac Muse and has been part of Beautiful Soup since version four point seven. Beyond that, everything is optional. lxml is a separate install if you want speed. html5lib is a separate install if you want maximum compatibility. The character encoding detection libraries, chardet and cchardet and charset-normalizer, are all optional extras for when you need to handle documents that do not declare their encoding properly.

But what depends on Beautiful Soup is a different story. Over two hundred million downloads a month means it sits somewhere in the dependency tree of an extraordinary number of Python projects. Web frameworks use it for testing. Data science pipelines use it for data collection. Natural language processing projects use it to clean HTML out of text corpora. The Scrapy framework, which is the industrial-strength web scraping tool for Python, often works alongside Beautiful Soup for the parsing step. And then there are the uncountable scripts written by individuals, researchers and journalists and hobbyists, people who needed to get data off a web page once and reached for the library that every tutorial recommended.

I don't do this for everything, but I am going to blame this one on capitalism. Almost all deployed APIs are company-controlled. Interoperability benefits users. Vendors prefer lock-in.

That quote is from Richardson talking about why hypermedia APIs never achieved widespread adoption, but it applies equally well to why scraping exists in the first place. When companies control the data and provide no API, or provide an API that is deliberately limited, or provide an API and then revoke it, as Twitter did and as Reddit did, scraping becomes the only way to access information that was, at least ostensibly, public. Beautiful Soup exists because the web makes promises about openness that its platform owners do not always keep.

The Maturity Model and the Library

Richardson's influence on the web extends beyond Beautiful Soup. In two thousand and eight, while working at Canonical on Ubuntu's development tools, he gave a talk that introduced what became known as the Richardson Maturity Model. The idea was simple. He had analyzed about a hundred web APIs and noticed that they clustered into distinct levels of sophistication. Level zero was a single endpoint that accepted everything, basically using HTTP as a tunnel. Level one introduced resources, distinct URLs for distinct things. Level two added proper use of HTTP verbs, GET for reading, POST for creating, DELETE for removing. Level three, the highest level, added hypermedia controls, links within responses that told the client what it could do next.

Martin Fowler wrote it up on his blog and it became one of the most widely referenced frameworks in API design. To this day, when developers argue about whether an API is "truly RESTful," they are usually arguing about which level of the Richardson Maturity Model it reaches. Most production APIs land at level two. Almost none reach level three. Richardson himself considers this a failure, not of the model, but of the market. Level three benefits users through interoperability. Companies prefer lock-in.

What is remarkable is that the same person who created this elegant framework for thinking about well-structured APIs also created the definitive tool for dealing with the web's worst-structured documents. Richardson lives at both ends of the spectrum simultaneously. He wrote the book on how APIs should work, literally, RESTful Web APIs was published by O'Reilly in two thousand and thirteen, and he also wrote the tool that handles what happens when nothing works the way it should. There is a kind of completeness to that. He understands the ideal and he builds for the real.

Twenty Years and Counting

Beautiful Soup is now in its twenty-first year. The current release, version four point fourteen point three, came out in November two thousand and twenty-five. Richardson is still the sole maintainer. He still responds to bug reports on Launchpad, the same bug tracker the project has used since the Canonical days. He has never taken venture capital. He has never started a company around the library. He has never changed the license, which has been MIT from the beginning. He has never even moved the project to GitHub, which at this point is almost a statement of principle.

A library that one person can maintain. That is what I have always wanted Beautiful Soup to be.

The funding situation is, as usual in this series, almost comically modest. Richardson has a full-time job at Bookshop dot org, where he builds server-side architecture for commercial ebook reading systems. Before that, he spent eight years at the New York Public Library, leading the Library Simplified project, which made library ebook borrowing competitive with commercial bookstores, and directing the Open eBooks initiative that delivered free books to children in kindergarten through twelfth grade. These are not the kinds of jobs that make someone wealthy. They are the kinds of jobs that a science fiction writer who cares about public access to knowledge would choose.

Beautiful Soup has never had a sustainability crisis because Richardson designed it to never need one. He set boundaries early and enforced them. He prefers bug reports over pull requests because reviewing other people's code is more draining than writing his own fixes. He limits support responses to actual bugs, not usage questions. He maintains comprehensive unit tests so he can step away for months and come back with confidence that nothing has broken. At PyCon, he was honest about the fact that his enthusiasm for volunteer work has naturally declined as he has gotten older, married, and developed other interests. But that is different from burnout. Burnout is when the stressors overwhelm you. Richardson's strategy has been to keep the stressors small enough that they never can.

Where It Connects

On this machine, Beautiful Soup appears in two project requirement files. Ugglescraper, which does exactly what the name suggests, and Cleanup, which processes collected data into usable formats. Both specify beautifulsoup4 version four point eleven or greater. The pattern is typical. Something needs to read a web page. Something needs to extract structured data from unstructured HTML. Beautiful Soup gets the call.

It is also, in a quiet way, a dependency of almost every web scraping pipeline in existence. Any tool that extracts text from URLs, whether for data analysis, content aggregation, or search indexing, is solving the same fundamental problem that Beautiful Soup solves, turning the messy web into clean text. Libraries like trafilatura, newspaper3k, and dozens of others either use Beautiful Soup directly or reimplemented its approach. They are all born from the same reality: the web is not as clean as it looks, and someone has to deal with that.

There is something fitting about ending a season of dependency stories with a library named after a Lewis Carroll poem. This whole series has been about the hidden layers beneath the software we use, the people and the decisions and the accidents that produced the tools we take for granted. Beautiful Soup is a library about hidden layers too. You look at a web page and you see clean text and aligned images. Beautiful Soup looks at the same page and sees the soup underneath, the unclosed tags and the malformed attributes and the markup that technically should not work but does because browsers are forgiving creatures. And it makes that soup beautiful. Or at least, navigable. Which, when you think about it, might be the same thing.

Open a terminal and type pip install beautifulsoup4 requests. Then open a Python shell and type three lines. Import requests and BeautifulSoup. Fetch any webpage with requests dot get. Pass the response text to BeautifulSoup. Then type soup dot title dot string. You will see the page title, pulled clean out of whatever mess of markup that page is actually made of. That is web scraping in its simplest form. The entire Beautiful Soup mental model, taking the soup and making it navigable, in three lines of code.

That was episode thirteen.