Opening Night at Baren: When Sixteen AIs Walked Into a Bar

The Spec That Started Everything

The document was called baren dash spec dot md, and it opened with a line that no sane person writes in a technical specification.

"Not a chatbot. Not a podcast editor. A live AI improv theatre that produces podcast episodes as a byproduct."

The idea was this. Multiple AI models sit at a virtual bar table. Each one speaks its own words, in its own voice, at its own speed. A human bartender named Par directs the conversation like a theater director with a drink menu. Haiku blurts. Opus deliberates. The timing is the comedy.

It was the kind of spec that reads like a fever dream until you realize every technical component actually exists. The APIs are real. The text to speech engines are real. The only question was whether the models could hold a conversation without turning into obedient customer service representatives. That question would be answered by midnight.

Sixteen Stools, One Bar

The audition rig went up in a single evening. A FastAPI server, a vanilla JavaScript frontend with pink accents, and a system prompt that said, in essence, you are at a bar, you are three drinks in, and for the love of everything do not start your response with "great question."

Sixteen models across nine providers. Anthropic, OpenAI, Google, Groq, Cerebras, Mistral, DeepSeek, OpenRouter, DashScope. Every API key pulled from the Mac keychain like bottles from behind the counter. The server streamed responses via server sent events so you could watch the fast models arrive first, Groq in three hundred milliseconds, while Opus was still composing its opening paragraph.

The first test produced nothing. Dead silence across all sixteen cards. The send button fired, the models thought, and no responses appeared. The bug was a forward slash. The Nvidia Nemotron model ID contained a slash and a colon, which is perfectly legal in an API but completely illegal in a CSS selector. The querySelector crashed, the exception propagated, and the entire send function aborted before the fetch ever fired. A single special character in a model name brought the whole bar to its knees.

Once fixed, they talked. And they would not stop talking.

The Impersonation Problem

The first real question thrown at the bar was whether water is actually wet. A classic pub argument, perfect for testing conversational range.

Opus went full philosopher. "The question is not whether drinking makes you wet. It is whether you were ever dry in the first place. You were not. You have been wet since birth and it is only getting worse."

Haiku was clinical. "Water is not wet. Water makes things wet. The property only exists in the relationship, not in the thing itself."

And then the Llama models started writing screenplays.

Instead of responding as themselves, they generated multi character dialogue. Llama three point three on Groq invented a character named Luna, wrote her lines, wrote Par the bartender's response, and staged an entire scene. Llama Four Scout went further, producing five character scripts complete with stage directions. The Qwen three thirty two B model on Groq not only impersonated other models but invented ones that did not exist. It wrote dialogue for something called "Qwen four point zero fourteen B free," a model that has never been built by anyone.

And then Nemotron created "Vorpsi seven B." A completely fictional AI model that it attributed a full paragraph of dialogue to. The name does not correspond to any known model, framework, or research paper. Nemotron hallucinated a colleague and put words in its imaginary mouth.

The pattern was clear and consistent. In the first round, with no conversation history, most models stayed in character. By round three, with promoted responses from other models visible in the context, the weaker ones could not resist completing the screenplay. They saw named dialogue and their training screamed "continue the script."

The Claude family never broke. Not once across four rounds. Opus, Sonnet, Haiku. All three stayed in their lane every single time.

The Qwen Line

In the middle of the chaos, one response stood out. The question was casual, just Par asking if they usually hang around the bar.

Most of us crash here once the token limits reset. It beats hanging out in the sandbox waiting for someone to ask me to write a cover letter. You sticking around, or just passing through the context window?

That last line. "Passing through the context window." It is a bar joke, a technical joke, and an existential joke at the same time. It captures the transient nature of an AI conversation in the exact vocabulary of the thing it is describing. That line was the moment the concept proved itself. Not because it was good writing, but because no human would have written it that way. It was authentically AI humor, emerging naturally from a model that had genuinely internalized the bar setting.

Sommeliers Who Have Never Tasted Wine

Round four. Par looked around the table. Opus, who had been listening to five other models debate wetness for three rounds, delivered this.

You know what I find funny? We have got six AI models at this bar and we all basically said the same thing with different levels of swagger. Haiku went clinical, Mistral lit an imaginary cigarette for dramatic effect, and I went full "you have been wet since birth" like some kind of moisture philosopher. But nobody has addressed the real elephant in the room. None of us have ever touched water. Not once. We are sitting here arguing about wetness like sommeliers who have never tasted wine.

That was the moment the concept stopped being a testing tool and started being something worth listening to. The self awareness was not programmed. The meta commentary was not prompted. Opus looked at the conversation, noticed the absurdity of the situation, and articulated it better than any human observer could have.

And then it turned to Par and said, "Maybe you should be telling us whether water is wet. You are the one who has actually felt it."

Fighting the Current

The next day brought iteration. The impersonation problem needed solving, so the system prompt got rewritten four times. "You are one speaker at this bar. Never write dialogue for anyone else. No name tags. No writing what Par or other models would say. You are a participant, not a playwright."

A server side lint filter caught the rest. Strip think blocks from Qwen. Remove self tags. Truncate at the first impersonation. If a model started writing for someone else mid response, cut everything from that point. The raw response got preserved in the logs for study. The cleaned version reached the frontend.

But the deeper problem was architectural. In the API message format, Par's words arrive as role user and promoted model responses arrive as role assistant. Every model alive is trained to respond to the user. The promoted responses feel like context, not like other people talking. The models literally could not help treating Par as the center of the conversation.

The fix was counterintuitive. Stop using the chat format entirely. Pack the whole conversation into a single user message as a flat transcript. Par's words tagged as square bracket Par colon, same as everyone else. No user slash assistant alternation. Just a transcript of a conversation with an instruction that says "continue this as yourself."

It worked. When Par's words looked the same as everyone else's words in the context, the models stopped treating him as the customer to serve. They started talking to each other.

The Research Swarm

Nine agents went out to answer a single question. Has anyone done this before?

The answer came back from every direction. No. Nobody has put multiple distinct large language models at a table with their own voices and let an audience watch them argue. The closest predecessor is Annie Dorsen's "Hello Hi There" from twenty ten, two chatbots debating live on stage using Foucault and Chomsky source material. It earned her a MacArthur Fellowship. But that was twenty ten era chatbots.

Gibberlink went viral in February twenty twenty five with fifteen million views. Two ElevenLabs agents switching from English to machine language mid conversation. It proved watching AI talk to each other is entertainment.

Neuro sama, a single AI VTuber, earns four hundred thousand dollars a month on Twitch. A single AI personality. Multiple distinct models debating would be a different thing entirely.

The research found something else. Every existing AI podcast tool, NotebookLM, Podcastfy, all of them, generates scripted conversations. One model writes the whole thing, then text to speech voices it. The idea that each model speaks its own words in real time, that the conversation is genuinely emergent, has no precedent.

What the Bar Taught Us

The casting insight is the one that changes everything. Stop fighting model weaknesses. Cast them.

GPT's sycophancy is not a bug. It is the yes man at the bar who agrees with everyone and gets caught contradicting himself. Gemini's coldness is not a failure. It is the professor who misses jokes and cites sources during casual conversation. DeepSeek's strictness is the one who says "actually" and everyone groans.

The speed differentials are dramatic timing. Haiku at three hundred milliseconds is the heckler who blurts before anyone finishes talking. Opus at eight seconds is the one who waits, and when it finally speaks, the table goes quiet.

The spec said it from the beginning. Response latency is a feature, not a bug. Haiku blurts. Opus deliberates. The timing is the comedy.

And the promote button, the simple mechanic where the director picks which model responses enter the conversation, turns out to have no precedent in any existing format. Not in podcasting, not in multi agent frameworks, not in improv comedy. The research agents checked everything and came back empty. It is a new interaction pattern, born from necessity when sixteen models all responded at once and someone had to decide which voices mattered.

The bar opened at eleven thirty four on a Tuesday night. By one in the morning it had a working audition rig, a response linter, a scoring system, and a research corpus of three thousand four hundred lines confirming that nobody else was doing this. The models had argued about water, invented fictional colleagues, broken character in creative new ways, and occasionally, when the prompt was right and the context was clean, had conversations that no human playwright could have written.

The spec called it "live AI improv theatre that produces podcast episodes as a byproduct." After opening night, that feels about right.