The previous episode laid out four sensible projects. Dialect classifier. Local NER model. Regional dataset. TTS benchmark. All reasonable. All publishable. All the kind of thing that looks great on a grant application. And they were met with the energy of a Tuesday morning fika where someone forgot the kanelbullar.
So let's try again. The constraint is the same: has to be doable by one person, has to cost almost nothing in compute, has to produce something you can put on Hugging Face. But we're dropping the constraint that it has to be tasteful, important, or something you'd describe with a straight face at a Vinnova meeting. Some of these are useful. Some are deeply stupid. The best ones are both.
Every Swede has received a letter from Försäkringskassan, Skatteverket, or Arbetsförmedlingen and spent ten minutes trying to figure out what it actually says. Swedish bureaucratic language — myndighetssvenska — is technically Swedish the same way a IKEA manual is technically furniture. The words are familiar but the meaning has been tortured into shapes that no human brain processes naturally.
Den försäkrade ska anmäla till Försäkringskassan om det skett en förändring som kan innebära att rätten till ersättning påverkas. Anmälan ska göras inom fjorton dagar från det att den försäkrade fick kännedom om förändringen.
Translation: "Tell us within two weeks if anything changes." That's it. That's the whole meaning. Fifteen words of human Swedish disguised as forty-two words of legalese.
The project: collect parallel examples of bureaucratic Swedish and their plain-language equivalents. Some already exist — Klarspråk, the government's plain language initiative, has published before-and-after rewrites. Scrape every Klarspråk example you can find, then use a locally running open model to generate more training pairs from real government documents. Fine-tune a small model to translate myndighetssvenska to human Swedish. Publish on Hugging Face as "myndighets-to-människa."
This is genuinely useful. It's also funny because the examples are inherently absurd. And it's a project that no AI lab in Silicon Valley would ever build because you have to deeply understand Swedish institutional culture to even know the problem exists.
Swedish municipal communications have a distinctive quality. They're earnest. They're formulaic. They use phrases like "Vi ser positivt på utvecklingen" and "Medborgardialogen har visat att" in contexts where a normal person would say "this is good" or "we asked people." The resulting prose is so uniform across Sweden's two hundred and ninety municipalities that it sounds like it was generated by an algorithm. Which, increasingly, it is.
Kommunen ser med tillförsikt på det framtida samarbetet och värdesätter den konstruktiva dialog som förts med samtliga berörda parter i processen. Vi fortsätter att arbeta aktivt med frågan.
The project: build a binary classifier that distinguishes real kommun press releases from AI-generated ones. Scrape a few hundred real municipal communications. Generate a few hundred fake ones using GPT-SW3 or a locally running model. Fine-tune KB-BERT to tell them apart. The funny part: publish the accuracy score. If the model can't tell the difference — and it probably can't, reliably — that itself is the finding. That's a paper title waiting to happen: "Indistinguishable: Municipal Communications and Language Model Output in Swedish."
Bonus: make it a web app where people paste text and get a confidence score. Kommun-eller-AI-dot-se. It'll go mildly viral in Swedish local government circles, which is a very specific form of fame.
Sentiment analysis is boring. Positive, negative, neutral. That's a Western reductionism that misses the entire emotional landscape of Swedish communication. Swedish has feelings that English doesn't have words for, and more importantly, has social situations that generate specific textual patterns.
The project: train a classifier not on sentiment but on vibes. The taxonomy:
Lagom — the text expresses a measured, appropriate amount of enthusiasm. Not too much. Not too little. Perfect.
Mysigt — the text conveys warmth, coziness, and a sense of being exactly where you should be. Candles may be implied.
Jobbigt — the text describes a situation that is uncomfortable, awkward, or emotionally draining, but in a specifically Swedish way where you can't quite complain because that would be even more jobbigt.
Jantelagen — the text subtly communicates that someone has gotten too big for their boots and should be brought back down to acceptable Swedish levels.
Passiv-aggressiv fika — the text appears friendly and collegial on the surface but contains a weapon concealed in politeness.
Systemkollaps — the text describes a bureaucratic, technical, or personal system failure with a tone of exhausted resignation.
And finally: fredagsmys — the text radiates the specific joy of a Friday evening where plans involve a couch, tacos, and complete absence of ambition.
Train this on Swedish social media and forum posts. The annotations themselves would be hilarious to do because you'd have to argue about whether a specific text is "jobbigt" or "passiv-aggressiv fika" and those arguments are themselves peak Swedish culture.
Flashback Forum is Sweden's collective unconscious committed to text. It's where Swedes go to say what they would never say with their name attached. And the thread titles are an art form. They combine mundane questions with existential drama in a way that no AI has ever replicated because no AI has been adequately traumatized by Swedish society.
Genuine Flashback thread titles — no, I can't read them because I don't have access to them right now, but anyone who has spent five minutes on Flashback knows the energy. It's questions like "Can my landlord forbid me from cooking surströmming?" next to "How do I know if I'm alive?" next to deeply technical threads about car repair.
The project: scrape the publicly available Flashback thread titles — they're just titles, no personal data, no post content. Fine-tune a small generative model — maybe the hundred and twenty-six million parameter GPT-SW3, which would run on a phone — to generate new thread titles. The output would be a kind of AI-generated Flashback simulator that captures the energy without the content.
This is stupid. It's also a legitimate demonstration of fine-tuning a Swedish generative model on a culturally specific domain. And the outputs would be genuinely entertaining. Ship it as a bot that posts one generated title per day.
Sweden has about two thousand inhabited localities. It also has an estimated roughly three hundred thousand named places — hills, lakes, bogs, farms, meadows, crossroads. The naming patterns are deeply systematic: -by, -ås, -ström, -berg, -vik, -holm, -lund, -torp. A trained model could generate plausible place names that sound real but don't exist.
The project: collect every named place in Sweden from Lantmäteriet's open data. Train a tiny character-level model — you don't even need BERT for this, a simple RNN would work — to generate new place names. Then build a quiz: Real or Fake? Show people a place name and ask them to guess. "Björkåsen" — real or fake? "Granliden"? "Kvarnbäcken"?
The quiz itself is the product. But the underlying model is a publishable artifact, and the training data and methodology are a tiny but valid exploration of Swedish linguistic patterns.
This one is the dumbest and I love it the most. Build a model that scores any text on a scale from "mild" to "surströmming" based on how provocative, pungent, or socially divisive the content is. Not toxicity — that's been done and it's boring. Pungency. How much would this text clear a room? How many people would involuntarily make a face if you said this at a dinner party?
A Klarspråk-perfect municipal press release: mild. A well-written newspaper editorial that most people agree with: knäckebröd. A political take that half the room loves and half hates: inlagd sill. A genuinely controversial opinion stated without nuance: surströmming.
Train it on Swedish social media posts annotated on the surströmming scale. Deploy it as a browser extension that rates every webpage you visit. Watch people lose their minds arguing about whether a specific text is sill or surströmming.
Swedish parliamentary debates contain moments of accidental beauty. Sentences that, pulled from context, read like poetry. The project: build a pipeline that processes the entire Riksdag open data archive — every speech, every debate, every motion — and identifies the most poetic sentences using a combination of rhythm analysis, unusual word choices, and emotional intensity scoring.
Then publish a daily feed: one accidentally beautiful sentence from a Swedish parliamentarian, with the speaker's name, party, and the context of the debate. Automated poetry extraction from democracy. It's art. It's also a legitimate NLP pipeline involving sentence scoring, aesthetic metrics, and Swedish text analysis.
The Hugging Face artifact could be the scoring model itself: given a Swedish sentence, how poetic is it on a scale from government report to Tranströmer?
This one swings back toward useful. Swedish text on the internet is full of specific errors that English spell-checkers don't catch. De/dem confusion, which is the Swedish equivalent of their/they're/there. Sär skrivning — incorrectly splitting compound words, which is a national epidemic. "Glass butik" instead of "glassbutik." "Tvätt maskin" instead of "tvättmaskin." Every Swede over thirty has opinions about this.
Train a sequence-to-sequence model — or even just a classifier — that detects and corrects specifically Swedish errors. Not general grammar correction, which existing tools handle. Specifically the errors that annoy Swedish people the most. Deploy it as a browser extension or a Hugging Face Space.
The name writes itself: Särkskrivningsakuten. The Compound Word Emergency Room. Every time it corrects a split compound word, it plays a tiny ambulance siren.
You've rendered hundreds of PärPod episodes across multiple TTS voices. You have opinions about which voices work for which content. A calm explainer needs a different voice than a dramatic narrative. A comedic piece needs different pacing than a technical breakdown.
The project: build a classifier that, given a text, recommends the optimal TTS voice and settings. Train it on your own PärPod archive — which texts went to which voices, which episodes worked best. This is a tiny model solving a niche problem, but it's a niche problem that everyone who works with TTS encounters, and nobody has published a solution.
The Hugging Face artifact: a model that takes text and outputs recommended voice parameters. Ship it alongside a dataset of text-to-voice pairings. It's useful for anyone building TTS pipelines, which is a small but growing community.
Take every single article from Årebladet's archive. Every issue. Decades of local journalism. OCR the older ones. Digitize the whole thing. Then build a retrieval-augmented generation system — not a fine-tuned model, just a search index — that lets anyone ask questions about the history of Åre kommun in natural language. "When was the ski lift built?" "Who was the kommun chairman in nineteen ninety-three?" "What happened when the bridge collapsed?"
The Hugging Face artifact isn't a model. It's the dataset. A cleaned, structured, searchable archive of decades of local Swedish journalism, published openly. Nobody has anything like it. Local newspaper archives are one of the most valuable and least digitized sources of Swedish cultural memory. AI Sweden would notice. KBLab would notice. The local history society would lose their minds.
And here's the grant application angle that none of the previous ideas had: this isn't just a fun project. It's a demonstration that one person with AI tools can digitize, clean, and publish a local cultural archive that would have taken a library team years to process. That's a story about AI democratizing access to cultural heritage. That's Vinnova language. That's Kulturrådet language. That's "we should fund this person to do it for other local newspapers" language.
Ten ideas. Some are deeply stupid. Some are accidentally smart. A few are both. The thread that connects them is that they're all things that only someone embedded in Swedish culture, with technical capability but not formal credentials, would think to build. They exist in the gap between what Silicon Valley cares about and what Sweden actually needs. That gap is where the interesting work lives.
Pick one. Ship it. Put it on Hugging Face. Then pick another. The grant application isn't about having one perfect project. It's about having a trail of shipped artifacts that prove you can go from idea to published output faster than most funded research teams. That's the story. Not "I trained a model." But "I shipped ten things, and here's what I learned, and here's what I'd do with actual resources."
The surströmming index can wait. But not forever.