This is a deep briefing on the ADHD-aware tooling research project. Over the course of one overnight session, we went from eight testable ideas to a complete evidence base — two hundred seventy-eight research leads collected, cross-validated by independent scoring agents, and twenty-one deep-dive reviews written. Tomorrow we start building. This episode covers everything you need to know.
Here's the core idea. Most ADHD technology is built by neurotypical teams trying to make ADHD people behave more normally. Track your tasks. Set reminders. Follow the schedule. The entire approach is deficit-framed — it assumes the problem is the person, not the tools.
Our thesis is different. We call it tool sovereignty. A person with ADHD, who cannot code in the traditional sense, uses AI — specifically Claude Code — to build tools shaped to their own neurodivergent patterns. Not conforming to tools designed for someone else's brain. Building the tools themselves. The whole existing toolchain — PärKit Capture, Focus, the editorial pipeline, chatarkiv, Director — this is already the proof of concept. It was all built by someone who can't code, using AI. No paper in the literature has studied this.
A critical paper from Katta Spiel at CHI 2022 reviewed one hundred computing papers about ADHD technology. Only twelve percent even interviewed ADHD participants. Only five projects genuinely co-designed with them. When users resisted interventions — hiding timers, subverting feedback — researchers called it failure rather than design feedback. Every critique Spiel raises, tool sovereignty answers. No proxy stakeholders. No deficit framing. No behavioral conditioning. The tool builder IS the user.
And here's what makes it strategic: a systematic review by Lauder in 2022 examined a hundred and forty-three studies on ADHD interventions. Zero — literally zero — were conducted in actual workplace contexts. Every single one was in a clinic or lab. Even a well-documented single-person study testing ADHD tool adaptations in a real work context would be the first of its kind.
Here's how we built the evidence base. We deployed eight search agents simultaneously, each covering a different research angle. AI cognitive scaffolding. Context switching and interruption science. Wearables and physiological monitoring. Task management tools. Neurodivergent design patterns. Executive function technology. Emotional regulation. And community wisdom — Reddit, Hacker News, GitHub projects, developer blogs.
Each agent searched the web independently and compiled its findings into a leads file. In total, two hundred seventy-eight raw entries came back across eight files.
Then came the triage. We sent two independent scoring agents with different evaluation lenses. Scorer A evaluated practical relevance — can we actually use this for our tools? Which of our eight testable ideas does it inform? Scorer B evaluated research quality and novelty — is the methodology sound? Is the evidence real? Does it tell us something we didn't already know?
The two scorers worked without shared context. When both finished, a reconciliation agent compared their ratings, flagged disagreements, deduplicated across files, and produced a tiered final ranking.
The most interesting part? The disagreements. When Scorer A gives something a five and Scorer B gives it a two, that's where you learn. Tether, the closest prior art to our "Where Was I" concept, scored five for practical relevance — it IS what we want to build — but two for research quality, because it has zero user evaluation. Architecture only, no data. Both scorers are right. It's strategically important for competitive intelligence but evidentially empty.
Goblin Tools scored five versus two for the same reason. Massive user adoption, a genuinely brilliant UX pattern called the spiciness slider for task decomposition — but zero published research. Popular does not mean validated.
And Dodson's Interest-Based Nervous System — the PINCH model — scored five versus two. It's cited everywhere in ADHD communities as established science. It's not. It's a clinical opinion without empirical validation. Extremely useful as a design heuristic. Dangerous as a foundation for claims.
After deduplication, we had about a hundred and fifty unique leads. Twenty-one made Tier 1 — the deep dive queue. Seventy-nine in Tier 2. Seventy-one reference only. Fifty-six discarded as noise, duplicates, or fabrication suspects. The scoring agents flagged five citations that might not be real papers — the scout agents used web search, and some leads may be hallucinated.
Let me walk through the key findings from all twenty-one Tier 1 reviews. This is the meat.
The strongest-backed idea in our entire project is "Where Was I?" — automatically surfacing your last active contexts when you start a new session. Six papers directly support it, and an existing open-source implementation already exists.
The theoretical blueprint comes from Sophie Leroy. In 2009, she published the concept of attention residue. When you switch away from an incomplete task, your attention doesn't come with you cleanly. Part of it stays with the unfinished work. She proved this experimentally — people performed worse on the next task when the previous one was left incomplete.
Then in 2018, Leroy and Glomb published the intervention. They call it the Ready-to-Resume Plan. Before switching tasks, you write down four things: where you are right now, what you'll do next when you return, what challenges remain, and what you're deliberately postponing. It takes under sixty seconds. And it works — people who used the plan were seventy-nine percent more likely to choose the optimal solution on their subsequent task. Seventy-nine percent. That is a massive effect from a sixty-second intervention.
The cognitive science explaining WHY this works comes from Altmann and Trafton's 2002 memory-for-goals model. Goal activation decays over time during interruptions. New goals interfere with old ones. But environmental cues can reactivate suspended goals — IF the cues are blatant and co-attended. Subtle cues don't work. The cursor blinking in the last position? Not enough. A bright red arrow pointing at where you left off? That works. This has profound implications for tool design. The session summary needs to be visually prominent and specific, not a faint hint in a sidebar.
Parnin and Rugaber provided the empirical backbone in 2011. They analyzed ten thousand programming sessions from eighty-six programmers. Only ten percent resumed work in less than one minute. Eighty-three percent navigated to a completely new location before their first edit. The median time to first edit was not the commonly cited ten to fifteen minutes — that was a simplification. Thirty percent of sessions exceeded thirty minutes before the first keystroke.
Parnin identified five types of memory failure during interruption: prospective memory — forgetting you need to do something. Attentive memory — forgetting details of the current task. Associative memory — forgetting relationships between components. Episodic memory — forgetting what you already tried. And conceptual memory — forgetting the mental model of how the system works.
Each of these maps to a specific tool capability. Prospective memory needs explicit intention capture. Attentive memory needs detailed state snapshots. Associative memory needs dependency graphs. Episodic memory needs activity logs. Conceptual memory needs — well, this is what CLAUDE.md files already do, surprisingly well.
Here's the exciting part: brain-mcp already exists on GitHub. It's an MCP server built by an ADHD developer that does cognitive state reconstruction. Parquet plus LanceDB plus DuckDB architecture. Twelve millisecond recall. Local only. It has twenty-five tools including tunnel_state for cognitive save-state, switching_cost for quantifying attention residue as a zero-to-one score, dormant_contexts for finding abandoned threads, and context_recovery for full re-entry briefs. The recommendation is clear: install and test this before building anything. Thirty minutes to set up. It may cover eighty percent of what we need.
But — and this is critical — Zhu's CHI 2026 paper adds a guardrail. They co-designed AI scaffolding tools with twenty ADHD students and five experts. Their key warning: GenAI can undermine metacognition if it does too much. The act of writing down your plan IS the metacognitive exercise. Auto-generating the session summary might actually be counterproductive. The tool should prompt you to state your intention, not present you with a finished summary. This directly challenges the brain-mcp approach of fully automated reconstruction and suggests the right design is a hybrid — show the data, but make the user articulate the plan.
One finding that deserved its own deep dive: the famous "23 minutes to resume after an interruption" statistic. Everyone cites this as gospel. It's attributed to Gloria Mark's 2008 CHI paper "The Cost of Interrupted Work."
The 2008 paper never mentions twenty-three minutes. Its actual finding is that interrupted workers complete tasks seven percent FASTER — but with significantly higher stress, frustration, and effort. The cost is psychological, not temporal.
The twenty-three minutes number comes from a 2005 field observation study and was popularized through media interviews, never published with full reproducible methodology in peer-reviewed form. And it measures wall-clock time away from the interrupted task — during which people complete two to three other tasks — not cognitive recovery time. These are very different things.
The more useful finding for us: forty-four percent of interruptions are self-initiated. We don't just need protection from external interruptions. We need checkpoint-on-switch, not just checkpoint-on-exit. And the cost of interruption is as much about anxiety as about lost time — context-saving reduces anxiety even when it doesn't actually save time.
Kushlev's notification research turned out to be two papers, not one. The 2016 CHI study proved that having notifications on causes ADHD-like symptoms in the general population — effect sizes of point-four-four for inattention. The 2019 follow-up tested four conditions: control, hourly batching, three-times-per-day batching, and no notifications.
Three-times-per-day batching won on every metric. Hourly batching was indistinguishable from having all notifications on — too frequent to matter. But complete silence was worse than the control. It caused anxiety with an effect size of point-five-six and fear of missing out at point-five-three.
For Context-Aware Do Not Disturb design, this is definitive. Not total silence. Batched delivery at natural breakpoints. With a reassurance signal that nothing urgent is pending. The tool should say "you have three notifications waiting" — not hide them entirely.
Here's a finding that changes the design rationale for several tools. Kofler's lab at Florida State analyzed the heterogeneity of ADHD cognitive profiles. Only thirty-eight percent have set-shifting impairment. Sixty-two percent have working memory impairment. The overlap is small. Only four percent are impaired on all three domains.
Combined with King's 2007 finding that ADHD switching difficulties are preparation-dependent — people can switch fine, they just can't use advance warning to prepare for a switch — this reframes the problem. It's not "ADHD people can't switch contexts." It's "ADHD people can't prepare for switches and can't hold both contexts in working memory."
The intervention isn't "minimize switches." It's "make switches cheaper through context preservation and preparation scaffolding." That's exactly what "Where Was I?" does.
Both scorers flagged the emotional regulation file as surprisingly strong. Knouse's EMA study was the standout. She tracked a hundred and six adults with ecological momentary assessment — random pings throughout the day asking what they're thinking and doing.
Avoidant thoughts were present in forty-five percent of moments. Nearly half the time. And here's the counterintuitive core: these avoidant thoughts are positively valenced. "I'll do it later." "I work better under pressure." "This can wait." They feel like reasonable decisions, not avoidance. Traditional CBT restructuring fails because people defend these thoughts as strengths.
For ADHD, the frequency is four-point-three times higher than controls. And each avoidant thought hits harder — avoidance jumps point-eight-eight points for high-ADHD versus point-five-nine for low-ADHD. The mechanism is escape-avoidance: an aversive task triggers emotional discomfort, the avoidant thought provides momentary relief, and negative reinforcement makes the pattern automatic.
This informs our Emotional Tone Classification idea. The tool shouldn't fight avoidance directly — it should mirror the pattern back without judgment. "You've been on this task for two minutes and switched away three times" is an observation. "Stop procrastinating" is an accusation.
Two papers documented what to avoid.
The neurofeedback JAMA Psychiatry meta-analysis from 2025 analyzed thirty-eight randomized controlled trials with twenty-four hundred seventy-two participants. When outcomes were rated by someone who didn't know whether the child received neurofeedback, the standardized mean difference was zero-point-zero-four. That is zero. Not small, not trending — zero. Parents who knew their child got the treatment reported improvement. Teachers who didn't know saw nothing. A one point three billion dollar per year industry built entirely on expectancy bias.
But the meta-lesson is valuable. The structured supportive attention that comes WITH the neurofeedback — someone caring, paying attention, showing up regularly — that part works. Which validates body doubling and AI companionship through the correct mechanism.
The Habitica gamification study found that all forty-five participants experienced counterproductive effects. Punishment during productive periods. Task relabeling to dodge penalties. System gaming. Focus shifting from actual work to the game. Motivation erosion over time. For ADHD specifically, this connects to emotional dysregulation, rejection sensitive dysphoria, and shame cycles. Streaks weaponize loss aversion. Missed days become evidence of failure.
The anti-pattern is clear. PärKit should never: punish missed tasks, display streak counters, show red indicators for overdue items, compare performance to others, or use loss-framed language. Finch, the self-care app, gets this right with its no-penalty design — the virtual pet is always happy to see you, no matter how long you've been gone.
The JetBrains study quantified something we intuited: clean interfaces help ADHD users disproportionately. They compared two panels in Zen mode versus seven panels in default PyCharm.
Thirty-five percent faster time to first keystroke. Twenty-nine percent faster coding speed. Twelve percent faster debugging. And the type of ADHD matters — people with time management difficulties were dramatically worse in the cluttered UI, while people with self-restraint difficulties showed near-equal performance in both modes.
Claude Code's terminal interface is accidentally ideal. It's inherently low perceptual load. One task, one context, one conversation. This validates the "Next 1" design principle — show one thing, not a dashboard.
The sleep variability paper from BMC Psychiatry 2025 delivered a crucial discovery. It's the same cohort as the Sankesara smartphone study — same forty participants, same ten weeks, same Fitbit. Sankesara analyzed phone behavior. Denyer analyzed sleep. This isn't two studies agreeing. It's one dataset showing the same trait across every behavioral channel.
Every mean sleep parameter was statistically identical between ADHD and control groups. Sleep efficiency had a p-value of point-nine-nine-nine for means. But every variability measure was significant at p less than point-zero-zero-one. Sleep duration variability. Onset variability. Offset variability. Efficiency variability. All significant.
Combined with Sankesara's findings on notification response variability and task interval variability, this confirms the project's core thesis from three independent behavioral domains. ADHD isn't "worse on average." It's "inconsistent." The signal is the spread, not the center.
For the Weekly Rhythm Digest, this means: show people their variability patterns, not their averages. "Your sleep start time ranged from eleven PM to three AM this week" is useful information. "Your average bedtime was one AM" hides the actual experience.
The research phase is complete. All twenty-one Tier 1 leads have been deep-dived. The evidence base is solid. Here's what we know for implementation:
"Where Was I?" is ready to build. The theoretical foundation is airtight — Leroy for the intervention design, Altmann and Trafton for the cognitive mechanism, Parnin for the empirical data on programmer behavior. brain-mcp exists as a reference implementation. The Zhu guardrail says: prompt, don't generate. The implementation spec writes itself — capture context on session end, surface it on session start, but make the user articulate their re-entry plan.
The first step tomorrow should be installing brain-mcp and testing it for thirty minutes. If it covers eighty percent of what we need, fork it and add PärKit awareness, git integration, and the Leroy-style planning prompt. If it doesn't, we build our own, but we borrow the architecture — Parquet, local-only, MCP integration.
Context-Aware Do Not Disturb has clear design parameters from Kushlev: batch three times per day, never total silence, include a reassurance signal. This could be a macOS Shortcut or a Claude Code hook.
"Next 1" Mode has strong design consensus from both research and commercial products. Gilbert's intention offloading research says timing triggers matter more than task content. The spiciness slider from Goblin Tools is the best UX exemplar for decomposition granularity. Sunsama's constraint-based approach — limiting how many tasks you can schedule — prevents over-commitment.
Kinder Language is straightforward: replace "overdue" with "waiting for you," reduce visible task counts, never punish. The Knouse data on positively-valenced avoidance suggests the tool should mirror, not judge.
Body Doubling has validation from the VR study — AI presence matches human presence — and from the neurofeedback meta-analysis finding that structured attention is the active ingredient, not the specific intervention.
We are positioned to produce the first workplace-context ADHD intervention study in the literature. Even a documented n-of-one case study with CIMO framework documentation — Context, Intervention, Mechanism, Outcome — would fill a gap that a hundred and forty-three studies have left empty.
The research is done. Time to build.