Two Hands: Code Execution and the One That Clicks

Two Hands, Two Trust Models

Picture two adjacent tools in the Anthropic catalogue. One runs Python in a containerised sandbox sitting inside Anthropic's infrastructure, the kind of clean room where the only damage Claude can do is to itself. The other runs on your laptop, watches your screen, and clicks the mouse. Both are listed under tool use. Both are pieces of agent plumbing. They sit at opposite ends of a trust spectrum, and the gap between them is the subject of this episode.

This is also the final episode of the Claude Series, so the gap matters in another way. The previous eleven episodes have, between them, walked the agent stack from the model upward: tool use, schemas, fine-grained streaming, programmatic tool calling, the Memory primitive, the Skills primitive, the org-wide distribution of Skills, the Managed Agents runtime. Most of those layers either depend on or quietly assume the existence of the code execution tool. Computer use, the click-on-your-screen tool, sits to one side of the whole stack and does something different in kind. Putting them together at the end is the way to see what the series has been about. The operator now has two hands. One is sandboxed and precise. The other is improvisational and prone to misreading buttons. Both are sometimes the only correct option for the work in front of you.

The code execution tool is server-executed. You enable it, Claude runs the code inside Anthropic's container, the result comes back. The computer use tool is client-executed. Claude emits an action, your application executes it on your machine, your application takes a screenshot, the screenshot goes back. Same tool-use contract, different sides of the wire. That distinction is the foundation, and the rest of the episode builds on it.

The Sandboxed Hand

Code execution went into public beta with the beta header code dash execution dash twenty twenty-five dash oh eight dash twenty-five. The most recent version, dated January twentieth twenty twenty-six, code underscore execution underscore two oh two six oh one two oh, is the one that matters for current builds. It adds two capabilities to the version before it: persistent R E P L state across calls inside a single container, and programmatic tool calling, which Episode Seven of this series walked at length. The earlier version, code underscore execution underscore two oh two five oh eight two five, supports Bash commands and file operations. A legacy version from May supports only Python. New code, new container; old code, old container.

The code execution tool runs in a secure, containerised environment designed specifically for code execution, with a higher focus on Python. File processing packages include pyarrow, openpyxl, xlsxwriter, xlrd, pillow, python-pptx, python-docx, pypdf, pdfplumber, pypdfium2.

That list is the part that makes this a real Python sandbox rather than a toy. Those are the same packages a working data engineer would reach for. The container has a thirty-day maximum lifetime, cleans up after four and a half minutes of idle time, and exposes a container I D you can pass back into subsequent requests to maintain state across calls. So the same container that built the spreadsheet in one A P I request can read the spreadsheet in the next. State persists for the duration the operator chooses to keep paying for it.

A couple of constraints worth knowing. Code execution does not run on Amazon Bedrock or Vertex A I. For the Claude Mythos preview deployment, code execution is supported on the Claude A P I and Microsoft Foundry only, and explicitly not on Claude Platform on A W S. The feature is not eligible for Zero Data Retention. So this is a primitive most deeply integrated with the direct Claude A P I path, and the more carefully provisioned the cloud deployment, the more the sandbox simply may not be available there.

This is where the Episode Nine callback earns its place. The Skills primitive walked back then, the folder of markdown plus scripts plus resources that Claude can discover at runtime, runs inside the code execution sandbox. The sandbox is the surface Skills live on. The pre-built Anthropic Skills for P P T X, X L S X, D O C X, and P D F creation are not separate features; they are Skills that ship with the container, sitting in a directory the runtime can list. Adding your own Skill, whether through the user-folder route documented in Episode Nine or the org-folder distribution route walked in Episode Ten, drops a directory into the same sandbox. The build-versus-load distinction Episode Nine talked about, the one between writing the logic and loading the logic, only works because there is a runtime to load into. The runtime is this one.

The Episode Seven callback follows directly. Programmatic tool calling, the technique where Claude writes a script that orchestrates multiple tools inside the sandbox rather than round-tripping each call through the model, requires this exact version of code execution. The pattern only works because the script and the tools and the intermediate results all stay inside the container until Claude is done, and only the final answer crosses back into the context window. For Claude for Excel reading and modifying spreadsheets with thousands of rows, that pattern is what makes the spreadsheet usable. For a multi-tool workflow that would otherwise consume ten or twenty thousand tokens in intermediate results, that pattern is what keeps the work in budget.

For Pär, the closest lived example is KallBadet, the cold-water bathing association where he is treasurer. The annual closing involves a Python script that reads a custom database, produces a S I E four file in the Swedish bookkeeping interchange format, and imports it into Fortnox. The script runs on his laptop today. The same script, rewritten as a Skill bundle and pushed into the code execution sandbox, would run identically inside Anthropic's container, with the Anthropic-pre-built P D F Skill nearby for generating the audit report. The shape of the workflow does not change; the location of the workflow moves from one operator's filesystem into a primitive available to any agent run. That is the move the series has, episode by episode, been mapping.

The Clumsy Hand

The computer use tool sits in a different category from everything else this series has covered. Computer use is a client-executed tool with the beta header computer dash use dash twenty twenty-five dash oh one dash twenty-four, available on Claude Opus four point seven, Sonnet four point six, Haiku four point five, and the older four-series models. Claude does not run the actions inside Anthropic's infrastructure. Claude emits actions, your application runs them, your application takes a screenshot, the screenshot returns. The agent loop runs in your code on your machine.

You have access to a set of functions you can use to answer the user's question. This includes access to a sandboxed computing environment. You do not currently have the ability to inspect files or interact with external resources, except by invoking the below functions.

That is the system prompt Claude sees when computer use is enabled. The phrasing is precise. The sandbox is the desktop your application has handed Claude, and Claude is told it cannot reach beyond it except through the tool. The actions Claude can request are mouse moves, mouse clicks, keyboard input, scrolling, and screenshot capture, plus the file and shell tools that often run alongside.

Two facts about computer use shape how it gets used in practice. First, prompt injection. Because Claude is reading screen content as input, anything visible on screen, a notification, a misleading button label, a hostile email, can attempt to inject instructions into the model. Anthropic runs an automatic classifier on screenshots that flags suspected prompt injections and steers Claude to ask the operator for confirmation before continuing. Pär's instinct, here, is to read this the same way he read the Managed Agents Vaults discussion in the previous episode: this is the part where the platform earns its keep, because writing your own prompt-injection classifier is the kind of work nobody actually does well. The classifier is the reason the tool ships at all.

Second, reliability. The docs are explicit. Claude may make mistakes selecting tools when generating actions. Reliability is lower with niche applications or multiple applications at once. Scroll often does not behave the way you expect. Selecting individual cells in spreadsheets is brittle. Creating accounts on social platforms is something the model has been deliberately restrained from doing. This is not the sandboxed-Python tool that runs a deterministic script. This is a model looking at pixels and guessing, often correctly, sometimes not. The reference implementation Anthropic publishes wraps everything in a Docker container with a web interface for inspection, which is the right shape for the work because the work needs supervision.

The cases where computer use is the right answer are the cases where no A P I exists. The legacy enterprise application that runs only in a Citrix session and has not had a meaningful integration update since two thousand twelve. The state-government portal that requires session cookies and a paper-form-style click sequence. The spreadsheet that some accountant has been editing by hand for fifteen years and the only path to automating it is to literally click in the cells the accountant clicks in. Pär's working version: a future hire at Årebladet Live who does not have an A P I to Bokio for some particular operation. Computer use is the operator's last-resort tool, and the value is that the last resort is sometimes the only resort. The tool exists for the cases where the proper interface does not.

Where Both Hands Meet

The architectural shape of these two tools, taken together, is the closing argument of the series. Episode Seven walked the three context-saving primitives, tool search, programmatic tool calling, fine-grained streaming, and identified them as a coherent set. Programmatic tool calling depends on code execution. Episode Eight walked the Memory tool and contrasted the directory-as-schema discipline against Pär's Pärkit, the custom PostgreSQL schema running on his Scaleway machine. The Memory tool, where it runs in a managed deployment, runs inside the same container the code execution tool provides. Episode Nine walked Skills, which install into the same container. Episode Ten walked the org-wide distribution of those Skills, which arrive into the same container. Episode Eleven walked Managed Agents, the hosted runtime, which wraps the same container.

Five episodes in a row, the load-bearing primitive underneath has been the same containerised sandbox. The sandbox is the simple layer. The other features are arrangements on top of it. The Hugo principle from Episode Eleven, strip assumed complexity, find the simple layer, build on that, applies retroactively to the entire stack: the simple layer everything else stands on is the code execution container, and once you see that, the rest of the agent platform is a series of progressively more polished ways to organise what runs inside it.

Computer use does not fit that pattern, and the not-fitting is the point. Computer use is the escape hatch from the platform. When the sandbox cannot reach the operation, because the operation requires a graphical user interface and a mouse on a real screen, computer use leaves the platform entirely and lets Claude work in your environment. The two tools are not in tension; they are complements. The sandboxed hand for the cases where the work is computable. The clumsy hand for the cases where the work requires looking at a screen made for humans.

Closing: The Operator-Complete Loop

[calm] Twelve episodes of the Claude Series end here. The unifying frame, looking back at the whole arc, is that an agent stack has to be operator-complete. By which the series has meant: the operator should be able to do the actual work, whatever shape the work takes, without falling out of the stack into ad-hoc scripting. The work has many shapes. Schema design. Long-running multi-step reasoning. Custom domain tooling. Document creation. Spreadsheet manipulation. Web research. Memory across sessions. Distribution to a team. And, for the work that simply cannot be reached any other way, clicking on a screen.

Anthropic's bet, episode after episode, has been to publish primitives at every layer of that completeness and let operators choose which combination matches their work. The Messages A P I for the case where the operator wants to drive the loop themselves. The Managed Agents runtime for the case where the operator wants the loop driven for them. Skills for the case where domain expertise is the bottleneck. Memory for the case where context cannot fit in a window. Code execution for the case where computation is the bottleneck. Computer use for the case where the proper interface does not exist. M C P for the case where the operator's tools live outside Anthropic's infrastructure altogether, on a Scaleway box in Pär's case, or in a colleague's database, or somewhere stranger.

The honest reading of where this leaves the operator is that the platform is sincere about completeness in a way that, three years ago, no agent platform was. The bet may not pay off forever; some of these layers will be replaced, some of these conventions will turn out to be wrong, some of the open-source alternatives will catch up or pass through. The series has been deliberately honest about that. But for the operator working today, in May twenty twenty-six, the stack is real, the documentation is actually good, and the choice of which layer to stand on is a real choice rather than a marketing one.

For Pär, sitting in Kall, the answer to which layer is the simple layer changes by project. Director runs on his Scaleway box because the topology is the value. The KallBadet annual close runs as a local Python script because the volume does not justify the platform. The Årebladet article pipeline mixes Claude Code on his laptop, the Whisper transcription on his M five MacBook, and judgement calls on every story. The PärPod renderer that produced this episode is a Director M C P tool that wraps a custom T T S pipeline behind a single function call. None of those would gain from being moved into Managed Agents today. Some of them might be moved next year. The diagnostic is the rule. The result is contingent.

So the series ends where it began, in the documentation, where the operator can read the primitive list, compare it against the work in front of them, and decide. Two hands, one sandboxed and one clumsy, plus everything in between. That is the Claude platform in May twenty twenty-six. Whatever you build with it, build the simple thing first, and let the platform earn its complexity only where the work demands it.