7 min read

Disposable Dev Environments for AI Coding Agents

Why running AI coding agents at scale falls apart, and how a self-provisioning, disposable environment per run changes your team's velocity, risk, and cost.
Disposable Dev Environments for AI Coding Agents
Photo by Ben Allan / Unsplash

The first time I let a coding agent loose on my own laptop, my hand hovered over the keyboard. It was deleting files, installing packages, running commands, and I sat there half-expecting it to break something I cared about. It felt like a technical worry, but it wasn't. The real question was: if an agent needs someone watching over its shoulder, how was I ever going to run ten of them, then fifty? Anything that needs babysitting is anything that doesn't scale. The problem wasn't the agent's ability; it was the environment I'd handed it. My machine couldn't be its playground, because it was also my home, and "home" doesn't scale.

Let me put a number on it. An agent I had to watch one at a time finished a handful of small jobs a day, because the bottleneck wasn't the agent, it was me sitting next to it. Take supervision out of the equation and parallelism stops being an architectural limit and becomes a config value: the number of agents running at once is a single setting, defaulting to 5 on my side and dialable up to 100. The interesting part is that the container is almost a rounding error in what those agents cost. The throwaway computer is the cheap part; the expensive part is the model tokens, and I cap that per job (a ceiling of 500k tokens per session). For a CTO that's where the math gets clear: the most expensive resource is human attention, the easiest one to scale is a disposable environment. Pull the human out of supervision and put a predictable ceiling on per-job cost, and "how many agents should I run" stops being a risk question and becomes a budget question.

Most teams' first instinct is the same: build a sandbox, lock the agent inside. Sounds reasonable. But a single shared sandbox's problems grow as the team grows. Two jobs show up at once and step on each other; one wipes node_modules while the other is mid-use; leftover state from a previous run quietly poisons the next. Give it a few weeks and that sandbox becomes the exact thing you were avoiding: a box nobody fully understands, everybody's afraid to touch, and whose upkeep depends on one person. A new pet server that just happens to be called "the sandbox."

I lived with that for a while, then asked the obvious question: why am I trying to reuse the environment at all?

If the environment is precious, you get scared

There's an old distinction in the sysadmin world: pets versus cattle. A pet server has a name, you hand-feed it, and you wake up at night when it goes down. A cattle server has a number, and when one dies you spin up a replacement and forget the old one. Infrastructure moved to the cattle side years ago, and your organization probably did too. Yet we still treat the dev environment we give an AI agent like a pet: just the one, precious, please don't break it.

And here's the trap. The moment the environment becomes precious, you start clipping the agent's wings. You write prompts like "don't delete that," "stay out of that folder," "undo any packages you install" -- throttling the agent to protect a fragility you created yourself. At team scale the real cost shows up: every constraint is a maintenance burden, every "be careful" prompt is tacit knowledge carried on one person's shoulders. For a tool to pay off across an organization, I need to let it off the leash without making it depend on anyone's supervision.

The fix is simple, but I had to flip something in my head: stop treating the environment as a reusable asset, and make it fully disposable. Every run gets its own environment, it's gone when the run ends, the next one starts from scratch. You don't have to protect something with no value, and something you're not protecting doesn't slow you down.

The dev runner

I call the thing I built a dev runner. When an agent gets assigned a job, the runner prepares a fresh environment from nothing.

First it figures out what the project is. I didn't want to hand-write "this is Rails, use this image" for every repo; that kind of config rots fast in an organization, nobody updates it. So instead it reads the project's CI configuration. Here I made a small but important call: rather than parse .gitlab-ci.yml myself, I call GitLab's own CI Lint API, so all the include: chains and templates get resolved on GitLab's side and handed back merged. The language, the version, the services, it's all already written down there. Whatever CI knows, the environment should know too: one source of truth, zero facts kept in two places. That means standing the agent up on top of your existing systems instead of building "agent infrastructure" as a separate cost center.

Then it generates a docker-compose and brings the container up, with a unique random suffix every time, something like devbox-a3f91c20. Even if two agents work on the same repo at the same moment, neither knows the other exists; each has its own copy, its own container. Parallelism isn't extra effort, it's a side effect of the design, and at team scale that's what matters: "how many agents can I run" stops being a coordination problem and becomes a plain capacity problem.

Inside, Claude Code does its work and opens a merge request at the end. The instant the job is done, the container, the network, the temp directories are torn down; all that's left is the MR. The thing a human reviews is the output, not the environment: an MR that flows through the normal review process. No trace of the environment, because it was never worth keeping.

Let me be explicit about one thing, because it's the first question a CTO will ask: the agent doesn't run unsupervised. A triage agent first predicts which component the job belongs to and how confidently it can say so (below a threshold, it stops), then a planner produces a plan and a task list, and the disposable environment only spins up after a human approves. The freedom of a throwaway environment isn't about removing control; it's about putting control where it belongs, at the start of the work and at the end on the output.

The speed objection

A fair objection: isn't building an environment from scratch every time slow? It could be. That's why I deliberately bend the "fully disposable" rule in two places.

The first is an image cache. As long as the Dockerfile hasn't changed, the same image gets reused: I hash the recipe (SHA256), and if it's identical I skip the build entirely and the container comes straight up. If it changed, I have to rebuild anyway.

The second, and this is the real judgment call: I throw the container away every time, but I don't throw away the dependency cache. Folders like bundle and node_modules live in a persistent named volume; runs come and go, that cache stays put. What should be disposable is the environment's state, not the dependencies themselves. Re-downloading half of Rubygems on every run wouldn't be cleanliness, just waste. Deciding what to keep is as much a part of the architecture as deciding what to throw away.

So "a fresh environment every run" doesn't mean "a build from scratch every run." What's fresh is the container and its state; the image and the dependencies usually arrive ready-made. A small distinction, but it's the one that makes the whole approach practical: the most common objection to a disposable architecture is "it's too expensive," and here the cost is paid per recipe change, not per build.

What I actually got

The biggest win didn't come from where I expected. Parallelism is nice, isolation is a relief, but the real change was in my own attitude, and therefore in how wide a lane I gave the agent. I no longer care what it does; let it delete, install, make a mess, the container's going to die anyway. That comfort gave me the nerve to give the agent a much wider lane, and it did better work in it. For a CTO the payoff is concrete: as human oversight per agent drops, the return on the investment becomes something you can scale. A tool that needs no supervision is the tool a team actually gets leverage from.

"It worked on my machine" evaporated too. Because the environment is derived from the project and built the same way every time, there's no gap between what the agent sees, what I see, and what the team sees. Reproducibility is what turns a bug report from something you argue about into something you fix.

To be honest

I won't pretend this is free. Disk usage is real, because there are containers and images everywhere; get the cache strategy wrong and you either lose the speed or fill the disk. Secret management is its own headache, and it can turn into a security and compliance question: passing credentials into a disposable environment without leaving a trace was harder than I assumed. In my setup credentials go to the container only at runtime as environment variables, never written to disk, and tokens are masked in the logs, but you have to know up front to build it that way. And containers don't always die cleanly; they leave junk behind, so I ended up writing a separate orphan sweeper with a time window to protect runs that are still active, plus a deliberate debug flag to leave an environment standing when I need to inspect it. That part is honestly a whole separate post.

But all of that is engineering detail, and a one-time infrastructure investment: a fixed cost, not a per-agent one. The idea itself holds up: don't treat the environment you give an agent as precious. The moment you do, you're stuck protecting it, and it's that protective instinct that ends up holding the agent back.

What's next

As the number of agents grows, "a separate home for each one" will become the default, and less an infrastructure preference than a scaling strategy. Spinning up a whole container for a single agent can look wasteful right now; but a separate server per app used to look wasteful too, until containers made it ordinary. The same will happen with agent environments, and the teams that make this bet early are the ones that will have turned the agent from an experiment into a capacity.

Don't open your own machine to the agent. Give it its own throwaway computer. If it breaks it, you build another one.