Enterprises are converging on cloud agents as the future of software engineering — and many are concluding they should build their own. Posts like Stripe's, detailing how they built a homegrown cloud agent, make the path look achievable.
Building cloud agent infrastructure requires two investments: the technical infrastructure to run agents securely and autonomously in the cloud, and the change management to make agents productive across your engineering org. We've spent over two years on both, for Devin. What follows is what we've learned.
The natural starting point for building cloud agents is straightforward: take a CLI agent, containerize it, and give it access to your repos and toolchain. This successfully moves execution to the cloud — but you quickly run into security, persistence, and orchestration issues that need to be solved.
Containerized agents share a kernel, which means one compromised session can access every other container's filesystems, credentials, and network connections. Agents generate their own code, run arbitrary commands, and probe the environment in unpredictable ways — making a kernel-level escape a real security threat.

The industry consensus for running untrusted code is VM-level isolation — each workload gets its own kernel, with no shared attack surface. This is the direction the broader infrastructure community has been moving toward.
Standing up VM-based isolation for agent workloads is a significant undertaking. Our own implementation of microVMs took over a year of hypervisor engineering, ensuring every agent session runs on its own dedicated kernel with fully isolated storage, networking, and compute. A side benefit is that agents running in dedicated VMs can use a full browser, desktop applications, and arbitrary tool stacks, just like a developer on their workstation.
Another problem with containerized agents is they cannot survive the async gaps that define most real engineering work. An agent opens a PR, waits on CI, responds to code review, reruns tests, and pushes a follow-up commit. Between each step, there are gaps — minutes, hours, sometimes days — where the agent must preserve its full working state. For bounded work like dependency upgrades, a single-pass agent that completes and exits is enough. But work that spans the async gaps of the SDLC remains out of reach.

The root issue is that containers do not provide a reliable way to snapshot an individual container's full state, shut down compute, and restore it later. A containerized agent can only survive async breaks by burning compute to stay alive — and if the container is rescheduled, times out, or crashes, the session is lost.
We solved this by snapshotting full machine state at the hypervisor level — memory, process trees, and filesystem. Compute shuts down while the agent is idle, and the session resumes exactly where it left off when a CI result or review comment arrives. Making this work reliably across thousands of concurrent sessions, each with different repos, dependencies, and runtime environments, took us longer than any other piece of infrastructure we have built to date.
Running hundreds of cloud agents across an engineering org requires orchestration, governance, and integrations — each a multi-quarter infrastructure project on its own. A leading cloud data platform company we spoke with attempted this and ultimately moved on after the project scope overwhelmed their infrastructure team. The challenges they ran into:
The pattern we've seen, across conversations with teams attempting this, is that the combined surface area is what becomes untenable — not any single piece, but the fact that all three have to be built, integrated, and maintained indefinitely. We currently staff a dedicated team to manage each layer of this stack. Our solution for the orchestration layer took over three quarters of dedicated engineering to build and can manage thousands of concurrent VMs — handling provisioning, demand prediction, crash recovery, and teardown.
Everything we have discussed is what we consider the first phase: building the infrastructure to deploy cloud agents at scale. The second is transforming how your engineering organization actually works with them, and this process cannot start until the agents are deployed.
Every engineering process inside an enterprise was designed for a world where humans do the work: how projects get scoped, how teams get staffed, how code gets reviewed and shipped. When agents are doing a significant share of the execution, those processes need to be rebuilt around a different operating model. One where agents execute and humans direct, review, and decide.
Getting there is both a technical and operational challenge. It requires people who understand the engineering systems and the business processes around them, many of which are deeply embedded and often not even documented. The questions it raises touch every part of how an engineering org operates, and none of them have straightforward answers:
Very few of these changes can be designed in advance. Teams develop fluency by operating with agents on real projects over months. Starting earlier means your org is further along the learning curve — and that gap widens over time.
Itaú, the largest private bank in Latin America, is eleven months in with nearly 17,000 engineers — and has completed migrations 5 to 6x faster, auto-remediated 70% of static-analysis security vulnerabilities, and increased test coverage by 2x.
Building cloud agent infrastructure has been a serious engineering commitment for us, and whether you decide to build in-house or work with an existing platform, we hope this post gave a useful picture of what the investments involve.
If you're thinking about how to get started — reach out here.