Cognition | What We Learned Building Cloud Agents

Enterprises are converging on cloud agents as the future of software engineering — and many are concluding they should build their own. Posts like Stripe's, detailing how they built a homegrown cloud agent, make the path look achievable.

Building cloud agent infrastructure requires two investments: the technical infrastructure to run agents securely and autonomously in the cloud, and the change management to make agents productive across your engineering org. We've spent over two years on both, for Devin. What follows is what we've learned.

The Right Approach to Building Cloud Agents

The natural starting point for building cloud agents is straightforward: take a CLI agent, containerize it, and give it access to your repos and toolchain. This successfully moves execution to the cloud — but you quickly run into security, persistence, and orchestration issues that need to be solved.

A shared kernel is a security threat

Containerized agents share a kernel, which means one compromised session can access every other container's filesystems, credentials, and network connections. Agents generate their own code, run arbitrary commands, and probe the environment in unpredictable ways — making a kernel-level escape a real security threat.

The industry consensus for running untrusted code is VM-level isolation — each workload gets its own kernel, with no shared attack surface. This is the direction the broader infrastructure community has been moving toward.

Standing up VM-based isolation for agent workloads is a significant undertaking. Our own implementation of microVMs took over a year of hypervisor engineering, ensuring every agent session runs on its own dedicated kernel with fully isolated storage, networking, and compute. A side benefit is that agents running in dedicated VMs can use a full browser, desktop applications, and arbitrary tool stacks, just like a developer on their workstation.

Containerized agents can't complete real engineering work

Another problem with containerized agents is they cannot survive the async gaps that define most real engineering work. An agent opens a PR, waits on CI, responds to code review, reruns tests, and pushes a follow-up commit. Between each step, there are gaps — minutes, hours, sometimes days — where the agent must preserve its full working state. For bounded work like dependency upgrades, a single-pass agent that completes and exits is enough. But work that spans the async gaps of the SDLC remains out of reach.

The root issue is that containers do not provide a reliable way to snapshot an individual container's full state, shut down compute, and restore it later. A containerized agent can only survive async breaks by burning compute to stay alive — and if the container is rescheduled, times out, or crashes, the session is lost.

We solved this by snapshotting full machine state at the hypervisor level — memory, process trees, and filesystem. Compute shuts down while the agent is idle, and the session resumes exactly where it left off when a CI result or review comment arrives. Making this work reliably across thousands of concurrent sessions, each with different repos, dependencies, and runtime environments, took us longer than any other piece of infrastructure we have built to date.

Scaling from one session to hundreds requires its own infrastructure

Running hundreds of cloud agents across an engineering org requires orchestration, governance, and integrations — each a multi-quarter infrastructure project on its own. A leading cloud data platform company we spoke with attempted this and ultimately moved on after the project scope overwhelmed their infrastructure team. The challenges they ran into:

Orchestration: Each agent session is unique — tied to a specific task and engineer's permissions. Running hundreds concurrently requires provisioning the right environment for each one, routing sessions correctly, predicting demand to keep warm VM pools ready, and keeping every provisioned environment current as codebases change daily.
Governance: Each session must inherit the dispatching engineer's permissions across every system it touches, with every action recorded in a tamper-evident audit trail. Building and maintaining identity chaining, access scoping, and audit logging at enterprise scale is its own engineering project that requires ongoing maintenance.
Integrations: An agent is only as useful as the systems it can reach — CI, monitoring, package registries, documentation, source control. Each has its own authentication model, permission scoping, and maintenance burden. Stripe has an internal MCP server with over 400 tools to keep their agents connected. That is the scale of investment this layer demands.

The pattern we've seen, across conversations with teams attempting this, is that the combined surface area is what becomes untenable — not any single piece, but the fact that all three have to be built, integrated, and maintained indefinitely. We currently staff a dedicated team to manage each layer of this stack. Our solution for the orchestration layer took over three quarters of dedicated engineering to build and can manage thousands of concurrent VMs — handling provisioning, demand prediction, crash recovery, and teardown.

The Right Approach to Deploying Cloud Agents

Everything we have discussed is what we consider the first phase: building the infrastructure to deploy cloud agents at scale. The second is transforming how your engineering organization actually works with them, and this process cannot start until the agents are deployed.

Engineering processes need to be rebuilt for agents

Every engineering process inside an enterprise was designed for a world where humans do the work: how projects get scoped, how teams get staffed, how code gets reviewed and shipped. When agents are doing a significant share of the execution, those processes need to be rebuilt around a different operating model. One where agents execute and humans direct, review, and decide.

Getting there is both a technical and operational challenge. It requires people who understand the engineering systems and the business processes around them, many of which are deeply embedded and often not even documented. The questions it raises touch every part of how an engineering org operates, and none of them have straightforward answers:

Engineer fluency: Engineers need to learn which work to delegate and which to keep, and how to define tasks precisely enough that agents execute without constant correction. Managing concurrent agent sessions is a fundamentally different skill from writing code, and it takes months of practice on real projects to develop.
Planning and resource allocation: Every assumption about team sizing, sprint capacity, and project staffing changes when agent capacity enters the equation. These aren't one-time decisions. They have to be revisited as agents get more capable and engineers get more fluent.
Review and quality standards: The volume of code that needs review increases dramatically, but the review process designed for human-authored code doesn't transfer cleanly. Teams need to establish what rigorous review looks like for agent-produced work at a much higher volume.

Very few of these changes can be designed in advance. Teams develop fluency by operating with agents on real projects over months. Starting earlier means your org is further along the learning curve — and that gap widens over time.

Itaú, the largest private bank in Latin America, is eleven months in with nearly 17,000 engineers — and has completed migrations 5 to 6x faster, auto-remediated 70% of static-analysis security vulnerabilities, and increased test coverage by 2x.

Building cloud agent infrastructure has been a serious engineering commitment for us, and whether you decide to build in-house or work with an existing platform, we hope this post gave a useful picture of what the investments involve.

If you're thinking about how to get started — reach out here.