Product

We Tried Every AI Sandbox. Then We Built Our Own.

Hosted sandboxes couldn't reach our private APIs. Self-hosted options needed dedicated servers. The best open-source project lacked persistence. So we forked it, added persistent filesystems, tiered pause/resume, and network security — and open-sourced the result.

KTKhai Trinh
·
opensandboxopensourcekubernetesoverlayfsaiagentssandboxeksnetworkingsecurity
Cover Image for We Tried Every AI Sandbox. Then We Built Our Own.

We Tried Every AI Sandbox. Then We Built Our Own.

Hosted sandboxes couldn't reach our private APIs. Self-hosted ones needed dedicated servers. So we built our own.

One of our AI agents spent 40 minutes debugging a cloud networking issue. It installed diagnostic tools, traced network configurations, and correlated logs across multiple services. Three commands from the root cause — the kind of deep investigation that separates a useful agent from a chatbot.

Then the session timed out. Everything gone. The agent started over from zero, reinstalling the same tools, re-tracing the same logs, re-learning the same context. Forty minutes of compute burned twice.

The problem wasn't just cost — it was the architecture: sandboxes that lived outside our network, couldn't reach our internal APIs, required enterprise plans for basic features like private network access, and gave us no control over what our agents could access. When you're running AI agents against production infrastructure, "it works but we can't see what it's doing" isn't a feature — it's a liability.

So we evaluated every option on the market, found an open-source project with the right foundation, added the production features it was missing, and open-sourced the result.

TL;DROpenSandbox on EKS: Self-hosted AI sandbox infrastructure with persistent filesystems, tiered pause/resume, network security controls, and security hardening — running on your own AWS account. Get started.


Choosing the Right Sandbox

We evaluated the main categories of AI sandbox options. Every category broke for different reasons.

Hosted Sandbox APIs

Managed infrastructure, fast cold starts, SDKs in every language — we had agents running in an afternoon. Then we tried to connect an agent to our internal Grafana instance. Nothing. The sandbox lived in the provider's network — completely isolated from our private infrastructure. Internal APIs, private databases, monitoring dashboards — all unreachable. Connecting to our network required expensive enterprise tiers ($3K+/month), and even then you don't control the underlying infrastructure.

The pricing model was also wrong for AI workloads. Per-second billing works for quick, short-lived tasks. AI agents are different — they run long sessions, install tools, explore systems, and iterate. You're paying for compute the entire time, with no way to pause a sandbox to cheaper storage when it's idle.

Some hosted sandboxes now offer basic outbound traffic filtering, but you're configuring security rules through an API you don't own. When an incident happens, you can't inspect the rules, review traffic logs, or audit what your agent actually accessed.

Self-Hosted VM-Based Alternatives

The alternative camp offered something different: run sandboxes on your own infrastructure. Open source, self-hosted, full control. Exactly what we wanted — in theory.

In practice, these tools required dedicated physical servers or traditional VMs. They weren't designed for Kubernetes — the orchestration platform we'd already invested years building around. That meant no automatic scaling, no cost-saving spot instances, and no integration with the infrastructure we already had.

The operational model was also limiting. Most VM-based solutions only support destroy-and-recreate — there's no concept of pausing a sandbox and resuming it later with state intact. For AI agents that work on multi-day investigations, this is a dealbreaker. You either keep the VM running (expensive) or destroy it and lose everything (wasteful).

Found a Kubernetes-Native Foundation

After these dead ends, we found Alibaba's OpenSandbox — an open-source project that got the fundamentals right.

It was Kubernetes-native from the ground up — designed for container orchestration, not a VM tool with a Kubernetes wrapper bolted on. It included a lifecycle management server, an execution engine inside each sandbox, and SDKs in five languages (Python, TypeScript, Java, C#, Go).

But honest assessment: it was a foundation, not a production system. For running on AWS at scale, critical features were missing:

  • No filesystem persistence — sandbox state didn't survive restarts
  • No tiered pause/resume — no way to suspend sandboxes at different cost tiers
  • No network security beyond basic defaults — no outbound filtering, no cloud credential protection
  • Minimal base image — missing the programming languages, CLIs, and tools agents actually need
  • Security gaps — admin access, excessive permissions, weak isolation

The foundation was right. The production layer was ours to build.

Problems We Kept Hitting
01
No Private ConnectivityHosted sandboxes live outside your network. Agents can't reach internal APIs, private databases, or monitoring systems without ugly workarounds.
02
Expensive at ScalePer-second billing compounds fast with long-running AI sessions. No way to pause idle sandboxes to cheaper storage tiers.
03
No Affordable PersistenceHosted persistence requires expensive enterprise tiers. Self-hosted VMs offer destroy-and-recreate only. No tiered pause/resume to optimize idle costs.
04
Limited Network VisibilityHosted sandboxes offer basic egress rules, but you can't see or audit the traffic. Self-hosted gives you full visibility into what your agents access.

Architecture in 60 Seconds

System Architecture

OpenSandbox Request Flow

Python · TS · Java · C# · GoHTTP / SSEAPI ServerCluster RuntimeSandbox ResourceControllerManages lifecycleCreates podSandbox Pod (per agent)

SDK calls create Sandbox resources, managed by the controller into isolated containers with persistent filesystems.

The request flow is straightforward:

  1. Call Sandbox.create() via any SDK (Python, TypeScript, Java, C#, Go)
  2. FastAPI lifecycle server creates a Sandbox resource in Kubernetes
  3. The controller creates an isolated container with persistent storage and network rules
  4. A setup step installs the execution engine into the container
  5. The startup script sets up the persistent filesystem and security restrictions
  6. SDK communicates with the execution engine for code execution, file ops, and shell commands

The entire lifecycle — create, pause, resume, archive, terminate — is managed through standard Kubernetes resources. No extra orchestration layers. No external databases. Standard tools work for debugging.

In an upcoming post, we'll do a full technical deep dive into filesystem persistence, controller design, network security, and storage tiering.


What We Added

Now that you've seen how the pieces fit together, here's what we built on top of the open-source foundation.

1. Persistent Filesystem

Standard containers lose all changes when they restart. We added a persistent filesystem layer so every package install, config change, and downloaded file survives restarts. The agent installs tools once. They're still there next week.

2. Tiered Pause & Resume

Not all sandboxes are equally active. Running sandboxes sit at full compute cost, but idle ones should cost almost nothing.

Tier What's kept Resume Time Cost (20GB)
Active Full compute + disk Instant ~$5/mo
Warm (paused) Disk only ~5 seconds ~$1.60/mo
Cold (archived) Snapshot only ~30 seconds ~$0.50/mo

Snapshots only store the data your agent actually wrote — not the full 20GB volume. A sandbox that used 10GB of its 20GB disk costs about $0.50/mo as a snapshot.

Sandbox Lifecycle

Tiered Pause / Resume

Running~$5/moready instantlypausePaused~$1.60/moresumes in ~5sarchiveArchived~$0.50/moresumes in ~30sresume — everything restored
Cost Per Agent · 20GB EBS
No Tiering~$5/mo

Every sandbox stays at full cost, active or idle

3-Tier Storage~$0.50/mo

Idle sandboxes archive to cold — 10x cheaper

Why this matters: Agents are idle most of the time — they only wake when an event fires (PR opened, issue created, scheduled task). Cold storage at $0.50/mo is the steady state, not the exception.

An agent that's idle for a few hours gets paused to warm. Idle for days, archived to cold. Resume from any tier restores the full filesystem — every installed tool, every config file, every work-in-progress artifact.

3. Network Security & Filtering

Every sandbox gets its own network filtering layer. Unlike hosted solutions where you configure rules through an API you don't control, OpenSandbox gives you full visibility — standard network policies you can audit, customize, and integrate with your existing security tooling.

4. Pre-Built Environment

Every sandbox comes with programming languages, cloud tools, monitoring integrations, and developer utilities pre-installed. No more spending the first 5 minutes of every session installing tools.

5. Security Hardening

Running untrusted AI-generated code requires a locked-down environment:

  • No admin access — agents run as unprivileged users
  • Privilege escalation blocked — no way for code to gain elevated permissions
  • Minimal permissions — only what's needed during initial setup, then immediately restricted
  • Shared directories locked down — no hijacking temp files or shared paths
What We Added to OpenSandbox
01
Persistent FilesystemInstall packages once, they survive pause/resume. No more cold environments.
02
Tiered Pause & ResumePause idle sandboxes to save ~68%. Resume in ~5s warm, ~30s cold. Storage drops to ~$0.50/mo.
03
Network Security & FilteringEach sandbox gets its own network rules. Block access to internal services, cloud credentials, and sensitive endpoints.
04
Pre-Built EnvironmentProgramming languages, cloud tools, monitoring integrations, and developer utilities — all pre-installed.
05
MCP Servers IncludedMonitoring, database, and infrastructure integrations ready to go out of the box.
06
Security HardeningNo admin access, privilege escalation blocked, minimal permissions, shared directories locked down.

OpenSandbox at Scale

Here's what OpenSandbox looks like running at scale on AWS:

Production Architecture

OpenSandbox on EKS

AWS ACCOUNTVPC · PRIVATE NETWORKEKS CLUSTERauto-scaledAPI ServerLifecycle · SDK gatewayControllerReconcile · Scale · CleanupNETWORK ISOLATEDAgent AK8s debugging$ python3 · node · awskubectl · git · curl · jqfilesystemshellpersistent volumeNETWORK ISOLATEDAgent BSecurity audit$ python3 · node · awskubectl · git · curl · jqfilesystemshellpersistent volumeNETWORK ISOLATEDAgent CLog analysis$ python3 · node · awskubectl · git · curl · jqfilesystemshellpersistent volumeAgent DpausedAgent EpausedAgent FarchivedAgent GarchivedActive · ~$5/moPaused · ~$1.60/moArchived · ~$0.50/moIsolatedEBS Volumespersistent disksEBS Snapshotscold storageECRsandbox images

Each AI agent gets its own isolated pod with persistent storage, pre-installed tools, and network filtering. Idle agents pause to save ~68%. Dormant agents archive to snapshots at ~$0.50/mo.

Every AI agent at CloudThinker runs inside its own isolated OpenSandbox environment:

What's Inside Each Sandbox
OpenSandbox Podper agent · persistent filesystem
Cloud CLIs
awsgcloudaz
MCP Servers
grafanaelasticsearchsonarqubezabbix
Cluster & Git
kubectlhelmgitgh
Dev Utilities
ripgrepjqcurlpython3nodebun

The typical lifecycle: an agent is assigned a task, creates or resumes a sandbox, works for minutes or hours, then goes idle. The system automatically pauses to warm storage. If the user comes back next week, the sandbox resumes from a snapshot in about 30 seconds — every tool still installed, every file still in place.

Here's a concrete example. One agent was debugging an intermittent connection timeout between a cloud service and a database. Day one: installed diagnostic tools, traced the issue to a networking bug. Day two: the engineer asked the agent to continue. It resumed in 4 seconds with everything intact — captured data, scripts, partial analysis. No reinstallation. It identified the fix within an hour.


What We Learned

Persistence changes everything

One of our agents had been working for 11 hours — it had installed diagnostic tools, cloned repositories, and was correlating logs across multiple services. At 3 AM, the spot instance it was running on got terminated.

The system moved it to a new server. Every file, every tool, every piece of progress — intact. The agent didn't even know it moved.

Without persistence, that's 11 hours of wasted compute. With it, it's a non-event. This is what convinced us persistence isn't optional — it's the foundation. Everything else we build assumes the sandbox survives failures.

Design for idle, not for active

Agents are event-driven — they activate when a PR opens or an issue is created, then sit idle. Without tiered storage, you pay full price for every sandbox whether it's working or not.

With tiered storage, idle sandboxes drop from ~$5/mo to ~$0.50/mo — about 90% cheaper. At 200 sandboxes where 80% are idle, that's the difference between ~$1,000/mo and ~$200/mo.

The insight: most infrastructure is designed for peak load. Agent infrastructure should be designed for idle.

Why Open Source

When AI runs code against production, you need to see the isolation layer. If an agent executes kubectl delete namespace production, the difference between "that's fine, it was sandboxed" and a career-ending incident is the sandbox implementation. That implementation shouldn't be a black box behind an API.

The ecosystem also needs a shared standard. Every AI platform is building proprietary sandbox infrastructure — duplicated engineering that doesn't interoperate. We'd rather compete on what agents accomplish than on container plumbing.

And we need help. Specifically:

  • Firecracker / gVisor — stronger isolation using secure container runtimes
  • Local development — running sandboxes on your laptop without a full cloud cluster

If any of these problems interest you, the issue tracker has tagged good-first-issue items for each.


What's Next

  • Secure container runtimes — gVisor, Kata Containers, and Firecracker for deeper isolation at the operating system level
  • Tiered cold storage — automatic archival for sandboxes that haven't been accessed in weeks, driven by usage metrics
  • Snapshot pre-warming — predictive resume that restores sandboxes before the user asks, based on usage patterns

Get Involved

Building AI agents and fighting sandbox problems? Open an issue — we read every one.


Coming Up Next

Part 2: 1,000 AI Agents, One Cluster — The Architecture That Holds

The full technical deep dive: how OverlayFS keeps agent work alive across node failures, how tiered storage cuts costs by 80-90%, and why blocking one IP address prevents an entire class of privilege escalation attacks.

Coming soon — follow us on Facebook or LinkedIn to get notified.