Show HN: Sub-millisecond VM sandboxes using CoW memory forking

github.com

106 points by adammiribyan 16 hours ago

I wanted to see how fast an isolated code sandbox could start if I never had to boot a fresh VM.

So instead of launching a new microVM per execution, I boot Firecracker once with Python and numpy already loaded, then snapshot the full VM state. Every execution after that creates a new KVM VM backed by a `MAP_PRIVATE` mapping of the snapshot memory, so Linux gives me copy-on-write pages automatically.

That means each sandbox starts from an already-running Python process inside a real VM, runs the code, and exits.

These are real KVM VMs, not containers: separate guest kernel, separate guest memory, separate page tables. When a VM writes to memory, it gets a private copy of that page.

The hard part was not CoW itself. The hard part was resuming the snapshotted VM correctly.

Rust, Apache 2.0.

cperciva 4 hours ago

Don't forget about entropy! You've just created two identical copies of all of your random number generators, which could be very very bad for security.

The firecracker team wrote a very good paper about addressing this when they added snapshot support.

  • Retr0id 2 hours ago

    I suppose it'd be easy enough to re-seed RNGs, but re-relocating ASLR sounds like a pain. (Although I suppose for Python that doesn't matter)

    • cperciva 14 minutes ago

      Re-seeding is easy. The hard parts are (a) finding everything which needs to be reseeded -- not just explicit RNGs but also things like keys used to pick outgoing port numbers in a pseudorandom order -- and (b) making sure that all the relevant code becomes aware that it was just forked -- not necessarily trivial given that there's no standard "you just got restarted from a snapshot" signal in UNIX.

    • hinkley an hour ago

      Off the cuff, the first step to ASLR is don’t publish your images and to rotate your snapshots regularly.

      The old fastCGI trick is to buffer the forking by idling a half a dozen or ten copies of the process and initialize new instances in the background while the existing pool is servicing new requests. By my count we are reinventing fastCGI for at least the fourth time.

      Long running tasks are less sensitive to the startup delays because we care a lot about a 4 second task taking an extra five seconds and we care much less about a 1 minute task taking 1:05. It amortizes out even in Little’s Law.

crawshaw 4 hours ago

Nice to see this work! I experimented with this for exe.dev before we launched. The VM itself worked really well, but there was a lot of setup to get the networking functioning. And in the end, our target are use cases that don't mind a ~1-second startup time, which meant doing a clean systemd start each time was easier.

That said, I have seen several use cases where people want a VM for something minimal, like a python interpreter, and this is absolutely the sort of approach they should be using. Lot of promise here, excited to see how far you can push it!

  • indigodaddy 4 hours ago

    simonw seems like he's always wanting what you describe, maybe more for wasm though

    • edunteman 25 minutes ago

      I’ve been a big fan of “what’s the thinnest this could be” interpretations of sandboxes. This is a great example of that. On the other end of the spectrum there’s just-bash from the Vercel folks.

indigodaddy 4 hours ago

Does this need passthrough or might we be able to leverage PVM with it on a passthrough-less cloud VM/VPS?

vmg12 5 hours ago

Does it only work with that specific version of firecracker and only with vms with 1 vcpu?

More than the sub ms startup time the 258kb of ram per VM is huge.

latortuga 3 hours ago

Similar to sprites.dev?

buckle8017 4 hours ago

This is how android processes work, but it's a security problem breaking some ASLR type things.

handfuloflight 5 hours ago

Can you run this in another sandbox? Not sure why you'd want to... but can you?

  • Teknoman117 5 hours ago

    Nested page tables / nested virtualization made it to consumer CPUs about a decade ago, so yes :)

  • wmf 5 hours ago

    It's pretty common to run VMs within containers so an attacker has to escape twice. You can probably disable 99% of system calls.