Experimental — Hotcell is under active development and should not be used in production.

Dual-Layer Isolation.

Every execution runs in its own VM. If someone escapes the VM, they land in a jailed process with no capabilities, a minimal filesystem, and a syscall filter that kills on violation.

Layer 1: VM isolation via libkrun, Firecracker, or Cloud Hypervisor. Separate kernel, memory, process tree.
Layer 2: The VMM process itself is jailed — strict sandbox with syscall filtering, filesystem restrictions, resource limits, and full capability drop.
277+ tests including 37 adversarial tests that verify escape attempts are blocked.

swap_horiz

VMM Backends

Hotcell supports three VMM (Virtual Machine Monitor) backends behind a common trait. Your code, the server, and the result protocol work identically regardless of which backend runs the VM.

Select per-request via the backend parameter. Each backend has different trade-offs for platform support, isolation strength, and filesystem sharing.

hotcell::backend::VmmBackend
/// Pluggable VMM backend trait.
/// Everything above the backend -- OCI pipeline, server,
/// result protocol -- is backend-agnostic.
#[async_trait]
pub trait VmmBackend: Send + Sync {
    /// Run a VM and return the result.
    async fn run(
        &self, config: &VmConfig, worker_bin: &Path,
    ) -> Result<VmResult, HotcellError>;

    /// Run a VM with streaming console output.
    async fn run_streaming(
        &self, config: &VmConfig, worker_bin: &Path,
        tx: broadcast::Sender<StreamEvent>,
    ) -> Result<VmResult, HotcellError>;

    /// Create a persistent VM that outlives a single request.
    async fn create_persistent(
        &self, config: &VmConfig, worker_bin: &Path,
    ) -> Result<Box<dyn PersistentVmHandle>, HotcellError>;

    /// Human-readable backend name.
    fn name(&self) -> &'static str;
}

// Three implementations in separate crates:
// - hotcell_libkrun::LibkrunBackend      (macOS + Linux)
// - hotcell_firecracker::FirecrackerBackend  (Linux only)
// - hotcell_ch::ChBackend                (Linux only)

Backend Comparison

libkrun (default)
Firecracker
Cloud Hypervisor
Platforms
macOS + Linux
Linux only
Linux only
VMM process
Embedded (FFI, takes over worker process)
Separate binary (REST API over UDS)
Separate binary (REST API over UDS)
Rootfs format
Shared directory (virtio-fs)
Disk image (ext4)
Shared directory (virtio-fs via virtiofsd)
Networking
Proxied through host (TSI) + port mapping
Virtual network (TAP + NAT) + egress filtering
Virtual network (TAP + NAT)
Shared directories
Host folders shared into guest (virtio-fs)
Not supported
Host folders shared into guest (virtio-fs via virtiofsd)
Guest kernel
Built into libkrunfw
Separate vmlinux image
Separate vmlinux image
Host sandboxing
hotcell-jailer (process isolation, syscall filtering, filesystem restrictions, resource limits)
Firecracker jailer (optional)
hotcell-ch-worker (process isolation, resource limits, filesystem pivot) + CH built-in syscall filtering
Result collection
Read from shared directory
Read from disk image after VM exit
Read from shared directory
Boot time
~130ms
~125ms
~200ms
Best for
Development, macOS, low-latency
Production Linux, multi-tenant, stronger isolation
Linux production with virtio-fs, no ext4 overhead
Default

libkrun

  • check_circle Embedded VMM via FFI — worker calls krun_start_enter()
  • check_circle virtiofs for rootfs and shared directory access
  • check_circle TSI (Transparent Socket Impersonation) networking — verified on both platforms
  • check_circle hotcell-jailer sandboxes the VMM process on Linux
Linux only

Firecracker

  • check_circle Separate VMM binary — configured via REST API over Unix socket
  • check_circle ext4 block device images created from OCI rootfs
  • check_circle Firecracker's own jailer — battle-tested in AWS Lambda
  • check_circle Serial console output streamed from file for real-time monitoring
Linux only

Cloud Hypervisor

  • check_circle Separate VMM binary — configured via REST API over Unix socket
  • check_circle virtio-fs rootfs via virtiofsd — host directory mounted directly, no ext4 images
  • check_circle Manages virtiofsd daemon processes for each virtio-fs mount
  • check_circle Requires cloud-hypervisor and virtiofsd binaries + KVM
memory

Layer 1: VM Isolation

Kernel Virtualization

Each execution runs inside its own virtual machine with a separate Linux kernel. With libkrun, the kernel is compiled into libkrunfw. With Firecracker and Cloud Hypervisor, a separate vmlinux image is used. Either way, this eliminates the shared-kernel attack surface found in traditional containerization.

Guest Properties

  • check_circle Own kernel, process table, and memory space
  • check_circle No access to the host filesystem — the guest sees only its own root filesystem and any explicitly shared directories
  • check_circle No network access by default — networking must be explicitly enabled per-execution, with optional egress filtering to restrict which hosts the guest can reach
  • check_circle Configurable resource limits — memory, CPU, and execution timeout per VM
VM boundary
GUEST_OS_KERNEL
deployed_code USER_WORKLOAD
shared directories only

Layer 2: VMM Process Jail

On Linux, the VMM process is sandboxed before it boots the VM. With the libkrun backend, hotcell-libkrun-worker is sandboxed by hotcell-jailer before it configures libkrun. With the Firecracker backend, Firecracker's own jailer (battle-tested in AWS Lambda) handles sandboxing. With the Cloud Hypervisor backend, Landlock-based sandboxing is planned. The jail steps below describe hotcell-jailer for the libkrun backend.

LINUX ONLY
DEFENSE-IN-DEPTH

The jail is built in 8 sequential steps. Each step removes a category of capability from the process: file descriptors, environment, resources, filesystem visibility, syscall access, and privileges. The steps are ordered so that each one assumes the previous steps might have been bypassed — defense-in-depth means no single step is a single point of failure.

01

Close Inherited FDs

close_range(3, MAX, 0) prevents leaking host file descriptors into the jail.

02

Clear Environment

All environment variables are removed. Only LD_LIBRARY_PATH=/lib remains (required for the dynamic linker to find libkrun inside the jail).

03

Join Cgroup

Dedicated cgroup with memory.max, pids.max (256), and cpu.max limits applied.

04

Namespace Isolation

unshare() creates mount, PID, IPC, UTS, and network namespaces (network only when TSI is disabled).

05

pivot_root

New root filesystem via pivot_root(), old root unmounted and removed. Host filesystem entirely invisible.

06

Landlock Restrictions

Mandatory access control via Landlock ABI v3 (filesystem) with optional v4 network restrictions (Linux 6.7+). The process can only access explicitly listed paths. This step is fatal — if Landlock is not enforced, the jail fails.

07

Drop Capabilities

Two-phase capability drop: bounding set cleared via PR_CAPBSET_DROP, then all remaining sets (ambient, effective, permitted, inheritable) cleared after setuid to nobody.

08

Seccomp BPF

Dual BPF filter with SECCOMP_FILTER_FLAG_TSYNC: an audit-log filter records violations, then a Kill-mode filter terminates on any syscall not in the allowlist. Both applied atomically across all threads.

Jail Filesystem

After pivot_root(), the worker's entire filesystem view is:

/
├── dev/
│   ├── kvm          # bind-mount, for VM creation
│   ├── urandom      # bind-mount, for randomness
│   └── null         # bind-mount
├── lib/             # bind-mount read-only: libkrun.so, libkrunfw.so, libc, ld-linux
├── proc/            # procfs, mounted after pivot_root
├── rootfs/          # bind-mount read-only: the OCI root filesystem
├── shares/          # bind-mount read-write: host shared directories
├── tmp/             # writable, world-writable with sticky bit
├── result/          # writable: result file directory
├── config.json      # read-only: VM configuration
├── console.log      # writable: console output
└── worker           # bind-mount read-only: hotcell-libkrun-worker binary
warning

Networking Trade-off

libkrun's TSI (Transparent Socket Impersonation) proxies guest socket calls through the VMM process on the host via vsock. When TSI is enabled, the worker does not unshare the network namespace (CLONE_NEWNET is skipped), socket syscalls are added to the seccomp allowlist, and Landlock network restrictions are skipped. The remaining layers (namespace isolation, seccomp allowlist, capability drop) still constrain the process.

Technical Specs

Cgroup memory.max guest + 256 MiB min 512 MiB
Cgroup pids.max 256 prevents fork bombs
Seccomp Mode Kill immediate termination
Landlock ABI v3 required Linux 6.2+
Landlock Network v4 optional Linux 6.7+
Cgroup cpu.max unlimited configurable per-execution
Seccomp Filters Dual BPF audit log + kill, TSYNC
Vsock Auth HMAC-SHA256 prevents cross-VM injection
Hardening Items 22 total host + guest combined
enhanced_encryption

Security Hardening

filter_alt

Guest Seccomp Filter

Optional hotcell-seccomp binary installs a BPF filter inside the guest, blocking ptrace, mount, unshare, and other privilege-escalation syscalls before the user workload runs.

vpn_key

Vsock HMAC-SHA256 Auth

Each VM receives a unique HMAC-SHA256 token over vsock. Result payloads are authenticated before acceptance, preventing cross-VM result injection attacks.

lan

Network Egress Filtering

--allow-host restricts guest network access to specific destinations via iptables rules. All other egress is dropped.

verified

Rootfs Integrity

fs-verity and dm-verity provide cryptographic verification of rootfs contents, detecting any tampering of the filesystem image before or during execution.

admin_panel_settings

AppArmor Profile

An AppArmor profile adds kernel-level mandatory access control on top of Landlock and seccomp.

folder_limited

virtiofsd Sandboxing

virtiofsd runs with --sandbox=chroot --seccomp=kill. The daemon is locked to a chroot with its own seccomp kill-mode filter.

build_circle

Binary Hardening CI

Worker binaries are verified in CI via checksec: Full RELRO, PIE, Stack Canary, and NX are enforced on every build.

verified_user

OCI Pipeline Security

shield_locked

Path Traversal Protection

Tar extraction rejects entries containing ../ to prevent directory escape attacks.

link

Symlink Escape Guards

Symlinks are resolved within the rootfs boundary using guest semantics. Absolute symlinks are rebased into the rootfs, not followed on the host.

block

Shell Injection Prevention

guest_tag values are validated to contain only [a-zA-Z0-9_-].

fingerprint

Digest Verification

Layer blob digest verification is delegated to the oci-client library, which checks SHA-256 digests during download.

lock_clock

Temp-File Downloads

Blobs download to random temp files before extraction. This shrinks the substitution window, though atomic rename isn't used yet — partial TOCTOU mitigation, not a full guarantee.

check_circle

Test Coverage

277+ tests across unit tests, integration tests (real VMs), and adversarial security boundary tests. The jailer is verified working on Linux+KVM with seccomp in Kill mode. TSI networking is verified on both macOS and Linux.

18

Jailer Escape Tests

Each test simulates an attacker inside the jail attempting a known escape technique. Can they break out of the filesystem? Traverse /proc? Create new namespaces? Regain dropped capabilities? Every escape attempt must be blocked for the build to pass.

19

Guest Isolation Tests

Real VMs boot with adversarial probe scripts that attempt to observe or reach the host. Tests run across all three backends (libkrun, Firecracker, Cloud Hypervisor) and verify hostname, filesystem, process, and network isolation.

E2E

End-to-End Verified

Jailed VM boot validated on Linux+KVM with the strictest seccomp mode (Kill). Networking verified on both macOS and Linux through the full sandbox stack. The complete jail sequence runs end-to-end in production configuration.