Dual-Layer Isolation.
Every execution runs in its own VM. If someone escapes the VM, they land in a jailed process with no capabilities, a minimal filesystem, and a syscall filter that kills on violation.
Layer 1: VM isolation via libkrun, Firecracker, or Cloud Hypervisor. Separate kernel, memory, process tree.
Layer 2: The VMM process itself is jailed — strict sandbox with syscall filtering, filesystem restrictions, resource limits, and full capability drop.
277+ tests including 37 adversarial tests that verify escape attempts are blocked.
VMM Backends
Hotcell supports three VMM (Virtual Machine Monitor) backends behind a common trait. Your code, the server, and the result protocol work identically regardless of which backend runs the VM.
Select per-request via the backend parameter. Each backend has different trade-offs for platform support, isolation strength, and filesystem sharing.
/// Pluggable VMM backend trait.
/// Everything above the backend -- OCI pipeline, server,
/// result protocol -- is backend-agnostic.
#[async_trait]
pub trait VmmBackend: Send + Sync {
/// Run a VM and return the result.
async fn run(
&self, config: &VmConfig, worker_bin: &Path,
) -> Result<VmResult, HotcellError>;
/// Run a VM with streaming console output.
async fn run_streaming(
&self, config: &VmConfig, worker_bin: &Path,
tx: broadcast::Sender<StreamEvent>,
) -> Result<VmResult, HotcellError>;
/// Create a persistent VM that outlives a single request.
async fn create_persistent(
&self, config: &VmConfig, worker_bin: &Path,
) -> Result<Box<dyn PersistentVmHandle>, HotcellError>;
/// Human-readable backend name.
fn name(&self) -> &'static str;
}
// Three implementations in separate crates:
// - hotcell_libkrun::LibkrunBackend (macOS + Linux)
// - hotcell_firecracker::FirecrackerBackend (Linux only)
// - hotcell_ch::ChBackend (Linux only) Backend Comparison
libkrun
- check_circle Embedded VMM via FFI — worker calls
krun_start_enter() - check_circle virtiofs for rootfs and shared directory access
- check_circle TSI (Transparent Socket Impersonation) networking — verified on both platforms
- check_circle hotcell-jailer sandboxes the VMM process on Linux
Firecracker
- check_circle Separate VMM binary — configured via REST API over Unix socket
- check_circle ext4 block device images created from OCI rootfs
- check_circle Firecracker's own jailer — battle-tested in AWS Lambda
- check_circle Serial console output streamed from file for real-time monitoring
Cloud Hypervisor
- check_circle Separate VMM binary — configured via REST API over Unix socket
- check_circle virtio-fs rootfs via virtiofsd — host directory mounted directly, no ext4 images
- check_circle Manages virtiofsd daemon processes for each virtio-fs mount
- check_circle Requires
cloud-hypervisorandvirtiofsdbinaries + KVM
Layer 1: VM Isolation
Kernel Virtualization
Each execution runs inside its own virtual machine with a separate Linux kernel. With libkrun, the kernel is compiled into libkrunfw. With Firecracker and Cloud Hypervisor, a separate vmlinux image is used. Either way, this eliminates the shared-kernel attack surface found in traditional containerization.
Guest Properties
- check_circle Own kernel, process table, and memory space
- check_circle No access to the host filesystem — the guest sees only its own root filesystem and any explicitly shared directories
- check_circle No network access by default — networking must be explicitly enabled per-execution, with optional egress filtering to restrict which hosts the guest can reach
- check_circle Configurable resource limits — memory, CPU, and execution timeout per VM
Layer 2: VMM Process Jail
On Linux, the VMM process is sandboxed before it boots the VM. With the libkrun backend, hotcell-libkrun-worker is sandboxed by hotcell-jailer before it configures libkrun. With the Firecracker backend, Firecracker's own jailer (battle-tested in AWS Lambda) handles sandboxing. With the Cloud Hypervisor backend, Landlock-based sandboxing is planned. The jail steps below describe hotcell-jailer for the libkrun backend.
The jail is built in 8 sequential steps. Each step removes a category of capability from the process: file descriptors, environment, resources, filesystem visibility, syscall access, and privileges. The steps are ordered so that each one assumes the previous steps might have been bypassed — defense-in-depth means no single step is a single point of failure.
Close Inherited FDs
close_range(3, MAX, 0) prevents leaking host file descriptors into the jail.
Clear Environment
All environment variables are removed. Only LD_LIBRARY_PATH=/lib remains (required for the dynamic linker to find libkrun inside the jail).
Join Cgroup
Dedicated cgroup with memory.max, pids.max (256), and cpu.max limits applied.
Namespace Isolation
unshare() creates mount, PID, IPC, UTS, and network namespaces (network only when TSI is disabled).
pivot_root
New root filesystem via pivot_root(), old root unmounted and removed. Host filesystem entirely invisible.
Landlock Restrictions
Mandatory access control via Landlock ABI v3 (filesystem) with optional v4 network restrictions (Linux 6.7+). The process can only access explicitly listed paths. This step is fatal — if Landlock is not enforced, the jail fails.
Drop Capabilities
Two-phase capability drop: bounding set cleared via PR_CAPBSET_DROP, then all remaining sets (ambient, effective, permitted, inheritable) cleared after setuid to nobody.
Seccomp BPF
Dual BPF filter with SECCOMP_FILTER_FLAG_TSYNC: an audit-log filter records violations, then a Kill-mode filter terminates on any syscall not in the allowlist. Both applied atomically across all threads.
Jail Filesystem
After pivot_root(), the worker's entire filesystem view is:
/ ├── dev/ │ ├── kvm # bind-mount, for VM creation │ ├── urandom # bind-mount, for randomness │ └── null # bind-mount ├── lib/ # bind-mount read-only: libkrun.so, libkrunfw.so, libc, ld-linux ├── proc/ # procfs, mounted after pivot_root ├── rootfs/ # bind-mount read-only: the OCI root filesystem ├── shares/ # bind-mount read-write: host shared directories ├── tmp/ # writable, world-writable with sticky bit ├── result/ # writable: result file directory ├── config.json # read-only: VM configuration ├── console.log # writable: console output └── worker # bind-mount read-only: hotcell-libkrun-worker binary
Networking Trade-off
libkrun's TSI (Transparent Socket Impersonation) proxies guest socket calls through the VMM process on the host via vsock. When TSI is enabled, the worker does not unshare the network namespace (CLONE_NEWNET is skipped), socket syscalls are added to the seccomp allowlist, and Landlock network restrictions are skipped. The remaining layers (namespace isolation, seccomp allowlist, capability drop) still constrain the process.
Technical Specs
Security Hardening
Guest Seccomp Filter
Optional hotcell-seccomp binary installs a BPF filter inside the guest, blocking ptrace, mount, unshare, and other privilege-escalation syscalls before the user workload runs.
Vsock HMAC-SHA256 Auth
Each VM receives a unique HMAC-SHA256 token over vsock. Result payloads are authenticated before acceptance, preventing cross-VM result injection attacks.
Network Egress Filtering
--allow-host restricts guest network access to specific destinations via iptables rules. All other egress is dropped.
Rootfs Integrity
fs-verity and dm-verity provide cryptographic verification of rootfs contents, detecting any tampering of the filesystem image before or during execution.
AppArmor Profile
An AppArmor profile adds kernel-level mandatory access control on top of Landlock and seccomp.
virtiofsd Sandboxing
virtiofsd runs with --sandbox=chroot --seccomp=kill. The daemon is locked to a chroot with its own seccomp kill-mode filter.
Binary Hardening CI
Worker binaries are verified in CI via checksec: Full RELRO, PIE, Stack Canary, and NX are enforced on every build.
OCI Pipeline Security
Path Traversal Protection
Tar extraction rejects entries containing ../ to prevent directory escape attacks.
Symlink Escape Guards
Symlinks are resolved within the rootfs boundary using guest semantics. Absolute symlinks are rebased into the rootfs, not followed on the host.
Shell Injection Prevention
guest_tag values are validated to contain only [a-zA-Z0-9_-].
Digest Verification
Layer blob digest verification is delegated to the oci-client library, which checks SHA-256 digests during download.
Temp-File Downloads
Blobs download to random temp files before extraction. This shrinks the substitution window, though atomic rename isn't used yet — partial TOCTOU mitigation, not a full guarantee.
Test Coverage
277+ tests across unit tests, integration tests (real VMs), and adversarial security boundary tests. The jailer is verified working on Linux+KVM with seccomp in Kill mode. TSI networking is verified on both macOS and Linux.
Jailer Escape Tests
Each test simulates an attacker inside the jail attempting a known escape technique. Can they break out of the filesystem? Traverse /proc? Create new namespaces? Regain dropped capabilities? Every escape attempt must be blocked for the build to pass.
Guest Isolation Tests
Real VMs boot with adversarial probe scripts that attempt to observe or reach the host. Tests run across all three backends (libkrun, Firecracker, Cloud Hypervisor) and verify hostname, filesystem, process, and network isolation.
End-to-End Verified
Jailed VM boot validated on Linux+KVM with the strictest seccomp mode (Kill). Networking verified on both macOS and Linux through the full sandbox stack. The complete jail sequence runs end-to-end in production configuration.