An important class of applications, including programs that leverage third-party libraries, programs that use user-defined functions in databases, and serverless applications, benefit from isolating the execution of untrusted code at the granularity of individual functions or function invocations. However, existing isolation mechanisms were not designed for this use case; rather, they have been adapted to it. We introduce virtines, a new abstraction designed specifically for function granularity isolation, and describe how we build virtines from the ground up by pushing hardware virtualization to its limits. Virtines give developers fine-grained control in deciding which functions should run in isolated environments, and which should not. The virtine abstraction is a general one, and we demonstrate a prototype that adds extensions to the C language. We present a detailed analysis of the overheads of running individual functions in isolated VMs, and guided by those findings, we present Wasp, an embeddable hypervisor that allows programmers to easily use virtines. We describe several representative scenarios that employ individual function isolation, and demonstrate that virtines can be applied in these scenarios with only a few lines of changes to existing codebases and with acceptable slowdowns.
Guided by these examples, we introduce virtines, a new abstraction designed for isolating execution at function call granularity using hardware virtualization. Data touched by a virtine is automatically encapsulated in the virtine’s isolated execution environment. This environment implements an abstract machine model that is not constrained by the traditional x86 platform. Virtines can seamlessly interact with the host environment through a checked hypervisor interposition layer. With virtines, programmers annotate critical functions in their code using language extensions, with the semantics that a single virtine will run in its own, isolated virtual machine environment. While virtines require code changes, these changes are minimal and easy to understand. Our current language extensions are for C, but we believe they can be adapted to most languages.
The execution environments for virtines (including parts of the hypervisor) are tailored to the code inside the isolated functions; a virtine image contains only the software that a function needs. We present a detailed, ground-up analysis of the start-up costs for virtine execution environments, and apply our findings to construct small and efficient virtine images. Virtines can achieve isolated execution microsecond scale startup latencies and limited slow-down relative to native execution. They are supported by a custom, user-space runtime system implemented using hardware virtualization called Wasp, which comprises a small, embeddable hypervisor that runs on both Linux and Windows. The Wasp runtime provides mechanisms to enforce strong virtine isolation by default, but isolation policies can be customized by users.
Our contributions in this paper are as follows:
We introduce virtines, programmer-guided abstractions that allow individual functions to run in light-weight, virtualized execution environments.
We present a prototype embeddable hypervisor framework, Wasp, that implements the virtine abstraction. Wasp runs as a Type-II micro-hypervisor on both Linux and Windows.
We provide language extensions for programming with virtines in C that are conceptually simple.
We evaluate Wasp’s performance using extensive microbenchmarking, and perform a detailed study of the costs of virtine execution environments.
A virtine provides an isolated execution environment using lightweight virtualization. Virtines consist of three components: a toolchain-generated binary to run in a virtual context, a hypervisor that facilitates the VM’s only external access (Wasp), and a host program which specifies virtine isolation policies and drives Wasp to create virtines. When invoked, virtines run synchronously from the caller’s perspective, leading them to appear and act like a regular function invocation. However, virtines could, given support in the hypervisor, behave like asynchronous functions or futures.1 As with most code written to execute in a different environment from the host, such as CUDA code or SGX enclaves , there are constraints on what virtine code can and cannot do. Due to their isolated nature, virtines have no direct access to the caller’s environment (global variables, heap, etc.). A virtine can, however, accept arguments and produce return values like any normal function. These arguments and return values are marshalled automatically by the virtine compiler when using our language extensions.2
Unlike traditional hypervisors, a virtine hypervisor need not–and we suspect in most cases will not–emulate every part of the x86 platform, such as PCI, ACPI, interrupts, or legacy I/O. A virtine hypervisor therefore implements an abstract machine model designed for and restricted to the intentions of the virtine. Figure 1 outlines the architecture and data access capabilities (indicated by the arrows) of a virtine compared to a traditional process abstraction. A host program that uses (links against) the embeddable virtine hypervisor has some, but not necessarily all, of its functions run as virtines. We refer to such a host process as a virtine client. If the virtine context wishes to access any data or service outside of its isolated environment, it must first request access from the client via the hypervisor. Virtines exist in a default-deny environment, so the hypervisor must interpose on all such requests. While the hypervisor provides the interposition mechanism , the virtine client has the option to implement a hypercall policy, which determines whether or not an individual request will be serviced. The capabilities of a virtine are determined by (1) the hypervisor, (2) the runtime within a virtine image, and (3) policies determined by the virtine client.
Virtines are constructed from a subset of an application’s call graph. Currently, the decision where the “cut” in the call graph is made by the programmer, but making this choice automatically in the compiler is possible . Since a virtine constitutes only a subset of the call graph, virtine images are typically small (∼16KB), and are statically compiled binaries containing all required software. Shared libraries violate our isolation requirements, as we will see in Section 3.1.
While the runtime environment that underlies a function running in virtine context can vary, we expect that in most cases this environment will comprise a limited, kernel-mode only, software layer. This may mean no scheduler, virtual memory, processes, threads, file systems, or any other high-level constructs that typically come with running a fully-featured VM. This is not, however, a requirement, and virtines can take advantage of hardware features like virtual memory, which can lead to interesting optimizations like those in Dune . Additional functionality must be provided by adding the functionality to the virtine environment or by borrowing functionality from the hypervisor. Adding this functionality should be done with care, as interactions with the hypervisor come with costs, both in terms of performance and isolation. In this paper, we provide two pre-built virtine execution environments (Section 5.4), but we envision a rich virtine ecosystem could develop from which an execution environment could be selected. These environments could also be synthesized automatically. Note that one possible execution environment for a virtine is a unikernel. However, unikernels are typically designed with a standard ABI in mind (e.g., binary compatible with Linux). Virtine execution environments are instead co-designed with the virtine client, and allow for a wide variety of virtual platforms which may support non-standard ABIs.
In this section, we describe our isolation and safety objectives in developing the virtine abstraction. We then discuss how to achieve these goals using hardware and software mechanisms.
Host code and data cannot be modified, and its control flow cannot be hijacked by a virtine running untrusted or adversarial code.
The private state of a virtine must not be affected by another virtine running untrusted or adversarial code. Thus, data secrecy must be maintained between virtines.
Host data secrecy must also be maintained, so virtines may not interact with any data or services outside of their own address space other than what is explicitly permitted by the virtine client’s policies.
Code that runs in virtine context can still suffer from software bugs such as buffer overflow vulnerabilities. We therefore assume an adversarial model, where attacks that arise from such bugs may occur, and where a virtine can behave maliciously. We assume the hypervisor (Wasp) and the host kernel are trusted, similar to prior work . In addition, we assume that the virtine client—in particular, its hypercall handlers—are trusted and implemented correctly. These handlers must take care to assume that inputs have not been properly sanitized, and may even be intentionally manipulated. Along with using best practices, we assume hypercall handlers are careful when accessing the resources mapped to a virtine, for example checking memory bounds before accessing virtine memory, validating potentially unsafe arguments, and correctly following the access model that the virtine requires. We assume that virtines do not share state with each other via shared mappings, and that they cannot directly access host memory. Additionally, we assume that microarchitectural and host kernel mitigations are sufficient to eliminate side channel attacks. Note that we do not expect end users to implement their own virtine clients. We instead assume that runtime experts will develop the virtine clients (and corresponding hypercall handlers). In this sense, our assumptions about the virtine client’s integrity are similar to those made in cloud platforms that employ a user-space device model (e.g., QEMU/KVM).
By assuming that both the hypervisor and client-defined hypercall handlers (of which there are few) are carefully implemented, using best practices of software development, an adversarial virtine cannot directly modify the state or code paths of the host. However, virtines do not guarantee that if permitted access to certain hypercalls or secret data, an attacker cannot utilize these hypercalls to exfiltrate sensitive data via side-channel mechanisms. This, however, can be mitigated by using a mechanism that disables certain hypercalls dynamically when they are not needed by the runtime, further restricting the attack surface.
Requiring that no two virtines directly share memory without first receiving permission from the hypervisor (e.g., via the hypercall interface) ensures data secrecy within the virtine. Each virtine must have its own set of private data which must be disjoint from any other virtine’s set. Thus, a virtine that runs untrusted or malicious code cannot affect the integrity of other virtines.
Modeling virtine and host private state as a disjoint set disallows any and all shared state between virtines or the host. The hardware’s use of nested paging (EPT in VT-x) prevents such access at a hardware level. Also, by assuming that hypercalls are carefully implemented, and that they only permit operations required by the application, we achieve isolation from states and services outside the virtine.
Before exploring the implementation of virtines, we first describe a series of experiments that guided their design. These experiments establish the creation costs of minimal virtual contexts and of the execution environments used within those contexts. Our goal is to establish what forms of overhead will be most significant when creating a virtine.
The majority of our Linux and KVM experiments were run on tinker, an AMD EPYC 7281 (Naples; 16 cores; 2.69 GHz) machine with 32 GB DDR4 running stock Linux kernel version 5.9.12. We disabled hyperthreading, turbo boost, and DVFS to mitigate measurement noise. We used a Dell XPS 9500 with an Intel i7 10750H (Comet Lake; 6 cores) for SGX measurements. This machine has 32 GB DDR4 and runs stock Ubuntu 20.04 (kernel version 5.13.0-28). We used gcc 10.2.1 to compile Wasp (C/C++), clang 10.0.1 for our C-based virtine language extensions, and NASM v2.14 for assembly-only virtines. Unless otherwise noted, we conduct experiments with 1000 trials. Note that our hypervisor implementation works on both Linux and has a prototype implementation in Windows (through Hyper-V), but for brevity we only show KVM’s performance on Linux, as Hyper-V performance was similar for our experiments.
We probe the costs of virtual execution contexts and see how they
compare to other types of execution contexts. To establish baseline
creation costs, we measure how quickly various execution contexts can be
constructed on tinker, as shown in
Figure 2. We measure the time it takes to
create, enter, and exit from the context in a way that the hypervisor
can observe. In “KVM”, we observe the latency to construct a virtual
machine and call the
hlt instruction. “Linux pthread” is simply a
pthread_create call followed by
pthread_join. The “vmrun”
measurement is the cost of running a VM hosted on KVM without the cost
of creating its associated state, i.e., only the
Finally, “function” is the cost of calling and returning from a null
function. All measurements are obtained using the
The “vmrun” measurements represent the lowest latency we could achieve
to begin execution in a virtual context using KVM in Linux 5.9. This
latency includes the cost of the
ioctl system call, which in KVM is
handled with a series of sanity checks followed by execution of the
vmrun instruction. Several optimizations can be made to the hypervisor
to reduce the cost of spawning new contexts and lower the latency of a
virtine, which we outline in
These measurements tell us that while a virtine invocation will be unsurprisingly more expensive than a native function call, it can compete with thread creation and will far outstrip any start-up performance that processes (and by proxy, containers) will achieve in a standard Linux setting. We conclude that the baseline cost of creating a virtual context is relatively inexpensive compared to the cost of other abstractions.
The boot sequences of fully-featured OSes are too costly to include on the critical path for low-latency function invocations . It takes hundreds of milliseconds to boot a standard Linux VM using QEMU/KVM. To understand why, we measured the time taken for components of a vanilla Linux kernel boot sequence and found that roughly 30% of the boot process is spent scanning ACPI tables, configuring ACPI, enumerating PCI devices, and populating the root file system. Most of these features, such as a fully-featured PCI interface, or a network stack, are unnecessary for short-lived, virtual execution environments, and are often omitted from optimized Linux guest images such as the Alpine Linux image used for Amazon’s Firecracker . Caching pre-booted environments can further mitigate this overhead, as we describe in §5.2.
|Paging identity mapping||28109|
|Long transition (
|Jump to 32-bit (
|Jump to 64-bit (
|Load 32-bit GDT (
Boot time breakdown for our minimal runtime environment on KVM. These are minimum latencies observed per component, measured in cycles.
In light of the data gathered in Figure 2, we set out to measure the cost of creating a virtual context and configuring it with the fewest operations possible. To do this, we built a simple wrapper around the KVM interface that loads a binary image compiled from roughly 160 lines of assembly. This binary closely mirrors the boot sequence of a classic OS kernel: it configures protected mode, a GDT, paging, and finally jumps to 64-bit code. These operations are outlined in Table 1, which indicates the minimum latencies (cycles) for each component, ordered by cost.
The row labeled “Paging/ident. map” is by far the most expensive at
∼28K cycles. Here we are using 2MB large pages to identity map
the first 1GB of address space, which entails three levels of page
tables (i.e., 12KB of memory references), plus the actual installation
of the page tables, control register configuration, and construction of
an EPT inside KVM. The transition to protected mode takes the second
longest, at 3K cycles. This is a bit surprising, given that this only
entails the protected mode bit flip (PE, bit 0) in
cr0. The transition
to long mode (which takes several hundreds of cycles) is less
significant. The remaining components—loading a 32-bit GDT, the long
jumps to complete the mode transitions, and the initial interrupt
The more complex the mode of execution (16, 32, or 64 bits), the higher
the latency to get there. This is consistent with descriptions in the
hardware manuals . To further investigate this effect, we invoked a
small binary written in assembly that brings the virtual context up to a
particular x86 execution mode and executes a simple function (
20 with a simple, recursive implementation).
Figure 3 shows our findings for the three
canonical modes of the x86 boot process using KVM: 16-bit (real) mode,
32-bit (protected) mode, and 64-bit (long) mode. Each mode includes the
necessary components from
Table 1 in the setup of the virtual
context. In this experiment, for each mode of execution, we measured the
latency in cycles from the time we initiated an entry on the host
KVM_RUN), to the time it took to bring the machine up to that mode in
the guest (including the necessary components listed in
Table 1), run fib(20), and exit
back to the host. These measurements include entry, startup cost,
computation, and exit. Note that we saw several outliers in all cases,
likely due to host kernel scheduling events. To make the data more
interpretable, we removed these outliers.3
While we expect much of the time to be dominated by entry/exit and the arithmetic, the benefits of real-mode only execution for our hand-written version are clear. The difference between 16-bit and 32-bit environments are not surprising. The most significant costs listed in Table 1 are not incurred when executing in 16-bit mode. Protected and Long mode execution are essentially the same as they both include those costs (paging and protected setup). These results suggest—provided that the virtine is short-lived (on the order of microseconds) and can feasibly execute in real-mode—that 10K cycles may potentially be saved.
We have seen that a minimal long-mode boot sequence costs less than 30K
cycles (∼12 μs), but what does it take to do something useful?
To determine this, we implemented a simple HTTP echo server where each
request is handled in a new virtual context employing our minimal
environment. We built a simple micro-hypervisor in C++ and a runtime
environment that brings the machine up to C code and uses
hypercall-based I/O to echo HTTP requests back to the sender. The
runtime environment comprises 970 lines of C (a large portion of which
are string formatting routines) and 150 lines of x86 assembly. The
micro-hypervisor comprises 900 lines of C++. The hypercall-based I/O
(described more in Section 5.1) obviates the need to emulate network
devices in the micro-hypervisor and implement the associated drivers in
the virtual runtime environment, simplifying the development process.
Figure 4 shows the mean time measured in
cycles to pass important startup milestones during the bring-up of the
server context. The left-most point indicates the time taken to reach
the server context’s main entry point (C code); roughly 10K cycles. Note
that this example does not actually require 64-bit mode, so we omit
paging and leave the context in protected mode. The middle point shows
the time to receive a request (the return from
recv()), and the last
point shows the time to complete the response (
measurements are taken inside the virtual context.
The send and receive functions for this environment use hypercalls to defer to the hypervisor, which proxies them to the Linux host kernel using the appropriate system calls. Even when leveraging the underlying host OS, and when adding the from-scratch virtual context creation time from Figure 2, we can achieve sub-millisecond HTTP response latencies (<300 μs) without optimizations (§5.2). Thus, we can infer that despite the cost of creating a virtual context, having few host/virtine interactions can keep execution latencies in a virtual context within an acceptable range. Note, however, that the guest-to-host interactions in this test introduce variance from the host kernel’s network stack, indicated by the large standard deviations shown in Figure 4.
These results are promising, and they indicate that we can achieve low overheads and start-up latencies for functions that do not require a heavy-weight runtime environment. We use three key insights from this section to inform the design of our virtine framework in the next section: (1) creating hardware virtualized contexts can be cheap when the environment is small, (2) tailoring the execution environment (for example, the processor mode) can pay off, and (3) host interactions can be facilitated with hypercalls (rather than shared memory), but their number must be limited to keep costs low.
In this section, we present Wasp, a prototype hypervisor designed for the creation and management of virtine environments. We also cover a few of the optimizations designed to overcome the cost of creating virtual contexts using KVM.
Wasp is a specialized, embeddable micro-hypervisor runtime that deploys
virtines with an easy-to-use interface. Wasp runs on Linux and Windows.
At its core, Wasp is a fairly ordinary hypervisor, hosting many virtual
contexts on top of a host OS. However, like other minimal hypervisors
such as Firecracker , Unikernel monitors , and uhyve , Wasp does not aim
to emulate the entire x86 platform or device model. As shown in
Figure 5, Wasp is a userspace runtime system
built as a library that host programs (virtine clients) can link
against. Wasp mediates virtine interactions with the host via a
hypercall interface, which is checked by the hypervisor and the virtine
client. The figure shows one virtine that has no host interactions, one
virtine which makes a valid hypercall request, and another whose
hypercall request is denied by the client-specified security policy. By
using Wasp’s runtime API, a virtine client can leverage hardware
specific virtualization features without knowing their details. Several
types of applications (including dynamic compilers and other runtime
systems) can link with the Wasp runtime library to leverage virtines. On
Linux, each virtual context is represented by a device file which is
manipulated by Wasp using an
Wasp provides no libraries to the binary being run, meaning they have no
in-virtine runtime support by default. Wasp simply accepts a binary
image, loads it at guest virtual address
0x8000, and enters the VM
context. Any extra functionality must be achieved by interacting with
the hypervisor and virtine client. In Wasp, delegation to the client is
achieved with hypercalls using virtual I/O ports.
Hypercalls in Wasp are not meant to emulate low-level virtual devices,
but are instead designed to provide high-level hypervisor services with
as few exits as possible. For example, rather than performing file I/O
by ultimately interacting with a virtio device and parsing filesystem
structures, a virtine could use a hypercall that mirrors the
POSIX system call. Hypercalls vector to a co-designed handler either
provided by Wasp or implemented by the virtine client. Wasp provides the
mechanisms to create virtines, while the client can specify security
policies through handlers. These handlers could simply run a series of
checks and pass through certain host system calls while filtering others
out. While virtine clients can implement custom hypercall handlers, they
can also choose from a variety of general-purpose handlers that Wasp
provides out-of-the-box; these canned hypercalls are used by our
language extensions (§5.3). By default, Wasp provides no
externally observable behavior through hypercalls other than the ability
to exit the virtual context; all other external behavior must be
validated and expressly permitted by the custom (or canned) hypercall
handlers, which are implemented (or selected) by the virtine client.
To reduce virtine start-up latencies, Wasp supports a pool of cached,
uninitialized, virtines (shells) that can be reused. As depicted in
Figure 6, Wasp receives a request from a
virtine client ((A)), which will drive virtine creation. Such requests
can be generated in a variety of virtine client scenarios. For example,
network traffic hitting a web server that implements a virtine client
may generate virtine invocations. A database engine incorporating a
virtine client may run virtine-based UDFs in response to triggers.
Because we must use a new virtine for every request, a hardware virtual
context must be provisioned to handle each invocation. The context is
acquired by one of two methods, provisioning a clean virtual context
((C)) or reusing a previously created context ((D)). When the system is
cold (no virtines have yet been created), we must ask the host kernel
for a new virtual context by using KVM’s
KVM_CREATE_VM interface. If
this route is taken, we pay a higher cost to construct a virtine due to
the host kernel’s internal allocation of the VM state (VMCS on
Intel/VMCB on AMD). However, once we do this, and the relevant virtine
returns, we can clear its context ((E)), preventing information leakage,
and cache it in a pool of “clean” virtines ((C)) so the host OS need not
pay the expensive cost of re-allocating virtual hardware contexts. These
virtine “shells” sit dormant waiting for new virtine creation requests
((B)). The benefits of pooling virtines are apparent in
Figure 8 by comparing creation of a Wasp
virtine from scratch (the “Wasp” measurement) with reuse of a cached
virtine shell from the pool (“Wasp+C”). By recycling virtines, we can
reach latencies much lower than Linux thread creation and much closer to
the hardware limit, i.e., the
vmrun instruction. Note that here we
include Linux process creation latencies as well for scale. Included is
the “Wasp+CA” (cached, asynchronous) measurement, which does not measure
the cost of cleaning virtines and instead cleans them asynchronously in
the background. This can be implemented by either a background thread or
can be done when there are no incoming requests. This measurement shows
that the caching mechanism brings the cost of provisioning a virtine
shell to within 4% of a bare
We also measured these costs on a recent SGX-enabled Intel platform and observed similar behavior, as shown in the bottom half of Figure 8. The “SGX Create” measurement indicates the cost of creating a new enclave, and the ECALL measurement indicates the cost of entering an enclave, thus reusing the previously created context.
As was shown in Section 4, the initialization of a virtine’s execution state can lead to significant overheads compared to traditional function calls. This overhead is undesirable if the code that is executed in a virtine is not particularly long-lived (less than a few microseconds). Others have mitigated these start-up latencies in the serverless domain by “checkpointing” or “snapshotting” container runtime state after initialization . In a similar fashion, Wasp supports snapshotting by allowing a virtine to leverage the work done by previous executions of the same function. As outlined in Figure 7, the first execution of a virtine must still go through the initialization process by entering the desired mode and initializing any runtime libraries (in this case, libc). The virtine then takes a snapshot of its state, and continues executing. Subsequent executions of the same virtine can then begin execution at the snapshot point and skip the initialization process. This optimization significantly reduces virtine overheads, which we explore further in Section 5.3. Of course, by snapshotting a virtine’s private state, that state is exposed to all future virtines that are created using that “reset state.” Thus, care must be taken in describing what memory is saved in a snapshot in order to maintain the isolation objectives outlined in Section 3.3. We detail the costs involved in snapshotting in Section 6.2.
While Wasp significantly eases the development and deployment of
virtines, with only the runtime library, developers must still manage
virtine internals, namely the build process for the virtine’s internal
execution environment. Requiring developers to create kernel-style build
systems that package boot code, address space configurations, a minimal
libc, and a linker script per virtine creates an undue burden. To
alleviate this burden, we implemented a clang wrapper and LLVM compiler
pass. The purpose of the clang wrapper is to include our pass in the
invocation of the middle-end. The compiler pass detects C functions
annotated with the
virtine keyword, runs middle-end analysis at the IR
level, and automatically generates code that invokes a pre-compiled
virtine binary whenever the function is called. When this pass detects a
function annotation as shown in
Figure [fig:c-virtine-api], it generates
a call graph rooted at that function. The compiler automatically
packages a subset of the source program into the virtine context based
on what that virtine needs. Global variables accessed by the virtine are
currently initialized with a snapshot when the virtine is invoked.
Concurrent modifications (e.g., by different virtines, or by the client
and a virtine) will occur on distinct copies of the variable. Currently,
if a virtine calls another virtine-annotated function, a nested virtine
will not be created.
To further ease programming burden, compiler-supported virtines must
have access to some subset of the C standard library. Due to the nature
of their runtime environment, basic virtines do not include these
libraries. To remedy this, we created a virtine-specific port of
newlib , an embeddable C standard library that statically links and
maintains a relatively small virtine image size. Newlib allows
developers to provide their own system call implementations; we simply
forward them to the hypervisor as a hypercall. When the
keyword is used, all hypercalls are restricted by default, following the
default-deny semantics of virtines previously mentioned. If, however,
the programmer (implementing the virtine client) would like to permit
hypercalls, they can use the
virtine_permissive keyword to allow all
hypercalls, or the
virtine_config(cfg) to supply a configuration
structure that contains a bit mask of allowed hypercalls. If a hypercall
is permitted, the handler in the client must validate the arguments and
service it, for example by delegating to the host kernel’s system call
interface or by performing client-specific emulation.
This allows virtines to support standard library functionality without
drastically expanding the virtine runtime environments. Of course, by
using a fully fledged standard library, the user still opens themselves
up to common programming errors. For example, an errant
still result in undefined (or malicious) behavior, but this has no
consequences for the host or other virtines as outlined in
Section 3.3. All virtines created via our
language extensions use Wasp’s snapshot feature by default. This can be
disabled with the use of an environment variable.
Wasp provides two default execution environments for programmers to use, though others are possible. These default environments are shown in Figure 9. For the C extensions ((A)), the virtine is pre-packaged with a POSIX-like runtime environment, which stands between the “boot” process and the virtine’s function. If a programmer directly uses the Wasp C++ API, ((B)), the virtine is not automatically packaged with a runtime, and it is up to the client to provide the virtine binary. Both environments can use snapshotting after the reset stage, allowing them to skip the costly boot sequence. We envision an environment management system that will allow programmers to treat these environments much like package dependencies .
In this section, we evaluate virtines and the Wasp runtime using microbenchmarks and case studies that are representative of function isolation in the wild. With these experiments, we seek to answer the following questions:
How significant are baseline virtine startup overheads with our language extensions, and how much computation is necessary to amortize them? (§6.1)
What is the impact of the virtine’s execution environment (image size) on start-up cost? (§6.2)
What is the performance penalty for host interactions? (§6.3)
How much effort is required to integrate virtines with off-the-shelf library code? (§6.4)
How difficult is it to apply virtines to managed language use cases and what are the costs? (§6.5)
We first study the start-up overheads of virtines using our language
extensions. We implemented the minimal
fib example shown in
Figure [fig:c-virtine-api] and scaled
the argument to
fib to increase the amount of computation per function
invocation, shown in
Figure 10. We compare virtines with and
without image snapshotting to native function invocations. fib(0)
essentially measures the inherent overhead of virtine creation, and as
n increases, the cost of creating the virtine is amortized. The
measurements include setup of a basic virtine image (which includes
libc), argument marshalling, and minimal machine state initialization.
The argument, n, is loaded into the virtine’s address space at address
0x0. In the case of the experiment labeled “virtine + snapshot,” a
snapshot of the virtine’s execution state is taken on the first
invocation of the
fib function. All subsequent invocations of that
function will use this snapshot, skipping the slow path boot sequence
(see Figure 7) producing an overall
speedup of 2.5× relative to virtines without snapshotting for
fib(0). Note that we are not measuring the steady state, so the
bars include the overhead for taking the initial snapshot. This is why
we see more variance for the snapshotting measurements.
At first, the relative slowdown between native function invocation and virtines with snapshotting is 6.6×. When the virtine is short-lived, the costs of provisioning a virtine shell and initializing it account for most of the execution time. However, with larger computational requirements, the slowdown drops to 1.03× for n = 25 and 1.01× for n = 30. This shows that as the function complexity increases, virtine start-up overheads become negligible, as expected. Here we can amortize start-up overheads with ∼100μs of work.
|System||Latency||Boundary Cross Mechanism|
|Enclosures||0.9μs||Custom syscall interface|
|Virtines||5μs||Syscall interface + VMRUN|
Comparing costs of crossing isolation boundaries.
We compare virtine start-up costs to the cost of crossing isolation boundaries in other published systems in Table 2. While the types of isolation these systems provide is slightly different, these numbers put the cost of the underlying mechanism into perspective. LwC and Enclosures switch between isolated contexts within the same kernel in a similar way to process-based isolation. SeCage and Hodor measure only the latency of the VMFUNC instruction without a VMEXIT event. Virtine latency is measured from userspace on the host, surrounding the KVM_RUN ioctl, thus incurring system call and ring-switch overheads.
To evaluate the impact of virtines’ execution environments on start-up
costs, we performed an experiment that artificially increases image
size, shown in
Figure 11. This figure shows increasing
virtine image size (up to 16MB) versus virtine execution latency for a
minimal virtine that simply halts on startup. We synthetically increase
image size by padding a minimal virtine image with zeroes. With a 16MB
image size, the start-up cost is 2.3ms. This amounts to roughly 6.8GB/s,
which is in line with our measurement of the
memcpy bandwidth on our
tinker machine, 6.7GB/s. This shows the minimal cost a virtine will
incur for start-up with a simple snapshotting strategy when the boot
sequence is eliminated. Using a copy-on-write approach, as is done in
SEUSS , we expect this cost could be reduced drastically.
These results reflect what others have seen for unikernel boot times. Unikernels tend to have a larger image size than what would be needed for a virtine execution environment, and thus incur longer start-up times. Kuenzer et al. report the shortest we have seen, at 10s to 100s of μs for Unikraft , while other unikernels (MirageOS , OSv , Rump , HermiTux , and Lupine ) take tens to hundreds of milliseconds to boot a trivial image. For example, we measured the no-op function evaluation time under OSv to be roughly 600 milliseconds on our testbed. A similar no-op function achieved roughly 12ms under MirageOS run with Solo5’s HVT tender , which directly interfaces with KVM and uses hypercalls in a similar way to virtines.
As outlined in Section 2, virtines must interact with the client for all actions that are not fulfilled by the environment within the virtine. For example, a virtine must use hypercalls to read files or access shared state. Here we attempt to determine how frequent client interactions (via hypercalls) affect performance for an easily understood example. To do so, we use our C extension to annotate a connection handling function in a simple, single-threaded HTTP server that serves static content. Each connection that the server receives is passed to this function, which automatically provisions a virtine environment.
We measured both the latency and throughput of HTTP requests with and
without virtines on tinker. The results are shown in
Figure [fig:http-perf]. Virtine performance
is shown with and without snapshotting (“virtine” and “snapshot”).
Requests are generated from
localhost using a custom request generator
(which always requests a single static file). Note that each virtine
invocation here involves seven host interactions (hypercalls): (1)
read() a request from host socket, (2)
stat() requested file, (3)
open() file, (4)
read() from file, (5)
write() response, (6)
close() file, (7)
exit(). Wasp handles these hypercalls by first
validating arguments, and if they are allowed through, re-creates the
calls on the host. For example, a validated
read() will turn into a
read() on the host filesystem. The exits generated by these hypercalls
are doubly expensive due to the ring transitions necessitated by KVM.
However, despite the cost of these host interactions, virtines with
snapshots incur only a 12% decrease in throughput relative to the
baseline. We expect that these costs would be reduced in a more
realistic HTTP server, as more work unrelated to I/O would be involved.
This effect has been observed by others employing connection
To investigate the difficulty of incorporating virtines into libraries, and more significant codebases, we modified off-the-shelf OpenSSL.4 OpenSSL is used as a library in many applications, such as the Apache web server, Lighttpd, and OpenVPN. We changed the library so that its 128-bit AES block cipher encryption is carried out in virtine context. We chose this function since it is a core component of many higher-level encryption features. While this would not be a good candidate for running in virtine context from a performance perspective, it gives us an idea of how difficult it is to use virtines to isolate a deeply buried, heavily optimized function in a large codebase.
Compiling OpenSSL using virtines was straightforward. From the
developer’s perspective, it simply involved annotating the block cipher
function with the
virtine keyword and integrating our custom
clang/LLVM toolchain with the OpenSSL build environment (i.e., swapping
the default compiler). The latter step was more work. In all, the change
took roughly one hour for an experienced developer.
Though our main goal here was not to evaluate end-to-end performance, we did measure the performance impact of integrating virtines using OpenSSL’s internal benchmarking tool. We ran the built-in speed benchmark5 to measure the throughput of the block cipher using virtines (with our snapshotting optimization) compared to the baseline (native execution). Note that since the block cipher is being invoked many thousands of times per second, virtine creation overheads amplify the invocation cost significantly. In a realistic scenario, the developer would likely include more functionality in virtine context, amortizing those overheads. That said, with our optimizations and a 16KB cipher block size, virtines only incur a 17× slowdown relative to native execution with snapshotting. The OpenSSL virtine image we use is roughly 21KB, which following Figure 11 will translate to 16μs for every virtine invocation. It follows, then, that virtine creation in this example is memory bound, since copying the snapshot comprises the dominant cost.
and small memory footprint . Our baseline implementation (no virtines)
is configured to allocate a Duktape context, populate several native
function bindings, execute a function that base64 encodes a buffer of
data, and returns the encoding to the caller after tearing down
(freeing) the JS engine. The virtine does the same thing, but uses the
Wasp runtime library directly (no language extensions). This allows the
engine to use only three hypercalls:
return_data(). The snapshot hypercall instructs the runtime to take a
snapshot after booting into long mode and allocating the Duktape
get_data() asks the hypervisor to fill a buffer of memory
with the data to be encoded, and once the virtine encodes the data, it
return_data() and the virtine exits. By co-designing the
hypervisor and the virtine, and by providing only a limited set of
hypercalls, we limit the attack surface available to an adversary. For
get_data cannot be called more than once,
meaning that if an attacker were to gain remote code execution
capabilities, the only permitted hypercall would terminate the virtine.
Figure 12 shows the results of our Duktape
implementation. The virtine trial without snapshotting takes 125μs
longer to execute than the baseline. We attribute this to several
sources, including the required virtine provisioning and initialization
overhead and the overhead to allocate and later free the Duktape
context. By giving programmers direct control over more aspects of the
execution environment, several optimizations can be made. For example,
snapshotting can be used as shown in
Figure 7 by including the
Doing so avoids many calls to
malloc and other expensive functions
while initializing. By taking advantage of snapshotting in the case of
the “Virtine + Snapshot” measurements, virtines can enjoy a significant
reduction in overhead–roughly 2×. Further, since all virtines are
cleared and reset after execution, paying the cost of tearing down the
optimizations, the virtine can almost entirely avoid the cost of
allocating and freeing the Duktape context by retaining it–something
that cannot be done when executing in the client environment. Both of
the trials, “Virtine NT” and “Virtine+Snapshot+NT” are designed to take
advantage of this “**N**o **T**eardown” optimization in
full. Note that the virtine is not executing code any faster than
native, but it is able to provide a significant reduction in overhead by
simply executing less code by applying optimizations. These
optimizations cause the overall latency to drop to 137μs, which
code. Similar optimizations are applied in SEUSS , which uses the more
In this section, we discuss how our results might translate to realistic scenarios and more complex applications, the limitations of our current approach, and other use cases we envision for virtines.
In Section 6.4, we demonstrated that it requires
little effort to incorporate virtines into existing codebases that use
sensitive or untrusted library functions. In our example we assumed
access to the library’s code (
libopenssl in our case). While others
make the same assumption , this is not an inherent limitation. The
virtine runtime could apply a combination of link-time wrapping and
binary rewriting to migrate library code automatically to run in virtine
context. Others have applied such techniques for software fault
isolation (SFI) , even in virtualized settings .
A similar model could be used to more strongly isolate UDFs from one another in database systems. Postgres, for example, uses V8 mechanisms to isolate individual UDFs from one another , but they still execute in the same address space. Because virtine address spaces are disjoint, they could help with this limitation. Furthermore, virtines would allow functions in unsafe languages (e.g., C, C++) to be safely used for UDFs.
While our current workloads represent components that could be used in real settings, we do not currently integrate with commodity serverless platforms or database engines. This integration is currently underway, with OpenWhisk and PostgreSQL, respectively.
Currently, our C extension lacks the ability to take advantage of functionality located in a different LLVM module (C source file) than the one that contains the C function. Build systems used in C applications produce intermediate object files that are linked into the final executable. This means that virtines created using the C extension are restricted to functionality in the same compilation unit. Solutions to this problem typically involve modifying the build system to produce LLVM bitcode and using whole program analysis to determine which functions are available to the virtine, and which are not.
Automatically generated virtines face an ABI challenge for argument passing. Because they do not share an address space with the host, argument marshalling is necessary. We leveraged LLVM to copy a compile-time generated structure containing the argument values into the virtine’s address space at a known offset. Marshalling does incur an overhead that varies with the argument types and sizes, as is typical with “copy-restore” semantics in RPC systems . This affects start-up latencies when launching virtines, as described in §6.4.
Virtines do not currently support nesting, but this is not an inherent limitation. Virtines that dynamically allocate memory are possible with an execution environment that provides heap allocation, but that memory is currently limited to the virtine context. We believe secure channels to communicate data between the virtine and host could be implemented with appropriate hypercalls and library/language support. The virtine compiler could identify and transform such allocation sites (e.g., malloc) using escape analysis.
Wasp’s snapshotting mechanism currently uses
memcpy to populate a
virtine’s memory image with the snapshot. This copying, as shown in
Figure 11, constitutes a considerable cost
for a large virtine image. We expect this cost to drop when using
copy-on-write mechanisms to reset a virtine, as in SEUSS .
As we found in Section 4.2, KVM has performance penalties due to its need to perform several ring transitions for each exit, and for VM start-up. Some of these costs are unavoidable because they maintain userspace control over the VM. However, a Type-I VMM like Palacios or Xen can mitigate some software latencies incurred by virtines.
Our threat model makes assumptions that may not hold in the real world. For example, a hardware bug in VT-x or a microarchitectural side channel vulnerability (e.g., Meltdown ) could feasibly be used to break our security guarantees.
In our examples, we used virtines to isolate certain annotated functions from the rest of the program. This use case is not the only possible one. Below, we outline several other potential use cases for virtines.
We believe that HLLs present an incremental path to using virtines, i.e., the language runtime might abstract away the use of virtines entirely, for example, to wrap function calls via the foreign function interface (Chisnall et al. employed special-purpose hardware for this purpose ). Virtines might also be used to apply security-in-depth to JIT compilers and dynamic binary translators. For example, bugs that lead to vulnerabilities in built-in functions or the JIT’s slow-path handlers can be mitigated by running them in virtine context (NoJITsu achieves this with Intel’s Memory Protection Keys ). Polyglot environments like GraalVM could more safely use native code by employing virtines.
Because virtines implement an abstract machine model, are packaged with their runtime environment, and employ similar semantics to RPC , they allow for location transparency. Virtines could therefore be migrated to execute on remote machines just like containers, e.g., for code offload. This could allow for implementing distributed services with virtines, and for service migration based on high load scenarios, especially when RPCs are fast, as in the datacenter . If virtines require host services or hardware not present in the local machine, they can be migrated to a machine that does.
There is significant prior work on isolation of software components. However, the received wisdom is that when using hardware virtualization, creating a new isolated context for every isolation boundary crossing is too expensive. With virtines, we have shown that, with sufficient optimization, these overheads can be significantly reduced. Virtines enjoy several unique properties: they have an easy-to-use programming model, they implement an abstract machine model that allows for customization of the execution environment and the hypervisor, and because they create new contexts on every invocation, we can apply snapshotting to optimize start-up costs. We now summarize key differences with prior work.
The closest work to virtines is Enclosures , which allow for programmer-guided isolation by splitting libraries into their own code, data, and configuration sections within the same binary. The security policy of Enclosures is defined in terms of packages, but with virtines, the security policy is defined and enforced at the level of individual functions. While, like Enclosures, virtines can be used to isolate library functions from their calling environment, they can also be used to selectively isolate functions from other users’ virtines in a multi-tenant cloud environment.
Hodor also provides library isolation, particularly for high-performance data-plane libraries. Gotee uses language-level isolation like virtines, but builds on SGX enclaves rather than hardware virtualization .
While TrustVisor employs hardware virtualization to isolate application components (and assumes a strong adversary model), virtines enjoy a simpler programming model. SeCage uses static and dynamic analysis to automatically isolate software components guided by the secrets those components access . Virtines give programmers more control over isolated components. Glamdring also automatically partitions applications based on annotations , but uses SGX Enclaves which have more limited execution environments than virtines.
With Wedge , execution contexts (sthreads) are given minimal permissions to resources (including memory) using default deny semantics. However, virtines are more flexible in that they need not use the same host ABI and they do not require a modified host kernel. Dune is an example of an unconventional use of a virtual execution environment that provides high performance and direct access to hardware devices within a Linux system . Unlike virtines, Dune’s virtualization is at process granularity. Similarly, SMV isolates multi-threaded applications .
Several systems that support isolated execution leverage Intel’s Memory Protection Keys for memory safety . For virtines, we chose not to use this mechanism since the number of protection domains (16) offered by the hardware was insufficient for multi-tenant scenarios (e.g., serverless). Even without this limitation, instructions that access the PKRU register would need to be validated/removed, e.g., with binary rewriting, as is done in ERIM . We leave the exploration of MPK and similar fine-grained memory protection mechanisms for future work.
Lightweight-Contexts (LwCs) are isolated execution contexts within a process . They share the same ABI as other contexts, but essentially act as isolated co-routines. Unlike LwCs, virtines can run an arbitrary software stack, and gain the strong isolation benefits of hardware virtualization. The Endokernel architecture enables intra-process isolation with virtual privilege rings, but still maps domains to the process abstraction, rather than functions.
Nooks , LXD , and Nested Kernel all implement isolation for kernel modules.
Software Fault Isolation (SFI) enforces isolation by instrumenting applications with enforcement checks at boundary crossings, and thus does not leverage hardware support.
Wasp is similar in architecture to other minimal hypervisors (implementing μVMs). Unlike Amazon’s Firecracker or Google’s Cloud Hypervisor , we do not intend to boot a full Linux (or Windows) kernel, even with a simplified I/O device model. Wasp bears more similarity to ukvm (especially the networking interface) and uhyve . Unlike those systems, we designed Wasp to use a set of pre-packaged runtime environments. We intend Wasp to be used as a pluggable back-end (for applications, libraries, serverless platforms, or language runtimes) rather than as a stand-alone VMM.
Jitsu allows Unikernels to be spawned on demand in response to network events, but does not allow programmers to invoke virtualized environments at the function call granularity. There is a rich history of combining language and OS research. Typified by MirageOS , writing kernel components in a high-level language gives the kernel developer more flexibility in moving away from legacy interfaces. It can also shift the burden of protection and isolation .
When using the virtine runtime library directly, developers must currently marshal arguments and return values manually, though we are currently developing an IDL to ease this process (like SGX’s EDL ). ↩
That is, using Tukey’s method, measurements not on the interval [x25%−1.5 IQR,x75%+1.5 IQR] are removed from the data. ↩
OpenSSL version 3.0.0 alpha7. ↩
openssl speed -elapsed -evp aes-128-cbc ↩