Network Virtualization, Encapsulation, and Stateless Transport Tunneling (STT)
Posted: March 4, 2012
[This post was written with Bruce Davie, and Andrew Lambeth.]
Recently, Jesse Gross, Bruce Davie and a number of contributors submitted the STT draft to the IETF (link to the draft here). STT is an encapsulation format for network virtualization. Unlike other protocols in this space (namely VXLAN and NVGRE), it was designed to be used with soft switching within the server (generally in the vswitch in the hypervisor) while taking advantage of hardware acceleration at the NIC. The goal is to preserve the flexibility and development speed of software while still providing hardware forwarding speeds.
The quick list of differentiators are i) it takes advantage of TSO available in NICs today allowing tunneling from software at 10G while consuming relatively little cpu ii) there are more bits allocated to the virtual network meta data carried per packet, and those bits are unstructured allowing for arbitrary interpretation by software and iii) the control plane is decoupled from the actual encapsulation.
There are a number of other software-centric features like better byte alignment of the headers, but these are not architecturally significant.
Of course, the publication of the draft drew reasonable skepticism on whether the industry needed yet another encapsulation format. And that is the question we will focus on in this post.
But first, let us try to provide a useful decomposition of the virtual networking problem (as it pertains to Distributed Edge Overlays DEO).
Distributed Edge Overlays (DEO)
Distributed edge overlays have gained a lot of traction as a mechanism for network virtualization. A reasonable characterization of the problem space can be found in the IETF nvo3 problem statement draft. Two recent DEO related drafts submitted to the IETF in addition to STT are NVGRE, and VXLAN.
The basic idea is to use tunneling (generally L2 in L3) to create an overlay network from the edge that exposes a virtual view of one or more network to end hosts. The edge may be, for example, a vswitch in the hypervisor, or the first hop physical switch.
DEO solutions can be roughly decomposed into three independent components.
- Encapsulation format:The encapsulation format is what the packet looks like on the wire. The format has implications both on hardware compatibility, and the amount of information that is carried across the tunnel with the packet.As an example of encapsulation, with NVGRE the encapsulation format is GRE where the GRE key is used to store some additional information (the tenant network ID).
- Control plane:The control plane disseminates the state needed to figure out which tunnels to create, which packets go in which tunnels, and what state is associated with packets as they traverse the tunnels. Changes to both the physical and virtual views of the network often require state to be updated and/or moved around the network.There are many ways to implement a control plane, either using traditional protocols (for example, NVGRE and the first VXLAN draft abdicate a lot of control responsibility to multicast), or something more SDN-esque like a centralized datastore, or even a proper SDN controller.
- Logical view:The logical view is what the “virtual network” looks like from the perspective of an end host. In the case of VXLAN and NVGRE, they offer a basic L2 learning domain. However, you can imagine this being extended to L3 (to support very large virtual networks, for example), security policies, and even higher-level services.The logical view defines the network services available to the virtual machine. For example, if only L2 is available, it may not be practical to run workloads of thousands of machines within a single virtual network due to scaling limitations of broadcast. If the virtual network provided L3, it could potentially host such workloads and still provide the benefits of virtualization such as support for VM mobility without requiring IP renumbering, higher-level service interposition (like adding firewalls), and mobile policies.
Before we jump into a justification for STT, we would like to make the point that each of these components really are logically distinct, and a good design should keep them decoupled. Why? For modularity. For example, if a particular encapsulation format has broad hardware support, it would be a shame to restrict it to a single control plane design. It would also be a shame to restrict it to a particular logical network view.
VXLAN and NVGRE or both guilty of this to some extent. For example, the NVGRE and the original VXLAN draft specify multicast as the mechanism to use for part of the control plane (with other parts left unspecified). The latest VXLAN addresses this somewhat, which is a great improvement.
Also, both VXLAN and NVGRE fix the logical forwarding model to L2 going as far as to specify how the logical forwarding tables get populated. Again, this is an unnecessary restriction.
For protocols that are hardware centric (which both VXLAN and NVGRE appear to me), this makes some modicum of sense, lookup space is expensive, and decoupling may require an extra level of indirection. However, for software this is simply bad design.
STT on the other hand limits its focus to the encapsulation format, and does not constrain the other components within the specification.
[Note: The point of this post is not to denigrate VXLAN or NVGRE, but rather to point out that they are comparatively less suited for running within the vswitch. If the full encap/decap and lookup logic is resides fully within hardware, VXLAN and NVGRE are both well designed and reasonable options.]
OK, on to a more detailed justification for STT
To structure the discussion, we’ll step through each logical component of the DEO architecture and describe the design decisions made by STT and how they compare to similar proposals.
Logical view: It turns out that the more information you can tack on to a packet as it transits the network, the richer a logical view you can create. Both NVGRE and VXLAN not only limit the additional information to 32 bits, but they also specify that those bits must contain the logical network ID. This leaves no additional space for specifying other aspects of the logical view that might be interesting to the control plane.
STT differs from NVGRE and VXLAN in two ways. First, it allocates more space to the per-packet metadata. Second, it doesn’t specify how that field is interpreted. This allows the virtual network control plane to use it for state versioning (useful for consistency across multiple switches), additional logical network meta-data, tenant identification, etc.
Of course, having a structured field of limited size makes a lot of sense for NVGRE and VXLAN where it is assumed that encap/decap and interpretation of those bits are likely to be in switching hardware.
However, STT is optimizing for soft switching with hardware accelerating in the NIC. Having larger, unstructured fields provides more flexibility for the software to work with. And, as I’ll describe below, it doesn’t obviate the ability to use hardware acceleration in the NIC to get vastly better performance than a pure software approach.
Control Plane: The STT draft says nothing about the control plane that is used for managing the tunnels and the lookup state feeding into them. This means that securing the control channel, state dissemination, packet replication, etc. are outside of, and thus not constrained by, the spec.
Encapsulation format: This is were STT really shines. STT was designed to take advantage of TSO and LRO engines in existing NICs today. With STT, it is possible to tunnel at 10G from the guest while consuming only a fraction of a CPU core. We’ve seen speedups up to 10x over pure software tunneling.
In other words, STT was designed to let you retain all the high performance features of the NIC when you start tunneling from the edge, while still retaining the flexibility of software to perform the network virtualization functions.
Here is how it works.
When a guest VM sends a packet to the wire, the transitions between the guest and the hypervisor (this is a software domain crossing which requires flushing the TLB, and likely the loss of cache locality, etc.) and the hypervisor and the NIC are relatively expensive. This is why hypervisor vendors take pains to always support TSO all the way up to the guest
Without tunneling, vswitches can take advantage of TSO by exposing a TSO enabled NIC to the guest and then passing large TCP frames to the hardware NIC which performs the segmentation. However, when tunneling is involved, this isn’t possible unless the NIC supports segmentation of the TCP frame within the tunnel in hardware (which hopefully will happen as tunneling protocols get adopted).
With STT, the guests are also exposed to a TSO enabled NIC, however instead of passing the packets directly to the NIC, the vswitch inserts an additional header that looks like a TCP packet, and performs all of the additional network virtualization procedures.
As a result, with STT, the guest ends up sending and receiving massive frames to the hypervisor (up to 64k) which are then encapsulated in software, and ultimately segmented in hardware by the NIC. The result is that the number of domain crossings are reduced by a significant factor in the case of high-throughput TCP flows.
One alternative to going through all this trouble to amortize the guest/hypervisor transistions is to try eliminating them altogether by exposing the NIC HW to the guest, with a technique commonly referred to as passthrough. However, with passthrough software is unable to make any forwarding decisions on the packet before it is sent to the NIC. Passthrough creates a number of problems by exposing the physical NIC to the guest which obviates many of the advantages of NIC virtualization (we describe these shortcomings at length here).
For modern NICs that support TSO and LRO, the only additional overhead that STT provides over sending a raw L2 frame is the memcpy() of the header for encap/decap, and the transmission cost of those additional bytes.
It’s worth pointing out that even if reassembly on the receive side is done in software (which is the case with some NICs), the interrupt coalescing between the hypervisor and the guest is still a significant performance win.
How does this compare to other tunneling proposals? The most significant difference is that NICs don’t support the tunneling protocols today, so they have to be implemented in software which results in a relatively significant performance hit. Eventually NICs will support multiple tunneling protocols, and hopefully they will also support the same stateless (on the send side) TCP segmentation offloading. However, this is unlikely to happen with LOM for awhile.
As a final point, much of STT was designed for efficient processing in software. It contains redundant fields in the header for more efficient lookup and padding to improve byte-alignment on 32-bit boundaries.
So, What’s Not to Like?
STT in it’s current form is a practical hack (more or less). Admittedly, it follows more of a “systems” than a networking aesthetic. More focus was put on practicality, performance, and software processing, than being parsimonious with lookup bits in the header.
As a result, there are definitely some blemishes. For example, because it uses a valid TCP header, but doesn’t have an associated TCP state machine, middleboxes that don’t do full TCP termination are likely to get confused (although it is a little difficult for us to see this as a real shortcoming given all of the other problems passive middleboxes have correctly reconstructing end state). Also, there is currently no simple way to distinguish it from standard TCP traffic (again, a problem for middleboxes). Of course, the opacity of tunnels to middleboxes is nothing new, but these are generally fair criticisms.
In the end, our guess is that abusing existing TSO and LRO engines will not ingratiate STT with traditional networking wonks any time soon …🙂
However, we believe that outside of the contortions needed to be compatible with existing TSO/LRO engines, STT has a suitable design for software based tunneling with hardware offload. Because the protocol does not over-specify the broader system in which the tunnel will sit, as the hardware ecosystem evolves, it should be possible to also evolve the protocol fields themselves (like getting rid of using an actual TCP header and setting the outer IP protocol to 6) without having to rewrite the control plane logic too.
Ultimately, we think there is room for a tunneling protocol that provides the benefits of STT, namely the ability to do processing in software with minimal hardware offload for send and receive segmentation. As long as there is compatible hardware, the particulars or the protocol header are less important. After all, it’s only (mostly) software.