network virtualization encapsulation and stateless tcp transport stt

Network Virtualization, Encapsulation, and Stateless Transport Tunneling (STT)

[This post was written with Bruce Davie, and Andrew Lambeth.]

Recently, Jesse Gross, Bruce Davie and a number of contributors submitted the STT draft to the IETF (link to the draft here). STT is an encapsulation format for network virtualization. Unlike other protocols in this space (namely VXLAN and NVGRE), it was designed to be used with soft switching within the server (generally in the vswitch in the hypervisor) while taking advantage of hardware acceleration at the NIC. The goal is to preserve the flexibility and development speed of software while still providing hardware forwarding speeds.

The quick list of differentiators are i) it takes advantage of TSO available in NICs today allowing tunneling from software at 10G while consuming relatively little cpu ii) there are more bits allocated to the virtual network meta data carried per packet, and those bits are unstructured allowing for arbitrary interpretation by software and iii) the control plane is decoupled from the actual encapsulation.

There are a number of other software-centric features like better byte alignment of the headers, but these are not architecturally significant.

Of course, the publication of the draft drew reasonable skepticism on whether the industry needed yet another encapsulation format. And that is the question we will focus on in this post.

But first, let us try to provide a useful decomposition of the virtual networking problem (as it pertains to Distributed Edge Overlays DEO).

Distributed Edge Overlays (DEO)

Distributed edge overlays have gained a lot of traction as a mechanism for network virtualization. A reasonable characterization of the problem space can be found in the IETF nvo3 problem statement draft. Two recent DEO related drafts submitted to the IETF in addition to STT are NVGRE, and VXLAN.

The basic idea is to use tunneling (generally L2 in L3) to create an overlay network from the edge that exposes a virtual view of one or more network to end hosts. The edge may be, for example, a vswitch in the hypervisor, or the first hop physical switch.

DEO solutions can be roughly decomposed into three independent components.

  • Encapsulation format:The encapsulation format is what the packet looks like on the wire. The format has implications both on hardware compatibility, and the amount of information that is carried across the tunnel with the packet.As an example of encapsulation, with NVGRE the encapsulation format is GRE where the GRE key is used to store some additional information (the tenant network ID).
  • Control plane:The control plane disseminates the state needed to figure out which tunnels to create, which packets go in which tunnels, and what state is associated with packets as they traverse the tunnels. Changes to both the physical and virtual views of the network often require state to be updated and/or moved around the network.There are many ways to implement a control plane, either using traditional protocols (for example, NVGRE and the first VXLAN draft abdicate a lot of control responsibility to multicast), or something more SDN-esque like a centralized datastore, or even a proper SDN controller.
  • Logical view:The logical view is what the “virtual network” looks like from the perspective of an end host. In the case of VXLAN and NVGRE, they offer a basic L2 learning domain. However, you can imagine this being extended to L3 (to support very large virtual networks, for example), security policies, and even higher-level services.The logical view defines the network services available to the virtual machine. For example, if only L2 is available, it may not be practical to run workloads of thousands of machines within a single virtual network due to scaling limitations of broadcast. If the virtual network provided L3, it could potentially host such workloads and still provide the benefits of virtualization such as support for VM mobility without requiring IP renumbering, higher-level service interposition (like adding firewalls), and mobile policies.

Before we jump into a justification for STT, we would like to make the point that each of these components really are logically distinct, and a good design should keep them decoupled.  Why? For modularity. For example, if a particular encapsulation format has broad hardware support, it would be a shame to restrict it to a single control plane design. It would also be a shame to restrict it to a particular logical network view.

VXLAN and NVGRE or both guilty of this to some extent. For example, the NVGRE and the original VXLAN draft specify multicast as the mechanism to use for part of the control plane (with other parts left unspecified). The latest VXLAN addresses this somewhat, which is a great improvement.

Also, both VXLAN and NVGRE fix the logical forwarding model to L2 going as far as to specify how the logical forwarding tables get populated. Again, this is an unnecessary restriction.

For protocols that are hardware centric (which both VXLAN and NVGRE appear to me), this makes some modicum of sense, lookup space is expensive, and decoupling may require an extra level of indirection.  However, for software this is simply bad design.

STT on the other hand limits its focus to the encapsulation format, and does not constrain the other components within the specification.

[Note: The point of this post is not to denigrate VXLAN or NVGRE, but rather to point out that they are comparatively less suited for running within the vswitch. If the full encap/decap and lookup logic is resides fully within hardware, VXLAN and NVGRE are both well designed and reasonable options.]

OK, on to a more detailed justification for STT

To structure the discussion, we’ll step through each logical component of the DEO architecture and describe the design decisions made by STT and how they compare to similar proposals.

Logical view: It turns out that the more information you can tack on to a packet as it transits the network, the richer a logical view you can create. Both NVGRE and VXLAN not only limit the additional information to 32 bits, but they also specify that those bits must contain the logical network ID. This leaves no additional space for specifying other aspects of the logical view that might be interesting to the control plane.

STT differs from NVGRE and VXLAN in two ways. First, it allocates more space to the per-packet metadata. Second, it doesn’t specify how that field is interpreted. This allows the virtual network control plane to use it for state versioning (useful for consistency across multiple switches), additional logical network meta-data, tenant identification, etc.

Of course, having a structured field of limited size makes a lot of sense for NVGRE and VXLAN where it is assumed that encap/decap and interpretation of those bits are likely to be in switching hardware.

However, STT is optimizing for soft switching with hardware accelerating in the NIC. Having larger, unstructured fields provides more flexibility for the software to work with. And, as I’ll describe below, it doesn’t obviate the ability to use hardware acceleration in the NIC to get vastly better performance than a pure software approach.

Control Plane: The STT draft says nothing about the control plane that is used for managing the tunnels and the lookup state feeding into them. This means that securing the control channel, state dissemination, packet replication, etc. are outside of, and thus not constrained by, the spec.

Encapsulation format: This is were STT really shines. STT was designed to take advantage of TSO and LRO engines in existing NICs today. With STT, it is possible to tunnel at 10G from the guest while consuming only a fraction of a CPU core. We’ve seen speedups up to 10x over pure software tunneling.

(If you’re not familiar with TSO or LRO, you may want to check out the wikipedia pages here and here.)

In other words, STT was designed to let you retain all the high performance features of the NIC when you start tunneling from the edge, while still retaining the flexibility of software to perform the network virtualization functions.

Here is how it works.

When a guest VM sends a packet to the wire, the transitions between the guest and the hypervisor (this is a software domain crossing which requires flushing the TLB, and likely the loss of cache locality, etc.) and the hypervisor and the NIC are relatively expensive. This is why hypervisor vendors take pains to always support TSO all the way up to the guest

Without tunneling, vswitches can take advantage of TSO by exposing a TSO enabled NIC to the guest and then passing large TCP frames to the hardware NIC which performs the segmentation. However, when tunneling is involved, this isn’t possible unless the NIC supports segmentation of the TCP frame within the tunnel in hardware (which hopefully will happen as tunneling protocols get adopted).

With STT, the guests are also exposed to a TSO enabled NIC, however instead of passing the packets directly to the NIC, the vswitch inserts an additional header that looks like a TCP packet, and performs all of the additional network virtualization procedures.

As a result, with STT, the guest ends up sending and receiving massive frames to the hypervisor (up to 64k) which are then encapsulated in software, and ultimately segmented in hardware by the NIC. The result is that the number of domain crossings are reduced by a significant factor in the case of high-throughput TCP flows.

One alternative to going through all this trouble to amortize the guest/hypervisor transistions is to try eliminating them altogether by exposing the NIC HW to the guest, with a technique commonly referred to as passthrough. However, with passthrough software is unable to make any forwarding decisions on the packet before it is sent to the NIC. Passthrough creates a number of problems by exposing the physical NIC to the guest which obviates many of the advantages of NIC virtualization (we describe these shortcomings at length here).

For modern NICs that support TSO and LRO, the only additional overhead that STT provides over sending a raw L2 frame is the memcpy() of the header for encap/decap, and the transmission cost of those additional bytes.

It’s worth pointing out that even if reassembly on the receive side is done in software (which is the case with some NICs), the interrupt coalescing between the hypervisor and the guest is still a significant performance win.

How does this compare to other tunneling proposals? The most significant difference is that NICs don’t support the tunneling protocols today, so they have to be implemented in software which results in a relatively significant performance hit.  Eventually NICs will support multiple tunneling protocols, and hopefully they will also support the same stateless (on the send side) TCP segmentation offloading.  However, this is unlikely to happen with LOM for awhile.

As a final point, much of STT was designed for efficient processing in software. It contains redundant fields in the header for more efficient lookup and padding to improve byte-alignment on 32-bit boundaries.

So, What’s Not to Like?

STT in it’s current form is a practical hack (more or less). Admittedly, it follows more of a “systems” than a networking aesthetic. More focus was put on practicality, performance, and software processing, than being parsimonious with lookup bits in the header.

As a result, there are definitely some blemishes.  For example, because it uses a valid TCP header, but doesn’t have an associated TCP state machine, middleboxes that don’t do full TCP termination are likely to get confused (although it is a little difficult for us to see this as a real shortcoming given all of the other problems passive middleboxes have correctly reconstructing end state). Also, there is currently no simple way to distinguish it from standard TCP traffic (again, a problem for middleboxes). Of course, the opacity of tunnels to middleboxes is nothing new, but these are generally fair criticisms.

In the end, our guess is that abusing existing TSO and LRO engines will not ingratiate STT with traditional networking wonks any time soon …?

However, we believe that outside of the contortions needed to be compatible with existing TSO/LRO engines, STT has a suitable design for software based tunneling with hardware offload. Because the protocol does not over-specify the broader system in which the tunnel will sit, as the hardware ecosystem evolves, it should be possible to also evolve the protocol fields themselves (like getting rid of using an actual TCP header and setting the outer IP protocol to 6) without having to rewrite the control plane logic too.

Ultimately, we think there is room for a tunneling protocol that provides the benefits of STT, namely the ability to do processing in software with minimal hardware offload for send and receive segmentation. As long as there is compatible hardware, the particulars or the protocol header are less important. After all, it’s only (mostly) software.

12 Comments on “Network Virtualization, Encapsulation, and Stateless Transport Tunneling (STT)”

  1. DK says:

    As I concluded in my (admittedly simplified) analysis post on STT, it will be very interesting to see what kind of position will IETF take toward this draft due to its creative use of a well-known protocol’s headers.

    I guess this will also largely determine the willingness of the middlebox Vendors to implement the necessary “enhancements”.

    — Dmitri

    • Hey Dmitri,

      My guess is that middlebox vendors are more likely to be swayed by customers than the IETF. There already are a handful that are working on STT support.

      Regarding the IETF itself, I agree that it is unlikely it would endorse STT in its current form. And noone behind STT has the time or resources to throw enough warm bodies at the IETF to make a difference. However, hopefully the basic principles of STT (LSO/LRO, decoupled control plane, large unstructured metadata) will creep into whatever proposal does get accepted.

  2. JS says:

    It has a fundamental problem. It reinvents AAL5 and together with this, all it’s problems. If a fragment is lost, the receiver will drop the whole segment.. This can kill the network. Next you will need Partial Packet Discard or Early Packet Discard. See Floyd’s paper. If you run with TCP semantics you end up with TCP over TCP .. Not a good combination, with very unpredictable results. With 64K packets and 1500 byte packets, and a packet loss probability of 1.5% in the network, you will never see a single segment go through.

    • Thanks for the comment JS.

      STT has the same properties as TSO, which is very heavily used in practice today in the datacenter (and has been for years). Like TSO, an implementation of STT can batch the first N contiguous bytes and send them up to the guest if there are packet drops. Even more sophisticated implementations are possible (and are used for TSO in practice). Also like TSO, STT is meant to be used in the datacenter.

      However, you are certainly right that running a naive implementation over a congested network (esp. one with high RTT’s) would have issues (the same applies for TSO).

      For what it’s worth, STT has been used successfully in many live deployments and consistenty performs far better than software encapsulation with software TSO, which is required since the guests are presented with TSO capabilties in the vNICs.

      • JS says:

        I am not sure it is the same semantics. In this case, the VM hands over an IP packet to the hypervisor, that is segmented by the tunneling method. The receiving VM will expect a full IP packet. This packet though was encapsulated in multiple iP packets , and any single packet loss will prevent the receiving hypervisors from delivering the packet to the VM. Unless, there is a reliable mechanism that will allow hypervisors to do retransmissions without the VM knowing about it. But, then a whole slew of problems is introduced, since the TCP stack in the VM, will not detect the congestion loss and it will not back off, thus throwing more traffic and more congestion and a network collapse. (see the wireless guys where they introduced ARQ to recover from channel losses. Hari Balakrishna has some good papers on the topic).

        In TCO, there is only one TCP stack, and it can receive partial packs. There is a TCp segment that is handed over and not an IP packet.

        Data center or not, it doesn’t matter. Especially in current DC switches with very small buffers, the packet loss probability is high due to plain burstiness even if the network is not congested. (see map reduce incest problems).

        The fact that it works in some deployments doesn’t mean much. The Internet worked for 10 years before the first congestion collapse, where we started understanding about these interactions. We can’t ignore this knowledge now.

        • I think I see where the confusion is coming from.

          While LSO/LRO can be applied to any IP packet in principle, the STT implementation I’m familiar with does not change the MTU size of the interface, but does expose a TSO enabled vNIC to the guest. So really, the performance benefits only apply to high throughput TCP flows. And yes, the semantics are very similar in that received packets can be batched in consecutive sequences and passed to the guest as legitimate TCP frames (just like TSO today).

          However, with STT the outer frame is what is segmented, where with other tunneling protocols presumedly it would be the inner TCP frame. There are clear trade-offs between the two approaches. With STT, if the first packets drops, then we’re hosed. On the other hand, segmenting the inner header (with L2) would likely require duplicating the TCP header in each packet which would be less efficient byte-for-byte.

  3. Anyonymous says:

    Actually several NICs are capable of doing TSO and Checksum offloads even with encapsulation by supplying proper offset to the NIC driver and this has already been proven to work. Likewise few others way to scale packet processing but you need to study the drivers close to see how to do it.

  4. Mark says:

    STT’s ability to take advantage of STT and LRO in modern NICs seems very useful at first glance. However, I wonder what situations exist where some of those advantages are removed. For example: LRO is incompatible with Linux bridging/forwarding, so presumably STT wouldn’t be able to gain any advantage from it in the many situations where those technologies are used, correct? In some cases LRO must actually be removed from the NIC driver at compile time if bridging or forwarding are to be used. How common a use case is this? Are there similar situations where TSO might not be usable?

    For reference, here’s a warning from the README file in Intel’s latest 10GE NIC drivers for Linux:

    “WARNING: The ixgbe driver compiles by default with the LRO (Large

    Receive Offload) feature enabled. This option offers the lowest CPU

    utilization for receives, but is completely incompatible with

    *routing/ip forwarding* and *bridging*. If enabling ip forwarding or

    bridging is a requirement, it is necessary to disable LRO using compile

    time options as noted in the LRO section later in this document. The

    result of not disabling LRO when combined with ip forwarding or bridging

    can be low throughput or even a kernel panic.”

    To be clear, this issue is not specific to Intel NIC’s…refer also to:


    • Hi Mark, thanks for the comment.

      The problem with LRO is that it loses information about the packets before they were merged, which is fine if you’re the end consumer of the packet because you’re just going to merge them in a buffer anyways. However, if you are going to retransmit them again as with bridging or routing (using TSO to break them apart) then you don’t know the appropriate MSS size. Generally this means that you have to disable LRO when doing these operations.

      Perhaps what you’re missing though, is that STT packets themselves are not being forwarded but rather their contents. Since the NIC sees the STT headers and we’re terminating the tunnel locally it’s OK to use LRO.

      It’s also worth noting that even if LRO is not being used, STT is a still a big performance win end to end.

  5. Vaughn Suazo says:

    Nice write up, the comparison and drill down on the motivation is very helpful. I see value for cloud services and “hosting” network services on x86 hardware. Reference the companies who wrote and contributed, obvious the control-plane is meant to be separate, SDN.

  6. […] GRE outer header it is not possible to take advantage of offload features on most existing NICs (we have discussed this problem in more detail before). However, this is a shortcoming of the NIC hardware in the near term. Next generation NICs will […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s