Tunneling for Network Virtualization

[This post was written by Bruce Davie and Martin Casado.]

With the growth of interest in network virtualization, there has been a tendency to focus on the encapuslations that are required to tunnel packets across the physical infrastructure, sometimes neglecting the fact that tunneling is just one (small) part of an overall architecture for network virtualization. Since this post is going to do just that – talk about tunnel encapsulations – we want to reiterate the point that a complete network virtualization solution is about much more than a tunnel encapsulation. It entails (at least) a control plane, a management plane, and a set of new abstractions for networking, all of which aim to change the operational model of networks from the traditional, physical model. We’ve written about these aspects of network virtualization before (e.g., here).

In this post, however, we do want to talk about tunneling encapsulations, for reasons that will probably be readily apparent. There is more than one viable encapsulation in the marketplace now, and that will be the case for some time to come. Does it make any difference which one is used? In our opinion, it does, but it’s not a simple beauty contest in which one encaps will be declared the winner. We will explore some of the tradeoffs in this post.

There are three main encapsulation formats that have been proposed for network virtualization: VXLAN, NVGRE, and STT. We’ll focus on VXLAN and STT here. Not only are they the two that VMware supports (now that Nicira is part of VMware) but they also represent two quite distinct points in the design space, each of which has its merits.

One of the salient advantages of VXLAN is that it’s gained traction with a solid number of vendors in a relatively short period. There were demonstrations of several vendors’ implementations at the recent VMworld event. It fills an important market need, by providing a straightforward way to encapsulate Layer 2 payloads such that the logical semantics of a LAN can be provided among virtual machines without concern for the limitations of physical layer 2 networks. For example, a VXLAN can provide logical L2 semantics among machines spread across a large data center network, without requiring the physical network to provide arbitrarily large L2 segments.

At the risk of stating the obvious, the fact that VXLAN has been implemented by multiple vendors makes it an ideal choice for multi-vendor deployments. But we should be clear what ”multi-vendor” means in this case. Network virtualization entails tunneling packets through the data center routers and switches, and those devices only forward based on the outer header of the tunnel – a plain old IP (or MAC header). So the entities that need to terminate tunnels for network virtualization are the ones that we are concerned about here.

In many virtualized data center deployments, most of the traffic flows from VM to VM (“east-west” traffic) in which case the tunnels are terminated in vswitches. It is very rare for those vswitches to be from different vendors, so in this case, one might not be so concerned about multi-vendor support for the tunnel encaps. Other issues, such as efficiency and ability to evolve quickly might be more important, as we’ll discuss below.

Of course, there are plenty of cases where traffic doesn’t just flow east-west. It might need to go out of the data center to the Internet (or some other WAN), i.e. “north-south”. It might also need to be sent to some sort of appliance such as a load balancer, firewall, intrusion detection system, etc. And there are also plenty of cases where a tunnel does need to be terminated on a switch or router, such as to connect non-virtualized workloads to the virtualized network. In all of these cases, we’re clearly likely to run into multi-vendor situations for tunnel termination. Hence the need for a common, stable, and straightfoward approach to tunneling among all those devices.

Now, getting back to server-server traffic, why wouldn’t you just use VXLAN? One clear reason is efficiency, as we’ve discussed here. Since tunneling between hypervisors is required for network virtualization, it’s essential that tunneling not impose too high an overhead in terms of CPU load and network throughput. STT was designed with those goals in mind and performs very well on those dimensions using today’s commodity NICs. Given the general lack of multi-vendor issues when tunneling between hypervisors, STT’s significant performance advantage makes it a better fit in this scenario.

The performance advantage of STT may be viewed as somewhat temporary - it’s a result of STT’s ability to leverage TCP segmentation offload (TSO) in today’s NICs. Given the rise in importance of tunneling, and the momentum behind VXLAN, it reasonable to expect that a new generation of NICs will emerge that allow other tunnel encapsulations to be used without disabling TSO. When that happens, performance differences between STT and VXLAN should (mostly) disappear, given appropriate software to leverage the new NICs.

Another factor that comes into play when tunneling traffic from server to server is that we may want to change the semantics of the encapsualution from time to time as new features and capabilities make their way into the network virtualization platform. Indeed, one of overall advantages of network virtualization is the ease with which the capabilities of the network can be upgraded over time, since they are all implemented in software that is completely independent of the underlying hardware. To make the most of this potential for new feature deployment, it’s helpful to have a tunnel encaps with fields that can be modified as required by new capabilities. An encaps that typically operates between the vswitches of a single vendor (like STT) can meet this goal, while an encaps designed to facilitate multi-vendor scenarios (like VXLAN) needs to have the meaning of every header field pretty well nailed down.

So, where does that leave us? In essence, with two good solutions for tunneling, each of which meets a subset of the total needs of the market, but which can be used side-by-side with no ill effect. Consequently, we believe that VXLAN will continue to be a good solution for the multi-vendor environments that often occur in data center deployments, while STT will, for at least a couple of years, be the best approach for hypervisor-to-hypervisor tunnels. A complete network virtualization solution will need to use both encapsulations. There’s nothing wrong with that – building tunnels of the correct encapsulation type can be handled by the controller, without the need for user involvement. And, of course, we need to remember that a complete solution is about much more than just the bits on the wire.


The Overhead of Software Tunneling

[This post was written with Jesse Gross, Ben Basler, Bruce Davie, and Andrew Lambeth]

Tunneling has earned a bad name over the years in networking circles.

Much of the problem is historical. When a new tunneling mode is introduced in a hardware device, it is often implemented in the slow path. And once it is pushed down to the fastpath, implementations are often encumbered by key or table limits, or sometimes throughput is halved due to additional lookups.

However, none of these problems are intrinsic to tunneling. At its most basic, a tunnel is a handful of additional bits that need to be slapped onto outgoing packets. Rarely, outside of encryption, is there significant per-packet computation required by a tunnel. The transmission delay of the tunnel header is insignificant, and the impact on throughput is – or should be – similarly minor.

In fact, our experience implementing multiple tunneling protocols within Open vSwitch is that it is possible to do tunneling in software with performance and overhead comparable to non encapsulated traffic, and to support hundreds of thousands of tunnel end points.

Given the growing importance of tunneling in virtual networking (as evidenced by the emergence of protocols such as STT, NVGRE, and VXLAN, it’s worth exploring its performance implications.

And that is the goal of this post: to start the discussion on the performance of tunneling in software from the network edge.

Background

An emerging method of network virtualization is to use tunneling from the edges to decoupled the virtual network address space from the physical address space. Often the tunneling is done in software in the hypervisor. Tunneling from within the server has a number of advantages: software tunneling can easily support hundreds of thousands of tunnels, it is not sensitive to key sizes, it can support complex lookup functions and header manipulations, it simplifies the server/switch interface and reduces demands on the in-network switching ASICs, and it naturally offers software flexibility and a rapid development cycle.

An idealized forwarding path is shown in the figure below. We assume that the tunnels are terminated within the hypervisor. The hypervisor is responsible for mapping packets from VIFs to tunnels, and from tunnels to VIFs. The hypervisor is also responsible for the forwarding decision on the outer header (mapping the encapsulated packet to the next physical hop).

Some Performance Numbers for Software Tunneling

The following tests show throughput and cpu overhead for tunneling within Open vSwitch.   Traffic was generated with netperf  attempting to emulate a high-bandwidth TCP flow. The MTU for the VM and the physical NICs are 1500bytes and the packet payload size is 32k. The test shows results using no tunneling (OVS bridge), GRE, and STT.

The results show aggregate bidirectional throughput, meaning that 20Gbps is a 10G NIC sending and receiving at line rate. All tests where done using Ubuntu 12.04 and KVM on an Intel Xeon 2.40GHz servers interconnected with a Dell 10G switch. We use standard 10G Broadcom NICs. CPU numbers reflect the percentage of a single core used for each of the processes tracked.

The following results show the performance of a single flow between two VMs on different hypervisors. We include the Linux bridge to show that performance is comparable. Note that the CPU only includes the CPU dedicated to switching in the hypervisor and not the overhead in the guest needed to push/consume traffic.

Throughput Recv side cpu Send side cpu
Linux Bridge: 9.3 Gbps 85% 75%
OVS Bridge: 9.4 Gbps 82% 70%
OVS-STT: 9.5 Gbps 70% 70%
OVS-GRE: 2.3 Gbps 75% 97%

This next table shows the aggregate throughput of two hypervisors with 4 VMs each. Since each side is doing both send and receive, we don’t differentiate between the two.

Throughput CPU
OVS Bridge: 18.4 Gbps 150%
OVS-STT: 18.5 Gbps 120%
OVS-GRE: 2.3 Gbps 150%

Interpreting the Results

Clearly these results (aside from GRE, discussed below) indicate that the overhead of software for tunneling is negligible. It’s easy enough to see why that is so. Tunneling requires copying the tunnel bits onto the header, an extra lookup (at least on receive), and the transmission delay of those extra bits when placing the packet on the wire. When compared to all of the other work that needs to be done during the domain crossing between the guest and the hypervisor, the overhead really is negligible.

In fact, with the right tunneling protocol, the performance is roughly equivalent to non-tunneling, and CPU overhead can even be lower.

STT’s lower CPU usage than non-tunneled traffic is not a statistical anomaly but is actually a property of the protocol. The primary reason is that STT allows for better coalescing on the received side in the common case (since we know how many packets are outstanding). However, the point of this post is not to argue that STT is better than other tunneling protocols, just that if implemented correctly, tunneling can have comparable performance to non-tunneled traffic. We’ll address performance specific aspects of STT relative to other protocols in a future post.

The reason that GRE numbers are so low is that with the GRE outer header it is not possible to take advantage of offload features on most existing NICs (we have discussed this problem in more detail before). However, this is a shortcoming of the NIC hardware in the near term. Next generation NICs will support better tunnel offloads, and in a couple of years, we’ll start to see them show up in LOM.

In the meantime, STT should work on any standard NIC with TSO today.

The Point

The point of this post is that at the edge, in software, tunneling overhead is comparable to raw forwarding, and under some conditions it is even beneficial. For virtualized workloads, the overhead of software forwarding is in the noise when compared to all of the other machinations performed by the hypervisor.

Technologies like passthrough are unlikely to have a significant impact on throughput, but they will save CPU cycles on the server. However, that savings comes at a fairly steep cost as we have explained before, and doesn’t play out in most deployment environments.


Network Virtualization

[This post was written with Bruce Davie]

Network virtualization has been around in some form or other for many years, but it seems of late to be getting more attention than ever. This is especially true in SDN circles, where we frequently hear of network virtualization as one of the dominant use cases of SDN. Unfortunately, as with much of SDN, the discussion has been muddled, and network virtualization is being both conflated with SDN and described as a direct result of it. However, SDN is definitely not network virtualization. And network virtualization does not require SDN.

No doubt, part of the problem is that there is no broad consensus on what network virtualization is. So this post is an attempt to construct a reasonable working definition of network virtualization. In particular, we want to distinguish network virtualization from some related technologies with which it is sometimes confused, and explain how it relates to SDN.

A good place to start is to take a step back and look at how virtualization has been defined in computing. Historically, virtualization of computational resources such as CPU and memory has allowed programmers (and applications) to be freed from the limitations of physical resources. Virtual memory, for example, allows an application to operate under the illusion that it has dedicated access to a vast amount of contiguous memory, even when the physical reality is that the memory is limited, partitioned over multiple banks, and shared with other applications. From the application’s perspective, the abstraction of virtual memory is almost indistinguishable from that provided by physical memory, supporting the same address structure and memory operations.

As another example, server virtualization presents the abstraction of a virtual machine, preserving all the details of a physical machine: CPU cycles, instruction set, I/O, etc.

A key point here is that virtualization of computing hardware preserves the abstractions that were presented by the resources being virtualized. Why is this important? Because changing abstractions generally means changing the programs that want to use the virtualized resources. Server virtualization was immediately useful because existing operating systems could be run on top of the hypervisor without modification. Memory virtualization was immediately useful because the programming model did not have to change.

Virtualization and the Power of New Abstractions

Virtualization should not change the basic abstractions exposed to workloads, however it nevertheless does introduce new abstractions. These new abstractions represent the logical enclosure of the entity being virtualized (for example a process, a logical volume, or a virtual machine). It is in these new abstractions that the real power of virtualization can be found.

So while the most immediate benefit of virtualization is the ability to multiplex hardware between multiple workloads (generally for the efficiency, fault containment or security), the longer term impact comes from the ability of the new abstractions to change the operational paradigm.

Server virtualization provides the most accessible example of this. The early value proposition of hypervisor products was simply server consolidation. However, the big disruption that followed server virtualization was not consolidation but the fundamental change to the operational model created by the introduction of the VM as a basic unit of operations.

This is a crucial point. When virtualizing some set of hardware resources, a new abstraction is introduced, and it will become a basic unit of operation. If that unit is too fine grained (e.g. just exposing logical CPUs) the impact on the operational model will be limited. Get it right, however, and the impact can be substantial.

As it turns out, the virtual machine was the right level of abstraction to dramatically impact data center operations. VMs embody a fairly complete target for the things operational staff want to do with servers: provisioning new workloads, moving workloads, snapshotting workloads, rolling workloads back in time, etc.

Quick Recap:

  • virtualization exposes a logical view of some resource decoupled from the physical substrate without changing the basic abstractions.
  • virtualization also introduces new abstractions – the logical container of virtualized resources.
  • it is the manipulation of these new abstractions that has the potential to change the operational paradigm.
  • the suitability of the new abstraction for simplifying operations is important.

Given this as background, let’s turn to network virtualization.

Network Virtualization, Then and Now

As noted above, network virtualization is an extremely broad and overloaded term that has been in use for decades. Overlays, MPLS, VPNs, VLANs, LISP, Virtual routers, VRFs can all be thought of as network virtualization of some form. An earlier blog post by Bruce Davie (here) touched on the relationship between these concepts and network virtualization as we’re defining it here. The key point of that post is that when employing one of the aforementioned network virtualization primitives, we’re virtualizing some aspect of the network (a LAN segment, an L3 path, an L3 forwarding table, etc.) but rarely a network in its entirety with all its properties.

For example, if you use VLANs to virtualize an L2 segment, you don’t get virtualized counters that stay in sync when a VM moves, or a virtual ACL that keeps working wherever the VM is located. For those sorts of capabilities, you need some other mechanisms.

To put it in the context of the previous discussion, traditional network virtualization mechanisms don’t provide the most suitable operational abstractions. For example, provisioning new workloads or moving workloads still requires operational overhead to update the network state, and this is generally a manual process.

Modern approaches to network virtualization try and address this disconnect. Rather than providing a bunch of virtualized components, network virtualization today tries to provide a suitable basic unit of operations. Unsurprisingly, the abstraction is of a “virtual network”.

To be complete, a virtual network should both support the basic abstractions provided by physical networks today (L2, L3, tagging, counters, ACLs, etc.) as well as introduce a logical abstraction that encompasses all of these to be used as the basis for operation.

And just like the compute analog, this logical abstraction should support all of the operational niceties we’ve come to expect from virtualization: dynamic creation, deletion, migration, configuration, snapshotting, and roll-back.

Cleaning up the Definition of Network Virtualization

Given the previous discussion, we would characterize network virtualization as follows:

  • Introduces the concept of a virtual network that is decoupled from the physical network.
  • The virtual networks don’t change any of the basic abstractions found in physical networks.
  • The virtual networks are exposed as a new logical abstraction that can form a basic unit of operation (creation, deletion, migration, dynamic service insertion, snapshotting, inspection, and so on).

Network Virtualization is not SDN

SDN is a mechanism, and network virtualization is a solution. It is quite possible to have network virtualization solution that doesn’t use SDN, and to use SDN to build a network that has no virtualized properties.

SDN provides network virtualization in about the same way Python does - it’s a tool (and not a mandatory one). That said, SDN does have something to offer as a mechanism for network virtualization.

A simple way to think about the problem of network virtualization is that the solution must map multiple logical abstractions onto the physical network, and keep those abstractions consistent as both the logical and physical worlds change. Since these logical abstractions may reside anywhere in the network, this becomes a fairly complicated state management problem that must be enforced network-wide.

However, managing large amounts of states with reasonable consistency guarantees is something that SDN is particularly good at. It is no coincidence that most of the network virtualization solutions out there (from a variety of vendors using a variety of approaches) have a logically centralized component of some form for state management.

Wrapping Up

The point of this post was simply to provide some scaffolding around the discussion of network virtualization. To summarize quickly, modern concepts of network virtualization both preserve traditional abstractions and provide a basic unit of operations which is a (complete) virtual network. And that new abstraction should support the same operational abstractions as its computational analog.

While SDN provides a useful approach to building a network virtualization solution, it isn’t the only way. And lets not confuse tools with solutions.

Over the next few years, we expect to see a variety of mechanisms for implementing virtual networking take hold. Some hardware-based, some software-based, some using tunnels, others using tags, some relying more on traditional distributed protocols, others relying on SDN.

In the end, the market will choose the winning mechanism(s). In the meantime, let’s make sure we clarify the dialog so that informed decisions are possible.


Network Virtualization, Encapsulation, and Stateless Transport Tunneling (STT)

[This post was written with Bruce Davie, and Andrew Lambeth.]

Recently, Jesse Gross, Bruce Davie and a number of contributors submitted the STT draft to the IETF (link to the draft here). STT is an encapsulation format for network virtualization. Unlike other protocols in this space (namely VXLAN and NVGRE), it was designed to be used with soft switching within the server (generally in the vswitch in the hypervisor) while taking advantage of hardware acceleration at the NIC. The goal is to preserve the flexibility and development speed of software while still providing hardware forwarding speeds.

The quick list of differentiators are i) it takes advantage of TSO available in NICs today allowing tunneling from software at 10G while consuming relatively little cpu ii) there are more bits allocated to the virtual network meta data carried per packet, and those bits are unstructured allowing for arbitrary interpretation by software and iii) the control plane is decoupled from the actual encapsulation.

There are a number of other software-centric features like better byte alignment of the headers, but these are not architecturally significant.

Of course, the publication of the draft drew reasonable skepticism on whether the industry needed yet another encapsulation format. And that is the question we will focus on in this post.

But first, let us try to provide a useful decomposition of the virtual networking problem (as it pertains to Distributed Edge Overlays DEO).

Distributed Edge Overlays (DEO)

Distributed edge overlays have gained a lot of traction as a mechanism for network virtualization. A reasonable characterization of the problem space can be found in the IETF nvo3 problem statement draft. Two recent DEO related drafts submitted to the IETF in addition to STT are NVGRE, and VXLAN.

The basic idea is to use tunneling (generally L2 in L3) to create an overlay network from the edge that exposes a virtual view of one or more network to end hosts. The edge may be, for example, a vswitch in the hypervisor, or the first hop physical switch.

DEO solutions can be roughly decomposed into three independent components.

  • Encapsulation format:The encapsulation format is what the packet looks like on the wire. The format has implications both on hardware compatibility, and the amount of information that is carried across the tunnel with the packet.As an example of encapsulation, with NVGRE the encapsulation format is GRE where the GRE key is used to store some additional information (the tenant network ID).
  • Control plane:The control plane disseminates the state needed to figure out which tunnels to create, which packets go in which tunnels, and what state is associated with packets as they traverse the tunnels. Changes to both the physical and virtual views of the network often require state to be updated and/or moved around the network.There are many ways to implement a control plane, either using traditional protocols (for example, NVGRE and the first VXLAN draft abdicate a lot of control responsibility to multicast), or something more SDN-esque like a centralized datastore, or even a proper SDN controller.
  • Logical view:The logical view is what the “virtual network” looks like from the perspective of an end host. In the case of VXLAN and NVGRE, they offer a basic L2 learning domain. However, you can imagine this being extended to L3 (to support very large virtual networks, for example), security policies, and even higher-level services.The logical view defines the network services available to the virtual machine. For example, if only L2 is available, it may not be practical to run workloads of thousands of machines within a single virtual network due to scaling limitations of broadcast. If the virtual network provided L3, it could potentially host such workloads and still provide the benefits of virtualization such as support for VM mobility without requiring IP renumbering, higher-level service interposition (like adding firewalls), and mobile policies.

Before we jump into a justification for STT, we would like to make the point that each of these components really are logically distinct, and a good design should keep them decoupled.  Why? For modularity. For example, if a particular encapsulation format has broad hardware support, it would be a shame to restrict it to a single control plane design. It would also be a shame to restrict it to a particular logical network view.

VXLAN and NVGRE or both guilty of this to some extent. For example, the NVGRE and the original VXLAN draft specify multicast as the mechanism to use for part of the control plane (with other parts left unspecified). The latest VXLAN addresses this somewhat, which is a great improvement.

Also, both VXLAN and NVGRE fix the logical forwarding model to L2 going as far as to specify how the logical forwarding tables get populated. Again, this is an unnecessary restriction.

For protocols that are hardware centric (which both VXLAN and NVGRE appear to me), this makes some modicum of sense, lookup space is expensive, and decoupling may require an extra level of indirection.  However, for software this is simply bad design.

STT on the other hand limits its focus to the encapsulation format, and does not constrain the other components within the specification.

[Note: The point of this post is not to denigrate VXLAN or NVGRE, but rather to point out that they are comparatively less suited for running within the vswitch. If the full encap/decap and lookup logic is resides fully within hardware, VXLAN and NVGRE are both well designed and reasonable options.]

OK, on to a more detailed justification for STT

To structure the discussion, we’ll step through each logical component of the DEO architecture and describe the design decisions made by STT and how they compare to similar proposals.

Logical view: It turns out that the more information you can tack on to a packet as it transits the network, the richer a logical view you can create. Both NVGRE and VXLAN not only limit the additional information to 32 bits, but they also specify that those bits must contain the logical network ID. This leaves no additional space for specifying other aspects of the logical view that might be interesting to the control plane.

STT differs from NVGRE and VXLAN in two ways. First, it allocates more space to the per-packet metadata. Second, it doesn’t specify how that field is interpreted. This allows the virtual network control plane to use it for state versioning (useful for consistency across multiple switches), additional logical network meta-data, tenant identification, etc.

Of course, having a structured field of limited size makes a lot of sense for NVGRE and VXLAN where it is assumed that encap/decap and interpretation of those bits are likely to be in switching hardware.

However, STT is optimizing for soft switching with hardware accelerating in the NIC. Having larger, unstructured fields provides more flexibility for the software to work with. And, as I’ll describe below, it doesn’t obviate the ability to use hardware acceleration in the NIC to get vastly better performance than a pure software approach.

Control Plane: The STT draft says nothing about the control plane that is used for managing the tunnels and the lookup state feeding into them. This means that securing the control channel, state dissemination, packet replication, etc. are outside of, and thus not constrained by, the spec.

Encapsulation format: This is were STT really shines. STT was designed to take advantage of TSO and LRO engines in existing NICs today. With STT, it is possible to tunnel at 10G from the guest while consuming only a fraction of a CPU core. We’ve seen speedups up to 10x over pure software tunneling.

(If you’re not familiar with TSO or LRO, you may want to check out the wikipedia pages here and here.)

In other words, STT was designed to let you retain all the high performance features of the NIC when you start tunneling from the edge, while still retaining the flexibility of software to perform the network virtualization functions.

Here is how it works.

When a guest VM sends a packet to the wire, the transitions between the guest and the hypervisor (this is a software domain crossing which requires flushing the TLB, and likely the loss of cache locality, etc.) and the hypervisor and the NIC are relatively expensive. This is why hypervisor vendors take pains to always support TSO all the way up to the guest

Without tunneling, vswitches can take advantage of TSO by exposing a TSO enabled NIC to the guest and then passing large TCP frames to the hardware NIC which performs the segmentation. However, when tunneling is involved, this isn’t possible unless the NIC supports segmentation of the TCP frame within the tunnel in hardware (which hopefully will happen as tunneling protocols get adopted).

With STT, the guests are also exposed to a TSO enabled NIC, however instead of passing the packets directly to the NIC, the vswitch inserts an additional header that looks like a TCP packet, and performs all of the additional network virtualization procedures.

As a result, with STT, the guest ends up sending and receiving massive frames to the hypervisor (up to 64k) which are then encapsulated in software, and ultimately segmented in hardware by the NIC. The result is that the number of domain crossings are reduced by a significant factor in the case of high-throughput TCP flows.

One alternative to going through all this trouble to amortize the guest/hypervisor transistions is to try eliminating them altogether by exposing the NIC HW to the guest, with a technique commonly referred to as passthrough. However, with passthrough software is unable to make any forwarding decisions on the packet before it is sent to the NIC. Passthrough creates a number of problems by exposing the physical NIC to the guest which obviates many of the advantages of NIC virtualization (we describe these shortcomings at length here).

For modern NICs that support TSO and LRO, the only additional overhead that STT provides over sending a raw L2 frame is the memcpy() of the header for encap/decap, and the transmission cost of those additional bytes.

It’s worth pointing out that even if reassembly on the receive side is done in software (which is the case with some NICs), the interrupt coalescing between the hypervisor and the guest is still a significant performance win.

How does this compare to other tunneling proposals? The most significant difference is that NICs don’t support the tunneling protocols today, so they have to be implemented in software which results in a relatively significant performance hit.  Eventually NICs will support multiple tunneling protocols, and hopefully they will also support the same stateless (on the send side) TCP segmentation offloading.  However, this is unlikely to happen with LOM for awhile.

As a final point, much of STT was designed for efficient processing in software. It contains redundant fields in the header for more efficient lookup and padding to improve byte-alignment on 32-bit boundaries.

So, What’s Not to Like?

STT in it’s current form is a practical hack (more or less). Admittedly, it follows more of a “systems” than a networking aesthetic. More focus was put on practicality, performance, and software processing, than being parsimonious with lookup bits in the header.

As a result, there are definitely some blemishes.  For example, because it uses a valid TCP header, but doesn’t have an associated TCP state machine, middleboxes that don’t do full TCP termination are likely to get confused (although it is a little difficult for us to see this as a real shortcoming given all of the other problems passive middleboxes have correctly reconstructing end state). Also, there is currently no simple way to distinguish it from standard TCP traffic (again, a problem for middleboxes). Of course, the opacity of tunnels to middleboxes is nothing new, but these are generally fair criticisms.

In the end, our guess is that abusing existing TSO and LRO engines will not ingratiate STT with traditional networking wonks any time soon … :)

However, we believe that outside of the contortions needed to be compatible with existing TSO/LRO engines, STT has a suitable design for software based tunneling with hardware offload. Because the protocol does not over-specify the broader system in which the tunnel will sit, as the hardware ecosystem evolves, it should be possible to also evolve the protocol fields themselves (like getting rid of using an actual TCP header and setting the outer IP protocol to 6) without having to rewrite the control plane logic too.

Ultimately, we think there is room for a tunneling protocol that provides the benefits of STT, namely the ability to do processing in software with minimal hardware offload for send and receive segmentation. As long as there is compatible hardware, the particulars or the protocol header are less important. After all, it’s only (mostly) software.


Networking Needs a VMware (Part 1: Address Virtualization)

[This post was written with Andrew Lambeth]

Our last post “Networking Doesn’t Need a VMware” made the point that drawing a simple analogy between server and network virtualization can steer the technical discourse on network virtualization in the wrong direction. The sentiment comes from the many partner, analyst, and media meetings we’ve been involved in that persistently focus on relatively uninteresting areas of the network virtualization space, specifically, details of encapsulation formats and lookup pipelines.

In this series of writeups, we take a deeper look and discuss some areas in which network
virtualization would do well to emulate server virtualization. This is a fairly broad topic so we’ll break it up across a couple of posts.

In this part, we’ll focus on address space virtualization.

Quick heads up that the length of this post got a little bit out of hand. For those who don’t have the time/patience/inclination/attention span, the synopsis is as follows:

One of the key strengths of a hypervisor lies in its insertion of a completely new address space below the guest OS’s view of what it believes to be the physical address space. And while there are several possible ways to interpose on network address space to achieve some form of virtualization, encapsulation provides the closest analog to the hierarchical memory virtualization used in compute. It does so by taking advantage of the hierarchy inherent in the physical topology, and allowing both the virtual and physical address spaces to support complete forwarding and addressing models. However, like memory virtualization’s page table, encapsulation requires maintenance of the address mappings (rules to tunnel mappings). The interface for doing so should be open, and a good candidate for that interface is OpenFlow.

Now, onto the detailed argument …

Virtual Memory in Compute

Virtual memory has been a core component of compute virtualization for decades. The basic concept is very simple, multiple (generally sparse) virtual address spaces are multiplexed to a single compact physical address using a page table that contains the between the two.

An OS virtualizes the address space for a process by populating a table with a single level of translations from virtual addresses (VA) to physical addresses (PA). All hypervisors support the ability for a guest OS to continue working in this mode by adding the notion of a third address space called machine addresses (MA) for the true physical addresses.

Since x86 hardware initially supported only a single level of mappings the hypervisor implemented a complete MMU in software to capture and maintain the guest’s VA to PA mappings, and then created a second set of mappings from guest PA to actual hardware MA. What was actually programmed into hardware was VA to MA mappings, in order to not incur overhead during the actual memory references by the guest, but the full heirarchy was maintained so at any time the hypervisor components could easily take an address from any of the three address spaces and map it back to any of the other address spaces.

Having a multi-level hierarchical mapping was so powerful that eventually CPU vendors added support for multi-level page tables in the hardware MMU (called Nested Page Tables on AMD and Extended Page Tables on Intel). Arguably this was the biggest architectural change to CPUs over the last decade.

Benefits of address virtualization in compute

Although this is pretty basic stuff, it is worth enumerating the benefits it provides to see how these can be applied to the networking world.

  • It allows the multiplexing of multiple large, sparse address spaces onto a smaller, compact physical address space.
  • It supports mobility of a process within a physical address space. This can be used to more efficiently allocated processes to memory, or take advantage of new memory as it is added.
  • VMs don’t have to coordinate with other VMs to select their address space. Or more generally, there are no constraints of the virtual address space that can be allocated to a guest VM.
  • A virtual memory subsystem provides a basic unit of isolation. A VM cannot mistakingly (or maliciously) address another VM unless another part of the system is busted or compromised.

Address Virtualization for Networking

The point of this post is to explore whether memory virtualization in compute can provide guidance on network virtualization. It would be nice, for example, to retain the same benefits that made memory virtualization so successful.

We’ll start with some common approaches to network virtualization and see how they compare.

Tagging : The basic idea behind tagging is to mark packets at the edge of the network with some bits (the tag) that contains the virtual context. This tag generally encodes a unique identifier for the virtual network and perhaps the virtual ingress port. As the packet traverses the network, the tag is used to segment the forwarding tables to only apply to rules associated with that tag.

A tag does not provide “virtualization” in the same way virtual memory of compute does. Rather it provides segmentation (which was also a phase compute memory went through decades ago). That is, it doesn’t introduce a new address space, but rather segments an existing address space. As a result, the same addresses are used for both physical and virtual purposes. The implications of this can be quite limiting. For example:

  • Since the addresses are used to address things in the virtual world *and* the physical world, they have to be exposed to the physical forwarding tables. Therefore the nice property of aggregation that comes with hierarchical address mapping cannot be exploited. Using VLANs as an example, every VM MAC has to be exposed to the hardware putting a lot of pressure on physical L2 tables. If virtual addresses where mapped to a smaller subset of physical addresses instead, the requirement for large L2 tables would go away.
  • Due to layering, tags generally only segment a single address space (e.g. L2). Again, because there is no new address space introduced, this means that all of the virtual contexts must all have identical addressing models (e.g. L2) *and* the virtual addresses space must be the same as the physical address space. This unnecessarily couples the virtual and physical worlds. Imagine a case in which the model used to address the virtual domain would not be suitable for physical forwarding. The classic example of this is L2. VMs are given Ethernet NICs and may want to talk using L2 only. However, L2 is generally not a great way of building large fabrics. Introducing an additional address space can provide the desired service model at the virtual realm, and the appropriate forwarding model in the physical realm.
  • Another shortcoming of tagging is that you cannot take advantage of mobility through address remapping. In the virtual address realm, you can arbitrarily map a virtual address to a physical address. If the process or VM moves, the mapping just needs to be updated.  A tag however does not provide address mobility.  The reason that VEPA, VNTAG, etc. support mobility within an L2 domain is by virtue of learned soft state in L2 (which exists whether or not a tag is in use).  For layers that don’t support address mobility (say vanilla IP), using a tag won’t somehow enable it.It’s worth pointing out that this is a classic problem within virtualized datacenters today. L2 supports mobility, and L3 often does not. So while both can be segmented using tags, often only the L2 portion allows mobility confining VMs to a given subnet. Most of the approaches to VM mobility across subnets introduce another layer of addressing, whether this is done with LISP, tunnels, etc. However, as soon as encapsulation is introduced, it begs the question of why use tags at all.

There are a number of other differences between tagging and address-level virtualization (like requiring all switches en route to understand the tag), but hopefully you’ve gotten the point. To be more like memory virtualization, a new addressing layer needs to be present, and segmenting the physical address space doesn’t provide the same properties.

Address Mapping : Another method of network virtualization is address mapping. Unlike tagging, a new address space is introduced and mapped onto the physical address space by one or more devices in the network. The most common example of address mapping is NAT, although it doesn’t necessarily have to be limited to L3.

To avoid changing the framing format of the packet, address mapping operates by updating the address in place, meaning that the same field is used for the updated address. However, multiplexing multiple larger address spaces onto a smaller physical address space (for example, a bunch of private IP subnets onto a smaller physical IP subnet) is a lossy operation and therefore requires additional bits to map back from the smaller space to the larger space.

Here in lies much of the complexity (and commensurate shortcoming) of address mapping. Generally, the additional bits are added to a smaller field in the packet, and the full mapping information is stored as state on the device performing translation. This is what NAT does, the original 5 tuple is stashed on the device, and the ephemeral transport port is used to launder a key that points back to that 5 tuple on the return path.

When compared to virtual memory, the model doesn’t hold up particularly well. Within a server, the entire page table is always accessible, so addresses can be mapped back and forth with some additional information to aid in the demultiplexing (generally the pid). With network address translation, the “page table” is created on demand, and stored in a single device. Therefore, in order to support component failover of asymmetric paths, the per-flow state has to be replicated to all other devices that could be on the alternate path.  Also, because state is set up during flow initiation from the virtual address, inbound flows cannot be forwarded unless the destination public IP address is effectively “pinned”  to a virtual address.  As a result, virtual to virtual communication in which the end points are behind different devices have to resort to OOB techniques like NAT punching in order to communicated.

This doesn’t mean that addressing mapping isn’t hugely useful in practice. Clearly for mapping from a virtual address space to a physical address space this is the correct approach. However, if communicating from a virtual address to a virtual address through a physical address space, it it is far more limited than its compute analog.

Tunneling : By tunneling we mean that the payload of a packet is another packet (headers and all).

Like address mapping (and virtual memory) tunneling introduces another address space. However, there are two differences. First, the virtual address space doesn’t have to look anything like the physical. For example, it is possible to have the virtual address space be IPv6 and the physical address space be IPv4. Second, the full address mapping is stored in the packet so that there is no need to create per-flow state within the network to manage the mappings.

Tunneling (or perhaps more broadly encapsulation) is probably the most popular method of doing full network virtualization today. TRILL uses L2 in L2, LISP uses encapsulation, VXLAN (which is LISP as far as I can tell) uses L2 in L3, VCDNI uses L2 in L2, NVGRE uses L2 in L3, etc.

The general approach maps to virtual memory pretty well. The outer header can be likened to a physical address. The inner header can be likened to a virtual address. And often a shim header is included just after the outer header that contains demultiplexing information (the equivalent of a PID). The “page table” consists of on-datapath table table entries which map packets to tunnels.

Take for example an overlay mesh (like NVGRE). The L2 table at the edge that points to the tunnels is effectively mapping from virtual addresses to physical addresses. And there is no reason to limit this lookup to L2, it could provide L2 and L3 in the virtual domain. If a VM moves, this “page table” is updated as it would be in a server of a process is moved.

Because the packets are encapsulated, switches in the “physical domain” only have to deal with the outer header greatly reducing the number of addresses they have to deal with. Switches which contain the “page table” however, have to do lookups in the virtual world (e.g. map from packet to tunnel), map to the physical world (throw a tunnel on the packet) and then forward the packet in the physical world (figure out which port to send the tunneled traffic out on). Note that these are effectively the same steps an MMU takes within a server.

Further, like memory virtualization, tunneling provides nice isolation properties. For example, a VM in the virtual address space cannot address the physical network unless the virtual network is somehow bridged into it.

Also, like hierarchical memory, it is possible for the logical networks to have totally different forwarding stacks. One could be L2-only, the other ipv4, and another ipv6, and all of these could differ from the physical substrate. A common setup has IPv6 run in the virtual domain (for end-to-end addressing), and IPv4 for the physical fabric (where a large address space isn’t needed).

Fortunately, we already have a a multi-level hierarchical topology in the network (in particularly the datacenter). This allows for either the addition of a new layer at the edge which provides a virtual address space without changing any other components, or just changing the components at the outer edge of the hierarchy.

So what are the shortcomings of this approach? The most immediate are performance issues with tunneling, additional overhead overhead in the packet, and the need to maintain the distributed “page table”.

Tunneling performance is no longer the problem it use to be. Even from software in the server, clever tunnel implementations are able to take advantage of LRO and the like and can achieve 10G performance without much CPU. Hardware tunneling on most switching chipsets performs at line rate as well. Header overhead marginally increases transmission delay, and will reduce total throughput if a link is saturated.

Keeping the “page tables” up to date on the other hand is still something the industry is grappling with. Most standards punt on describing how this is done, or even the interface to use to write to the “page table”. Some piggyback on L2 learning (like NVGRE) to construct some of the state but still rely on an out of band mechanism for other state (in the case of NVGRE the VIF to tenantID mapping).

As we’ve said many times in the past, this interface should be standardized if for no other reason than to provide modularity in the architecture. Not doing that would be like limiting a hardware MMU on a CPU to only work with a single Operating System.

Fortunately, Microsoft realizes this need and has opened the interface to their vswitch on Windows Server 8, and XenServer, KVM and other Linux-based hypervisors support Open vSwitch.

For hardware platforms it would also be nice to expose the ability to manage this state. Of course, we would argue that OpenFlow is a good candidate if for no other reason than it has been already used for this purpose successfully.

Wrapping Up ..

If the analogy of memory virtualization in compute holds then encapsulation is the correct way to go about network virtualization. It introduces a new address space that is unrestricted in forwarding model, it takes advantage of hierarchy inherent in the physical topology allowing for address aggregation further up in the hierarchy, and it provides the basic virtualization properties of isolation and transparent mobility.

In the next post of this series, we’ll explore how server virtualization suggests that the way that encapsulation is implemented today isn’t quite right, and some things we may want to do to fix it.


Networking Doesn’t Need a VMWare …

[This post was written with Andrew Lambeth.  Andrew has been virtualizing networking for long enough to have coined the term "vswitch", and led the vDS distributed switching project at VMware. ]

Or at least, it doesn’t need to solve the problem in the same way.

It’s commonly said that “networking needs a VMWare”. Hell, there have been occasions in which we’ve said something very similar. However, while the analogy has an obvious appeal (virtual, flexible, thin layer of indirection in software, commoditize, commoditize, commoditize!), a closer look suggests that it draws from a very superficial understanding of the technology, and in the limit, it doesn’t make much sense.

It’s no surprise that many are drawn to this line of thought. It probably stems from the realization that virtualizing the network rather than managing the physical components is the right direction for networks to evolve. On this point, it appears there is broad agreement. In order to bring networking up to the operational model of compute (and perhaps disrupt the existing supply chain a bit) virtualization is needed.

Beyond this gross comparison, however, the analogy breaks down. The reality is that the technical requirements for server virtualization and network virtualization are very, very different.

Server Virtualization vs. Network Virtualization

With server virtualization, virtualizing CPU, memory and device I/O is incredibly complex, and the events that need to be handled with translation or emulation happen at CPU cycle timescale. So the virtualization logic must be both highly sophisticated and highly performant on the “datapath” (the datapath for compute virtualization being the instruction stream and I/O events).

On the other hand, the datapath operations for network virtualization are almost trivially simple. All they involve is mapping one address/context space to another address/context space. This effectively reduces to an additional header on the packet (or tag), and one or two more lookups on the datapath. Somewhat revealing of this simplicity, there are multiple reasonable solutions that address the datapath component, NVGRE, and VXLAN being two recently publicized proposals.

If the datapath is so simple, it’s reasonable to ask why network virtualization isn’t already a solved problem.

The answer, is that there is a critical difference between network virtualization and server virtualization and that difference is where the bulk of complexity for network virtualization resides.

What is that difference?

Virtualized servers are effectively self contained in that they are only very loosely coupled to one another (there are a few exceptions to this rule, but even then, the groupings with direct relationships are small). As a result, the virtualization logic doesn’t need to deal with the complexity of state sharing between many entities.

A virtualized network solution, on the other hand, has to deal with all ports on the network, most of which can be assumed to have a direct relationship (the ability to communicate via some service model). Therefore, the virtual networking logic not only has to deal with N instances of N state (assuming every port wants to talk to every other port), but it has to ensure that state is consistent (or at least safely inconsistent) along all of the elements on the path of a packet. Inconsistent state can result in packet loss (not a huge deal) or much worse, delivery of the packet to the wrong location.

It’s important to remember that networking traditionally has only had to deal with eventual consistency. That is “after state change, the network will take some time to converge, and until that time, all bets are off”. Eventual consistency is fine for basic forwarding provided that loops are prevented using a TTL, or perhaps the algorithm ensures loop freedom while it is converging. However, eventual consistency doesn’t work so well with virtualization. During failure, for example, it would suck if packets from tenant A managed to leak over to tenant B’s network. It would also suck if ACLs configured in tenant A’s were not enforced correctly during convergence.

So simply, the difference between server virtualization, and network virtualization is that network virtualization is all about scale (dealing with the complexity of many interconnected entities which is generally a N2 problem), and it is all about distributed state consistency. Or more concretely, it is a distributed state management problem rather than a low level exercise in dealing with the complexities of various hardware devices.

Of course, depending on the layer of networking being virtualized, the amount of state that has to be managed varies.

All network virtualization solutions have to handle basic address mapping. That is, provide a virtual address space (generally addresses of the packets within the tunnel) and the physical address space (the external tunnel header), and a mapping between the two (virtual address X is at physical address Y). Any of the many tunnel overlays solutions, whether ad hoc, proprietary, or standardized provide this basic mapping service.

To then virtualize L2, requires almost no additional state management. The L2 forwarding tables are dynamically populated from passing traffic. And the size of a single broadcast domain has fairly limited scale, supporting hundreds or low thousands of active MACs. So the only additional state that has to be managed is the association of a port (virtual or physical) to a broadcast domain which is what virtual networking standards like NVGRE and VXLAN provide.

As an aside, it’s a shame that standards like NVGRE and VXLAN choose to dictate the wire format (important for hardware compatibility) and the method for managing the context mapping between address domains (multicast), but not the control interface to manage the rest of the state. Specifying the wire format is fine. However, requiring a specific mechanism (and a shaky one at that) for managing the virtual to physical address mappings severely limits the solution space. And not specifying the control interface for managing the rest of the state effectively guarantees that implementations will be vertically integrated and proprietary.

For L3, there is a lot more state to deal with, and the number of end points to which this state applies can be very large. There are a number of datacenters today who have, or plan to have, millions of VMs. Because of this, any control plane that hopes to offer a virtualized L3 solution needs to manage potentially millions of entries at hundreds of thousands of end points (assuming the first hop network logic is within the vswitch). Clearly, scale is a primary consideration.

As another aside, in our experience, there is a lot of confusion on what exactly L3 virtualization is. While a full discussion will have to wait for a future post, it is worth pointing out that running a router as a VM is *not* network virtualization, it is x86 virtualization. Network virtualzation involves mapping between network address contexts in a manner that does not effect the total available bandwidth of the physical fabric. Running a networking stack in a virtual machine, while it does provide the benefits of x86 virtualization, limits the cross-sectional bandwidth of the emulated network to the throughput of a virtual machine. Ouch.

For L4 and above, the amount of state that has to be shared and the rate that it changes increases again by orders of magnitude. Take, for example, WAN optimization. A virtualized WAN optimization solution should be enforced throughout the network (for example, each vswitch running a piece of it) yet this would incur a tremendous amount of control overhead to create a shared content cache.

So while server virtualization lives and dies by the ability to deal with the complexity of virtualizing complex hardware interfaces of many devices at speed, network virtualization’s primary technical challenge is scale. Any solution that doesn’t deal with this up front will probably run into a wall at L2, or with some luck, basic L3.

This is all interesting … but why do I care?

Full virtualization of the network address space and service model is still a relatively new area. However, rather than tackle the problem of network virtualization directly, it appears that a fair amount of energy in industry is being poured into point solutions. This reminds us of the situation 10 years ago when many people were trying to solve server sprawl problems with application containers, and the standard claim was that virtualization didn’t offer additional benefits to justify the overhead and complexity of fully virtualizing the platform. Had that mindset prevailed, today we’d have solutions doing minimal server consolidation for a small handful of applications on only one or possibly two OSs, instead of a set of solutions that solve this and many many more problems for any application and most any OS. That mindset, for example could never have produced vMotion, which was unimaginable at the outset of server virtualization.

At the same time, those who are advocating for network virtualization tend to draw technical comparisons with server virtualization. And while clearly there is a similarity at the macro level, this comparison belies the radically different technical challenges of the two problems. And it belies the radically different approaches needed to solve the two problems. Network virtualization is not the same as server virtualization any more than server virtualization is the same as storage virtualization. Saying “the network needs a VMware” in 2012 is a little like saying “the x86 needs an EMC” in 2002.

Perhaps the confusion is harmless, but it does seem to effect how the solution space is viewed, and that may be drawing the conversation away from what really is important, scale (lots of it) and distributed state consistency. Worrying about the datapath , is worrying about a trivial component of an otherwise enormously challenging problem.


Defining “Fabric” in the Era of Overlays

[This post was written with Andrew Lambeth.  A version of it has been posted on Search Networking, but I wasn't totally happy with how that turned out.  So here is a revised version. ]

There has been a lot of talk about fabrics lately.  An awful lot.  However, our experience has been that the message is somewhat muddled and there is confusion on what fabrics are, and what they can do for the industry.  It is our opinion that the move to fabric is one of the more significant events in datacenter networking.  And it is important to understand this movement not only for the direct impact it is having on the way we build datacenter networks, but the indirect implications it will have on the networking industry more broadly. (I’d like to point out that there is nothing really “new” in this writeup.  Ivan and Brad have been covering these issues for a long time.  However, I do think it is worth refining the discussion a bit, and perhaps providing an additional perspective.)

Lets first tackle the question of, “why fabric” and “why now”?  The short answer is that the traditional network architecture was not designed for modern datacenter workloads.

The longer answer is that datacenter design has evolved to treat all aspects of the infrastructure (compute, storage, and network) as generic pools of resources.  This means that any workload should be able to run anywhere.  However, traditional datacenter network design does not make this easy.  The classic three tier architecture (top of rack (ToR), aggregation, core) has non-uniform access to bandwidth and latency depending on the traffic matrix.  For example, hosts connected to the same top ToR switch will have more bandwidth (and lower latency) than hosts connected through an aggregation switch, which will again have access to more total bandwidth than hosts trying to communicate through the core router.  The net result? Deciding where to put a workload matters.  Meaning that allocating workloads to ports is a constant bin-packing problem, and in dynamic environments, the result is very likely to be suboptimal allocation of bandwidth to workloads, or suboptimal utilization of compute due to placement constraints.

Enter fabric.   In our vernacular (admittedly there is ample disagreement on what exactly a fabric is),  a fabric is a physical network which doesn’t constrain workload placement.  Basically, this means that communicating between any two ports should have the same latency, and the bandwidth between any disjoint subset of ports is non-oversubscribed.  Or more simply, the physical network operates much as a backplane does within a network chassis.

The big question is, in addition to highly available bandwidth, what should a fabric offer?  Let’s get the obvious out of the way.  In order to offer multicast, the fabric should support packet replication in hardware as well as a way to manage multicast groups.  Also, the fabric should probably offer some QoS support in which packet markings indicate the relative priority to aid packet triage during congestion.

But what else should the fabric support?  Most vendor fabrics on the market tout a wide array of additional capabilities. For example, isolation primitives (VLAN and otherwise), security primitives, support for end-host mobility,  and support for programmability, just to name a few.

Clearly these features add value in a classic enterprise or campus network.  However, the modern datacenter hosts very different types of workloads.  In particular, datacenter system design often employes overlays at the end hosts which duplicate most of these functions.  Take for example a large web-service, it isn’t uncommon for load balancing, mobility, failover, isolation and security to be implemented within the load balancer, or the back-end application logic.  Or a distributed compute platform.  Similar properties are often implemented within the distribution harness rather than relying on the fabric.  Even virtualized hosting environments (such as IaaS) are starting to use overlays to implement these features within the vswitch (see for example NVGRE or VXLAN).

There is good reason to implement these functions as overlays at the edge.  Minimally it allows compatibility with any fabric design.  But much more importantly, the edge has extremely rich semantics with regard to true end-to-end addressing, security contexts, sessions, mobility events, and so on.  And implementing at the edge allows the system builders to evolve these features without having to change the fabric.

In such environments, the primary purpose of the fabric is to provide raw bandwidth, and price/performance not features/performance is king.  This is probably why many of the datacenter networks we are familiar with (both in big data and hosting) are in fact IP fabrics.  Simple, cheap and effective.  That is also why many next generation fabric companies and projects are focused on providing low-cost IP fabrics.

If existing deployments of the most advanced datacenters in the world are any indication, edge software is going to consume a lot of functionality that has traditionally been in the network.  It is a non-disruptive disruption whose benefits are obvious and simple to articulate. Yet the implications it could have on the traditional network supply chain are profound.


NVGRE, VXLAN and what Microsoft is Doing Right

There has been more movement in the industry towards L2 in L3 tunneling from the edge as the approach for tackling issues with virtual networking.

Hot on the heals of the VMWare/Cisco-led VXLAN announcement, an Internet draft on NVGRE (authored by Microsoft, Intel, Dell, Broadcom, and Arista, but my guess is that Microsoft is the primary driver) showed up with relatively little fanfare. You can check it out here.

In this blog post, I’ll briefly introduce NVGRE . However, I’d like to spend more time providing broader context on where these technologies fit into the virtual networking solution space. Specifically, I’ll argue that the tunneling protocol is a minor aspect of a complete solution, and we need open standards around the control and configuration interface as much as we need to standardize the wire format.

We’ll get to that in a few paragraphs. But first, What is NVGRE?

NVGRE is very similar to VXLAN (my comments on VXLAN here). Basically, it uses GRE as a method to tunnel L2 packets across an IP fabric, and uses 24 bits of the GRE key as a logical network discriminator (which they call a tenant network ID or TNI). By logical network discriminator, I mean it indicates which logical network a particular packet is part of. Also like VXLAN, logical broadcast is achieved through physical multicast.

A day in the life of a packet is simple. I’ll sketch out the case of a packet being sent from a VM. The vswitch, on receiving a packet from a vNic, does two lookups (a) it uses the destination MAC address to determine which tunnel to send the packet to (b) it uses the ingress vNic to determine the tenant network ID. If the MAC in (a) is known, the vswitch will cram the packet into the associated point-to-point GRE tunnel, setting the GRE key to the tenant network ID. If it isn’t know, it will tunnel the packet to the multicast address associated with that tenant network ID. Easy peasy.

While architecturally NVGRE is very similar to VXLAN, there are some differences that have practical implications.

On positive side, NVGRE’s use of GRE eases the compatibility requirement for existing hardware and software stacks. Many of the switching chips I’m familiar with already support GRE, so porting it to these environments is likely much easier than a non-supported tunneling format.

As another example, on a quick read of the RFC, it appears that Open vSwitch can already support NVGRE — it supports GRE, allows for looking and setting the GRE key (including masking, so can be limited to 24 bits), and it supports learning and explicit population of the logical L2 table. In Open vSwitch’s case, all of this can be driven through OpenFlow and the Open vSwitch configuration protocol.

On the other hand, GRE does not take advantage of a standard transport protocol (TCP/UDP), so logical flow information cannot be reflected in the outer header ports like you can for VXLAN. This means that ECMP hashing in the fabric cannot provide flow-level granularity which is desirable to take advantage of all bandwidth. As the protocol catches on, this is simple to address in hardware and will likely happen. [Update:  Since writing this, I've been told that some hardware can use the GRE key in the ECMP hash, which improves the granularity of load balancing in the fabric.  However, per-flow load-balancing  (over the logical 5 tuple) is still not possible.]

OK, so there you have it. A very brief glimpse of NVGRE. Sure there is more to say, but I have a hard time getting excited about tunneling protocols. Why? That is what I’m going to talk about next.

Tunneling vs. Network Virtualization

So, while it is moderately interesting to explore NVGRE and VXLAN, it’s important to remember that the tunneling formats are really a very minor (and easily changed) component of a full network virtualization solution. That is, how a system tunnels, whether over UDP, or GRE, or CAPWAP or something else, doesn’t define what functions the system provides. Rather it specifies what the tunnel packets look like on the wire (an almost trivial consideration).

It’s also important to remember that these two proposals (VXLAN and NVGRE) in specific, while an excellent step in the right direction, are almost certainly going to see a lot of change going forward. As they stand now, they are fairly limited. For example, they only support L2 within the logical network. Also, they both abdicate a lot of responsibility to multicast. This limits scalability on many modern fabrics, requires the deployment of multicast which many large operators eschew, and has real shortcomings when it comes to speed of provisioning a new network, manageability, and security.

Clearly these issues are going to be addressed as the protocols mature. For example, it’s likely that there will be support for L2 and L3 in the logical network, as well as ACLs, QoS primitives, etc. Also, it is likely that there will be some primitives for support secure group joins, and perhaps even support for creating optimized multicast trees that don’t rely on the fabric (some existing solutions already do this).

So if we step back and look at the current situation. We have “standardized” the tunneling protocol and some basic mechanism for L2 in logical space, and not much else. However, we know that these protocols will need evolve going forward. And it’s very likely that this evolution will have non-trivial dependency on the control path.

Therefore, I would argue that the real issue is not “lets standardize the tunneling protocol” but rather, “lets standardize the control interface to configure the tunnels and associated state so system developers can use it address these and future challenges”. That is, if vswitches and pswitches of the future have souped up versions of NVGRE and VXLAN, something is still going to have to provide the orchestration of these primitives.  And that is where the real function comes from. Neither of these proposals have addressed this, for example, the interface to provide the mapping from a vNIC to a tenant network ID is unspecified.

So what might this look like? I’ll use Open vSwitch to provide a specific example. Open vSwitch is a great platform for NVGRE (and soon VXLAN). It supports tens of thousands of GRE tunnels without a problem. And it allows setting and doing lookups on the GRE key. But what makes it immediately useful is not that it supports these tunneling primitives, but that a third party can pick it up, and use OpenFlow to configure those keys, tunnels, and any other network state (ACLs, L3, etc.) to build a full solution.

And that is the crux of the issue. If you get a solution that supports VXLAN or NVGRE, even though they are standardized, you are only getting a piece of the solution. The second piece we need, and should all ask for, is an open standard for configuring this interface. We (Open vSwitch) use OpenFlow. But any standard will do.

What about Microsoft?

From what I can tell (and admittedly, I don’t know very much), this seems to be what Microsoft is getting right. Rather than just specify a tunneling protocol, it appears that they’re also opening up the vswitch interface to support implementations from multiple vendors. While this itself doesn’t provide an open interface to virtual networking configuration, it does allow someone to do that work. For example, NEC has announced they will be releasing an OpenFlow compatible vSwitch for Windows Server 8. My guess is that this is a port of Open vSwitch. (Hey NEC, if you are using Open vSwitch, care to share the changes with the rest of the community?)

In any case, NEC will probably charge you for it, but we will work hard to make a full Open vSwitch port for Windows Server 8 available. Of course it will be open source, so we encourage users to get it for free and innovate in the source as well as use the open management protocols.

We’ll also be demoing Open vSwitch support for NVGRE within OpenStack/Quantum, so stay tuned for that.

Anyway, kudos to Microsoft for supporting L2 in L3 and adding a contender to the pool. It will be interesting to see what finally pops out of the IETF standardization process. I presume it will be some agglomeration achieved through consensus.

And double kudos for opening the vswitch interface. I think we’re going to see a lot of cool innovation in networking around Windows Server 8. And that’s good for all of us.


The Rise of Soft Switching Part IV: Comments on the Hardware Supply Chain

[This post is written by Alex Bachmutsky and Martin Casado. Alex is a Distinguished Engineer at Ericsson in Silicon Valley, driving system architecture aspects of the company's next generation platform. He is the author of the book “Platform Design for Telecommunication Gateways,” and co-authored “WiMAX Evolution: Emerging Technologies and Applications”.]

This is yet another (unplanned) addendum to the soft switching series.

Our previous posts were arguing that for virtual edge switching, using the servers compute was the best point in the design space given the current hardware landscape. The basic argument was that it is possible to achieve 10G from the server in soft switching by dedicating a single core (about $60 – $120 of silicon depending on how you count). So, it is difficult to make the argument for passthrough plus specialized hardware for two reasons. First, given currently available choices, price performance will almost certainly be higher with x86. Second, you loose many of the benefits of virtualization that are retained when using soft switching (which we’ve explained here).

Alex submitted two very good (and very detailed) responses to our claims (you can see the summary here). In them he argued that looking at the basic components costs, a hardware offload solution should be both lower power, and provide better cost performance than doing forwarding in x86. He also argued that switching and NPU chipsets are flexible and support mature, high-level development environments, which may make them suitable for the virtual networking problem.

Alex is an expert in this area, he’s incredibly experienced, and, at least partially, he’s right. Specialized hardware should be able to hit better price/performance and lower power. And it is true that development environments have come a long way.

So what’s the explanation for the discrepancy in the view points?

That is what we’ve teamed up to discuss in this post. It turns out the differences in view are twofold. The first difference comes from the perspective of a large company (and commensurate purchasing power) versus that of a handful of customers. While today, “intelligent NICs” are a least twice (generally more like 3-5x) as expensive as a pure soft solution, this is a supply side issue and isn’t justifiable by the bill of materials (BoM). A company sufficiently large with sufficient investment could overcome this obstacle.

The second difference is distributing an appliance versus distributing software. Distributing an appliance allows ultimate control over hardware configuration, development environment, runtime environment etc. Distributing software, on the other hand, often requires dealing with multiple hardware configurations, and complex software interactions both with drivers and the operating system.

Lets start with a quick resync:

In our original post, we argued that for most workloads, the most flexible and cost effective method for doing networking at the virtual edge is to do soft switching on the server (which is what 99% of all virtual deployments have done over the last decade).

The other two options we considered are using a switching chip (either in the NIC or the ToR switch) or an NPU (most likely on the NIC due to port density issues).

We’ll discuss Switching Chips First:

In our experience, non-NPU chipsets in the NIC are not sufficiently flexible for networking at the virtualized edge. The limitations are manyfold, size of lookup of metadata between tables, number of stages in the lookup pipeline, and economically viable table sizes (sometimes table can be increased, for instance, by external TCAM, but it is an expensive option and hence ignored here for budget sensitive implementations we are focusing on).  A good concrete example is table space for tunnels.  Network virtualization solutions often use lots and lots of tunnels (N^2 in the number of servers is not unusual).  Soft switches have no problem supporting tens of thousands of  tunnels without performance degradation.  This is one or two orders of magnitude more than you’ll find on existing non-NPU NIC chipsets.

So this is simply a limitation in the supply chain, perhaps a future NIC will be sufficiently flexible for the virtual networking problem, until then, we’ll omit this as an option.

ToR:

We both agreed that standard silicon is getting very close to being useful for edge virtual switching. The next generation of 10G switches appear to be particularly compelling. So while still suffering the flexiliby limitations and the shortcomings of hairpinning on inter-VM traffic (like reducing edge link bandwidth and ToR switching capacity), and table space issues, they do appear to be a viable option for some set of the switching decision going forward. This is due to improved support for tagging and tunneling, increased sizes in ACL tables, and improved lookup generality between tables.

So we’re hopeful that next generation ToR switches have an open interface like OpenFlow so that the network virtualization layer can manage the forwarding state.

So what about NPUs?

To be clear, Open vSwitch has already been ported to multiple NPU-NIC platforms. In the cases I am familiar with, inter-VM traffic through the CPU is far slower (presumedly due to DMA overhead) than keeping it on the x86. However, off server traffic requires less CPU.

To explore this solution space in more detail, we’ll subdivide the NPU space into two groups: classical NPUs and multicore-based NPUs.

Many NPUs in the former group have the required level of the processing flexibility, some even have integrated general-purpose CPUs that can be used to implement, for instance, OpenFlow-based control plane. However, their data plane processing is usually based on proprietary microcode that may be a development hurdle in terms of available expertise and toolchain.

Any investment to develop on such an NPU is a sunk cost, both in training the developers to the internal details of the hardware featureset and limitations, and developing the code to work within the environment. It is very difficult (if not impossible), for example, to port the microcode written for one such NPU to the NPU of another vendor.

So while this route is not only viable one but very likely beneficial for creating a mass produced appliance, the investment for supporting ”yet another NIC” in a software distribution is likely unjustifiable. And without sufficient economies of scale for the NPU (which can be ensured by a large vendor) the cost to the consumer would likely stunt adoption.

The latter group is based on general-purpose multicore CPUs, but it integrates some NPU-like features, such as streaming I/O, HW packet pre-classification, HW packet re-ordering and atomic flow handling, some level of traffic management, special HW offload engines and more. Since they are based on general-purpose processors, they could be programmed using similar languages and tools as other soft switch solutions. These NPUs can be either used as a main processing engine, or alternatively offload some tasks from the main CPU when integrated on intelligent NIC cards.

If the supply side can be ironed out, both in terms of purchase model and price, this would seem to strike a reasonable balance between development overhead, portability, and price/power/performance. Below, we’ll discuss some of the other challenges that need to be overcome on the supply side to be competitive with x86.

Price:

Let’s start with the economics of switching at the edge. First, in our experience, workloads in the cloud are either mostly idle, or handling lots of TCP traffic, generally HTTP offsite (trading and HPC being the obvious exceptions). Thus 10G at MTU size packets is not only sufficient, it is usually overkill. That said, it is good to be prepared as data usage continues to grow.

Soft switching can be co-resident with the management domain. This has been the standard with Xen deployments in which the Linux bridge or Open vSwitch shares the CPUs allocated to the management domain. For 1G, allocating a single core to both is most likely sufficient. For 10G, allocating an extra core is necessary.

So if we assume worst case (requiring a full core for networking), given a fairly modern CPU, a core weighs in at $60-$100. Motherboard and packaging for that core is probably another $50 (no need for additional memory or harddrive space for soft switching).

On the high end, if you could get specialized hardware for less than (say) $100 – $150 over the price of a standard NIC per server, then there could be an argument for using them. And since often NICs are bonded for resilience, it should probably be half of that or at least include dual ports for the same price for fault resiliency.

Unfortunately, to our knowledge, such a NIC doesn’t exist at those price points. If we’re wrong please let us know (price, relative power and performance) and we’ll update this. To date, all NPU-based NICs we’ve looked are 2-5 times what they should be to pencil out competitively.

However, clearly it should be possible to make a NIC that is optimized for virtual edge switching as the raw components (when purchased at scale) do not justify these high price tags. We’ve found that usually NPU chip vendors do not manufacture their own NIC cards (or do that only for evaluation purposes), and 3rd parties charge a significant premium for their role in the supply chain. Therefore, we posit that this is largely a supply chain issue and perhaps due to the immaturity of the market. Regular NICs are a commodity, intelligent NICs are still a “luxury” without a proportional relationship to their BOM differences.

Of course, this argument is based on traditional cloud packet loads. If the environment instead was hosting an application with small-sized and/or latency sensitive traffic (e.g. voice), then system design criteria would be very different and a specialized hardware solution becomes more compelling.

Drivers:

Drivers are non-existant or poor for specialized NICs. If you look at the HCL (hardware compatibility lists) for VMWare, Citrix, or even the common Linux distributions, NPU-based NICs are rarely (if ever) supported. Generally software solutions rely on the underlying operating system to provide the appropriate drivers. Going with a nonstandard NIC requires getting the hypervisor vendors to support it, which is unlikely. Even multicore NPUs are not supported well, because they are based on non-x86 ISA (MIPS, ARM, PPC) with much more limited support. Practically, it becomes chicken-and-egg problem: major hypervisor vendors don’t spend enough efforts to support lower volume multicore NPUs, and system developers don’t select these NPUs because of lack of drivers causing low volume. Any break out of that vicious circle may change the equation.

Tool Chain:

While the development tool set for embedded processors has come a long way, it still doesn’t match that of standard x86/Linux environment. To be clear, this is only a very minor hurdle to a large development shop with in house expertise in embedded development.

However, from the perspective of a smaller company, dealing with embedded development often means finding and employing relatively specialized developers, more expensive tools (often), and a more complex debug and testing environment. My (Martin’s) experience with side-by-side projects working both on specialized hardware environments, and standard (non-embedded) x86 server’s is that the former is at least twice as slow.

Where does that leave us?

A quick summary of the discussion thus far is as follows.

Considering only the cost and performance properties of specialized hardware, a virtual switching solution using them should have better price performance, and lower power than an equivalent x86 solution. However, few virtualized workloads could actually take advantage of the additional hardware. And supply side issues (no such component exists today at a competitive price point), and complications in inserting specialized hardware into today’s server ecosystem, remain enormous hurdles to realizing this potential. Which is probably why 99% of all virtual deployments over the last decade (and certainly all of the largest virtual operations in the world) have relied on soft switching.

Alex is right to remind us that it is not always a technology limitation, and that there is a room for hardware to come in at the right price/performance if the supply chain and development support matures. Which it very well may be. We hope that intelligent NICs will become more affordable and their price will reflect the BOM. We also hope that NPU vendors will sponsor in some ways the development of hypervisor and middleware layers by major SW suppliers as opposed to some proprietary solutions available on the market today.

As I’ve mentioned previously, there are a number of production deployments which use Open vSwitch directly on hardware. In an upcoming blog post, we’ll dig into those use cases a little further.


VXLAN: Moving Towards Network Virtualization

Thanks, Steve Herrod.

If you missed it, VMware, Cisco, Broadcom and others announced support for VXLAN a new tunneling format. For the curious, here is the RFC draft.

So quickly, what is VXLAN?  It’s an L2 in L3 tunneling mechanism that supports L2 learning and tenancy information.

And what does it do for the world?  A lot really.  There currently is fair bit of effort in the virtualization space to address issues of L2 adjacency across subnets, VM mobility across L3 boundaries, overlapping IP spaces, service interposition, and a host of other sticky networking shortcomings.  Like VXLAN, these solutions are often built on L2 in L3 tunneling, however the approaches are often form-fitted and rarely are they compatible with another implementation.

Announcing support for a common approach is a very welcome move from the industry. It will, of course, facilitate interoperability between implementations, and it will pave the way for broad support in hardware (very happy to see Broadcom and Intel in the announcements).

Ivan (as usual, ahead of the game) has already commented on some of the broader implications. His post is worth a read.

I don’t have too much to add outside of a few immediate comments.

First, we need an open implementation of this available.  The Open vSwitch project has already started to dig in and will try and have something out soon — perhaps for the next OpenStack summit. More about this as the effort moves along.

Second, this is a *great* opportunity for the NIC vendors to support acceleration of VXLAN in hardware.  It would be particularly nice if LRO support worked in conjunction with the tunneling so that interrupt overhead from the VMs is minimized, and the hardware handles segmentation and coalescing

Finally, it’s great to see this sort of validation for the L3 fabric, which is an excellent way to build a datacenter network.  IGP + ECMP + a well understood topology (e.g. CLOS or Flattened butterfly) is the foundation of many large fabrics today, and I predict many more going forward.

Exciting times.


Follow

Get every new post delivered to your Inbox.

Join 233 other followers