Of Mice and Elephants

[This post has been written by Martin Casado and Justin Pettit with hugely useful input from Bruce Davie, Teemu Koponen, Brad Hedlund, Scott Lowe, and T. Sridhar]

Overview

This post introduces the topic of network optimization via large flow (elephant) detection and handling.  We decompose the problem into three parts, (i) why large (elephant) flows are an important consideration, (ii) smart things we can do with them in the network, and (iii) detecting elephant flows and signaling their presence.  For (i), we explain the basis of elephant and mice and why this matters for traffic optimization. For (ii) we present a number of approaches for handling the elephant flows in the physical fabric, several of which we’re working on with hardware partners.  These include using separate queues for elephants and mice (small flows), using a dedicated network for elephants such as an optical fast path, doing intelligent routing for elephants within the physical network, and turning elephants into mice at the edge. For (iii), we show that elephant detection can be done relatively easily in the vSwitch.  In fact, Open vSwitch has supported per-flow tracking for years. We describe how it’s easy to identify elephant flows at the vSwitch and in turn provide proper signaling to the physical network using standard mechanisms.  We also show that it’s quite possible to handle elephants using industry standard hardware based on chips that exist today.

Finally, we argue that it is important that this interface remain standard and decoupled from the physical network because the accuracy of elephant detection can be greatly improved through edge semantics such as application awareness and a priori knowledge of the workloads being used.

The Problem with Elephants in a Field of Mice

Conventional wisdom (somewhat substantiated by research) is that the majority of flows within the datacenter are short (mice), yet the majority of packets belong to a few long-lived flows (elephants).  Mice are often associated with bursty, latency-sensitive apps whereas elephants tend to be large transfers in which throughput is far more important than latency.

Here’s why this is important.  Long-lived TCP flows tend to fill network buffers end-to-end, and this introduces non-trivial queuing delay to anything that shares these buffers.  In a network of elephants and mice, this means that the more latency-sensitive mice are being affected. A second-order problem is that mice are generally very bursty, so adaptive routing techniques aren’t effective with them.  Therefore routing in data centers often uses stateless, hash-based multipathing such as Equal-cost multi-path routing (ECMP).  Even for very bursty traffic, it has been shown that this approach is within a factor of two of optimal, independent of the traffic matrix.  However, using the same approach for very few elephants can cause suboptimal network usage, like hashing several elephants on to the same link when another link is free.  This is a direct consequence of the law of small numbers and the size of the elephants.

Treating Elephants Differently than Mice 

Most proposals for dealing with this problem involve identifying the elephants, and handling them differently than the mice.  Here are a few approaches that are either used today, or have been proposed:

  1.  Throw mice and elephants into different queues.  This doesn’t solve the problem of hashing long-lived flows to the same link, but it does alleviate the queuing impact on mice by the elephants.  Fortunately, this can be done easily on standard hardware today with DSCP bits.
  2. Use different routing approaches for mice and elephants.  Even though mice are too bursty to do something smart with, elephants are by definition longer lived and are likely far less bursty.  Therefore, the physical fabric could adaptively route the elephants while still using standard hash-based multipathing for the mice.
  3. Turn elephants into mice.  The basic idea here is to split an elephant up into a bunch of mice (for example, by using more than one ephemeral source port for the flow) and letting end-to-end mechanisms deal with possible re-ordering.  This approach has the nice property that the fabric remains simple and uses a single queuing and routing mechanism for all traffic.  Also, SACK in modern TCP stacks handles reordering much better than traditional stacks.  One way to implement this in an overlay network is to modify the ephemeral port of the outer header to create the necessary entropy needed by the multipathing hardware.
  4. Send elephants along a separate physical network.  This is an extreme case of 2.  One method of implementing this is to have two spines in a leaf/spine architecture, and having the top-of-rack direct the flow to the appropriate spine.  Often an optical switch is proposed for the spine.  One method for doing this is to do a policy-based routing decision using  a DSCP value that by convention denotes “elephant”.

Elephant Detection

 At this point it should be clear that handling elephants requires detection of elephants.  It should also be clear that we’ve danced around the question of what exactly characterizes an elephant.  Working backwards from the problem of introducing queuing delays on smaller, latency-sensitive flows, it’s fair to say that an elephant has high throughput for a sustained period.

Often elephants can be determined a priori without actually trying to infer them from network effects.  In a number of the networks we work with, the elephants are either related to cloning, backup, or VM migrations, all of which can be inferred from the edge or are known to the operators.  vSphere, for example, knows that a flow belongs to a migration.  And in Google’s published work on using OpenFlow, they had identified the flows on which they use the TE engine beforehand (reference here).

Dynamic detection is a bit trickier.  Doing it from within the network is hard due to the difficulty of flow tracking in high-density switching ASICs.  A number of sampling methods have been proposed, such as sampling the buffers or using sFlow. However the accuracy of such approaches hasn’t been clear due to the sampling limitations at high speeds.

On the other hand, for virtualized environments (which is a primary concern of ours given that the authors work at VMware), it is relatively simple to do flow tracking within the vSwitch.  Open vSwitch, for example, has supported per-flow granularity for the past several releases now with each flow record containing the bytes and packets sent.  Given a specified threshold, it is trivial for the vSwitch to mark certain flows as elephants.

The More Vantage Points, the Better

It’s important to remember that there is no reason to limit elephant detection to a single approach.  If you know that a flow is large a priori, great.  If you can detect elephants in the network by sampling buffers, great.  If you can use the vSwitch to do per-packet flow tracking without requiring any sampling heuristics, great.  In the end, if multiple methods identify it as an elephant, it’s still an elephant.

For this reason we feel that it is very important that the identification of elephants should be decoupled from the physical hardware and signaled over a standard interface.  The user, the policy engine, the application, the hypervisor, a third party network monitoring system, and the physical network should all be able identify elephants.

Fortunately, this can easily be done relatively simply using standard interfaces.  For example, to affect per-packet handling of elephants, marking the DSCP bits is sufficient, and the physical infrastructure can be configured to respond appropriately.

Another approach we’re exploring takes a more global view.  The idea is for each vSwitch to expose its elephants along with throughput metrics and duration.  With that information, an SDN controller for the virtual edge can identify the heaviest hitters network wide, and then signal them to the physical network for special handling.  Currently, we’re looking at exposing this information within an OVSDB column.

Are Elephants Obfuscated by the Overlay?

No.  For modern overlays, flow-level information, and QoS marking are all available in the outer header and are directly visible to the underlay physical fabric.  Elephant identification can exploit this characteristic.

Going Forward

This is a very exciting area for us.  We believe there is a lot of room to bring to bear edge understanding of workloads, and the ability for  software at the edge to do sophisticated trending analysis to the problem of elephant detection and handling.  It’s early days yet, but our initial forays both with customers and hardware partners, has been very encouraging.

More to come.


The Overhead of Software Tunneling

[This post was written with Jesse Gross, Ben Basler, Bruce Davie, and Andrew Lambeth]

Tunneling has earned a bad name over the years in networking circles.

Much of the problem is historical. When a new tunneling mode is introduced in a hardware device, it is often implemented in the slow path. And once it is pushed down to the fastpath, implementations are often encumbered by key or table limits, or sometimes throughput is halved due to additional lookups.

However, none of these problems are intrinsic to tunneling. At its most basic, a tunnel is a handful of additional bits that need to be slapped onto outgoing packets. Rarely, outside of encryption, is there significant per-packet computation required by a tunnel. The transmission delay of the tunnel header is insignificant, and the impact on throughput is – or should be – similarly minor.

In fact, our experience implementing multiple tunneling protocols within Open vSwitch is that it is possible to do tunneling in software with performance and overhead comparable to non encapsulated traffic, and to support hundreds of thousands of tunnel end points.

Given the growing importance of tunneling in virtual networking (as evidenced by the emergence of protocols such as STT, NVGRE, and VXLAN, it’s worth exploring its performance implications.

And that is the goal of this post: to start the discussion on the performance of tunneling in software from the network edge.

Background

An emerging method of network virtualization is to use tunneling from the edges to decoupled the virtual network address space from the physical address space. Often the tunneling is done in software in the hypervisor. Tunneling from within the server has a number of advantages: software tunneling can easily support hundreds of thousands of tunnels, it is not sensitive to key sizes, it can support complex lookup functions and header manipulations, it simplifies the server/switch interface and reduces demands on the in-network switching ASICs, and it naturally offers software flexibility and a rapid development cycle.

An idealized forwarding path is shown in the figure below. We assume that the tunnels are terminated within the hypervisor. The hypervisor is responsible for mapping packets from VIFs to tunnels, and from tunnels to VIFs. The hypervisor is also responsible for the forwarding decision on the outer header (mapping the encapsulated packet to the next physical hop).

Some Performance Numbers for Software Tunneling

The following tests show throughput and cpu overhead for tunneling within Open vSwitch.   Traffic was generated with netperf  attempting to emulate a high-bandwidth TCP flow. The MTU for the VM and the physical NICs are 1500bytes and the packet payload size is 32k. The test shows results using no tunneling (OVS bridge), GRE, and STT.

The results show aggregate bidirectional throughput, meaning that 20Gbps is a 10G NIC sending and receiving at line rate. All tests where done using Ubuntu 12.04 and KVM on an Intel Xeon 2.40GHz servers interconnected with a Dell 10G switch. We use standard 10G Broadcom NICs. CPU numbers reflect the percentage of a single core used for each of the processes tracked.

The following results show the performance of a single flow between two VMs on different hypervisors. We include the Linux bridge to show that performance is comparable. Note that the CPU only includes the CPU dedicated to switching in the hypervisor and not the overhead in the guest needed to push/consume traffic.

Throughput Recv side cpu Send side cpu
Linux Bridge: 9.3 Gbps 85% 75%
OVS Bridge: 9.4 Gbps 82% 70%
OVS-STT: 9.5 Gbps 70% 70%
OVS-GRE: 2.3 Gbps 75% 97%

This next table shows the aggregate throughput of two hypervisors with 4 VMs each. Since each side is doing both send and receive, we don’t differentiate between the two.

Throughput CPU
OVS Bridge: 18.4 Gbps 150%
OVS-STT: 18.5 Gbps 120%
OVS-GRE: 2.3 Gbps 150%

Interpreting the Results

Clearly these results (aside from GRE, discussed below) indicate that the overhead of software for tunneling is negligible. It’s easy enough to see why that is so. Tunneling requires copying the tunnel bits onto the header, an extra lookup (at least on receive), and the transmission delay of those extra bits when placing the packet on the wire. When compared to all of the other work that needs to be done during the domain crossing between the guest and the hypervisor, the overhead really is negligible.

In fact, with the right tunneling protocol, the performance is roughly equivalent to non-tunneling, and CPU overhead can even be lower.

STT’s lower CPU usage than non-tunneled traffic is not a statistical anomaly but is actually a property of the protocol. The primary reason is that STT allows for better coalescing on the received side in the common case (since we know how many packets are outstanding). However, the point of this post is not to argue that STT is better than other tunneling protocols, just that if implemented correctly, tunneling can have comparable performance to non-tunneled traffic. We’ll address performance specific aspects of STT relative to other protocols in a future post.

The reason that GRE numbers are so low is that with the GRE outer header it is not possible to take advantage of offload features on most existing NICs (we have discussed this problem in more detail before). However, this is a shortcoming of the NIC hardware in the near term. Next generation NICs will support better tunnel offloads, and in a couple of years, we’ll start to see them show up in LOM.

In the meantime, STT should work on any standard NIC with TSO today.

The Point

The point of this post is that at the edge, in software, tunneling overhead is comparable to raw forwarding, and under some conditions it is even beneficial. For virtualized workloads, the overhead of software forwarding is in the noise when compared to all of the other machinations performed by the hypervisor.

Technologies like passthrough are unlikely to have a significant impact on throughput, but they will save CPU cycles on the server. However, that savings comes at a fairly steep cost as we have explained before, and doesn’t play out in most deployment environments.


Networking Needs a VMware (Part 1: Address Virtualization)

[This post was written with Andrew Lambeth]

Our last post “Networking Doesn’t Need a VMware” made the point that drawing a simple analogy between server and network virtualization can steer the technical discourse on network virtualization in the wrong direction. The sentiment comes from the many partner, analyst, and media meetings we’ve been involved in that persistently focus on relatively uninteresting areas of the network virtualization space, specifically, details of encapsulation formats and lookup pipelines.

In this series of writeups, we take a deeper look and discuss some areas in which network
virtualization would do well to emulate server virtualization. This is a fairly broad topic so we’ll break it up across a couple of posts.

In this part, we’ll focus on address space virtualization.

Quick heads up that the length of this post got a little bit out of hand. For those who don’t have the time/patience/inclination/attention span, the synopsis is as follows:

One of the key strengths of a hypervisor lies in its insertion of a completely new address space below the guest OS’s view of what it believes to be the physical address space. And while there are several possible ways to interpose on network address space to achieve some form of virtualization, encapsulation provides the closest analog to the hierarchical memory virtualization used in compute. It does so by taking advantage of the hierarchy inherent in the physical topology, and allowing both the virtual and physical address spaces to support complete forwarding and addressing models. However, like memory virtualization’s page table, encapsulation requires maintenance of the address mappings (rules to tunnel mappings). The interface for doing so should be open, and a good candidate for that interface is OpenFlow.

Now, onto the detailed argument …

Virtual Memory in Compute

Virtual memory has been a core component of compute virtualization for decades. The basic concept is very simple, multiple (generally sparse) virtual address spaces are multiplexed to a single compact physical address using a page table that contains the between the two.

An OS virtualizes the address space for a process by populating a table with a single level of translations from virtual addresses (VA) to physical addresses (PA). All hypervisors support the ability for a guest OS to continue working in this mode by adding the notion of a third address space called machine addresses (MA) for the true physical addresses.

Since x86 hardware initially supported only a single level of mappings the hypervisor implemented a complete MMU in software to capture and maintain the guest’s VA to PA mappings, and then created a second set of mappings from guest PA to actual hardware MA. What was actually programmed into hardware was VA to MA mappings, in order to not incur overhead during the actual memory references by the guest, but the full heirarchy was maintained so at any time the hypervisor components could easily take an address from any of the three address spaces and map it back to any of the other address spaces.

Having a multi-level hierarchical mapping was so powerful that eventually CPU vendors added support for multi-level page tables in the hardware MMU (called Nested Page Tables on AMD and Extended Page Tables on Intel). Arguably this was the biggest architectural change to CPUs over the last decade.

Benefits of address virtualization in compute

Although this is pretty basic stuff, it is worth enumerating the benefits it provides to see how these can be applied to the networking world.

  • It allows the multiplexing of multiple large, sparse address spaces onto a smaller, compact physical address space.
  • It supports mobility of a process within a physical address space. This can be used to more efficiently allocated processes to memory, or take advantage of new memory as it is added.
  • VMs don’t have to coordinate with other VMs to select their address space. Or more generally, there are no constraints of the virtual address space that can be allocated to a guest VM.
  • A virtual memory subsystem provides a basic unit of isolation. A VM cannot mistakingly (or maliciously) address another VM unless another part of the system is busted or compromised.

Address Virtualization for Networking

The point of this post is to explore whether memory virtualization in compute can provide guidance on network virtualization. It would be nice, for example, to retain the same benefits that made memory virtualization so successful.

We’ll start with some common approaches to network virtualization and see how they compare.

Tagging : The basic idea behind tagging is to mark packets at the edge of the network with some bits (the tag) that contains the virtual context. This tag generally encodes a unique identifier for the virtual network and perhaps the virtual ingress port. As the packet traverses the network, the tag is used to segment the forwarding tables to only apply to rules associated with that tag.

A tag does not provide “virtualization” in the same way virtual memory of compute does. Rather it provides segmentation (which was also a phase compute memory went through decades ago). That is, it doesn’t introduce a new address space, but rather segments an existing address space. As a result, the same addresses are used for both physical and virtual purposes. The implications of this can be quite limiting. For example:

  • Since the addresses are used to address things in the virtual world *and* the physical world, they have to be exposed to the physical forwarding tables. Therefore the nice property of aggregation that comes with hierarchical address mapping cannot be exploited. Using VLANs as an example, every VM MAC has to be exposed to the hardware putting a lot of pressure on physical L2 tables. If virtual addresses where mapped to a smaller subset of physical addresses instead, the requirement for large L2 tables would go away.
  • Due to layering, tags generally only segment a single address space (e.g. L2). Again, because there is no new address space introduced, this means that all of the virtual contexts must all have identical addressing models (e.g. L2) *and* the virtual addresses space must be the same as the physical address space. This unnecessarily couples the virtual and physical worlds. Imagine a case in which the model used to address the virtual domain would not be suitable for physical forwarding. The classic example of this is L2. VMs are given Ethernet NICs and may want to talk using L2 only. However, L2 is generally not a great way of building large fabrics. Introducing an additional address space can provide the desired service model at the virtual realm, and the appropriate forwarding model in the physical realm.
  • Another shortcoming of tagging is that you cannot take advantage of mobility through address remapping. In the virtual address realm, you can arbitrarily map a virtual address to a physical address. If the process or VM moves, the mapping just needs to be updated.  A tag however does not provide address mobility.  The reason that VEPA, VNTAG, etc. support mobility within an L2 domain is by virtue of learned soft state in L2 (which exists whether or not a tag is in use).  For layers that don’t support address mobility (say vanilla IP), using a tag won’t somehow enable it.It’s worth pointing out that this is a classic problem within virtualized datacenters today. L2 supports mobility, and L3 often does not. So while both can be segmented using tags, often only the L2 portion allows mobility confining VMs to a given subnet. Most of the approaches to VM mobility across subnets introduce another layer of addressing, whether this is done with LISP, tunnels, etc. However, as soon as encapsulation is introduced, it begs the question of why use tags at all.

There are a number of other differences between tagging and address-level virtualization (like requiring all switches en route to understand the tag), but hopefully you’ve gotten the point. To be more like memory virtualization, a new addressing layer needs to be present, and segmenting the physical address space doesn’t provide the same properties.

Address Mapping : Another method of network virtualization is address mapping. Unlike tagging, a new address space is introduced and mapped onto the physical address space by one or more devices in the network. The most common example of address mapping is NAT, although it doesn’t necessarily have to be limited to L3.

To avoid changing the framing format of the packet, address mapping operates by updating the address in place, meaning that the same field is used for the updated address. However, multiplexing multiple larger address spaces onto a smaller physical address space (for example, a bunch of private IP subnets onto a smaller physical IP subnet) is a lossy operation and therefore requires additional bits to map back from the smaller space to the larger space.

Here in lies much of the complexity (and commensurate shortcoming) of address mapping. Generally, the additional bits are added to a smaller field in the packet, and the full mapping information is stored as state on the device performing translation. This is what NAT does, the original 5 tuple is stashed on the device, and the ephemeral transport port is used to launder a key that points back to that 5 tuple on the return path.

When compared to virtual memory, the model doesn’t hold up particularly well. Within a server, the entire page table is always accessible, so addresses can be mapped back and forth with some additional information to aid in the demultiplexing (generally the pid). With network address translation, the “page table” is created on demand, and stored in a single device. Therefore, in order to support component failover of asymmetric paths, the per-flow state has to be replicated to all other devices that could be on the alternate path.  Also, because state is set up during flow initiation from the virtual address, inbound flows cannot be forwarded unless the destination public IP address is effectively “pinned”  to a virtual address.  As a result, virtual to virtual communication in which the end points are behind different devices have to resort to OOB techniques like NAT punching in order to communicated.

This doesn’t mean that addressing mapping isn’t hugely useful in practice. Clearly for mapping from a virtual address space to a physical address space this is the correct approach. However, if communicating from a virtual address to a virtual address through a physical address space, it it is far more limited than its compute analog.

Tunneling : By tunneling we mean that the payload of a packet is another packet (headers and all).

Like address mapping (and virtual memory) tunneling introduces another address space. However, there are two differences. First, the virtual address space doesn’t have to look anything like the physical. For example, it is possible to have the virtual address space be IPv6 and the physical address space be IPv4. Second, the full address mapping is stored in the packet so that there is no need to create per-flow state within the network to manage the mappings.

Tunneling (or perhaps more broadly encapsulation) is probably the most popular method of doing full network virtualization today. TRILL uses L2 in L2, LISP uses encapsulation, VXLAN (which is LISP as far as I can tell) uses L2 in L3, VCDNI uses L2 in L2, NVGRE uses L2 in L3, etc.

The general approach maps to virtual memory pretty well. The outer header can be likened to a physical address. The inner header can be likened to a virtual address. And often a shim header is included just after the outer header that contains demultiplexing information (the equivalent of a PID). The “page table” consists of on-datapath table table entries which map packets to tunnels.

Take for example an overlay mesh (like NVGRE). The L2 table at the edge that points to the tunnels is effectively mapping from virtual addresses to physical addresses. And there is no reason to limit this lookup to L2, it could provide L2 and L3 in the virtual domain. If a VM moves, this “page table” is updated as it would be in a server of a process is moved.

Because the packets are encapsulated, switches in the “physical domain” only have to deal with the outer header greatly reducing the number of addresses they have to deal with. Switches which contain the “page table” however, have to do lookups in the virtual world (e.g. map from packet to tunnel), map to the physical world (throw a tunnel on the packet) and then forward the packet in the physical world (figure out which port to send the tunneled traffic out on). Note that these are effectively the same steps an MMU takes within a server.

Further, like memory virtualization, tunneling provides nice isolation properties. For example, a VM in the virtual address space cannot address the physical network unless the virtual network is somehow bridged into it.

Also, like hierarchical memory, it is possible for the logical networks to have totally different forwarding stacks. One could be L2-only, the other ipv4, and another ipv6, and all of these could differ from the physical substrate. A common setup has IPv6 run in the virtual domain (for end-to-end addressing), and IPv4 for the physical fabric (where a large address space isn’t needed).

Fortunately, we already have a a multi-level hierarchical topology in the network (in particularly the datacenter). This allows for either the addition of a new layer at the edge which provides a virtual address space without changing any other components, or just changing the components at the outer edge of the hierarchy.

So what are the shortcomings of this approach? The most immediate are performance issues with tunneling, additional overhead overhead in the packet, and the need to maintain the distributed “page table”.

Tunneling performance is no longer the problem it use to be. Even from software in the server, clever tunnel implementations are able to take advantage of LRO and the like and can achieve 10G performance without much CPU. Hardware tunneling on most switching chipsets performs at line rate as well. Header overhead marginally increases transmission delay, and will reduce total throughput if a link is saturated.

Keeping the “page tables” up to date on the other hand is still something the industry is grappling with. Most standards punt on describing how this is done, or even the interface to use to write to the “page table”. Some piggyback on L2 learning (like NVGRE) to construct some of the state but still rely on an out of band mechanism for other state (in the case of NVGRE the VIF to tenantID mapping).

As we’ve said many times in the past, this interface should be standardized if for no other reason than to provide modularity in the architecture. Not doing that would be like limiting a hardware MMU on a CPU to only work with a single Operating System.

Fortunately, Microsoft realizes this need and has opened the interface to their vswitch on Windows Server 8, and XenServer, KVM and other Linux-based hypervisors support Open vSwitch.

For hardware platforms it would also be nice to expose the ability to manage this state. Of course, we would argue that OpenFlow is a good candidate if for no other reason than it has been already used for this purpose successfully.

Wrapping Up ..

If the analogy of memory virtualization in compute holds then encapsulation is the correct way to go about network virtualization. It introduces a new address space that is unrestricted in forwarding model, it takes advantage of hierarchy inherent in the physical topology allowing for address aggregation further up in the hierarchy, and it provides the basic virtualization properties of isolation and transparent mobility.

In the next post of this series, we’ll explore how server virtualization suggests that the way that encapsulation is implemented today isn’t quite right, and some things we may want to do to fix it.


NVGRE, VXLAN and what Microsoft is Doing Right

There has been more movement in the industry towards L2 in L3 tunneling from the edge as the approach for tackling issues with virtual networking.

Hot on the heals of the VMWare/Cisco-led VXLAN announcement, an Internet draft on NVGRE (authored by Microsoft, Intel, Dell, Broadcom, and Arista, but my guess is that Microsoft is the primary driver) showed up with relatively little fanfare. You can check it out here.

In this blog post, I’ll briefly introduce NVGRE . However, I’d like to spend more time providing broader context on where these technologies fit into the virtual networking solution space. Specifically, I’ll argue that the tunneling protocol is a minor aspect of a complete solution, and we need open standards around the control and configuration interface as much as we need to standardize the wire format.

We’ll get to that in a few paragraphs. But first, What is NVGRE?

NVGRE is very similar to VXLAN (my comments on VXLAN here). Basically, it uses GRE as a method to tunnel L2 packets across an IP fabric, and uses 24 bits of the GRE key as a logical network discriminator (which they call a tenant network ID or TNI). By logical network discriminator, I mean it indicates which logical network a particular packet is part of. Also like VXLAN, logical broadcast is achieved through physical multicast.

A day in the life of a packet is simple. I’ll sketch out the case of a packet being sent from a VM. The vswitch, on receiving a packet from a vNic, does two lookups (a) it uses the destination MAC address to determine which tunnel to send the packet to (b) it uses the ingress vNic to determine the tenant network ID. If the MAC in (a) is known, the vswitch will cram the packet into the associated point-to-point GRE tunnel, setting the GRE key to the tenant network ID. If it isn’t know, it will tunnel the packet to the multicast address associated with that tenant network ID. Easy peasy.

While architecturally NVGRE is very similar to VXLAN, there are some differences that have practical implications.

On positive side, NVGRE’s use of GRE eases the compatibility requirement for existing hardware and software stacks. Many of the switching chips I’m familiar with already support GRE, so porting it to these environments is likely much easier than a non-supported tunneling format.

As another example, on a quick read of the RFC, it appears that Open vSwitch can already support NVGRE — it supports GRE, allows for looking and setting the GRE key (including masking, so can be limited to 24 bits), and it supports learning and explicit population of the logical L2 table. In Open vSwitch’s case, all of this can be driven through OpenFlow and the Open vSwitch configuration protocol.

On the other hand, GRE does not take advantage of a standard transport protocol (TCP/UDP), so logical flow information cannot be reflected in the outer header ports like you can for VXLAN. This means that ECMP hashing in the fabric cannot provide flow-level granularity which is desirable to take advantage of all bandwidth. As the protocol catches on, this is simple to address in hardware and will likely happen. [Update:  Since writing this, I've been told that some hardware can use the GRE key in the ECMP hash, which improves the granularity of load balancing in the fabric.  However, per-flow load-balancing  (over the logical 5 tuple) is still not possible.]

OK, so there you have it. A very brief glimpse of NVGRE. Sure there is more to say, but I have a hard time getting excited about tunneling protocols. Why? That is what I’m going to talk about next.

Tunneling vs. Network Virtualization

So, while it is moderately interesting to explore NVGRE and VXLAN, it’s important to remember that the tunneling formats are really a very minor (and easily changed) component of a full network virtualization solution. That is, how a system tunnels, whether over UDP, or GRE, or CAPWAP or something else, doesn’t define what functions the system provides. Rather it specifies what the tunnel packets look like on the wire (an almost trivial consideration).

It’s also important to remember that these two proposals (VXLAN and NVGRE) in specific, while an excellent step in the right direction, are almost certainly going to see a lot of change going forward. As they stand now, they are fairly limited. For example, they only support L2 within the logical network. Also, they both abdicate a lot of responsibility to multicast. This limits scalability on many modern fabrics, requires the deployment of multicast which many large operators eschew, and has real shortcomings when it comes to speed of provisioning a new network, manageability, and security.

Clearly these issues are going to be addressed as the protocols mature. For example, it’s likely that there will be support for L2 and L3 in the logical network, as well as ACLs, QoS primitives, etc. Also, it is likely that there will be some primitives for support secure group joins, and perhaps even support for creating optimized multicast trees that don’t rely on the fabric (some existing solutions already do this).

So if we step back and look at the current situation. We have “standardized” the tunneling protocol and some basic mechanism for L2 in logical space, and not much else. However, we know that these protocols will need evolve going forward. And it’s very likely that this evolution will have non-trivial dependency on the control path.

Therefore, I would argue that the real issue is not “lets standardize the tunneling protocol” but rather, “lets standardize the control interface to configure the tunnels and associated state so system developers can use it address these and future challenges”. That is, if vswitches and pswitches of the future have souped up versions of NVGRE and VXLAN, something is still going to have to provide the orchestration of these primitives.  And that is where the real function comes from. Neither of these proposals have addressed this, for example, the interface to provide the mapping from a vNIC to a tenant network ID is unspecified.

So what might this look like? I’ll use Open vSwitch to provide a specific example. Open vSwitch is a great platform for NVGRE (and soon VXLAN). It supports tens of thousands of GRE tunnels without a problem. And it allows setting and doing lookups on the GRE key. But what makes it immediately useful is not that it supports these tunneling primitives, but that a third party can pick it up, and use OpenFlow to configure those keys, tunnels, and any other network state (ACLs, L3, etc.) to build a full solution.

And that is the crux of the issue. If you get a solution that supports VXLAN or NVGRE, even though they are standardized, you are only getting a piece of the solution. The second piece we need, and should all ask for, is an open standard for configuring this interface. We (Open vSwitch) use OpenFlow. But any standard will do.

What about Microsoft?

From what I can tell (and admittedly, I don’t know very much), this seems to be what Microsoft is getting right. Rather than just specify a tunneling protocol, it appears that they’re also opening up the vswitch interface to support implementations from multiple vendors. While this itself doesn’t provide an open interface to virtual networking configuration, it does allow someone to do that work. For example, NEC has announced they will be releasing an OpenFlow compatible vSwitch for Windows Server 8. My guess is that this is a port of Open vSwitch. (Hey NEC, if you are using Open vSwitch, care to share the changes with the rest of the community?)

In any case, NEC will probably charge you for it, but we will work hard to make a full Open vSwitch port for Windows Server 8 available. Of course it will be open source, so we encourage users to get it for free and innovate in the source as well as use the open management protocols.

We’ll also be demoing Open vSwitch support for NVGRE within OpenStack/Quantum, so stay tuned for that.

Anyway, kudos to Microsoft for supporting L2 in L3 and adding a contender to the pool. It will be interesting to see what finally pops out of the IETF standardization process. I presume it will be some agglomeration achieved through consensus.

And double kudos for opening the vswitch interface. I think we’re going to see a lot of cool innovation in networking around Windows Server 8. And that’s good for all of us.


The Rise of Soft Switching Part IV: Comments on the Hardware Supply Chain

[This post is written by Alex Bachmutsky and Martin Casado. Alex is a Distinguished Engineer at Ericsson in Silicon Valley, driving system architecture aspects of the company's next generation platform. He is the author of the book “Platform Design for Telecommunication Gateways,” and co-authored “WiMAX Evolution: Emerging Technologies and Applications”.]

This is yet another (unplanned) addendum to the soft switching series.

Our previous posts were arguing that for virtual edge switching, using the servers compute was the best point in the design space given the current hardware landscape. The basic argument was that it is possible to achieve 10G from the server in soft switching by dedicating a single core (about $60 – $120 of silicon depending on how you count). So, it is difficult to make the argument for passthrough plus specialized hardware for two reasons. First, given currently available choices, price performance will almost certainly be higher with x86. Second, you loose many of the benefits of virtualization that are retained when using soft switching (which we’ve explained here).

Alex submitted two very good (and very detailed) responses to our claims (you can see the summary here). In them he argued that looking at the basic components costs, a hardware offload solution should be both lower power, and provide better cost performance than doing forwarding in x86. He also argued that switching and NPU chipsets are flexible and support mature, high-level development environments, which may make them suitable for the virtual networking problem.

Alex is an expert in this area, he’s incredibly experienced, and, at least partially, he’s right. Specialized hardware should be able to hit better price/performance and lower power. And it is true that development environments have come a long way.

So what’s the explanation for the discrepancy in the view points?

That is what we’ve teamed up to discuss in this post. It turns out the differences in view are twofold. The first difference comes from the perspective of a large company (and commensurate purchasing power) versus that of a handful of customers. While today, “intelligent NICs” are a least twice (generally more like 3-5x) as expensive as a pure soft solution, this is a supply side issue and isn’t justifiable by the bill of materials (BoM). A company sufficiently large with sufficient investment could overcome this obstacle.

The second difference is distributing an appliance versus distributing software. Distributing an appliance allows ultimate control over hardware configuration, development environment, runtime environment etc. Distributing software, on the other hand, often requires dealing with multiple hardware configurations, and complex software interactions both with drivers and the operating system.

Lets start with a quick resync:

In our original post, we argued that for most workloads, the most flexible and cost effective method for doing networking at the virtual edge is to do soft switching on the server (which is what 99% of all virtual deployments have done over the last decade).

The other two options we considered are using a switching chip (either in the NIC or the ToR switch) or an NPU (most likely on the NIC due to port density issues).

We’ll discuss Switching Chips First:

In our experience, non-NPU chipsets in the NIC are not sufficiently flexible for networking at the virtualized edge. The limitations are manyfold, size of lookup of metadata between tables, number of stages in the lookup pipeline, and economically viable table sizes (sometimes table can be increased, for instance, by external TCAM, but it is an expensive option and hence ignored here for budget sensitive implementations we are focusing on).  A good concrete example is table space for tunnels.  Network virtualization solutions often use lots and lots of tunnels (N^2 in the number of servers is not unusual).  Soft switches have no problem supporting tens of thousands of  tunnels without performance degradation.  This is one or two orders of magnitude more than you’ll find on existing non-NPU NIC chipsets.

So this is simply a limitation in the supply chain, perhaps a future NIC will be sufficiently flexible for the virtual networking problem, until then, we’ll omit this as an option.

ToR:

We both agreed that standard silicon is getting very close to being useful for edge virtual switching. The next generation of 10G switches appear to be particularly compelling. So while still suffering the flexiliby limitations and the shortcomings of hairpinning on inter-VM traffic (like reducing edge link bandwidth and ToR switching capacity), and table space issues, they do appear to be a viable option for some set of the switching decision going forward. This is due to improved support for tagging and tunneling, increased sizes in ACL tables, and improved lookup generality between tables.

So we’re hopeful that next generation ToR switches have an open interface like OpenFlow so that the network virtualization layer can manage the forwarding state.

So what about NPUs?

To be clear, Open vSwitch has already been ported to multiple NPU-NIC platforms. In the cases I am familiar with, inter-VM traffic through the CPU is far slower (presumedly due to DMA overhead) than keeping it on the x86. However, off server traffic requires less CPU.

To explore this solution space in more detail, we’ll subdivide the NPU space into two groups: classical NPUs and multicore-based NPUs.

Many NPUs in the former group have the required level of the processing flexibility, some even have integrated general-purpose CPUs that can be used to implement, for instance, OpenFlow-based control plane. However, their data plane processing is usually based on proprietary microcode that may be a development hurdle in terms of available expertise and toolchain.

Any investment to develop on such an NPU is a sunk cost, both in training the developers to the internal details of the hardware featureset and limitations, and developing the code to work within the environment. It is very difficult (if not impossible), for example, to port the microcode written for one such NPU to the NPU of another vendor.

So while this route is not only viable one but very likely beneficial for creating a mass produced appliance, the investment for supporting “yet another NIC” in a software distribution is likely unjustifiable. And without sufficient economies of scale for the NPU (which can be ensured by a large vendor) the cost to the consumer would likely stunt adoption.

The latter group is based on general-purpose multicore CPUs, but it integrates some NPU-like features, such as streaming I/O, HW packet pre-classification, HW packet re-ordering and atomic flow handling, some level of traffic management, special HW offload engines and more. Since they are based on general-purpose processors, they could be programmed using similar languages and tools as other soft switch solutions. These NPUs can be either used as a main processing engine, or alternatively offload some tasks from the main CPU when integrated on intelligent NIC cards.

If the supply side can be ironed out, both in terms of purchase model and price, this would seem to strike a reasonable balance between development overhead, portability, and price/power/performance. Below, we’ll discuss some of the other challenges that need to be overcome on the supply side to be competitive with x86.

Price:

Let’s start with the economics of switching at the edge. First, in our experience, workloads in the cloud are either mostly idle, or handling lots of TCP traffic, generally HTTP offsite (trading and HPC being the obvious exceptions). Thus 10G at MTU size packets is not only sufficient, it is usually overkill. That said, it is good to be prepared as data usage continues to grow.

Soft switching can be co-resident with the management domain. This has been the standard with Xen deployments in which the Linux bridge or Open vSwitch shares the CPUs allocated to the management domain. For 1G, allocating a single core to both is most likely sufficient. For 10G, allocating an extra core is necessary.

So if we assume worst case (requiring a full core for networking), given a fairly modern CPU, a core weighs in at $60-$100. Motherboard and packaging for that core is probably another $50 (no need for additional memory or harddrive space for soft switching).

On the high end, if you could get specialized hardware for less than (say) $100 – $150 over the price of a standard NIC per server, then there could be an argument for using them. And since often NICs are bonded for resilience, it should probably be half of that or at least include dual ports for the same price for fault resiliency.

Unfortunately, to our knowledge, such a NIC doesn’t exist at those price points. If we’re wrong please let us know (price, relative power and performance) and we’ll update this. To date, all NPU-based NICs we’ve looked are 2-5 times what they should be to pencil out competitively.

However, clearly it should be possible to make a NIC that is optimized for virtual edge switching as the raw components (when purchased at scale) do not justify these high price tags. We’ve found that usually NPU chip vendors do not manufacture their own NIC cards (or do that only for evaluation purposes), and 3rd parties charge a significant premium for their role in the supply chain. Therefore, we posit that this is largely a supply chain issue and perhaps due to the immaturity of the market. Regular NICs are a commodity, intelligent NICs are still a “luxury” without a proportional relationship to their BOM differences.

Of course, this argument is based on traditional cloud packet loads. If the environment instead was hosting an application with small-sized and/or latency sensitive traffic (e.g. voice), then system design criteria would be very different and a specialized hardware solution becomes more compelling.

Drivers:

Drivers are non-existant or poor for specialized NICs. If you look at the HCL (hardware compatibility lists) for VMWare, Citrix, or even the common Linux distributions, NPU-based NICs are rarely (if ever) supported. Generally software solutions rely on the underlying operating system to provide the appropriate drivers. Going with a nonstandard NIC requires getting the hypervisor vendors to support it, which is unlikely. Even multicore NPUs are not supported well, because they are based on non-x86 ISA (MIPS, ARM, PPC) with much more limited support. Practically, it becomes chicken-and-egg problem: major hypervisor vendors don’t spend enough efforts to support lower volume multicore NPUs, and system developers don’t select these NPUs because of lack of drivers causing low volume. Any break out of that vicious circle may change the equation.

Tool Chain:

While the development tool set for embedded processors has come a long way, it still doesn’t match that of standard x86/Linux environment. To be clear, this is only a very minor hurdle to a large development shop with in house expertise in embedded development.

However, from the perspective of a smaller company, dealing with embedded development often means finding and employing relatively specialized developers, more expensive tools (often), and a more complex debug and testing environment. My (Martin’s) experience with side-by-side projects working both on specialized hardware environments, and standard (non-embedded) x86 server’s is that the former is at least twice as slow.

Where does that leave us?

A quick summary of the discussion thus far is as follows.

Considering only the cost and performance properties of specialized hardware, a virtual switching solution using them should have better price performance, and lower power than an equivalent x86 solution. However, few virtualized workloads could actually take advantage of the additional hardware. And supply side issues (no such component exists today at a competitive price point), and complications in inserting specialized hardware into today’s server ecosystem, remain enormous hurdles to realizing this potential. Which is probably why 99% of all virtual deployments over the last decade (and certainly all of the largest virtual operations in the world) have relied on soft switching.

Alex is right to remind us that it is not always a technology limitation, and that there is a room for hardware to come in at the right price/performance if the supply chain and development support matures. Which it very well may be. We hope that intelligent NICs will become more affordable and their price will reflect the BOM. We also hope that NPU vendors will sponsor in some ways the development of hypervisor and middleware layers by major SW suppliers as opposed to some proprietary solutions available on the market today.

As I’ve mentioned previously, there are a number of production deployments which use Open vSwitch directly on hardware. In an upcoming blog post, we’ll dig into those use cases a little further.


The Rise of Soft Switching Part II: Soft Switching is Awesome ™

[This series is written by Jesse Gross, Andrew Lambeth, Ben Pfaff, and Martin Casado. Ben is an early and continuing contributor to the design and implementation of OpenFlow. He's also a primary developer of Open vSwitch. Jesse is also a lead developer of Open vSwitch and is responsible for the kernel work and datapath. Andrew has been virtualizing networking for long enough to have coined the term "vswitch", and led the vDS distributed switching project at VMware. All authors currently work at Nicira.]

This is the second post in our series on soft switching. In the first part (found here), we lightly covered a subset of the technical landscape around networking at the virtual edge. This included tagging and offloading decisions to the access switch, switching inter-VM traffic in the NIC, as well as NIC passthrough to the guest VM.

A brief synopsis of the discussion is as follows. Both tagging and passthrough are designed to save end-host CPU by punting the packet classification problem to specialized forwarding hardware either on the NIC or first hop switch, and to avoid the overhead of switching out of the guest to the hypervisor to access the hardware. However, tagging adds inter-VM latency and reduces inter-VM bisectional bandwidth. Passthrough also increases inter-VM latency, and effectively de-virtualizes the network thereby greatly limiting the flexibility provided by the hypervisor. We also mentioned that switching in widely available NICs today is impractical due to severe limitations in the on-board switching chips.

For the purposes of the following discussion, we are going to reduce the previous discussion to the following: The performance arguments in favor of an approach like passthrough + tagging (with enforcement in the first hope switch) is that latency is reduced to the wire (albeit marginally) and packet classification from a proper switching chip will noticeably outperform x86.

The goal of this post is to explain why soft switching kicks ass. Initially, we’ll debunk some of the FUD around performance issues with it, and then try and quantify the resource/performance tradeoffs of soft switching vis a vis hardware-based approaches. As we’ll argue, the question is not “how fast” is soft switching (it is almost certainly fast enough), but rather, “how much cpu am I willing to burn”, or perhaps “should I heat the room with teal or black colored boxes”?

So with that …

Why soft switching is awesome:

So, what is soft switching? Exactly what it sounds like. Instead of passing the packet off to a special purpose hardware device, the packet transitions from the Guest VM into the hypervisor which performs the forwarding decision in software (read, x86). Note that while a soft switch can technically be used for tagging, for the purposes of this discussion we’ll assume that it’s doing all of the first-hop switching.

The benefits of this approach are obvious. You get the flexibility and upgrade cycle of software, and compared to passthrough, you keep all of the benefits of virtualization (memory overcommit, page sharing, etc.). Also, soft switching tends to be much better integrated with the virtual environment. There is a tremendous amount of context that can be gleaned by being co-resident with the VMs, such as which MAC and IP addresses are assigned locally, VM resource use and demands, or which multicast addresses are being listened to. This information can be used to pre-populate tables, optimize QoS rules, prune multicast trees, etc.

Another benefit is simple resource efficiency, you already bought the damn server, so if you have excess compute capacity why buy specialized hardware for something you can do on the end host?  Or put another way, after you provision some amount of hardware resources to handle the switching work, any of those resources that are left over are always available to do real work running an app instead of being wasted(which is usually a lot since you have to provision for peaks).

Of course, nothing comes for free. And there is a perennial skepticism around the performance of software when compared to specialized hardware. So we’ll take some time to focus on that.

First, what are the latency costs of soft switching?

With soft switching, VM to VM communication effectively reduces to a memcpy() (you can also do page flipping which has the same order of overhead). This is as fast as one can expect to achieve on a modern architecture. Copying data between VMs through a shared L2 cache on a multicore CPU, or even if you are unlucky enough to have to go to main memory, is certainly faster than doing a DMA over the PCI bus. So for VM to VM communication, soft switching will have the lowest latency, presuming you can do the lookup function sufficiently quickly (more on that below).

Sending traffic from the guest to the wire is only marginally more expensive due to the overhead of a domain transfer (e.g. flushing the TLB) and copying security sensitive information (such as headers). In Xen (for example) guest transmit (DomU-to-Dom0) operates by mapping pages that were allocated by the guest into the hypervisor, which then get DMA’d with no copy required (with the exception of headers, which are copied for security purposes so the guest can’t change them after the hypervisor has made a decision). In the other direction, the guest allocates pages and puts them in its rx ring, similar to real hardware. These then get shared with the hypervisor via remapping. When receiving a packet the hypervisor copies the data into the guest buffers. (note: VMware does almost the same thing. However, there is no remapping because the vswitch runs in the vmkernel and all physical pages are already mapped and the vmkernel has access to the guest MMU mappings.)

So, while there is comparatively more overhead than a pure hardware approach (due to copying headers and the overhead of the domain transfer), it is in the order of microseconds and dwarfed by other aspects of a virtualized system like memory overcommit. Or more to the point, only in extreme latency sensitive environments does this matter (the overhead is completely lost in the noise of other hypervisor overhead), in which case the only deployment approach that makes sense is to effectively pin compute to dedicated hardware greatly diminishing the utility of using virtualization in the first place.

What about throughput?

Modern soft switches that don’t suck are able to saturate a 10G link from a guest to the wire with less than a core (assuming MTU size packets). They are also able to saturate a 1G link with less than 20% of a core. In the case of Open vSwitch, these numbers include full packet lookup over L2, L3 and L4 headers.

While these are numbers commonly seen in practice, theoretically, throughput is affected by the overhead of the forwarding decision — more complex lookups can take more time thus reducing total throughput.

The forwarding decision involves taking the header fields of each packet and checking them against the forwarding rule set (L2, L3, ACLs, etc.) to determine how to handle the packet. This general class of problem is termed “packet classification” and is worth taking a closer look at.

Soft Packet Classification:

One of the primary arguments in favor of offloading virtual edge switching to hardware is that a TCAM can do a lookup faster than x86. This is unequivocally true, TCAMs have lots and lots of gates (and are commensurately costly and have high power demands) so that they can do lookups of many rules in parallel. A general CPU cannot match the lookup capacity of a TCAM in a degenerate case.

However, software packet classification has come a long way. Under realistic workloads and rule sets that are found in virtualized environments (e.g. mult-tenant isolation with a sane security policy), soft switching can handle lookups at line rates with the resource usage mentioned above (less than a core for 10G) and so does not add appreciable overhead.

How is this achieved? For Open vSwitch, which looks at many more headers than will fit in a standard TCAM, the common case lookup reduces to the overhead of a hash (due to extensive use of flow caching) and can achieve the same throughput as normal soft forwarding. We have run Open vSwitch with hundreds of thousands of forwarding rules and still achieved similar performance numbers to those described above.

Flow setup, on the other hand, is marginally more expensive since it cannot benefit from the caching. Performance of the packet classifier in Open vSwitch relies on our observation that flow tables used in practice (in the environments we’re familiar with) tend to have only a handful of unique sets of wildcarded fields. Each of these observed wildcard sets has its own hash table, hashed on the basis of the fields that are not wildcarded. Therefore classifying a packet requires a O(1) lookup in each hash table and selecting the highest-priority match. Lookup performance is therefore linear in the number of unique wildcard sets in the flow table. Since this tends to be small, classifier overhead tends to be negligible.

We realize that this is all a bit hand-wavy and needs to be backed up with hard performance results. Because soft classification is such an important (and somewhat nuanced) issue, we will dedicate a future post to it.

“Yeah, this is all great. But when is soft switching not a good fit?”

While we would contend that soft switching is good for most deployment environments, there are instances in which passthrough or tagging is useful.

In our experience, the mainstay argument for passthrough is reduced latency to the wire. So while average latency is probably ok, specialized apps that have very small request/response type workloads can be impacted by the latency of soft switching.

Another common use case for passthrough is a local appliance VM that acts as an inline device between normal application VMs and the network. Such an appliance VM has no need of most of the mobility or other hypervisor provided goodness that is sacrificed with passthrough but it does have a need to process traffic with as little overhead as possible.

Passthrough is also useful for providing the guest with access to hardware that is not exposed by the emulated NIC (for example, some NICs have IPsec offload but that is not generally exposed).

Of course, if you do tagging to a physical switch you get access to the all of the advanced features that have been developed over time, all exposed through a CLI that people are familiar with (this is clearly less true with Cisco’s Nexus 1k). In general, this line of argument has more to do with the immaturity of software switches than any real fundamental limitation. But it’s a reasonable use case from an operations perspective.

The final, and probably most widely used (albeit least talked about) use case for passthrough is drag racing. Hypervisor vendors need to make sure that they can all post the absolute highest, break-neck performance numbers for cross-hypervisor performance comparisons (yup, sleazy), regardless of how much fine print is required to qualify them. Why else would any sane vendor of the most valuable piece of real estate in the network (the last inch) cede it to a NIC vendor? And of course the NIC vendors that are lucky enough to be blessed by a hypervisor into passthrough Valhalla can drag race with each other with all their os-specific hacks again.

“What am I supposed to conclude from all of this?”

Hopefully we’ve made our opinion clear: soft switching kicks mucho ass. There is good reason that it is far and away the dominant technology used for switching at the virtual edge. To distill the argument even further, our calculus is simple …

Software flexibility + 1 core x86 + 10G networking + cheap gear + other cool shit >
       saving a core of x86 + costly specialized hardware + unlikely to be realized benefit of doing classification in the access switch.

More seriously, while we make the case for soft switching, we still believe there is ample room for hardware acceleration. However, rather than just shipping off packets to a hardware device, we believe that stateless offload in the NIC is a better approach. In the next post in this series, we will describe how we think the hardware ecosystem should evolve to aid the virtual networking problem at the edge.


The Rise of Soft Switching (Part I: Introduction and Background)

[This series is written by Jesse Gross, Andrew Lambeth, Ben Pfaff, and Martin Casado. Ben is an early and continuing contributor to the design and implementation of OpenFlow. He's also a primary developer of Open vSwitch. Jesse is also a lead developer of Open vSwitch and is responsible for the kernel work and datapath. Andrew has been virtualizing networking for long enough to have coined the term "vswitch", and led the vDS distributed switching project at VMware. All authors currently work at Nicira.]

How many times have you been involved in the following conversation?

NIC vendor: “You should handle VM networking at the NIC. It has direct access to server memory and can use a real switching chip with a TCAM to do fast lookups. Much faster than an x86. Clearly the NIC is the correct place to do inter-VM networking.

Switch vendor: “Nonsense! You should handle VM networking at the switch. Just slap a tag on it, shunt it out, and let the real men (and women) do the switching and routing. Not only do we have bigger TCAMs, it’s what we do for a living. You trust the rest of your network to Crisco, so trust your inter-VM traffic too. The switch is the correct place to do inter-VM networking.

Hypervisor vendor: “You guys are both wrong, networking should be done in software in the hypervisor. Software is flexible, efficient and the hypervisor already has rich knowledge of VM properties such as addressing and motion events. The hypervisor is the correct place to do inter-VM networking.

I doubt anyone familiar with these arguments is fooled into thinking they are anything other than what they are, poorly cloaked marketing pitches charading as a technical discussion. This topic in particular is a focus of a lot of marketing noise. Why? Probably because of its strategic importance. The access layer to the network hasn’t opened for over a decade and with virtualization, there is the opportunity to shift control of the network from the core into the edge. This has to be making the traditional hardware vendors pretty nervous.

Fortunately, while the market blather decries this as a nuanced issue, we believe the technical discussion is actually fairly straightforward.

And that is the goal of this series of posts, to explore the technical implications of doing virtual edge networking in software vs. various configurations of hardware (Passthrough, tagging, switching in the NIC, etc.) The current plan is to split the posts into three parts. First, we’ll provide an overview of the proposed solution space. In the next post, we’ll focus on soft switching in particular, and finally we’ll describe how we would like to see the hardware ecosystem evolve to better support networking at the virtual edge.

Not to spoil the journey, but the high-order take-away of this series is that for almost any deployment environment, soft switching is the right approach. And by “right” we mean flexible, powerful, economic, and fast. Really fast. Modern vswitches (that don’t suck) can handle 1G at less than 20% of a core, and 10G at about a core. We’re going to discuss this at length in the next post.

Before we get started, it’s probably worth talking to the bias of the authors. We’re all software guys, so we have a natural preference for soft solutions. That said, we’re also all involved in Open vSwitch which is built to support both hardware (NIC and first hop switch) and software forwarding models. In fact, some of the more important deployments we’re involved in use Open vSwitch to drive switching hardware directly.

Alright, on to the focus of this post, background.

At it’s most basic, networking at the virtual edge simply refers to how forwarding and policy decisions are applied to VM traffic. Since VMs may be co-resident on a server, this means either doing the networking in software, or sending it to hardware to make the decision (or a hybrid of the two) and then back.

We’ll cover some of the more commonly discussed proposals for doing this here:

Tagging + Hairpinning: Tagging with hairpinning is a method for getting inter-VM traffic off of a server so that the first hop forwarding decisions can be enforced by the access hardware switch. Predictably, this approach is strongly used/backed by HP ProCurve and Cisco.

The idea is simple. When a VM sends a packet, it gets a tag unique to the sending VM, and then the packet is sent to the first hop switch. The switch does forwarding/policy lookup on the packet, and if the next hop is another VM resident on the same server, the packet is sent back to the server (hairpinned) to be delivered to the VM. The tagging can be done by the vswitch in the hypervisor, or by the NIC when using passthrough (discussed below).

The rational for this approach is that special purpose switching hardware can perform packet classification (forwarding and policy decisions) faster than software. In the next post, we’ll discuss whether this is an appreciable win over software classification at the edge (hint, not really).

The very obvious downsides to hairpinning are twofold. First, the bisectional bandwidth for inter-VM communication is limited by the first hop link. And it consumes bandwidth which could otherwise be used exclusively for communication between VMs and the outside world.

Perhaps not so obvious is that paying DMA and transmission costs for inter-VM communication increases latency. Also, by shunting the switching decision off to a remote device, you’re throwing away a goldmine of rich contextual information about the state of the VM’s doing the sending and receiving, as well as (potentially) the applications within those VMs.

Regarding the tags themselves, with VNTag, Cisco proposes a new field in the Ethernet header (thereby changing the Ethernet framing format, and thereby requiring new hardware, … ). And HP, a primary driver behind VEPA, decided to use the VLAN tag (and or source MAC) which any modern switching chipset can handle. In either case, a tag is just bits, so whether it is a new tag (requiring a new ASIC), or the use of an existing tag (VLAN and MPLS are commonly suggested candidates), the function is the same.

Blind Trust MAC Authentication: Another approach for networking at the virtual edge is to rely on MAC addresses to identify VMs. It appears that some of the products that do this are repurposed NAC solutions being pushed into the virtualization market. Yuck.

In any case, assuming there is no point of presence on the hypervisor, there are a ton of problems with this approach. They rarely provide any means to manage policy of inter-VM traffic, they can be easily fooled through source spoofing, and often they rely on hypervisor specific implementation tricks to determine move events (like looking for VMWare RARPs). Double yuck. This is the last we’ll mention of this approach as it really isn’t a valid solution to the problem.

Switching in the NIC: Another common proposal is to do all inter-VM networking in the NIC. The rational is twofold, if passthrough is being used (discussed further below) the hypervisor is bypassed, so something needs to do the classification. And secondly, packet classification is faster in hardware than software.

However, switching within the NIC hasn’t caught on (and we don’t believe it will to any significant degree). Using passthrough obviates many advantages of virtualization (as we describe below), and DMA’ing to the NIC is not free. Further, switching chipsets on NICs to date have not been nearly as powerful as those used in standard switching gear. The only clear advantage to doing switching in the NIC is the avoidance of hair-pinning (and perhaps shared memory with the CPU via QPI).

[note: the original version of this article conflated SR-IOV and passthrough which is incorrect.  While it is often used in conjunction with passthrough, SR-IOV itself is not a passthrough technology]

Where does Passthrough fit in all of this? Passthrough is a method of bypassing the hypervisor so that the packet is DMA’d from the guest directly to the NIC for forwarding (and vice versa). This can be used in conjunction with NIC-based switching as well as tagging and enforcement in the access switch. Passthrough basicaly de-virtualizes the network by removing the layer of software indirection provided by the hypervisor.

While the argument for passthrough is one of saving CPU, doing so is a complete anathema to many hypervisor developers (and their customers!) due to the loss of the software interposition layer. With passthrough, you generally loose the following: memory overcommit, page sharing, live migration, fault tolerance (live standby), live snapshots, the ability to interpose on the IO with the flexibility of software on x86, and device driver independence.

Regarding this last issue (hardware decoupling), with passthrough, if you buy new hardware, prepare to enjoy hours of upgrading/changing device drivers in hundreds of VMs. Or keep your old HW and enjoy updating the drivers anyway because NICs have hardware errata. Also enjoy lots of fun trying to restore from a disk snapshot or deploy from a template that has the old/wrong device driver in it.

To be fair, VMware and others have been investing large amounts of engineering resources into address this by performing unnatural acts like injecting device driver code into guests. But to date, the solutions appear to be intrusive and proprietary workarounds limited to a small subset of the available hardware. More general solutions, such as NPA, have no declared release vehicle, and from the Linux kernel mailing list appear to have died (or perhaps put on hold).

This all said, there actually are some reasonable (albeit fringe) use cases for Passthrough, such as throughput sensitive network appliances or single packet latency messaging applications. We’ll talk to these more in a later post.

Soft Switching: Soft switching is where networking at the virtual edge is done in software withing the hypervisor vswitch, and without offloading the decision to special purpose forwarding hardware. This is far and away the most popular approach in use today, and the singular focus of our next post in this series, so we’ll leave it at that for now.

Is that all?

No. This isn’t an exhaustive overview of all available approaches (far from it). Rather, it is an opinion-soaked survey of what has the most industry noise. In the next post, we will do a deep dive into soft switching, focusing in particular on the performance implications (latency, throughput, cpu overhead) in comparison to the approaches mentioned above.

Until then …


Follow

Get every new post delivered to your Inbox.

Join 424 other followers