The Rise of Soft Switching (Part I: Introduction and Background)

[This series is written by Jesse Gross, Andrew Lambeth, Ben Pfaff, and Martin Casado. Ben is an early and continuing contributor to the design and implementation of OpenFlow. He's also a primary developer of Open vSwitch. Jesse is also a lead developer of Open vSwitch and is responsible for the kernel work and datapath. Andrew has been virtualizing networking for long enough to have coined the term "vswitch", and led the vDS distributed switching project at VMware. All authors currently work at Nicira.]

How many times have you been involved in the following conversation?

NIC vendor: “You should handle VM networking at the NIC. It has direct access to server memory and can use a real switching chip with a TCAM to do fast lookups. Much faster than an x86. Clearly the NIC is the correct place to do inter-VM networking.

Switch vendor: “Nonsense! You should handle VM networking at the switch. Just slap a tag on it, shunt it out, and let the real men (and women) do the switching and routing. Not only do we have bigger TCAMs, it’s what we do for a living. You trust the rest of your network to Crisco, so trust your inter-VM traffic too. The switch is the correct place to do inter-VM networking.

Hypervisor vendor: “You guys are both wrong, networking should be done in software in the hypervisor. Software is flexible, efficient and the hypervisor already has rich knowledge of VM properties such as addressing and motion events. The hypervisor is the correct place to do inter-VM networking.

I doubt anyone familiar with these arguments is fooled into thinking they are anything other than what they are, poorly cloaked marketing pitches charading as a technical discussion. This topic in particular is a focus of a lot of marketing noise. Why? Probably because of its strategic importance. The access layer to the network hasn’t opened for over a decade and with virtualization, there is the opportunity to shift control of the network from the core into the edge. This has to be making the traditional hardware vendors pretty nervous.

Fortunately, while the market blather decries this as a nuanced issue, we believe the technical discussion is actually fairly straightforward.

And that is the goal of this series of posts, to explore the technical implications of doing virtual edge networking in software vs. various configurations of hardware (Passthrough, tagging, switching in the NIC, etc.) The current plan is to split the posts into three parts. First, we’ll provide an overview of the proposed solution space. In the next post, we’ll focus on soft switching in particular, and finally we’ll describe how we would like to see the hardware ecosystem evolve to better support networking at the virtual edge.

Not to spoil the journey, but the high-order take-away of this series is that for almost any deployment environment, soft switching is the right approach. And by “right” we mean flexible, powerful, economic, and fast. Really fast. Modern vswitches (that don’t suck) can handle 1G at less than 20% of a core, and 10G at about a core. We’re going to discuss this at length in the next post.

Before we get started, it’s probably worth talking to the bias of the authors. We’re all software guys, so we have a natural preference for soft solutions. That said, we’re also all involved in Open vSwitch which is built to support both hardware (NIC and first hop switch) and software forwarding models. In fact, some of the more important deployments we’re involved in use Open vSwitch to drive switching hardware directly.

Alright, on to the focus of this post, background.

At it’s most basic, networking at the virtual edge simply refers to how forwarding and policy decisions are applied to VM traffic. Since VMs may be co-resident on a server, this means either doing the networking in software, or sending it to hardware to make the decision (or a hybrid of the two) and then back.

We’ll cover some of the more commonly discussed proposals for doing this here:

Tagging + Hairpinning: Tagging with hairpinning is a method for getting inter-VM traffic off of a server so that the first hop forwarding decisions can be enforced by the access hardware switch. Predictably, this approach is strongly used/backed by HP ProCurve and Cisco.

The idea is simple. When a VM sends a packet, it gets a tag unique to the sending VM, and then the packet is sent to the first hop switch. The switch does forwarding/policy lookup on the packet, and if the next hop is another VM resident on the same server, the packet is sent back to the server (hairpinned) to be delivered to the VM. The tagging can be done by the vswitch in the hypervisor, or by the NIC when using passthrough (discussed below).

The rational for this approach is that special purpose switching hardware can perform packet classification (forwarding and policy decisions) faster than software. In the next post, we’ll discuss whether this is an appreciable win over software classification at the edge (hint, not really).

The very obvious downsides to hairpinning are twofold. First, the bisectional bandwidth for inter-VM communication is limited by the first hop link. And it consumes bandwidth which could otherwise be used exclusively for communication between VMs and the outside world.

Perhaps not so obvious is that paying DMA and transmission costs for inter-VM communication increases latency. Also, by shunting the switching decision off to a remote device, you’re throwing away a goldmine of rich contextual information about the state of the VM’s doing the sending and receiving, as well as (potentially) the applications within those VMs.

Regarding the tags themselves, with VNTag, Cisco proposes a new field in the Ethernet header (thereby changing the Ethernet framing format, and thereby requiring new hardware, … ). And HP, a primary driver behind VEPA, decided to use the VLAN tag (and or source MAC) which any modern switching chipset can handle. In either case, a tag is just bits, so whether it is a new tag (requiring a new ASIC), or the use of an existing tag (VLAN and MPLS are commonly suggested candidates), the function is the same.

Blind Trust MAC Authentication: Another approach for networking at the virtual edge is to rely on MAC addresses to identify VMs. It appears that some of the products that do this are repurposed NAC solutions being pushed into the virtualization market. Yuck.

In any case, assuming there is no point of presence on the hypervisor, there are a ton of problems with this approach. They rarely provide any means to manage policy of inter-VM traffic, they can be easily fooled through source spoofing, and often they rely on hypervisor specific implementation tricks to determine move events (like looking for VMWare RARPs). Double yuck. This is the last we’ll mention of this approach as it really isn’t a valid solution to the problem.

Switching in the NIC: Another common proposal is to do all inter-VM networking in the NIC. The rational is twofold, if passthrough is being used (discussed further below) the hypervisor is bypassed, so something needs to do the classification. And secondly, packet classification is faster in hardware than software.

However, switching within the NIC hasn’t caught on (and we don’t believe it will to any significant degree). Using passthrough obviates many advantages of virtualization (as we describe below), and DMA’ing to the NIC is not free. Further, switching chipsets on NICs to date have not been nearly as powerful as those used in standard switching gear. The only clear advantage to doing switching in the NIC is the avoidance of hair-pinning (and perhaps shared memory with the CPU via QPI).

[note: the original version of this article conflated SR-IOV and passthrough which is incorrect.  While it is often used in conjunction with passthrough, SR-IOV itself is not a passthrough technology]

Where does Passthrough fit in all of this? Passthrough is a method of bypassing the hypervisor so that the packet is DMA’d from the guest directly to the NIC for forwarding (and vice versa). This can be used in conjunction with NIC-based switching as well as tagging and enforcement in the access switch. Passthrough basicaly de-virtualizes the network by removing the layer of software indirection provided by the hypervisor.

While the argument for passthrough is one of saving CPU, doing so is a complete anathema to many hypervisor developers (and their customers!) due to the loss of the software interposition layer. With passthrough, you generally loose the following: memory overcommit, page sharing, live migration, fault tolerance (live standby), live snapshots, the ability to interpose on the IO with the flexibility of software on x86, and device driver independence.

Regarding this last issue (hardware decoupling), with passthrough, if you buy new hardware, prepare to enjoy hours of upgrading/changing device drivers in hundreds of VMs. Or keep your old HW and enjoy updating the drivers anyway because NICs have hardware errata. Also enjoy lots of fun trying to restore from a disk snapshot or deploy from a template that has the old/wrong device driver in it.

To be fair, VMware and others have been investing large amounts of engineering resources into address this by performing unnatural acts like injecting device driver code into guests. But to date, the solutions appear to be intrusive and proprietary workarounds limited to a small subset of the available hardware. More general solutions, such as NPA, have no declared release vehicle, and from the Linux kernel mailing list appear to have died (or perhaps put on hold).

This all said, there actually are some reasonable (albeit fringe) use cases for Passthrough, such as throughput sensitive network appliances or single packet latency messaging applications. We’ll talk to these more in a later post.

Soft Switching: Soft switching is where networking at the virtual edge is done in software withing the hypervisor vswitch, and without offloading the decision to special purpose forwarding hardware. This is far and away the most popular approach in use today, and the singular focus of our next post in this series, so we’ll leave it at that for now.

Is that all?

No. This isn’t an exhaustive overview of all available approaches (far from it). Rather, it is an opinion-soaked survey of what has the most industry noise. In the next post, we will do a deep dive into soft switching, focusing in particular on the performance implications (latency, throughput, cpu overhead) in comparison to the approaches mentioned above.

Until then …

8 Comments on “The Rise of Soft Switching (Part I: Introduction and Background)”

  1. Nice post Martin! Good summary. Just a couple of remarks:

    - VN-Tag exists not just to sell new ASICs. It exists so you don’t have to recycle the QinQ tag, which was developed for a purpose. It is the base of the Qbh standard, and I believe there is sufficient need for new ID in the market to develop a new standard for this. It also allows for nesting, which is nice for other reasons (extenders…).

    - I agree that soft-switching is a great way to follow, and what I find most interesting is the new service deployment model that this allows, where you redirect VM traffic to a service (load-balancer, firewall, IPS…) based on VM property policies, not on network addressing or VLAN. This is a huge change, that abstracts network design from service deployment, and I believe it can be a true game changer for the short-term future in Data Center networks.


    (Disclaimer: I am a Cisco engineer).

    • Heya Pablo, thanks for the comment. You definitely make a fair point regarding VN-Tag, perhaps my cynicism is unwarranted. Thanks for pointing that out. Secondly, I very much agree with your statement on service interposition. I’m guessing we’re going to see a lot more of that and similar approaches in the near future.

  2. [...] is the second post in our series on soft switching. In the first part (found here), we lightly covered a subset of the technical landscape around networking at the virtual edge. [...]

  3. Brad Hedlund says:

    Hi Martin,
    I agree, soft switching IS awesome :-) That said, your characterization of the Switch vendor’s technical argument is partially true, but’s it missing perhaps the primary argument (assuming we’re talking about Cisco). Which is that hardware switch vendors can also make a software switch (for all the reasons you state here, and more).

    Switch vendor: “You can handle VM switching in the server or the network, there are reasons to do each. You already trust Cisco with your physical network, so trust your inter-VM switching to Cisco as well, we can do both and provide you a consistent operational model in each case”
    e.g. Nexus 1000V (soft), VM-FEX (hard)


    • Heya Brad,

      I’m glad you brought this up. I really view Cisco as an exception, not only is CIsco a pioneer in networking at the virtual edge (with N1 K, for example), you guys play in all three spaces (soft switching, NICs, and switching). That said, there are many other switch vendors who aren’t nearly as broad.

      thanks for the comment.

  4. [...] software on in the hypervisor platform that performs all the frame forwarding. The folks over at Network Heresy have pumped out a number of self-serving and bombastic articles declaiming the value of software [...]

  5. [...] is our third (and final) post on soft switching. The previous two posts described various hardware edge switching technologies (tagging + hairpinning, offloading to [...]

  6. Technically speaking I believe VEPA does not require tagging (QinQ), only the multi-channel variant of VEPA does.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 374 other followers