the rise of soft switching part ii soft switching is awesome tm
The Rise of Soft Switching Part II: Soft Switching is Awesome ™
Posted: June 25, 2011
[This series is written by Jesse Gross, Andrew Lambeth, Ben Pfaff, and Martin Casado. Ben is an early and continuing contributor to the design and implementation of OpenFlow. He’s also a primary developer of Open vSwitch. Jesse is also a lead developer of Open vSwitch and is responsible for the kernel work and datapath. Andrew has been virtualizing networking for long enough to have coined the term “vswitch”, and led the vDS distributed switching project at VMware. All authors currently work at Nicira.]
This is the second post in our series on soft switching. In the first part (found here), we lightly covered a subset of the technical landscape around networking at the virtual edge. This included tagging and offloading decisions to the access switch, switching inter-VM traffic in the NIC, as well as NIC passthrough to the guest VM.
A brief synopsis of the discussion is as follows. Both tagging and passthrough are designed to save end-host CPU by punting the packet classification problem to specialized forwarding hardware either on the NIC or first hop switch, and to avoid the overhead of switching out of the guest to the hypervisor to access the hardware. However, tagging adds inter-VM latency and reduces inter-VM bisectional bandwidth. Passthrough also increases inter-VM latency, and effectively de-virtualizes the network thereby greatly limiting the flexibility provided by the hypervisor. We also mentioned that switching in widely available NICs today is impractical due to severe limitations in the on-board switching chips.
For the purposes of the following discussion, we are going to reduce the previous discussion to the following: The performance arguments in favor of an approach like passthrough + tagging (with enforcement in the first hope switch) is that latency is reduced to the wire (albeit marginally) and packet classification from a proper switching chip will noticeably outperform x86.
The goal of this post is to explain why soft switching kicks ass. Initially, we’ll debunk some of the FUD around performance issues with it, and then try and quantify the resource/performance tradeoffs of soft switching vis a vis hardware-based approaches. As we’ll argue, the question is not “how fast” is soft switching (it is almost certainly fast enough), but rather, “how much cpu am I willing to burn”, or perhaps “should I heat the room with teal or black colored boxes”?
So with that …
Why soft switching is awesome:
So, what is soft switching? Exactly what it sounds like. Instead of passing the packet off to a special purpose hardware device, the packet transitions from the Guest VM into the hypervisor which performs the forwarding decision in software (read, x86). Note that while a soft switch can technically be used for tagging, for the purposes of this discussion we’ll assume that it’s doing all of the first-hop switching.
The benefits of this approach are obvious. You get the flexibility and upgrade cycle of software, and compared to passthrough, you keep all of the benefits of virtualization (memory overcommit, page sharing, etc.). Also, soft switching tends to be much better integrated with the virtual environment. There is a tremendous amount of context that can be gleaned by being co-resident with the VMs, such as which MAC and IP addresses are assigned locally, VM resource use and demands, or which multicast addresses are being listened to. This information can be used to pre-populate tables, optimize QoS rules, prune multicast trees, etc.
Another benefit is simple resource efficiency, you already bought the damn server, so if you have excess compute capacity why buy specialized hardware for something you can do on the end host? Or put another way, after you provision some amount of hardware resources to handle the switching work, any of those resources that are left over are always available to do real work running an app instead of being wasted(which is usually a lot since you have to provision for peaks).
Of course, nothing comes for free. And there is a perennial skepticism around the performance of software when compared to specialized hardware. So we’ll take some time to focus on that.
First, what are the latency costs of soft switching?
With soft switching, VM to VM communication effectively reduces to a memcpy() (you can also do page flipping which has the same order of overhead). This is as fast as one can expect to achieve on a modern architecture. Copying data between VMs through a shared L2 cache on a multicore CPU, or even if you are unlucky enough to have to go to main memory, is certainly faster than doing a DMA over the PCI bus. So for VM to VM communication, soft switching will have the lowest latency, presuming you can do the lookup function sufficiently quickly (more on that below).
Sending traffic from the guest to the wire is only marginally more expensive due to the overhead of a domain transfer (e.g. flushing the TLB) and copying security sensitive information (such as headers). In Xen (for example) guest transmit (DomU-to-Dom0) operates by mapping pages that were allocated by the guest into the hypervisor, which then get DMA’d with no copy required (with the exception of headers, which are copied for security purposes so the guest can’t change them after the hypervisor has made a decision). In the other direction, the guest allocates pages and puts them in its rx ring, similar to real hardware. These then get shared with the hypervisor via remapping. When receiving a packet the hypervisor copies the data into the guest buffers. (note: VMware does almost the same thing. However, there is no remapping because the vswitch runs in the vmkernel and all physical pages are already mapped and the vmkernel has access to the guest MMU mappings.)
So, while there is comparatively more overhead than a pure hardware approach (due to copying headers and the overhead of the domain transfer), it is in the order of microseconds and dwarfed by other aspects of a virtualized system like memory overcommit. Or more to the point, only in extreme latency sensitive environments does this matter (the overhead is completely lost in the noise of other hypervisor overhead), in which case the only deployment approach that makes sense is to effectively pin compute to dedicated hardware greatly diminishing the utility of using virtualization in the first place.
What about throughput?
Modern soft switches that don’t suck are able to saturate a 10G link from a guest to the wire with less than a core (assuming MTU size packets). They are also able to saturate a 1G link with less than 20% of a core. In the case of Open vSwitch, these numbers include full packet lookup over L2, L3 and L4 headers.
While these are numbers commonly seen in practice, theoretically, throughput is affected by the overhead of the forwarding decision — more complex lookups can take more time thus reducing total throughput.
The forwarding decision involves taking the header fields of each packet and checking them against the forwarding rule set (L2, L3, ACLs, etc.) to determine how to handle the packet. This general class of problem is termed “packet classification” and is worth taking a closer look at.
Soft Packet Classification:
One of the primary arguments in favor of offloading virtual edge switching to hardware is that a TCAM can do a lookup faster than x86. This is unequivocally true, TCAMs have lots and lots of gates (and are commensurately costly and have high power demands) so that they can do lookups of many rules in parallel. A general CPU cannot match the lookup capacity of a TCAM in a degenerate case.
However, software packet classification has come a long way. Under realistic workloads and rule sets that are found in virtualized environments (e.g. mult-tenant isolation with a sane security policy), soft switching can handle lookups at line rates with the resource usage mentioned above (less than a core for 10G) and so does not add appreciable overhead.
How is this achieved? For Open vSwitch, which looks at many more headers than will fit in a standard TCAM, the common case lookup reduces to the overhead of a hash (due to extensive use of flow caching) and can achieve the same throughput as normal soft forwarding. We have run Open vSwitch with hundreds of thousands of forwarding rules and still achieved similar performance numbers to those described above.
Flow setup, on the other hand, is marginally more expensive since it cannot benefit from the caching. Performance of the packet classifier in Open vSwitch relies on our observation that flow tables used in practice (in the environments we’re familiar with) tend to have only a handful of unique sets of wildcarded fields. Each of these observed wildcard sets has its own hash table, hashed on the basis of the fields that are not wildcarded. Therefore classifying a packet requires a O(1) lookup in each hash table and selecting the highest-priority match. Lookup performance is therefore linear in the number of unique wildcard sets in the flow table. Since this tends to be small, classifier overhead tends to be negligible.
We realize that this is all a bit hand-wavy and needs to be backed up with hard performance results. Because soft classification is such an important (and somewhat nuanced) issue, we will dedicate a future post to it.
“Yeah, this is all great. But when is soft switching not a good fit?”
While we would contend that soft switching is good for most deployment environments, there are instances in which passthrough or tagging is useful.
In our experience, the mainstay argument for passthrough is reduced latency to the wire. So while average latency is probably ok, specialized apps that have very small request/response type workloads can be impacted by the latency of soft switching.
Another common use case for passthrough is a local appliance VM that acts as an inline device between normal application VMs and the network. Such an appliance VM has no need of most of the mobility or other hypervisor provided goodness that is sacrificed with passthrough but it does have a need to process traffic with as little overhead as possible.
Passthrough is also useful for providing the guest with access to hardware that is not exposed by the emulated NIC (for example, some NICs have IPsec offload but that is not generally exposed).
Of course, if you do tagging to a physical switch you get access to the all of the advanced features that have been developed over time, all exposed through a CLI that people are familiar with (this is clearly less true with Cisco’s Nexus 1k). In general, this line of argument has more to do with the immaturity of software switches than any real fundamental limitation. But it’s a reasonable use case from an operations perspective.
The final, and probably most widely used (albeit least talked about) use case for passthrough is drag racing. Hypervisor vendors need to make sure that they can all post the absolute highest, break-neck performance numbers for cross-hypervisor performance comparisons (yup, sleazy), regardless of how much fine print is required to qualify them. Why else would any sane vendor of the most valuable piece of real estate in the network (the last inch) cede it to a NIC vendor? And of course the NIC vendors that are lucky enough to be blessed by a hypervisor into passthrough Valhalla can drag race with each other with all their os-specific hacks again.
“What am I supposed to conclude from all of this?”
Hopefully we’ve made our opinion clear: soft switching kicks mucho ass. There is good reason that it is far and away the dominant technology used for switching at the virtual edge. To distill the argument even further, our calculus is simple …
Software flexibility + 1 core x86 + 10G networking + cheap gear + other cool shit >
saving a core of x86 + costly specialized hardware + unlikely to be realized benefit of doing classification in the access switch.
More seriously, while we make the case for soft switching, we still believe there is ample room for hardware acceleration. However, rather than just shipping off packets to a hardware device, we believe that stateless offload in the NIC is a better approach. In the next post in this series, we will describe how we think the hardware ecosystem should evolve to aid the virtual networking problem at the edge.