the rise of soft switching part ii soft switching is awesome tm

The Rise of Soft Switching Part II: Soft Switching is Awesome ™

[This series is written by Jesse Gross, Andrew Lambeth, Ben Pfaff, and Martin Casado. Ben is an early and continuing contributor to the design and implementation of OpenFlow. He’s also a primary developer of Open vSwitch. Jesse is also a lead developer of Open vSwitch and is responsible for the kernel work and datapath. Andrew has been virtualizing networking for long enough to have coined the term “vswitch”, and led the vDS distributed switching project at VMware. All authors currently work at Nicira.]

This is the second post in our series on soft switching. In the first part (found here), we lightly covered a subset of the technical landscape around networking at the virtual edge. This included tagging and offloading decisions to the access switch, switching inter-VM traffic in the NIC, as well as NIC passthrough to the guest VM.

A brief synopsis of the discussion is as follows. Both tagging and passthrough are designed to save end-host CPU by punting the packet classification problem to specialized forwarding hardware either on the NIC or first hop switch, and to avoid the overhead of switching out of the guest to the hypervisor to access the hardware. However, tagging adds inter-VM latency and reduces inter-VM bisectional bandwidth. Passthrough also increases inter-VM latency, and effectively de-virtualizes the network thereby greatly limiting the flexibility provided by the hypervisor. We also mentioned that switching in widely available NICs today is impractical due to severe limitations in the on-board switching chips.

For the purposes of the following discussion, we are going to reduce the previous discussion to the following: The performance arguments in favor of an approach like passthrough + tagging (with enforcement in the first hope switch) is that latency is reduced to the wire (albeit marginally) and packet classification from a proper switching chip will noticeably outperform x86.

The goal of this post is to explain why soft switching kicks ass. Initially, we’ll debunk some of the FUD around performance issues with it, and then try and quantify the resource/performance tradeoffs of soft switching vis a vis hardware-based approaches. As we’ll argue, the question is not “how fast” is soft switching (it is almost certainly fast enough), but rather, “how much cpu am I willing to burn”, or perhaps “should I heat the room with teal or black colored boxes”?

So with that …

Why soft switching is awesome:

So, what is soft switching? Exactly what it sounds like. Instead of passing the packet off to a special purpose hardware device, the packet transitions from the Guest VM into the hypervisor which performs the forwarding decision in software (read, x86). Note that while a soft switch can technically be used for tagging, for the purposes of this discussion we’ll assume that it’s doing all of the first-hop switching.

The benefits of this approach are obvious. You get the flexibility and upgrade cycle of software, and compared to passthrough, you keep all of the benefits of virtualization (memory overcommit, page sharing, etc.). Also, soft switching tends to be much better integrated with the virtual environment. There is a tremendous amount of context that can be gleaned by being co-resident with the VMs, such as which MAC and IP addresses are assigned locally, VM resource use and demands, or which multicast addresses are being listened to. This information can be used to pre-populate tables, optimize QoS rules, prune multicast trees, etc.

Another benefit is simple resource efficiency, you already bought the damn server, so if you have excess compute capacity why buy specialized hardware for something you can do on the end host?  Or put another way, after you provision some amount of hardware resources to handle the switching work, any of those resources that are left over are always available to do real work running an app instead of being wasted(which is usually a lot since you have to provision for peaks).

Of course, nothing comes for free. And there is a perennial skepticism around the performance of software when compared to specialized hardware. So we’ll take some time to focus on that.

First, what are the latency costs of soft switching?

With soft switching, VM to VM communication effectively reduces to a memcpy() (you can also do page flipping which has the same order of overhead). This is as fast as one can expect to achieve on a modern architecture. Copying data between VMs through a shared L2 cache on a multicore CPU, or even if you are unlucky enough to have to go to main memory, is certainly faster than doing a DMA over the PCI bus. So for VM to VM communication, soft switching will have the lowest latency, presuming you can do the lookup function sufficiently quickly (more on that below).

Sending traffic from the guest to the wire is only marginally more expensive due to the overhead of a domain transfer (e.g. flushing the TLB) and copying security sensitive information (such as headers). In Xen (for example) guest transmit (DomU-to-Dom0) operates by mapping pages that were allocated by the guest into the hypervisor, which then get DMA’d with no copy required (with the exception of headers, which are copied for security purposes so the guest can’t change them after the hypervisor has made a decision). In the other direction, the guest allocates pages and puts them in its rx ring, similar to real hardware. These then get shared with the hypervisor via remapping. When receiving a packet the hypervisor copies the data into the guest buffers. (note: VMware does almost the same thing. However, there is no remapping because the vswitch runs in the vmkernel and all physical pages are already mapped and the vmkernel has access to the guest MMU mappings.)

So, while there is comparatively more overhead than a pure hardware approach (due to copying headers and the overhead of the domain transfer), it is in the order of microseconds and dwarfed by other aspects of a virtualized system like memory overcommit. Or more to the point, only in extreme latency sensitive environments does this matter (the overhead is completely lost in the noise of other hypervisor overhead), in which case the only deployment approach that makes sense is to effectively pin compute to dedicated hardware greatly diminishing the utility of using virtualization in the first place.

What about throughput?

Modern soft switches that don’t suck are able to saturate a 10G link from a guest to the wire with less than a core (assuming MTU size packets). They are also able to saturate a 1G link with less than 20% of a core. In the case of Open vSwitch, these numbers include full packet lookup over L2, L3 and L4 headers.

While these are numbers commonly seen in practice, theoretically, throughput is affected by the overhead of the forwarding decision — more complex lookups can take more time thus reducing total throughput.

The forwarding decision involves taking the header fields of each packet and checking them against the forwarding rule set (L2, L3, ACLs, etc.) to determine how to handle the packet. This general class of problem is termed “packet classification” and is worth taking a closer look at.

Soft Packet Classification:

One of the primary arguments in favor of offloading virtual edge switching to hardware is that a TCAM can do a lookup faster than x86. This is unequivocally true, TCAMs have lots and lots of gates (and are commensurately costly and have high power demands) so that they can do lookups of many rules in parallel. A general CPU cannot match the lookup capacity of a TCAM in a degenerate case.

However, software packet classification has come a long way. Under realistic workloads and rule sets that are found in virtualized environments (e.g. mult-tenant isolation with a sane security policy), soft switching can handle lookups at line rates with the resource usage mentioned above (less than a core for 10G) and so does not add appreciable overhead.

How is this achieved? For Open vSwitch, which looks at many more headers than will fit in a standard TCAM, the common case lookup reduces to the overhead of a hash (due to extensive use of flow caching) and can achieve the same throughput as normal soft forwarding. We have run Open vSwitch with hundreds of thousands of forwarding rules and still achieved similar performance numbers to those described above.

Flow setup, on the other hand, is marginally more expensive since it cannot benefit from the caching. Performance of the packet classifier in Open vSwitch relies on our observation that flow tables used in practice (in the environments we’re familiar with) tend to have only a handful of unique sets of wildcarded fields. Each of these observed wildcard sets has its own hash table, hashed on the basis of the fields that are not wildcarded. Therefore classifying a packet requires a O(1) lookup in each hash table and selecting the highest-priority match. Lookup performance is therefore linear in the number of unique wildcard sets in the flow table. Since this tends to be small, classifier overhead tends to be negligible.

We realize that this is all a bit hand-wavy and needs to be backed up with hard performance results. Because soft classification is such an important (and somewhat nuanced) issue, we will dedicate a future post to it.

“Yeah, this is all great. But when is soft switching not a good fit?”

While we would contend that soft switching is good for most deployment environments, there are instances in which passthrough or tagging is useful.

In our experience, the mainstay argument for passthrough is reduced latency to the wire. So while average latency is probably ok, specialized apps that have very small request/response type workloads can be impacted by the latency of soft switching.

Another common use case for passthrough is a local appliance VM that acts as an inline device between normal application VMs and the network. Such an appliance VM has no need of most of the mobility or other hypervisor provided goodness that is sacrificed with passthrough but it does have a need to process traffic with as little overhead as possible.

Passthrough is also useful for providing the guest with access to hardware that is not exposed by the emulated NIC (for example, some NICs have IPsec offload but that is not generally exposed).

Of course, if you do tagging to a physical switch you get access to the all of the advanced features that have been developed over time, all exposed through a CLI that people are familiar with (this is clearly less true with Cisco’s Nexus 1k). In general, this line of argument has more to do with the immaturity of software switches than any real fundamental limitation. But it’s a reasonable use case from an operations perspective.

The final, and probably most widely used (albeit least talked about) use case for passthrough is drag racing. Hypervisor vendors need to make sure that they can all post the absolute highest, break-neck performance numbers for cross-hypervisor performance comparisons (yup, sleazy), regardless of how much fine print is required to qualify them. Why else would any sane vendor of the most valuable piece of real estate in the network (the last inch) cede it to a NIC vendor? And of course the NIC vendors that are lucky enough to be blessed by a hypervisor into passthrough Valhalla can drag race with each other with all their os-specific hacks again.

“What am I supposed to conclude from all of this?”

Hopefully we’ve made our opinion clear: soft switching kicks mucho ass. There is good reason that it is far and away the dominant technology used for switching at the virtual edge. To distill the argument even further, our calculus is simple …

Software flexibility + 1 core x86 + 10G networking + cheap gear + other cool shit >

       saving a core of x86 + costly specialized hardware + unlikely to be realized benefit of doing classification in the access switch.

More seriously, while we make the case for soft switching, we still believe there is ample room for hardware acceleration. However, rather than just shipping off packets to a hardware device, we believe that stateless offload in the NIC is a better approach. In the next post in this series, we will describe how we think the hardware ecosystem should evolve to aid the virtual networking problem at the edge.

10 Comments on “The Rise of Soft Switching Part II: Soft Switching is Awesome ™”

  1. Hi Martin, let me reply here with the same arguments I used commenting your post at Ivan’s site.

    Designing a virtualization-aware network in practice requires lots of design work, study of failover scenarios and high/low level design effort. While EVB solutions do the job very effectively (Nexus 1000v being a great example of that), VM-Fex eliminates that extra layer of design, troubleshooting, configuration and management. This is huge in real life production environments when you have to deal with complex virtual-machine environments (with SAN and NAS storage networking, several management domains, different security requirements and traffic separation policies).

    Virtual networking directly at the hardware layer paints a very simplified picture, with predictable behaviour and easy troubleshooting. I like that picture, and depending on the scenario I may prefer it to an embedded EVB or soft-switch.

    This said, I like how you clarify some “myths” on software-based switching for virtualized environments. I am missing a bit on the real world challenges of configuring this kind of scenario for a quick and manageable deployment. I believe the beauty of passthrough solutions is in its simplicity and direct ways. I happen to just have finished the design and deployment of a pretty dense virtual network (more than 2000 virtual desktops and around the same figure for virtual servers), and there really is a design complexity toll you pay when introducing this layer. True, its benefits greatly overshadow this extra complexity, but every step we take towards simplification is very welcome, especially for large Enterprise customers.

    Congrats on the post though, referenced it at my blog as usual: great clarity and succinctness.



    (DISCLAIMER: I am a Cisco Systems Engineer).

  2. D.S. says:

    What about the overhead of TCP offload , IP checksums etc, that cannot be delegated to the NIC any more?

    • [note: the original reply was a little misleading and has been edited to clarify things a bit.

      It’s important to remember that end-host protocol offload does not conflict with the softwitching. The vSwitch can (and does) carry the context for various stateless offloads from the vNIC through to the pNIC and in the vNICvNIC case can either avoid doing them (ie VLAN tagging/stripping, or can do them in software optionally [checksum offload is one that can be dropped technically, but makes tcpdump output look scary so is usually done in software [for “free” inline with the copy that happens anyway]])

      However, there is definitely room for simple offload of other functions within the NIC (without full abdication of control). And that is the subject of our final post which should be up in the next day or so.

  3. Alex Bachmutsky says:

    I would say that you’ve made a lot of assumptions and oversimplified the generic problem. You are right that for some applications a softswitch could be a viable solution. However, there are a number of arguments against it:

    1. When you need IPsec/MACsec/TCP/UDP offload, you still need an intelligent NIC. In many cases you need to load balance between multiple switching instances, so with the intelligent NIC it makes sense to place that function there, because it is coming “for free”. The same is true also for a basic switching.

    2. Switches today are not that dumb, they can handle very large tables, they can process L2-L4 headers and beyond (usually first 128B-256B of the packet are being parsed), some of them are even programmable to be able to add new protocols.

    3. You can use NPUs for switching, which can provide the required flexibility, performance and scalability.

    4. Even if softswitch can achieve the same performance level, you pay much heavier price from the power consumption point of view. And we all know well how important is the power today.

    5. You’ve assumed in your calculations a large packet size, which is not always the case. Instead of providing the bit rate as a performance characteristic, much better metric is the packet rate. For instance, the 10Gbps capable switch can usually handle 15Mpps. I do not think you can claim the same for softswitch.

    6. If your application requires also traffic management capabilities, the softswitch overhead goes up very quickly with a number of queues, because dual token or leaky bucket implementation is expensive in means of CPU cycles. Of course, you can say that TM can be also offloaded to the NIC, but TM requires some level of flow classification, and with flow classification already performed in the NIC it makes sense to do the switching in the NIC as per the 1st argument.

    Just to clarify my comment: I am not saying that softswitch is not a good solution, I am saying that you should check a particular application and its further evolving requirements before deciding on the softswitch-based implementation.

    And a final comment here is about the definition of a softswitch. Remember that above-mentioned intelligent NIC could be based on the latest generation of multi-core processors (Cavium, NetLogic, Tilera, etc.) with hybrid NPU-CPU capabilities and a lot of offload, but the switching itself is still performed by software either in Linux or bare metal environment. So from the system point of view the switching is performed by the NIC, but it is still a softswitch…

  4. […] the second article the Network Heresy folks attempts to defend software switching and excoriate SR-IOV in […]

  5. EtherealMind says:

    I have two, at least, issues that remains unaddressed:

    1) is that the additional complexity of a software layer is a operational failure – network software has a poor reliability record and no reason to believe that soft switching in the hypervisor will be any more or less successful.

    2) that security is not possible when soft switching for a standard multi-tenant and multipurpose designs. Security baselines require more than promises of “no bugs” or “no access we promise”. Sinvce VMware has started to inject device drivers into GuestOS’s there is no security barrier between the hypervisor and Guest.

    For these fundamental reasons, I find it highly unlikely that Soft Switching will cross the adoption gap.

    • Hey, thanks for the comment.

      However, I (this is Martin) think we might be talking about two different things. By soft switching, we’re simply referring to the first hop decision being done in software rather than shunted directly to hardware via hairpinning or passthrough. This is how 99.9% of all virtualized workloads have worked over the last decade (whether using VMWare’s vswitch, the linux bridge, or Open vSwitch). All of the large multi-tenant virtualized clouds that we know of use soft switching. So (a) it seems the complexity is manageable (b) it seems to have already crossed the adoption gap. Perhaps you’re referring to something different than we are?

      To your second point, security. Soft switching maintains the same security model as compute virtualization, which is trust consolidation in the hypervisor. Today, all virtualized deployments which support co-resident VMs entrust the hypervisor to enforce address space isolation by managing the physical page table. If this is compromised, all isolation is compromised.

      So, if you don’t trust the hypervisor (for whatever reason), you don’t trust virtualization (and you shouldn’t be running it). However, if you do trust virtualization, then soft switching doesn’t change the basic security assumptions. Which again, is why soft switching has faired well over the last decade in security sensitive environments such as the federal space, and large multi-tenant datacenters.

  6. […] is our third (and final) post on soft switching. The previous two posts described various hardware edge switching technologies (tagging + hairpinning, offloading to the […]

  7. I would argue that softswitching “has faired well over the last decade” and “all of the large multi-tenant virtualized clouds that we know of use soft switching” because there really hasn’t been a real-world alternative solution until relatively recently.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s