the rise of soft switching part 2 5 responses and clarifications
The Rise of Soft Switching Part 2.5: Responses and clarifications
Posted: July 10, 2011
[Note: This is an unplanned vignette for the soft switching series. We’ve received a lot of e-mail in response to our previous two posts. Alex submitted a great comment which outlines many of the popular issues that have come up. So rather than having the response buried in the comment forum, we’re upgrading it to a full post. So thanks to Alex for kicking this off. ]
Alex: I would say that you’ve made a lot of assumptions and oversimplified the generic problem. You are right that for some applications a softswitch could be a viable solution. However, there are a number of arguments against it:
It’s hard to write a somewhat concise blog post without some simplification. Oversimplified? Perhaps, and if so lets take the time to flush out the details. We always welcome constructive discussion. As a start, lets take a look at some of your points.
Alex: 1. When you need IPsec/MACsec/TCP/UDP offload, you still need an intelligent NIC. In many cases you need to load balance between multiple switching instances, so with the intelligent NIC it makes sense to place that function there, because it is coming “for free”. The same is true also for a basic switching.
Noone is arguing against intelligent NICs. However, this has nothing to do with soft switching. As mentioned in response to another comment, the vSwitch can (and does) carry the context for various stateless offloads from the vNIC through to the pNIC and in the vNIC<->vNIC case can either avoid doing them (ie VLAN tagging/stripping), or can do them in software optionally [checksum offload is one that can be dropped technically, but makes tcpdump output look scary so is usually done in software for “free” inline with the copy that happens anyway].
Also, as we’ve discussed at some length, doing inter-VM switching in hardware is far from “free”. It incurs overhead and it obviates the flexibility of switching in software.
Alex: 2. Switches today are not that dumb, they can handle very large tables, they can process L2-L4 headers and beyond (usually first 128B-256B of the packet are being parsed), some of them are even programmable to be able to add new protocols.
This is definitely worth a taking a closer look at. We’ve spent years working with high-end merchant silicon switching chips trying to emulate what others are able to do with soft switching. Note that we have very good relationship with the silicon vendors, so this isn’t just blind fumbling.
Lets looks at some recent hurdles we’ve run into in practice:
IP overwrite: Virtual networking environments sometime need address isolation similar to NAT. Very few commercially available switching chipsets support general IP overwrite. And even fewer (none?) support it at a useful position in the lookup pipeline for virtual networking. Clearly this is trivially doable (and often done) in software at the edge.
Note that IP overwrite is not just an edge function within a virtualized context since two co-located VMs in different logical contexts may wish to communicatin thereby neededing overwrite.
Additional lookup for virtual context: Often to preserve logical context, additional information needs to be stuffed into packets. To do so, you have one of two choices: (a) overload and existing field (ugly because you can no longer use that field within the logical context, and at some point you’ll run out of fields to overload) (b) create a new field. Lets look at (b):
For the mass produced switching chips that we’re familiar with, adding a new field means changing the protocol parser means spinning a new chip means an 18 month wait time. Further, if you want this field to be part of a useful lookup (which often you do) then you’re probably limited to the number of bits you’re going to get. We’ve quite litterally sat with ASIC designers arguing for 32 bits over 24 for such fields when we routinly use 64 or 128 bits in software.
L2 in L3 tunneling: Until very recently, L2 in L3 was not available in 10G and even now the support is extremely limited (although next generation chips will improve the situation — though again, this required over a year of bake time).
For the L2 in L3 tunneling that is available, rarely can you use the key for lookup, and in some implementations, decap only offers a small fraction of the full bisectional bandwidth of the switching chip making it impractcal for heavy use.
Again, this isn’t a theoretical musing on edge cases. All of these issues we’ve run into in practice. And there are many others we haven’t mentioned (tunnel and ECMP limits for example).
So yes, switching silicon is flexible and getting better all the time. But there are, and will continue to be, very real limitations when compared to switching in software. Switching silicon will never be replaced. However, it’s not clear that it is the right place to perform all packet lookup and manipulation.
A final point. You can certainly achieve all of these things if you cobble together enough hardware. However, we would wager that is a very clear looser on price performance, and you still don’t overcome the flexibility and upgrade cycle problems.
Alex: 3. You can use NPUs for switching, which can provide the required flexibility, performance and scalability.
NPUs don’t have the price/performance (nor high fanout capacity) of a standard switching chip (e.g. Broadcom). And they don’t have the flexibiliy, economies of scale, and tool chain support of x86.
There is probably a good economic reason that many of the most popular edge appliances today (wan opt, load balancing, etc.) are built on souped up x86 boxes, and not esoteric NPU platforms.
This would be an interesting point to follow up on. Our contention is that classic table-based switching is the best way to build a physical fabric. The price/performance for high-fan-out switching seems unbeatable. And that x86 is the best way to handle intelligent packet manipulation at the edge.
If you (Alex) are interested in working on a follow-on post. How about together we pencil out the price/performance of a (say) Broadcom/Intel solution vs. Broadcom + some NPU + Intel. I think it would be an interesting exercise.
Alex: 4. Even if softswitching can achieve the same performance level, you pay much heavier price from the power consumption point of view. And we all know well how important is the power today.
This is true of course in terms of byte/watt, but in terms of elasticity and utilization x86 remains attractive, since for the 90% of the day that the network is at 10% utilization all those x86 cores can do useful work.
It’s fair to point out that you may want to statically provision the core, meaning it will also remain underutilized. Given this, then you’re probably right that there is power differential. It would be interesting to compare this with having an additional NPU/TCAM/whatever. That’s something we don’t know the answer to off hand. But would love to explore further.
Alex: 5. You’ve assumed in your calculations a large packet size, which is not always the case. Instead of providing the bit rate as a performance characteristic, much better metric is the packet rate. For instance, the 10Gbps capable switch can usually handle 15Mpps. I do not think you can claim the same for softswitch.
Also true, but we’re talking about edge switching. If we enumerate the workloads we’re familiar with that do anywhere near line rate in the number of VMs that fit on one host with small message sizes. Our experience is that they are either a) netperf b) hpc c) trading apps.
We can safely ignore a. b and c are legitimate exceptions, but have many other issues with virtualization and also fewer requirements. Passthrough with on NIC switching or hairpinning to the switch is probably a good choice for them. There’s a reason myrinet/quadrics/infiniband (a) existed and (b) are not heavily used outside of HPC and other specialized environments.
Alex: 6. If your application requires also traffic management capabilities, the softswitch overhead goes up very quickly with a number of queues, because dual token or leaky bucket implementation is expensive in means of CPU cycles. Of course, you can say that TM can be also offloaded to the NIC, but TM requires some level of flow classification, and with flow classification already performed in the NIC it makes sense to do the switching in the NIC as per the 1st argument.
Here we totally agree, and this is an awesome segue to our next post.
Alex: Just to clarify my comment: I am not saying that softswitch is not a good solution, I am saying that you should check a particular application and its further evolving requirements before deciding on the softswitch-based implementation.
This is is true, but borders on tautological. Clearly you can contrive a situation in which soft switching doesn’t work. However, we’re interested in refining the discourse beyond the obvious.
I think at least some of the authors are in a pretty good position to claim having surveyed a large number of applications, deployments, and customer requirements for vswitches (edge switching within virtualized environments). In our experience, there are very few environments outside of HPC and trading in which soft switching isn’t a great fit.
Alex: And a final comment here is about the definition of a softswitch. Remember that above-mentioned intelligent NIC could be based on the latest generation of multi-core processors (Cavium, NetLogic, Tilera, etc.) with hybrid NPU-CPU capabilities and a lot of offload, but the switching itself is still performed by software either in Linux or bare metal environment. So from the system point of view the switching is performed by the NIC, but it is still a softswitch…
No argument here
[Thanks again to Alex for the comment, it does flush out more of the discussion. Clearly the position is more nuanced than catagorical, and clearly the discourse is ongoing. Perhaps you’d be interested in fleshing out some of the price/performance/power arguments for a future post?]