Posted: July 25, 2011
I am working on multiple upcoming posts, but they are taking some time to gel. In the meantime, for those who are interested, I’ve also been helping on a series of posts of openvswitch.org describing the OpenStack Quantum network service and how Open vSwitch fits into the picture. The first post was put up today at:
Posted: July 10, 2011
[This series is written by Jesse Gross, Andrew Lambeth, Ben Pfaff, and Martin Casado. Ben is an early and continuing contributor to the design and implementation of OpenFlow. He’s also a primary developer of Open vSwitch. Jesse is also a lead developer of Open vSwitch and is responsible for the kernel work and datapath. Andrew has been virtualizing networking for long enough to have coined the term “vswitch”, and led the vDS distributed switching project at VMware. All authors currently work at Nicira.]
This is our third (and final) post on soft switching. The previous two posts described various hardware edge switching technologies (tagging + hairpinning, offloading to the NIC, passthrough) as well as soft switching, and made the argument that in most cases, soft switching is the way to go.
We have received a bunch of e-mails (and a few comments) trying to make the case for passthrough as implemented by certain vendors. While there are a few defendable uses for it (which we try to outline in part 2), in the general case, we still maintain that passthrough is a net loss.
Of course, “net loss” is a fairly qualitative statement. So lets try this: if you are one of the handful of folks that have a special use case like HPC and trading and can live without many of the benefits of modern virtualized system, the passthrough is fine. Our preference is clearly to use mass produced NICs from multiple vendors (available in quantity today), and we prefer the benefits of software flexibility and innovation speeds.
However, while we’ve been arguing that doing full offload of switching into special purpose ASICs is “no good”, it’s only a partial representation of our position. We do feel that there is room for hardware acceleration in edge switching that also preserves the benefits and flexibility of software.
And that’s the topic for this final post. We will be discussing our view of how the hardware ecosystem should evolve to enable offload of virtual switching while still maintain many of the benefits of software.
The basic idea is to use NICs that contain targeted offloads that are performed under the direction of software, not a wholesale abdication to hardware. As a comparison point with another offload strategy, it should look more like TSO and less like TOE.
Regarding this final point (TSO vs. TOE), there may be a history lesson in the adoption and use of those approaches to offload. If the analogy holds, one could conclude that implementing stateful offloading and complex processing in I/O devices is less viable for adoption than more targeted and stateless approaches.
So with that, we’ll start by discussing how a more limited offload might look in NICs.
For soft switching, receive generally incurs the most overhead (due to a buffer copy and the difficulty in optimizing polling), and so is a good initial candidate for adding some specialized hardware muscle. Unfortunately however, receive is complicated to offload because you have no knowledge and context for the packet until you start to process it in the hypervisor, at which point it is too late.
Improved LRO Support: One place to start is simply by using LRO in more places. This doesn’t require putting any policy into the NIC, so by having the hardware manage coalescing, you can maintain complete control while reducing the number of packets (which overhead is generally proportional to). LRO has some limitations on the scenarios that you can use it in, which means that Linux based hypervisors never use it, even on NICs that have support. It would be fairly easy for NIC vendors to eliminate these issues.
Improved MultiQueue Support: The other direction that some NICs have been moving in is adding the ability to classify on various headers and use the result to direct packets to a particular queue. These rules can be managed by software in the hypervisor. Rules that can’t be matched either because the table isn’t sufficiently large or because the hardware can’t extract those fields can be sent to a default queue to be handled by software.
Unlike most physical switches, a server “management” CPU is powerful with a good connection to the NIC so there isn’t a huge performance hit by going to software. Packets delivered to a particular queue can actually be allocated from guest memory buffers and delivered on the CPU where the application in the VM consumes them, so there are no issues with cache line bouncing. The hypervisor is still involved in packet processing so the latency is not exactly as efficient as passthrough but you don’t tie the VM directly to the NIC. This means no need for hardware-specific drivers in the guest, a smooth fallback from hardware to software, live migration, etc.
One concrete enhancement to multiqueu support that would be useful is just being able to match on more fields. Right now MAC/vlan is pretty common, and Intel NICs also provide matching on a 5 tuple. Obviously once tunnel offloads come into the picture, you’ll also want to match on both the inner and outer headers for steering. Also, if you’re going to directly put packet data in guest memory buffers then you need to do security processing on the NIC and therefore want to match on headers that are necessary for that (within reason, we wouldn’t suggest doing stateful firewalling on the NIC for example; that’s where the software fallback is important).
For offloading transmit, a major feature that can be provided by the NICs are various forms of encapsulation which is used heavily in virtual networking solutions. Today, NICs provide VLAN tagging assist. Having this extend to other forms of tunneling (mpls, L2 in L3, etc.) would be an enormous win. However, to do so, segmentation/fragmentation optimizations (like TSO) would still have to be supported for the upper layer protocol.
For multique, support for QoS could be useful for transmit. If you’re trying to implement any policy more complex than round robin in software, you quickly get contention on the QoS data structures. Unfortunately, this gets pretty complicated quickly because QoS itself is complicated and there are a lot of possible algorithms so it’s hard to get consistent results. We’re not certain how practical a hardware solution is due to the complexity, but the upside is potential quite large if QoS is important.
Another area which NICs may help on transmit offload is packet replication. This is only important if two conditions are met: (a) the physical fabric does not support replication (or the operator is to much of a wuss to turn in on) (b) the application requires high performance broadcast/multicast to a reasonable fan-out.
What does this leave for physical switches?
For switches, this picture of the future — where most functionality is implemented in software with some stateless offload in the NIC — would mean two things.
First, if functionality is going to be pulled into the server, then the physical networking problem reduces to building a good fabric (without trying to overload the functionality to support virtualization primitives). Good fabrics generally mean no oversubscription (e.g. lots of multipathing) and quick convergence on failure. In our experience, standard L3 with ECMP can be used to build a fantastic fabric using fat-tree or spine topologies.
Regardless, whatever the approach (e.g. L3, TRILL, or something proprietary), the goal is no longer overloading the physical network with the need to maintain switching at the virtual layer, but rather providing a robust backplane for the vswitches to use.
Second, soft switching is only practical for for worklaods running on a hypervisor. However, many virtual deployments require integration with legacy bare metal workloads (like that old oracle server which you’re unlikely to ever virtualize).
In this case, it would be useful to have hardware switches that expose an interface (like OpenFlow) which would allow them to interoperate with soft switches running on lots and lots of hypervisors. It’s likely that this can be achieved through fairly basic support for encap/decap and perhaps some TCAM programmability for QoS and filtering.
Far from it. These musings are mostly a rough sketch built around an intuition of how the ecosystem should evolve to support the flexibility of virtualization and software and the switching speeds and cost/performance of specialized forwarding hardware.
The high-level points are that most virtual edge networking functions should be implemented in software with targeted, and (probably) stateless hardware offload that doesn’t obviate software flexibility or control. The physical network should focus on becoming an awesome, scalable fabric and providing some method of integrating legacy applications.
The exact convergence on feature set of each of these components is anyones guess, and can only be bred out of solid engineering, deployment, and experience.
Posted: July 10, 2011
[Note: This is an unplanned vignette for the soft switching series. We’ve received a lot of e-mail in response to our previous two posts. Alex submitted a great comment which outlines many of the popular issues that have come up. So rather than having the response buried in the comment forum, we’re upgrading it to a full post. So thanks to Alex for kicking this off. ]
Alex: I would say that you’ve made a lot of assumptions and oversimplified the generic problem. You are right that for some applications a softswitch could be a viable solution. However, there are a number of arguments against it:
It’s hard to write a somewhat concise blog post without some simplification. Oversimplified? Perhaps, and if so lets take the time to flush out the details. We always welcome constructive discussion. As a start, lets take a look at some of your points.
Alex: 1. When you need IPsec/MACsec/TCP/UDP offload, you still need an intelligent NIC. In many cases you need to load balance between multiple switching instances, so with the intelligent NIC it makes sense to place that function there, because it is coming “for free”. The same is true also for a basic switching.
Noone is arguing against intelligent NICs. However, this has nothing to do with soft switching. As mentioned in response to another comment, the vSwitch can (and does) carry the context for various stateless offloads from the vNIC through to the pNIC and in the vNIC<->vNIC case can either avoid doing them (ie VLAN tagging/stripping), or can do them in software optionally [checksum offload is one that can be dropped technically, but makes tcpdump output look scary so is usually done in software for “free” inline with the copy that happens anyway].
Also, as we’ve discussed at some length, doing inter-VM switching in hardware is far from “free”. It incurs overhead and it obviates the flexibility of switching in software.
Alex: 2. Switches today are not that dumb, they can handle very large tables, they can process L2-L4 headers and beyond (usually first 128B-256B of the packet are being parsed), some of them are even programmable to be able to add new protocols.
This is definitely worth a taking a closer look at. We’ve spent years working with high-end merchant silicon switching chips trying to emulate what others are able to do with soft switching. Note that we have very good relationship with the silicon vendors, so this isn’t just blind fumbling.
Lets looks at some recent hurdles we’ve run into in practice:
IP overwrite: Virtual networking environments sometime need address isolation similar to NAT. Very few commercially available switching chipsets support general IP overwrite. And even fewer (none?) support it at a useful position in the lookup pipeline for virtual networking. Clearly this is trivially doable (and often done) in software at the edge.
Note that IP overwrite is not just an edge function within a virtualized context since two co-located VMs in different logical contexts may wish to communicatin thereby neededing overwrite.
Additional lookup for virtual context: Often to preserve logical context, additional information needs to be stuffed into packets. To do so, you have one of two choices: (a) overload and existing field (ugly because you can no longer use that field within the logical context, and at some point you’ll run out of fields to overload) (b) create a new field. Lets look at (b):
For the mass produced switching chips that we’re familiar with, adding a new field means changing the protocol parser means spinning a new chip means an 18 month wait time. Further, if you want this field to be part of a useful lookup (which often you do) then you’re probably limited to the number of bits you’re going to get. We’ve quite litterally sat with ASIC designers arguing for 32 bits over 24 for such fields when we routinly use 64 or 128 bits in software.
L2 in L3 tunneling: Until very recently, L2 in L3 was not available in 10G and even now the support is extremely limited (although next generation chips will improve the situation — though again, this required over a year of bake time).
For the L2 in L3 tunneling that is available, rarely can you use the key for lookup, and in some implementations, decap only offers a small fraction of the full bisectional bandwidth of the switching chip making it impractcal for heavy use.
Again, this isn’t a theoretical musing on edge cases. All of these issues we’ve run into in practice. And there are many others we haven’t mentioned (tunnel and ECMP limits for example).
So yes, switching silicon is flexible and getting better all the time. But there are, and will continue to be, very real limitations when compared to switching in software. Switching silicon will never be replaced. However, it’s not clear that it is the right place to perform all packet lookup and manipulation.
A final point. You can certainly achieve all of these things if you cobble together enough hardware. However, we would wager that is a very clear looser on price performance, and you still don’t overcome the flexibility and upgrade cycle problems.
Alex: 3. You can use NPUs for switching, which can provide the required flexibility, performance and scalability.
NPUs don’t have the price/performance (nor high fanout capacity) of a standard switching chip (e.g. Broadcom). And they don’t have the flexibiliy, economies of scale, and tool chain support of x86.
There is probably a good economic reason that many of the most popular edge appliances today (wan opt, load balancing, etc.) are built on souped up x86 boxes, and not esoteric NPU platforms.
This would be an interesting point to follow up on. Our contention is that classic table-based switching is the best way to build a physical fabric. The price/performance for high-fan-out switching seems unbeatable. And that x86 is the best way to handle intelligent packet manipulation at the edge.
If you (Alex) are interested in working on a follow-on post. How about together we pencil out the price/performance of a (say) Broadcom/Intel solution vs. Broadcom + some NPU + Intel. I think it would be an interesting exercise.
Alex: 4. Even if softswitching can achieve the same performance level, you pay much heavier price from the power consumption point of view. And we all know well how important is the power today.
This is true of course in terms of byte/watt, but in terms of elasticity and utilization x86 remains attractive, since for the 90% of the day that the network is at 10% utilization all those x86 cores can do useful work.
It’s fair to point out that you may want to statically provision the core, meaning it will also remain underutilized. Given this, then you’re probably right that there is power differential. It would be interesting to compare this with having an additional NPU/TCAM/whatever. That’s something we don’t know the answer to off hand. But would love to explore further.
Alex: 5. You’ve assumed in your calculations a large packet size, which is not always the case. Instead of providing the bit rate as a performance characteristic, much better metric is the packet rate. For instance, the 10Gbps capable switch can usually handle 15Mpps. I do not think you can claim the same for softswitch.
Also true, but we’re talking about edge switching. If we enumerate the workloads we’re familiar with that do anywhere near line rate in the number of VMs that fit on one host with small message sizes. Our experience is that they are either a) netperf b) hpc c) trading apps.
We can safely ignore a. b and c are legitimate exceptions, but have many other issues with virtualization and also fewer requirements. Passthrough with on NIC switching or hairpinning to the switch is probably a good choice for them. There’s a reason myrinet/quadrics/infiniband (a) existed and (b) are not heavily used outside of HPC and other specialized environments.
Alex: 6. If your application requires also traffic management capabilities, the softswitch overhead goes up very quickly with a number of queues, because dual token or leaky bucket implementation is expensive in means of CPU cycles. Of course, you can say that TM can be also offloaded to the NIC, but TM requires some level of flow classification, and with flow classification already performed in the NIC it makes sense to do the switching in the NIC as per the 1st argument.
Here we totally agree, and this is an awesome segue to our next post.
Alex: Just to clarify my comment: I am not saying that softswitch is not a good solution, I am saying that you should check a particular application and its further evolving requirements before deciding on the softswitch-based implementation.
This is is true, but borders on tautological. Clearly you can contrive a situation in which soft switching doesn’t work. However, we’re interested in refining the discourse beyond the obvious.
I think at least some of the authors are in a pretty good position to claim having surveyed a large number of applications, deployments, and customer requirements for vswitches (edge switching within virtualized environments). In our experience, there are very few environments outside of HPC and trading in which soft switching isn’t a great fit.
Alex: And a final comment here is about the definition of a softswitch. Remember that above-mentioned intelligent NIC could be based on the latest generation of multi-core processors (Cavium, NetLogic, Tilera, etc.) with hybrid NPU-CPU capabilities and a lot of offload, but the switching itself is still performed by software either in Linux or bare metal environment. So from the system point of view the switching is performed by the NIC, but it is still a softswitch…
No argument here ?
[Thanks again to Alex for the comment, it does flush out more of the discussion. Clearly the position is more nuanced than catagorical, and clearly the discourse is ongoing. Perhaps you’d be interested in fleshing out some of the price/performance/power arguments for a future post?]