The Rise of Soft Switching Part III: A Perspective on Hardware OffloadPosted: July 10, 2011
[This series is written by Jesse Gross, Andrew Lambeth, Ben Pfaff, and Martin Casado. Ben is an early and continuing contributor to the design and implementation of OpenFlow. He's also a primary developer of Open vSwitch. Jesse is also a lead developer of Open vSwitch and is responsible for the kernel work and datapath. Andrew has been virtualizing networking for long enough to have coined the term "vswitch", and led the vDS distributed switching project at VMware. All authors currently work at Nicira.]
This is our third (and final) post on soft switching. The previous two posts described various hardware edge switching technologies (tagging + hairpinning, offloading to the NIC, passthrough) as well as soft switching, and made the argument that in most cases, soft switching is the way to go.
We have received a bunch of e-mails (and a few comments) trying to make the case for passthrough as implemented by certain vendors. While there are a few defendable uses for it (which we try to outline in part 2), in the general case, we still maintain that passthrough is a net loss.
Of course, “net loss” is a fairly qualitative statement. So lets try this: if you are one of the handful of folks that have a special use case like HPC and trading and can live without many of the benefits of modern virtualized system, the passthrough is fine. Our preference is clearly to use mass produced NICs from multiple vendors (available in quantity today), and we prefer the benefits of software flexibility and innovation speeds.
However, while we’ve been arguing that doing full offload of switching into special purpose ASICs is “no good”, it’s only a partial representation of our position. We do feel that there is room for hardware acceleration in edge switching that also preserves the benefits and flexibility of software.
And that’s the topic for this final post. We will be discussing our view of how the hardware ecosystem should evolve to enable offload of virtual switching while still maintain many of the benefits of software.
The basic idea is to use NICs that contain targeted offloads that are performed under the direction of software, not a wholesale abdication to hardware. As a comparison point with another offload strategy, it should look more like TSO and less like TOE.
Regarding this final point (TSO vs. TOE), there may be a history lesson in the adoption and use of those approaches to offload. If the analogy holds, one could conclude that implementing stateful offloading and complex processing in I/O devices is less viable for adoption than more targeted and stateless approaches.
So with that, we’ll start by discussing how a more limited offload might look in NICs.
For soft switching, receive generally incurs the most overhead (due to a buffer copy and the difficulty in optimizing polling), and so is a good initial candidate for adding some specialized hardware muscle. Unfortunately however, receive is complicated to offload because you have no knowledge and context for the packet until you start to process it in the hypervisor, at which point it is too late.
Improved LRO Support: One place to start is simply by using LRO in more places. This doesn’t require putting any policy into the NIC, so by having the hardware manage coalescing, you can maintain complete control while reducing the number of packets (which overhead is generally proportional to). LRO has some limitations on the scenarios that you can use it in, which means that Linux based hypervisors never use it, even on NICs that have support. It would be fairly easy for NIC vendors to eliminate these issues.
Improved MultiQueue Support: The other direction that some NICs have been moving in is adding the ability to classify on various headers and use the result to direct packets to a particular queue. These rules can be managed by software in the hypervisor. Rules that can’t be matched either because the table isn’t sufficiently large or because the hardware can’t extract those fields can be sent to a default queue to be handled by software.
Unlike most physical switches, a server “management” CPU is powerful with a good connection to the NIC so there isn’t a huge performance hit by going to software. Packets delivered to a particular queue can actually be allocated from guest memory buffers and delivered on the CPU where the application in the VM consumes them, so there are no issues with cache line bouncing. The hypervisor is still involved in packet processing so the latency is not exactly as efficient as passthrough but you don’t tie the VM directly to the NIC. This means no need for hardware-specific drivers in the guest, a smooth fallback from hardware to software, live migration, etc.
One concrete enhancement to multiqueu support that would be useful is just being able to match on more fields. Right now MAC/vlan is pretty common, and Intel NICs also provide matching on a 5 tuple. Obviously once tunnel offloads come into the picture, you’ll also want to match on both the inner and outer headers for steering. Also, if you’re going to directly put packet data in guest memory buffers then you need to do security processing on the NIC and therefore want to match on headers that are necessary for that (within reason, we wouldn’t suggest doing stateful firewalling on the NIC for example; that’s where the software fallback is important).
For offloading transmit, a major feature that can be provided by the NICs are various forms of encapsulation which is used heavily in virtual networking solutions. Today, NICs provide VLAN tagging assist. Having this extend to other forms of tunneling (mpls, L2 in L3, etc.) would be an enormous win. However, to do so, segmentation/fragmentation optimizations (like TSO) would still have to be supported for the upper layer protocol.
For multique, support for QoS could be useful for transmit. If you’re trying to implement any policy more complex than round robin in software, you quickly get contention on the QoS data structures. Unfortunately, this gets pretty complicated quickly because QoS itself is complicated and there are a lot of possible algorithms so it’s hard to get consistent results. We’re not certain how practical a hardware solution is due to the complexity, but the upside is potential quite large if QoS is important.
Another area which NICs may help on transmit offload is packet replication. This is only important if two conditions are met: (a) the physical fabric does not support replication (or the operator is to much of a wuss to turn in on) (b) the application requires high performance broadcast/multicast to a reasonable fan-out.
What does this leave for physical switches?
For switches, this picture of the future — where most functionality is implemented in software with some stateless offload in the NIC — would mean two things.
First, if functionality is going to be pulled into the server, then the physical networking problem reduces to building a good fabric (without trying to overload the functionality to support virtualization primitives). Good fabrics generally mean no oversubscription (e.g. lots of multipathing) and quick convergence on failure. In our experience, standard L3 with ECMP can be used to build a fantastic fabric using fat-tree or spine topologies.
Regardless, whatever the approach (e.g. L3, TRILL, or something proprietary), the goal is no longer overloading the physical network with the need to maintain switching at the virtual layer, but rather providing a robust backplane for the vswitches to use.
Second, soft switching is only practical for for worklaods running on a hypervisor. However, many virtual deployments require integration with legacy bare metal workloads (like that old oracle server which you’re unlikely to ever virtualize).
In this case, it would be useful to have hardware switches that expose an interface (like OpenFlow) which would allow them to interoperate with soft switches running on lots and lots of hypervisors. It’s likely that this can be achieved through fairly basic support for encap/decap and perhaps some TCAM programmability for QoS and filtering.
Far from it. These musings are mostly a rough sketch built around an intuition of how the ecosystem should evolve to support the flexibility of virtualization and software and the switching speeds and cost/performance of specialized forwarding hardware.
The high-level points are that most virtual edge networking functions should be implemented in software with targeted, and (probably) stateless hardware offload that doesn’t obviate software flexibility or control. The physical network should focus on becoming an awesome, scalable fabric and providing some method of integrating legacy applications.
The exact convergence on feature set of each of these components is anyones guess, and can only be bred out of solid engineering, deployment, and experience.