NVGRE, VXLAN and what Microsoft is Doing Right

There has been more movement in the industry towards L2 in L3 tunneling from the edge as the approach for tackling issues with virtual networking.

Hot on the heals of the VMWare/Cisco-led VXLAN announcement, an Internet draft on NVGRE (authored by Microsoft, Intel, Dell, Broadcom, and Arista, but my guess is that Microsoft is the primary driver) showed up with relatively little fanfare. You can check it out here.

In this blog post, I’ll briefly introduce NVGRE . However, I’d like to spend more time providing broader context on where these technologies fit into the virtual networking solution space. Specifically, I’ll argue that the tunneling protocol is a minor aspect of a complete solution, and we need open standards around the control and configuration interface as much as we need to standardize the wire format.

We’ll get to that in a few paragraphs. But first, What is NVGRE?

NVGRE is very similar to VXLAN (my comments on VXLAN here). Basically, it uses GRE as a method to tunnel L2 packets across an IP fabric, and uses 24 bits of the GRE key as a logical network discriminator (which they call a tenant network ID or TNI). By logical network discriminator, I mean it indicates which logical network a particular packet is part of. Also like VXLAN, logical broadcast is achieved through physical multicast.

A day in the life of a packet is simple. I’ll sketch out the case of a packet being sent from a VM. The vswitch, on receiving a packet from a vNic, does two lookups (a) it uses the destination MAC address to determine which tunnel to send the packet to (b) it uses the ingress vNic to determine the tenant network ID. If the MAC in (a) is known, the vswitch will cram the packet into the associated point-to-point GRE tunnel, setting the GRE key to the tenant network ID. If it isn’t know, it will tunnel the packet to the multicast address associated with that tenant network ID. Easy peasy.

While architecturally NVGRE is very similar to VXLAN, there are some differences that have practical implications.

On positive side, NVGRE’s use of GRE eases the compatibility requirement for existing hardware and software stacks. Many of the switching chips I’m familiar with already support GRE, so porting it to these environments is likely much easier than a non-supported tunneling format.

As another example, on a quick read of the RFC, it appears that Open vSwitch can already support NVGRE — it supports GRE, allows for looking and setting the GRE key (including masking, so can be limited to 24 bits), and it supports learning and explicit population of the logical L2 table. In Open vSwitch’s case, all of this can be driven through OpenFlow and the Open vSwitch configuration protocol.

On the other hand, GRE does not take advantage of a standard transport protocol (TCP/UDP), so logical flow information cannot be reflected in the outer header ports like you can for VXLAN. This means that ECMP hashing in the fabric cannot provide flow-level granularity which is desirable to take advantage of all bandwidth. As the protocol catches on, this is simple to address in hardware and will likely happen. [Update:  Since writing this, I've been told that some hardware can use the GRE key in the ECMP hash, which improves the granularity of load balancing in the fabric.  However, per-flow load-balancing  (over the logical 5 tuple) is still not possible.]

OK, so there you have it. A very brief glimpse of NVGRE. Sure there is more to say, but I have a hard time getting excited about tunneling protocols. Why? That is what I’m going to talk about next.

Tunneling vs. Network Virtualization

So, while it is moderately interesting to explore NVGRE and VXLAN, it’s important to remember that the tunneling formats are really a very minor (and easily changed) component of a full network virtualization solution. That is, how a system tunnels, whether over UDP, or GRE, or CAPWAP or something else, doesn’t define what functions the system provides. Rather it specifies what the tunnel packets look like on the wire (an almost trivial consideration).

It’s also important to remember that these two proposals (VXLAN and NVGRE) in specific, while an excellent step in the right direction, are almost certainly going to see a lot of change going forward. As they stand now, they are fairly limited. For example, they only support L2 within the logical network. Also, they both abdicate a lot of responsibility to multicast. This limits scalability on many modern fabrics, requires the deployment of multicast which many large operators eschew, and has real shortcomings when it comes to speed of provisioning a new network, manageability, and security.

Clearly these issues are going to be addressed as the protocols mature. For example, it’s likely that there will be support for L2 and L3 in the logical network, as well as ACLs, QoS primitives, etc. Also, it is likely that there will be some primitives for support secure group joins, and perhaps even support for creating optimized multicast trees that don’t rely on the fabric (some existing solutions already do this).

So if we step back and look at the current situation. We have “standardized” the tunneling protocol and some basic mechanism for L2 in logical space, and not much else. However, we know that these protocols will need evolve going forward. And it’s very likely that this evolution will have non-trivial dependency on the control path.

Therefore, I would argue that the real issue is not “lets standardize the tunneling protocol” but rather, “lets standardize the control interface to configure the tunnels and associated state so system developers can use it address these and future challenges”. That is, if vswitches and pswitches of the future have souped up versions of NVGRE and VXLAN, something is still going to have to provide the orchestration of these primitives.  And that is where the real function comes from. Neither of these proposals have addressed this, for example, the interface to provide the mapping from a vNIC to a tenant network ID is unspecified.

So what might this look like? I’ll use Open vSwitch to provide a specific example. Open vSwitch is a great platform for NVGRE (and soon VXLAN). It supports tens of thousands of GRE tunnels without a problem. And it allows setting and doing lookups on the GRE key. But what makes it immediately useful is not that it supports these tunneling primitives, but that a third party can pick it up, and use OpenFlow to configure those keys, tunnels, and any other network state (ACLs, L3, etc.) to build a full solution.

And that is the crux of the issue. If you get a solution that supports VXLAN or NVGRE, even though they are standardized, you are only getting a piece of the solution. The second piece we need, and should all ask for, is an open standard for configuring this interface. We (Open vSwitch) use OpenFlow. But any standard will do.

What about Microsoft?

From what I can tell (and admittedly, I don’t know very much), this seems to be what Microsoft is getting right. Rather than just specify a tunneling protocol, it appears that they’re also opening up the vswitch interface to support implementations from multiple vendors. While this itself doesn’t provide an open interface to virtual networking configuration, it does allow someone to do that work. For example, NEC has announced they will be releasing an OpenFlow compatible vSwitch for Windows Server 8. My guess is that this is a port of Open vSwitch. (Hey NEC, if you are using Open vSwitch, care to share the changes with the rest of the community?)

In any case, NEC will probably charge you for it, but we will work hard to make a full Open vSwitch port for Windows Server 8 available. Of course it will be open source, so we encourage users to get it for free and innovate in the source as well as use the open management protocols.

We’ll also be demoing Open vSwitch support for NVGRE within OpenStack/Quantum, so stay tuned for that.

Anyway, kudos to Microsoft for supporting L2 in L3 and adding a contender to the pool. It will be interesting to see what finally pops out of the IETF standardization process. I presume it will be some agglomeration achieved through consensus.

And double kudos for opening the vswitch interface. I think we’re going to see a lot of cool innovation in networking around Windows Server 8. And that’s good for all of us.


The Rise of Soft Switching Part IV: Comments on the Hardware Supply Chain

[This post is written by Alex Bachmutsky and Martin Casado. Alex is a Distinguished Engineer at Ericsson in Silicon Valley, driving system architecture aspects of the company's next generation platform. He is the author of the book “Platform Design for Telecommunication Gateways,” and co-authored “WiMAX Evolution: Emerging Technologies and Applications”.]

This is yet another (unplanned) addendum to the soft switching series.

Our previous posts were arguing that for virtual edge switching, using the servers compute was the best point in the design space given the current hardware landscape. The basic argument was that it is possible to achieve 10G from the server in soft switching by dedicating a single core (about $60 – $120 of silicon depending on how you count). So, it is difficult to make the argument for passthrough plus specialized hardware for two reasons. First, given currently available choices, price performance will almost certainly be higher with x86. Second, you loose many of the benefits of virtualization that are retained when using soft switching (which we’ve explained here).

Alex submitted two very good (and very detailed) responses to our claims (you can see the summary here). In them he argued that looking at the basic components costs, a hardware offload solution should be both lower power, and provide better cost performance than doing forwarding in x86. He also argued that switching and NPU chipsets are flexible and support mature, high-level development environments, which may make them suitable for the virtual networking problem.

Alex is an expert in this area, he’s incredibly experienced, and, at least partially, he’s right. Specialized hardware should be able to hit better price/performance and lower power. And it is true that development environments have come a long way.

So what’s the explanation for the discrepancy in the view points?

That is what we’ve teamed up to discuss in this post. It turns out the differences in view are twofold. The first difference comes from the perspective of a large company (and commensurate purchasing power) versus that of a handful of customers. While today, “intelligent NICs” are a least twice (generally more like 3-5x) as expensive as a pure soft solution, this is a supply side issue and isn’t justifiable by the bill of materials (BoM). A company sufficiently large with sufficient investment could overcome this obstacle.

The second difference is distributing an appliance versus distributing software. Distributing an appliance allows ultimate control over hardware configuration, development environment, runtime environment etc. Distributing software, on the other hand, often requires dealing with multiple hardware configurations, and complex software interactions both with drivers and the operating system.

Lets start with a quick resync:

In our original post, we argued that for most workloads, the most flexible and cost effective method for doing networking at the virtual edge is to do soft switching on the server (which is what 99% of all virtual deployments have done over the last decade).

The other two options we considered are using a switching chip (either in the NIC or the ToR switch) or an NPU (most likely on the NIC due to port density issues).

We’ll discuss Switching Chips First:

In our experience, non-NPU chipsets in the NIC are not sufficiently flexible for networking at the virtualized edge. The limitations are manyfold, size of lookup of metadata between tables, number of stages in the lookup pipeline, and economically viable table sizes (sometimes table can be increased, for instance, by external TCAM, but it is an expensive option and hence ignored here for budget sensitive implementations we are focusing on).  A good concrete example is table space for tunnels.  Network virtualization solutions often use lots and lots of tunnels (N^2 in the number of servers is not unusual).  Soft switches have no problem supporting tens of thousands of  tunnels without performance degradation.  This is one or two orders of magnitude more than you’ll find on existing non-NPU NIC chipsets.

So this is simply a limitation in the supply chain, perhaps a future NIC will be sufficiently flexible for the virtual networking problem, until then, we’ll omit this as an option.

ToR:

We both agreed that standard silicon is getting very close to being useful for edge virtual switching. The next generation of 10G switches appear to be particularly compelling. So while still suffering the flexiliby limitations and the shortcomings of hairpinning on inter-VM traffic (like reducing edge link bandwidth and ToR switching capacity), and table space issues, they do appear to be a viable option for some set of the switching decision going forward. This is due to improved support for tagging and tunneling, increased sizes in ACL tables, and improved lookup generality between tables.

So we’re hopeful that next generation ToR switches have an open interface like OpenFlow so that the network virtualization layer can manage the forwarding state.

So what about NPUs?

To be clear, Open vSwitch has already been ported to multiple NPU-NIC platforms. In the cases I am familiar with, inter-VM traffic through the CPU is far slower (presumedly due to DMA overhead) than keeping it on the x86. However, off server traffic requires less CPU.

To explore this solution space in more detail, we’ll subdivide the NPU space into two groups: classical NPUs and multicore-based NPUs.

Many NPUs in the former group have the required level of the processing flexibility, some even have integrated general-purpose CPUs that can be used to implement, for instance, OpenFlow-based control plane. However, their data plane processing is usually based on proprietary microcode that may be a development hurdle in terms of available expertise and toolchain.

Any investment to develop on such an NPU is a sunk cost, both in training the developers to the internal details of the hardware featureset and limitations, and developing the code to work within the environment. It is very difficult (if not impossible), for example, to port the microcode written for one such NPU to the NPU of another vendor.

So while this route is not only viable one but very likely beneficial for creating a mass produced appliance, the investment for supporting ”yet another NIC” in a software distribution is likely unjustifiable. And without sufficient economies of scale for the NPU (which can be ensured by a large vendor) the cost to the consumer would likely stunt adoption.

The latter group is based on general-purpose multicore CPUs, but it integrates some NPU-like features, such as streaming I/O, HW packet pre-classification, HW packet re-ordering and atomic flow handling, some level of traffic management, special HW offload engines and more. Since they are based on general-purpose processors, they could be programmed using similar languages and tools as other soft switch solutions. These NPUs can be either used as a main processing engine, or alternatively offload some tasks from the main CPU when integrated on intelligent NIC cards.

If the supply side can be ironed out, both in terms of purchase model and price, this would seem to strike a reasonable balance between development overhead, portability, and price/power/performance. Below, we’ll discuss some of the other challenges that need to be overcome on the supply side to be competitive with x86.

Price:

Let’s start with the economics of switching at the edge. First, in our experience, workloads in the cloud are either mostly idle, or handling lots of TCP traffic, generally HTTP offsite (trading and HPC being the obvious exceptions). Thus 10G at MTU size packets is not only sufficient, it is usually overkill. That said, it is good to be prepared as data usage continues to grow.

Soft switching can be co-resident with the management domain. This has been the standard with Xen deployments in which the Linux bridge or Open vSwitch shares the CPUs allocated to the management domain. For 1G, allocating a single core to both is most likely sufficient. For 10G, allocating an extra core is necessary.

So if we assume worst case (requiring a full core for networking), given a fairly modern CPU, a core weighs in at $60-$100. Motherboard and packaging for that core is probably another $50 (no need for additional memory or harddrive space for soft switching).

On the high end, if you could get specialized hardware for less than (say) $100 – $150 over the price of a standard NIC per server, then there could be an argument for using them. And since often NICs are bonded for resilience, it should probably be half of that or at least include dual ports for the same price for fault resiliency.

Unfortunately, to our knowledge, such a NIC doesn’t exist at those price points. If we’re wrong please let us know (price, relative power and performance) and we’ll update this. To date, all NPU-based NICs we’ve looked are 2-5 times what they should be to pencil out competitively.

However, clearly it should be possible to make a NIC that is optimized for virtual edge switching as the raw components (when purchased at scale) do not justify these high price tags. We’ve found that usually NPU chip vendors do not manufacture their own NIC cards (or do that only for evaluation purposes), and 3rd parties charge a significant premium for their role in the supply chain. Therefore, we posit that this is largely a supply chain issue and perhaps due to the immaturity of the market. Regular NICs are a commodity, intelligent NICs are still a “luxury” without a proportional relationship to their BOM differences.

Of course, this argument is based on traditional cloud packet loads. If the environment instead was hosting an application with small-sized and/or latency sensitive traffic (e.g. voice), then system design criteria would be very different and a specialized hardware solution becomes more compelling.

Drivers:

Drivers are non-existant or poor for specialized NICs. If you look at the HCL (hardware compatibility lists) for VMWare, Citrix, or even the common Linux distributions, NPU-based NICs are rarely (if ever) supported. Generally software solutions rely on the underlying operating system to provide the appropriate drivers. Going with a nonstandard NIC requires getting the hypervisor vendors to support it, which is unlikely. Even multicore NPUs are not supported well, because they are based on non-x86 ISA (MIPS, ARM, PPC) with much more limited support. Practically, it becomes chicken-and-egg problem: major hypervisor vendors don’t spend enough efforts to support lower volume multicore NPUs, and system developers don’t select these NPUs because of lack of drivers causing low volume. Any break out of that vicious circle may change the equation.

Tool Chain:

While the development tool set for embedded processors has come a long way, it still doesn’t match that of standard x86/Linux environment. To be clear, this is only a very minor hurdle to a large development shop with in house expertise in embedded development.

However, from the perspective of a smaller company, dealing with embedded development often means finding and employing relatively specialized developers, more expensive tools (often), and a more complex debug and testing environment. My (Martin’s) experience with side-by-side projects working both on specialized hardware environments, and standard (non-embedded) x86 server’s is that the former is at least twice as slow.

Where does that leave us?

A quick summary of the discussion thus far is as follows.

Considering only the cost and performance properties of specialized hardware, a virtual switching solution using them should have better price performance, and lower power than an equivalent x86 solution. However, few virtualized workloads could actually take advantage of the additional hardware. And supply side issues (no such component exists today at a competitive price point), and complications in inserting specialized hardware into today’s server ecosystem, remain enormous hurdles to realizing this potential. Which is probably why 99% of all virtual deployments over the last decade (and certainly all of the largest virtual operations in the world) have relied on soft switching.

Alex is right to remind us that it is not always a technology limitation, and that there is a room for hardware to come in at the right price/performance if the supply chain and development support matures. Which it very well may be. We hope that intelligent NICs will become more affordable and their price will reflect the BOM. We also hope that NPU vendors will sponsor in some ways the development of hypervisor and middleware layers by major SW suppliers as opposed to some proprietary solutions available on the market today.

As I’ve mentioned previously, there are a number of production deployments which use Open vSwitch directly on hardware. In an upcoming blog post, we’ll dig into those use cases a little further.


OpenFlow’s Beginnings

NetworkWorld recently posted a blog that includes an interview with me (Martin) on OpenFlow’s roots and relevance in the current ecosystem.

http://www.networkworld.com/community/node/78613

While it’s generous to label me as the “inventor” of OpenFlow, the accolade is somewhat misleading. So I wanted to jot off a quick post to clarify things a bit.

To begin with, there  is no sole inventor of OpenFlow.  It’s been the product of many tens of individuals and organizations, over multiple years, and really, it’s still in its infancy and can be expected to evolve drastically going forward.

Regarding the specifics of its origins.  I wrote the first, half-baked, unofficial draft in late 2007 with Nick McKeown (my advisor) as a follow on to research we were doing at Stanford with Scott Shenker, and Justin Pettit (among others). The first contributors to a “full” spec were Ben Pfaff, Justin, Nick, and myself. Ben and Justin did the lion’s share of the work evolving the protocol into something that could actually be used and implemented. And Justin was the primary editor and wrote the bulk of the initial spec text.

Justin, Ben, and I were at Nicira at the time and needed a protocol for remote switch management, so we coordinated with Nick to develop something that could be useful to a wider community.

Within a few months, we handed the spec, along with a working implementation (primarily written by Ben and Justin) to Stanford as there was growing interest in building a community effort around it. Since then, we’ve continued to play a limited role in its development.  However, there have been many very influential contributors since (Rajiv Ramanathan, Jean Tourhilles, Glen Gibb, Brandon Heller, and Ed Crabbe just to name a very few).

So there you have it. The largely uninteresting, and somewhat convoluted origins of OpenFlow.


VXLAN: Moving Towards Network Virtualization

Thanks, Steve Herrod.

If you missed it, VMware, Cisco, Broadcom and others announced support for VXLAN a new tunneling format. For the curious, here is the RFC draft.

So quickly, what is VXLAN?  It’s an L2 in L3 tunneling mechanism that supports L2 learning and tenancy information.

And what does it do for the world?  A lot really.  There currently is fair bit of effort in the virtualization space to address issues of L2 adjacency across subnets, VM mobility across L3 boundaries, overlapping IP spaces, service interposition, and a host of other sticky networking shortcomings.  Like VXLAN, these solutions are often built on L2 in L3 tunneling, however the approaches are often form-fitted and rarely are they compatible with another implementation.

Announcing support for a common approach is a very welcome move from the industry. It will, of course, facilitate interoperability between implementations, and it will pave the way for broad support in hardware (very happy to see Broadcom and Intel in the announcements).

Ivan (as usual, ahead of the game) has already commented on some of the broader implications. His post is worth a read.

I don’t have too much to add outside of a few immediate comments.

First, we need an open implementation of this available.  The Open vSwitch project has already started to dig in and will try and have something out soon — perhaps for the next OpenStack summit. More about this as the effort moves along.

Second, this is a *great* opportunity for the NIC vendors to support acceleration of VXLAN in hardware.  It would be particularly nice if LRO support worked in conjunction with the tunneling so that interrupt overhead from the VMs is minimized, and the hardware handles segmentation and coalescing

Finally, it’s great to see this sort of validation for the L3 fabric, which is an excellent way to build a datacenter network.  IGP + ECMP + a well understood topology (e.g. CLOS or Flattened butterfly) is the foundation of many large fabrics today, and I predict many more going forward.

Exciting times.


What Might an SDN Controller API Look Like? (and should we standardize it?)

[This post is written by Teemu Koponen, and Martin Casado. Teemu has architected and been involved in the implementation of multiple widely used OpenFlow controllers.]

Over the last year, there has been some discussion within the OpenFlow and broader SDN community about designing and standardizing a controller API in addition to the switch interface.

To be honest, standardizing a controller API sounds like a poor idea. The controller is software, not a protocol. To our knowledge there are 9 controllers in the wild today, and more than half of them are open source. If the community wants an open software platform, it should divert resources from arguing over standards to building open source software.

(yeah ok, this comic is only tangentially relevant … )

In addition to being a waste of time, premature standardization could be detrimental to the SDN effort. Minimally, it will unnecessarily constrain a fledgling software ecosystem. Software innovation should come through development and experience with deployment, not through consensus built on prophecies.

Further, it’s not clear how to design and standardize an interface to a (non-existent) large software system out of whole cloth. Even Unix which started in the early 70s did not get standardized until the mid/late 80s.

For SDN in particular, there is really very little experience building controllers systems in the wild today. We (the authors) have built four controllers, three which have seen production use, and two of which have multiple products built on them from different organizations. And still, we would certainly not argue for standardizing a particular interface. Hell, we’re not even clear if there is a one-size-fits-all interface for controllers targeted at drastically different deployment environments. In fact, our experience suggests the contrary.

In our opinion, if an interface is going to be standardized, it should follow the software model of standardization. Either (a) take a system that has demonstrated broad success and generality over many years and call it a standard, or (b) during the design phase of a particular software project, have the *developers* work to create a standard API with the appropriate community feedback (which is probably product and system developers).

But most importantly, and it really bears repeating, we all win if we let an unbridled software ecosystem flourish within the controller space.

OK, so while we think any discussion about standardizing a controller interface is extremely premature, the question of “what makes a good controller API” is absolutely crucial to SDN and very much worth discussing. However, there has been relatively little discourse on controller APIs. And that’s exactly what we’ll be focusing on in rest of this post. We’ll start by providing a brief look at common controller designs, and then take a closer look at Onix, the controller platform we work on.

Common Controller Types

If you survey the existing controller landscape there appears to be three broad classes of API.

Single Purpose Controllers:  These controllers are built for a specific function. So while they may use some common code (like an OpenFlow library) to communicate with the switch, they are otherwise not built with different uses in mind. Or put another way, they lack a control-logic agnostic interface for extension, so there really is little more to talk about.

OpenFlow Controllers:  Yeah, that is probably a confusing name, but bear with us. A number of controllers provide a high-level interface to OpenFlow (specifically) and some infrastructure for hosting one or more control ”applications”. These are general purpose platforms meant for extension. However, the interface is built around the OpenFlow protocol itself, meaning the interface is a thin wrapper around each message, without offering higher-level abstractions. Therefore, as the protocol changes (e.g., a new message is added), the API necessarily reflects this change. Most controllers (including one we’ve written) fall in this camp.

In our experience, the primary problem with this tight coupling is that it is challenging to evolve any aspect of the control application, switch protocol, or controller state distribution (assuming the controllers form a cluster) without refactoring all the layers of the software stack in tandem.

General SDN Controllers: We’re sort of making these names up as we go along, but a general SDN controller is one in which the control application is fully decoupled from both the underlying protocol(s) communicating with the switches, and the protocol(s) exchanging state between controller instances (assuming the controllers can run in a cluster).

The decoupling is achieved by turning the network control problem into a state management problem, and only exposing the state to be managed to the application, and not the mechanisms by which it arrived.

Applications operate over control traffic, flow table state, configuration state, port state, counters, etc. However, how this data is pulled into the controller, and how any changes are pushed back down to the switch is not the application’s business. It can just happily modify network state as if it were local and the platform will take care of the state dissemination.

The platform we’ve been working on over the last couple of years (Onix) is of this latter category. It supports controller clustering (distribution), multiple controller/switch protocols (including OpenFlow) and provides a number of API design concessions to allow it to scale to very large deployments (tens or hundreds of thousands of ports under control). Since Onix is the controller we’re most familiar with, we’ll focus on it.

So, what does the Onix API look like? It’s extremely simple. In brief, it presents the network to applications as an eventually consistent graph that is shared among the nodes in the controller cluster. In addition, it provides applications with distributed coordination primitives for coordinating nodes’ access to that graph.

Conceptually the interface is as straightforward as it sounds, however understanding its use in practice, especially in a distributed setting, bears more discussion.

So, digging a little further …

The Network as an Eventually Consistent Graph

In Onix, the physical network is represented as a graph for the control application. Elements in the graph may represent physical objects such as switches, or their subparts, such as physical ports, or lookup tables.  And it may contain logical entities such as logical ports, tunnels, BFD configuration, etc.

Control applications operate on the graph to find elements of interest and read and write to them. They can register for notifications about state changes in the elements, and they can even extend the existing elements to hold network state specific to a particular control logic.

We happen to call this graph “the NIB” (for Network Information Base).

The NIB abstracts away both the protocols for state synchronization between the controller and the switches and the protocols used for state synchronization between controllers (discussed further below).

Crummy diagram of Onix showing the major components ...

Regarding switch state, the control applications operate on the state directly, independent of how it is synchronized with the switch. For example, control logic may traverse the NIB looking for a particular switch, modify some forwarding table entries, and then place a trigger to receive notification on any future changes (such as a port status change event).

Doing this within a single controller node is pretty straightforward. However, for resilience and scale, most production deployments will want to run with multiple controllers. Hence, the platform needs to support clustering.

Within Onix, the NIB is the central mechanism for distributed state sharing. That is, the NIB is actually a shared datastructure among the nodes in the cluster. If a node updates the NIB, eventually that update will propagate to the other nodes interested in that portion of the graph (assuming no network failures).

Under this model, the control logic remains unaware of the specifics of the distribution mechanisms for controller-to-controller state. However, it is the responsibility of the application logic to coordinate which controller instance is responsible for which portion of the graph at any given time. To aid with this, Onix includes a number of standard distributed coordination primitives such as distributed locking, leader election, etc. Using them, control applications can safely partition the work amongst the cluster.

Because the control logic is decoupled from the distribution mechanisms, the Onix may be configured to use different distributed datastores to share different parts of the graph. For more dynamic state, it may use high-performance, memory-only, key/value store, whereas for more static information (such as admin configured parts of the graph) it may prefer storage that is strongly consistent and provides durability.

Of course, a design like this pushes a lot of complexity to the application. For example, the application is responsible for all distributed coordination (though tools for doing so are provided), and if the application chooses to use an eventually consistent storage back-end, it must tolerate eventual consistency in the NIB (including conflicts, data disappearing, etc.)

An alternative approach would have been to constrain the distribution model to something simple (like strongly consistent updates to all data) which would have greatly simplified the API. However, this would come at a huge cost to scalability.

Which begs the question …

Is This the Right Interface?

Perhaps for some environments.  Almost certainly not for others.  There have been multiple control applications built on Onix, and it is used in large production deployments in the data centers, as well as in the access and core networks. However, it is probably too heavyweight for smaller networks (the home or small enterprise), and it is certainly too complex to use as a basic research tool.

The NIB is also a very low-level interface. Clearly, there is a lot of scaffolding one can build on top to aid in application development. For example, routing libraries, packet classification engines, network discovery components, configuration management frameworks, and what not.

And that is the beauty of software. If you don’t like an API, you fix it, build on it, or you build your own platform. Sure, as a platform matures, binary compatibility and stability of APIs becomes crucial, but that is more a matter of versioning than a priori standardization and design.

So our vote is to keep standards away from the controller design, to support and promote multiple efforts, and to let the ecosystem play out naturally.


An Attempt to Motivate and Clarify Software-Defined Networking (SDN)

Scott’s keynote at Ericsson research on SDN.  I really encourage anyone who is interested in OpenFlow and/or SDN to view it.   It is, in my opinion, the cleanest justification for SDN, and appropriately articulates where OpenFlow fits in the broader context (… as a minor mechanism that is capable, but generally unimportant).


OpenStack / Quantum / Open vSwitch

I am working on multiple upcoming posts, but they are taking some time to gel.  In the meantime, for those who are interested, I’ve also been helping on a series of posts of openvswitch.org describing the OpenStack Quantum network service and how Open vSwitch fits into the picture.  The first post was put up today at:

http://openvswitch.org/openstack/2011/07/25/openstack-quantum-and-open-vswitch-part-1/


The Rise of Soft Switching Part III: A Perspective on Hardware Offload

[This series is written by Jesse Gross, Andrew Lambeth, Ben Pfaff, and Martin Casado. Ben is an early and continuing contributor to the design and implementation of OpenFlow. He's also a primary developer of Open vSwitch. Jesse is also a lead developer of Open vSwitch and is responsible for the kernel work and datapath. Andrew has been virtualizing networking for long enough to have coined the term "vswitch", and led the vDS distributed switching project at VMware. All authors currently work at Nicira.]

This is our third (and final) post on soft switching. The previous two posts described various hardware edge switching technologies (tagging + hairpinning, offloading to the NIC, passthrough) as well as soft switching, and made the argument that in most cases, soft switching is the way to go.

We have received a bunch of e-mails (and a few comments) trying to make the case for passthrough as implemented by certain vendors. While there are a few defendable uses for it (which we try to outline in part 2), in the general case, we still maintain that passthrough is a net loss.

Of course, “net loss” is a fairly qualitative statement. So lets try this: if you are one of the handful of folks that have a special use case like HPC and trading and can live without many of the benefits of modern virtualized system, the passthrough is fine. Our preference is clearly to use mass produced NICs from multiple vendors (available in quantity today), and we prefer the benefits of software flexibility and innovation speeds.

However, while we’ve been arguing that doing full offload of switching into special purpose ASICs is “no good”, it’s only a partial representation of our position. We do feel that there is room for hardware acceleration in edge switching that also preserves the benefits and flexibility of software.

And that’s the topic for this final post. We will be discussing our view of how the hardware ecosystem should evolve to enable offload of virtual switching while still maintain many of the benefits of software.

The basic idea is to use NICs that contain targeted offloads that are performed under the direction of software, not a wholesale abdication to hardware. As a comparison point with another offload strategy, it should look more like TSO and less like TOE.

Regarding this final point (TSO vs. TOE), there may be a history lesson in the adoption and use of those approaches to offload. If the analogy holds, one could conclude that implementing stateful offloading and complex processing in I/O devices is less viable for adoption than more targeted and stateless approaches.

So with that, we’ll start by discussing how a more limited offload might look in NICs.

Offloading Receive

For soft switching, receive generally incurs the most overhead (due to a buffer copy and the difficulty in optimizing polling), and so is a good initial candidate for adding some specialized hardware muscle. Unfortunately however, receive is complicated to offload because you have no knowledge and context for the packet until you start to process it in the hypervisor, at which point it is too late.

Improved LRO Support: One place to start is simply by using LRO in more places. This doesn’t require putting any policy into the NIC, so by having the hardware manage coalescing, you can maintain complete control while reducing the number of packets (which overhead is generally proportional to). LRO has some limitations on the scenarios that you can use it in, which means that Linux based hypervisors never use it, even on NICs that have support. It would be fairly easy for NIC vendors to eliminate these issues.

Improved MultiQueue Support:  The other direction that some NICs have been moving in is adding the ability to classify on various headers and use the result to direct packets to a particular queue. These rules can be managed by software in the hypervisor.   Rules that can’t be matched either because the table isn’t sufficiently large or because the hardware can’t extract those fields can be sent to a default queue to be handled by software.

Unlike most physical switches, a server “management” CPU is powerful with a good connection to the NIC so there isn’t a huge performance hit by going to software. Packets delivered to a particular queue can actually be allocated from guest memory buffers and delivered on the CPU where the application in the VM consumes them, so there are no issues with cache line bouncing. The hypervisor is still involved in packet processing so the latency is not exactly as efficient as passthrough but you don’t tie the VM directly to the NIC. This means no need for hardware-specific drivers in the guest, a smooth fallback from hardware to software, live migration, etc.

One concrete enhancement to multiqueu support that would be useful is just being able to match on more fields. Right now MAC/vlan is pretty common, and Intel NICs also provide matching on a 5 tuple. Obviously once tunnel offloads come into the picture, you’ll also want to match on both the inner and outer headers for steering. Also, if you’re going to directly put packet data in guest memory buffers then you need to do security processing on the NIC and therefore want to match on headers that are necessary for that (within reason, we wouldn’t suggest doing stateful firewalling on the NIC for example; that’s where the software fallback is important).

Offloading Transmit

For offloading transmit, a major feature that can be provided by the NICs are various forms of encapsulation which is used heavily in virtual networking solutions. Today, NICs provide VLAN tagging assist. Having this extend to other forms of tunneling (mpls, L2 in L3, etc.) would be an enormous win. However, to do so, segmentation/fragmentation optimizations (like TSO) would still have to be supported for the upper layer protocol.

For multique, support for QoS could be useful for transmit. If you’re trying to implement any policy more complex than round robin in software, you quickly get contention on the QoS data structures. Unfortunately, this gets pretty complicated quickly because QoS itself is complicated and there are a lot of possible algorithms so it’s hard to get consistent results. We’re not certain how practical a hardware solution is due to the complexity, but the upside is potential quite large if QoS is important.

Another area which NICs may help on transmit offload is packet replication. This is only important if two conditions are met: (a) the physical fabric does not support replication (or the operator is to much of a wuss to turn in on) (b) the application requires high performance broadcast/multicast to a reasonable fan-out.

What does this leave for physical switches?

For switches, this picture of the future — where most functionality is implemented in software with some stateless offload in the NIC — would mean two things.

First, if functionality is going to be pulled into the server, then the physical networking problem reduces to building a good fabric (without trying to overload the functionality to support virtualization primitives). Good fabrics generally mean no oversubscription (e.g. lots of multipathing) and quick convergence on failure. In our experience, standard L3 with ECMP can be used to build a fantastic fabric using fat-tree or spine topologies.

Regardless, whatever the approach (e.g. L3, TRILL, or something proprietary), the goal is no longer overloading the physical network with the need to maintain switching at the virtual layer, but rather providing a robust backplane for the vswitches to use.

Second, soft switching is only practical for for worklaods running on a hypervisor. However, many virtual deployments require integration with legacy bare metal workloads (like that old oracle server which you’re unlikely to ever virtualize).

In this case, it would be useful to have hardware switches that expose an interface (like OpenFlow) which would allow them to interoperate with soft switches running on lots and lots of hypervisors. It’s likely that this can be achieved through fairly basic support for encap/decap and perhaps some TCAM programmability for QoS and filtering.

Fait accompli?

Far from it. These musings are mostly a rough sketch built around an intuition of how the ecosystem should evolve to support the flexibility of virtualization and software and the switching speeds and cost/performance of specialized forwarding hardware.

The high-level points are that most virtual edge networking functions should be implemented in software with targeted, and (probably) stateless hardware offload that doesn’t obviate software flexibility or control. The physical network should focus on becoming an awesome, scalable fabric and providing some method of integrating legacy applications.

The exact convergence on feature set of each of these components is anyones guess, and can only be bred out of solid engineering, deployment, and experience.


The Rise of Soft Switching Part 2.5: Responses and clarifications

[Note: This is an unplanned vignette for the soft switching series. We've received a lot of e-mail in response to our previous two posts. Alex submitted a great comment which outlines many of the popular issues that have come up. So rather than having the response buried in the comment forum, we're upgrading it to a full post.  So thanks to Alex for kicking this off. ]

Alex: I would say that you’ve made a lot of assumptions and oversimplified the generic problem. You are right that for some applications a softswitch could be a viable solution. However, there are a number of arguments against it:

It’s hard to write a somewhat concise blog post without some simplification. Oversimplified? Perhaps, and if so lets take the time to flush out the details.  We always welcome constructive discussion.  As a start, lets take a look at some of your points.

Alex: 1. When you need IPsec/MACsec/TCP/UDP offload, you still need an intelligent NIC. In many cases you need to load balance between multiple switching instances, so with the intelligent NIC it makes sense to place that function there, because it is coming “for free”. The same is true also for a basic switching.

Noone is arguing against intelligent NICs. However, this has nothing to do with soft switching. As mentioned in response to another comment, the vSwitch can (and does) carry the context for various stateless offloads from the vNIC through to the pNIC and in the vNIC<->vNIC case can either avoid doing them (ie VLAN tagging/stripping), or can do them in software optionally [checksum offload is one that can be dropped technically, but makes tcpdump output look scary so is usually done in software for "free" inline with the copy that happens anyway].

Also, as we’ve discussed at some length, doing inter-VM switching in hardware is far from “free”. It incurs overhead and it obviates the flexibility of switching in software.

Alex: 2. Switches today are not that dumb, they can handle very large tables, they can process L2-L4 headers and beyond (usually first 128B-256B of the packet are being parsed), some of them are even programmable to be able to add new protocols.

This is definitely worth a taking a closer look at. We’ve spent years working with high-end merchant silicon switching chips trying to emulate what others are able to do with soft switching. Note that we have very good relationship with the silicon vendors, so this isn’t just blind fumbling.

Lets looks at some recent hurdles we’ve run into in practice:

IP overwrite: Virtual networking environments sometime need address isolation similar to NAT. Very few commercially available switching chipsets support general IP overwrite. And even fewer (none?) support it at a useful position in the lookup pipeline for virtual networking. Clearly this is trivially doable (and often done) in software at the edge.

Note that IP overwrite is not just an edge function within a virtualized context since two co-located VMs in different logical contexts may wish to communicatin thereby neededing overwrite.

Additional lookup for virtual context: Often to preserve logical context, additional information needs to be stuffed into packets. To do so, you have one of two choices: (a) overload and existing field (ugly because you can no longer use that field within the logical context, and at some point you’ll run out of fields to overload) (b) create a new field. Lets look at (b):

For the mass produced switching chips that we’re familiar with, adding a new field means changing the protocol parser means spinning a new chip means an 18 month wait time.  Further, if you want this field to be part of a useful lookup (which often you do) then you’re probably limited to the number of bits you’re going to get.  We’ve quite litterally sat with ASIC designers arguing for 32 bits over 24 for such fields when we routinly use 64 or 128 bits in software.

L2 in L3 tunneling: Until very recently, L2 in L3 was not available in 10G and even now the support is extremely limited (although next generation chips will improve the situation — though again, this required over a year of bake time).

For the L2 in L3 tunneling that is available, rarely can you use the key for lookup, and in some implementations, decap only offers a small fraction of the full bisectional bandwidth of the switching chip making it impractcal for heavy use.

Again, this isn’t a theoretical musing on edge cases. All of these issues we’ve run into in practice. And there are many others we haven’t mentioned (tunnel and ECMP limits for example).

So yes, switching silicon is flexible and getting better all the time. But there are, and will continue to be, very real limitations when compared to switching in software.  Switching silicon will never be replaced.  However, it’s not clear that it is the right place to perform all packet lookup and manipulation.

A final point. You can certainly achieve all of these things if you cobble together enough hardware. However, we would wager that is a very clear looser on price performance, and you still don’t overcome the flexibility and upgrade cycle problems.

Alex: 3. You can use NPUs for switching, which can provide the required flexibility, performance and scalability.

NPUs don’t have the price/performance (nor high fanout capacity) of a standard switching chip (e.g. Broadcom). And they don’t have the flexibiliy, economies of scale, and tool chain support of x86.

There is probably a good economic reason that many of the most popular edge appliances today (wan opt, load balancing, etc.) are built on souped up x86 boxes, and not esoteric NPU platforms.

This would be an interesting point to follow up on. Our contention is that classic table-based switching is the best way to build a physical fabric. The price/performance for high-fan-out switching seems unbeatable. And that x86 is the best way to handle intelligent packet manipulation at the edge.

If you (Alex) are interested in working on a follow-on post. How about together we pencil out the price/performance of a (say) Broadcom/Intel solution vs. Broadcom + some NPU + Intel. I think it would be an interesting exercise.

Alex: 4. Even if softswitching can achieve the same performance level, you pay much heavier price from the power consumption point of view. And we all know well how important is the power today.

This is true of course in terms of byte/watt, but in terms of elasticity and utilization x86 remains attractive, since for the 90% of the day that the network is at 10% utilization all those x86 cores can do useful work.

It’s fair to point out that you may want to statically provision the core, meaning it will also remain underutilized. Given this, then you’re probably right that there is power differential. It would be interesting to compare this with having an additional NPU/TCAM/whatever. That’s something we don’t know the answer to off hand.  But would love to explore further.

Alex: 5. You’ve assumed in your calculations a large packet size, which is not always the case. Instead of providing the bit rate as a performance characteristic, much better metric is the packet rate. For instance, the 10Gbps capable switch can usually handle 15Mpps. I do not think you can claim the same for softswitch.

Also true, but we’re talking about edge switching. If we enumerate the workloads we’re familiar with that do anywhere near line rate in the number of VMs that fit on one host with small message sizes. Our experience is that they are either a) netperf b) hpc c) trading apps.

We can safely ignore a. b and c are legitimate exceptions, but have many other issues with virtualization and also fewer requirements. Passthrough with on NIC switching or hairpinning to the switch is probably a good choice for them. There’s a reason myrinet/quadrics/infiniband (a) existed and (b) are not heavily used outside of HPC and other specialized environments.

Alex: 6. If your application requires also traffic management capabilities, the softswitch overhead goes up very quickly with a number of queues, because dual token or leaky bucket implementation is expensive in means of CPU cycles. Of course, you can say that TM can be also offloaded to the NIC, but TM requires some level of flow classification, and with flow classification already performed in the NIC it makes sense to do the switching in the NIC as per the 1st argument.

Here we totally agree, and this is an awesome segue to our next post.

Alex: Just to clarify my comment: I am not saying that softswitch is not a good solution, I am saying that you should check a particular application and its further evolving requirements before deciding on the softswitch-based implementation.

This is is true, but borders on tautological. Clearly you can contrive a situation in which soft switching doesn’t work. However, we’re interested in refining the discourse beyond the obvious.

I think at least some of the authors are in a pretty good position to claim having surveyed a large number of applications, deployments, and customer requirements for vswitches (edge switching within virtualized environments). In our experience, there are very few environments outside of HPC and trading in which soft switching isn’t a great fit.

Alex: And a final comment here is about the definition of a softswitch. Remember that above-mentioned intelligent NIC could be based on the latest generation of multi-core processors (Cavium, NetLogic, Tilera, etc.) with hybrid NPU-CPU capabilities and a lot of offload, but the switching itself is still performed by software either in Linux or bare metal environment. So from the system point of view the switching is performed by the NIC, but it is still a softswitch…

No argument here :)

[Thanks again to Alex for the comment, it does flush out more of the discussion. Clearly the position is more nuanced than catagorical, and clearly the discourse is ongoing. Perhaps you'd be interested in fleshing out some of the price/performance/power arguments for a future post?]


The Rise of Soft Switching Part II: Soft Switching is Awesome ™

[This series is written by Jesse Gross, Andrew Lambeth, Ben Pfaff, and Martin Casado. Ben is an early and continuing contributor to the design and implementation of OpenFlow. He's also a primary developer of Open vSwitch. Jesse is also a lead developer of Open vSwitch and is responsible for the kernel work and datapath. Andrew has been virtualizing networking for long enough to have coined the term "vswitch", and led the vDS distributed switching project at VMware. All authors currently work at Nicira.]

This is the second post in our series on soft switching. In the first part (found here), we lightly covered a subset of the technical landscape around networking at the virtual edge. This included tagging and offloading decisions to the access switch, switching inter-VM traffic in the NIC, as well as NIC passthrough to the guest VM.

A brief synopsis of the discussion is as follows. Both tagging and passthrough are designed to save end-host CPU by punting the packet classification problem to specialized forwarding hardware either on the NIC or first hop switch, and to avoid the overhead of switching out of the guest to the hypervisor to access the hardware. However, tagging adds inter-VM latency and reduces inter-VM bisectional bandwidth. Passthrough also increases inter-VM latency, and effectively de-virtualizes the network thereby greatly limiting the flexibility provided by the hypervisor. We also mentioned that switching in widely available NICs today is impractical due to severe limitations in the on-board switching chips.

For the purposes of the following discussion, we are going to reduce the previous discussion to the following: The performance arguments in favor of an approach like passthrough + tagging (with enforcement in the first hope switch) is that latency is reduced to the wire (albeit marginally) and packet classification from a proper switching chip will noticeably outperform x86.

The goal of this post is to explain why soft switching kicks ass. Initially, we’ll debunk some of the FUD around performance issues with it, and then try and quantify the resource/performance tradeoffs of soft switching vis a vis hardware-based approaches. As we’ll argue, the question is not “how fast” is soft switching (it is almost certainly fast enough), but rather, “how much cpu am I willing to burn”, or perhaps “should I heat the room with teal or black colored boxes”?

So with that …

Why soft switching is awesome:

So, what is soft switching? Exactly what it sounds like. Instead of passing the packet off to a special purpose hardware device, the packet transitions from the Guest VM into the hypervisor which performs the forwarding decision in software (read, x86). Note that while a soft switch can technically be used for tagging, for the purposes of this discussion we’ll assume that it’s doing all of the first-hop switching.

The benefits of this approach are obvious. You get the flexibility and upgrade cycle of software, and compared to passthrough, you keep all of the benefits of virtualization (memory overcommit, page sharing, etc.). Also, soft switching tends to be much better integrated with the virtual environment. There is a tremendous amount of context that can be gleaned by being co-resident with the VMs, such as which MAC and IP addresses are assigned locally, VM resource use and demands, or which multicast addresses are being listened to. This information can be used to pre-populate tables, optimize QoS rules, prune multicast trees, etc.

Another benefit is simple resource efficiency, you already bought the damn server, so if you have excess compute capacity why buy specialized hardware for something you can do on the end host?  Or put another way, after you provision some amount of hardware resources to handle the switching work, any of those resources that are left over are always available to do real work running an app instead of being wasted(which is usually a lot since you have to provision for peaks).

Of course, nothing comes for free. And there is a perennial skepticism around the performance of software when compared to specialized hardware. So we’ll take some time to focus on that.

First, what are the latency costs of soft switching?

With soft switching, VM to VM communication effectively reduces to a memcpy() (you can also do page flipping which has the same order of overhead). This is as fast as one can expect to achieve on a modern architecture. Copying data between VMs through a shared L2 cache on a multicore CPU, or even if you are unlucky enough to have to go to main memory, is certainly faster than doing a DMA over the PCI bus. So for VM to VM communication, soft switching will have the lowest latency, presuming you can do the lookup function sufficiently quickly (more on that below).

Sending traffic from the guest to the wire is only marginally more expensive due to the overhead of a domain transfer (e.g. flushing the TLB) and copying security sensitive information (such as headers). In Xen (for example) guest transmit (DomU-to-Dom0) operates by mapping pages that were allocated by the guest into the hypervisor, which then get DMA’d with no copy required (with the exception of headers, which are copied for security purposes so the guest can’t change them after the hypervisor has made a decision). In the other direction, the guest allocates pages and puts them in its rx ring, similar to real hardware. These then get shared with the hypervisor via remapping. When receiving a packet the hypervisor copies the data into the guest buffers. (note: VMware does almost the same thing. However, there is no remapping because the vswitch runs in the vmkernel and all physical pages are already mapped and the vmkernel has access to the guest MMU mappings.)

So, while there is comparatively more overhead than a pure hardware approach (due to copying headers and the overhead of the domain transfer), it is in the order of microseconds and dwarfed by other aspects of a virtualized system like memory overcommit. Or more to the point, only in extreme latency sensitive environments does this matter (the overhead is completely lost in the noise of other hypervisor overhead), in which case the only deployment approach that makes sense is to effectively pin compute to dedicated hardware greatly diminishing the utility of using virtualization in the first place.

What about throughput?

Modern soft switches that don’t suck are able to saturate a 10G link from a guest to the wire with less than a core (assuming MTU size packets). They are also able to saturate a 1G link with less than 20% of a core. In the case of Open vSwitch, these numbers include full packet lookup over L2, L3 and L4 headers.

While these are numbers commonly seen in practice, theoretically, throughput is affected by the overhead of the forwarding decision – more complex lookups can take more time thus reducing total throughput.

The forwarding decision involves taking the header fields of each packet and checking them against the forwarding rule set (L2, L3, ACLs, etc.) to determine how to handle the packet. This general class of problem is termed “packet classification” and is worth taking a closer look at.

Soft Packet Classification:

One of the primary arguments in favor of offloading virtual edge switching to hardware is that a TCAM can do a lookup faster than x86. This is unequivocally true, TCAMs have lots and lots of gates (and are commensurately costly and have high power demands) so that they can do lookups of many rules in parallel. A general CPU cannot match the lookup capacity of a TCAM in a degenerate case.

However, software packet classification has come a long way. Under realistic workloads and rule sets that are found in virtualized environments (e.g. mult-tenant isolation with a sane security policy), soft switching can handle lookups at line rates with the resource usage mentioned above (less than a core for 10G) and so does not add appreciable overhead.

How is this achieved? For Open vSwitch, which looks at many more headers than will fit in a standard TCAM, the common case lookup reduces to the overhead of a hash (due to extensive use of flow caching) and can achieve the same throughput as normal soft forwarding. We have run Open vSwitch with hundreds of thousands of forwarding rules and still achieved similar performance numbers to those described above.

Flow setup, on the other hand, is marginally more expensive since it cannot benefit from the caching. Performance of the packet classifier in Open vSwitch relies on our observation that flow tables used in practice (in the environments we’re familiar with) tend to have only a handful of unique sets of wildcarded fields. Each of these observed wildcard sets has its own hash table, hashed on the basis of the fields that are not wildcarded. Therefore classifying a packet requires a O(1) lookup in each hash table and selecting the highest-priority match. Lookup performance is therefore linear in the number of unique wildcard sets in the flow table. Since this tends to be small, classifier overhead tends to be negligible.

We realize that this is all a bit hand-wavy and needs to be backed up with hard performance results. Because soft classification is such an important (and somewhat nuanced) issue, we will dedicate a future post to it.

“Yeah, this is all great. But when is soft switching not a good fit?”

While we would contend that soft switching is good for most deployment environments, there are instances in which passthrough or tagging is useful.

In our experience, the mainstay argument for passthrough is reduced latency to the wire. So while average latency is probably ok, specialized apps that have very small request/response type workloads can be impacted by the latency of soft switching.

Another common use case for passthrough is a local appliance VM that acts as an inline device between normal application VMs and the network. Such an appliance VM has no need of most of the mobility or other hypervisor provided goodness that is sacrificed with passthrough but it does have a need to process traffic with as little overhead as possible.

Passthrough is also useful for providing the guest with access to hardware that is not exposed by the emulated NIC (for example, some NICs have IPsec offload but that is not generally exposed).

Of course, if you do tagging to a physical switch you get access to the all of the advanced features that have been developed over time, all exposed through a CLI that people are familiar with (this is clearly less true with Cisco’s Nexus 1k). In general, this line of argument has more to do with the immaturity of software switches than any real fundamental limitation. But it’s a reasonable use case from an operations perspective.

The final, and probably most widely used (albeit least talked about) use case for passthrough is drag racing. Hypervisor vendors need to make sure that they can all post the absolute highest, break-neck performance numbers for cross-hypervisor performance comparisons (yup, sleazy), regardless of how much fine print is required to qualify them. Why else would any sane vendor of the most valuable piece of real estate in the network (the last inch) cede it to a NIC vendor? And of course the NIC vendors that are lucky enough to be blessed by a hypervisor into passthrough Valhalla can drag race with each other with all their os-specific hacks again.

“What am I supposed to conclude from all of this?”

Hopefully we’ve made our opinion clear: soft switching kicks mucho ass. There is good reason that it is far and away the dominant technology used for switching at the virtual edge. To distill the argument even further, our calculus is simple …

Software flexibility + 1 core x86 + 10G networking + cheap gear + other cool shit >
       saving a core of x86 + costly specialized hardware + unlikely to be realized benefit of doing classification in the access switch.

More seriously, while we make the case for soft switching, we still believe there is ample room for hardware acceleration. However, rather than just shipping off packets to a hardware device, we believe that stateless offload in the NIC is a better approach. In the next post in this series, we will describe how we think the hardware ecosystem should evolve to aid the virtual networking problem at the edge.


Follow

Get every new post delivered to your Inbox.

Join 234 other followers