The Rise of Soft Switching Part II: Soft Switching is Awesome ™

[This series is written by Jesse Gross, Andrew Lambeth, Ben Pfaff, and Martin Casado. Ben is an early and continuing contributor to the design and implementation of OpenFlow. He's also a primary developer of Open vSwitch. Jesse is also a lead developer of Open vSwitch and is responsible for the kernel work and datapath. Andrew has been virtualizing networking for long enough to have coined the term "vswitch", and led the vDS distributed switching project at VMware. All authors currently work at Nicira.]

This is the second post in our series on soft switching. In the first part (found here), we lightly covered a subset of the technical landscape around networking at the virtual edge. This included tagging and offloading decisions to the access switch, switching inter-VM traffic in the NIC, as well as NIC passthrough to the guest VM.

A brief synopsis of the discussion is as follows. Both tagging and passthrough are designed to save end-host CPU by punting the packet classification problem to specialized forwarding hardware either on the NIC or first hop switch, and to avoid the overhead of switching out of the guest to the hypervisor to access the hardware. However, tagging adds inter-VM latency and reduces inter-VM bisectional bandwidth. Passthrough also increases inter-VM latency, and effectively de-virtualizes the network thereby greatly limiting the flexibility provided by the hypervisor. We also mentioned that switching in widely available NICs today is impractical due to severe limitations in the on-board switching chips.

For the purposes of the following discussion, we are going to reduce the previous discussion to the following: The performance arguments in favor of an approach like passthrough + tagging (with enforcement in the first hope switch) is that latency is reduced to the wire (albeit marginally) and packet classification from a proper switching chip will noticeably outperform x86.

The goal of this post is to explain why soft switching kicks ass. Initially, we’ll debunk some of the FUD around performance issues with it, and then try and quantify the resource/performance tradeoffs of soft switching vis a vis hardware-based approaches. As we’ll argue, the question is not “how fast” is soft switching (it is almost certainly fast enough), but rather, “how much cpu am I willing to burn”, or perhaps “should I heat the room with teal or black colored boxes”?

So with that …

Why soft switching is awesome:

So, what is soft switching? Exactly what it sounds like. Instead of passing the packet off to a special purpose hardware device, the packet transitions from the Guest VM into the hypervisor which performs the forwarding decision in software (read, x86). Note that while a soft switch can technically be used for tagging, for the purposes of this discussion we’ll assume that it’s doing all of the first-hop switching.

The benefits of this approach are obvious. You get the flexibility and upgrade cycle of software, and compared to passthrough, you keep all of the benefits of virtualization (memory overcommit, page sharing, etc.). Also, soft switching tends to be much better integrated with the virtual environment. There is a tremendous amount of context that can be gleaned by being co-resident with the VMs, such as which MAC and IP addresses are assigned locally, VM resource use and demands, or which multicast addresses are being listened to. This information can be used to pre-populate tables, optimize QoS rules, prune multicast trees, etc.

Another benefit is simple resource efficiency, you already bought the damn server, so if you have excess compute capacity why buy specialized hardware for something you can do on the end host?  Or put another way, after you provision some amount of hardware resources to handle the switching work, any of those resources that are left over are always available to do real work running an app instead of being wasted(which is usually a lot since you have to provision for peaks).

Of course, nothing comes for free. And there is a perennial skepticism around the performance of software when compared to specialized hardware. So we’ll take some time to focus on that.

First, what are the latency costs of soft switching?

With soft switching, VM to VM communication effectively reduces to a memcpy() (you can also do page flipping which has the same order of overhead). This is as fast as one can expect to achieve on a modern architecture. Copying data between VMs through a shared L2 cache on a multicore CPU, or even if you are unlucky enough to have to go to main memory, is certainly faster than doing a DMA over the PCI bus. So for VM to VM communication, soft switching will have the lowest latency, presuming you can do the lookup function sufficiently quickly (more on that below).

Sending traffic from the guest to the wire is only marginally more expensive due to the overhead of a domain transfer (e.g. flushing the TLB) and copying security sensitive information (such as headers). In Xen (for example) guest transmit (DomU-to-Dom0) operates by mapping pages that were allocated by the guest into the hypervisor, which then get DMA’d with no copy required (with the exception of headers, which are copied for security purposes so the guest can’t change them after the hypervisor has made a decision). In the other direction, the guest allocates pages and puts them in its rx ring, similar to real hardware. These then get shared with the hypervisor via remapping. When receiving a packet the hypervisor copies the data into the guest buffers. (note: VMware does almost the same thing. However, there is no remapping because the vswitch runs in the vmkernel and all physical pages are already mapped and the vmkernel has access to the guest MMU mappings.)

So, while there is comparatively more overhead than a pure hardware approach (due to copying headers and the overhead of the domain transfer), it is in the order of microseconds and dwarfed by other aspects of a virtualized system like memory overcommit. Or more to the point, only in extreme latency sensitive environments does this matter (the overhead is completely lost in the noise of other hypervisor overhead), in which case the only deployment approach that makes sense is to effectively pin compute to dedicated hardware greatly diminishing the utility of using virtualization in the first place.

What about throughput?

Modern soft switches that don’t suck are able to saturate a 10G link from a guest to the wire with less than a core (assuming MTU size packets). They are also able to saturate a 1G link with less than 20% of a core. In the case of Open vSwitch, these numbers include full packet lookup over L2, L3 and L4 headers.

While these are numbers commonly seen in practice, theoretically, throughput is affected by the overhead of the forwarding decision – more complex lookups can take more time thus reducing total throughput.

The forwarding decision involves taking the header fields of each packet and checking them against the forwarding rule set (L2, L3, ACLs, etc.) to determine how to handle the packet. This general class of problem is termed “packet classification” and is worth taking a closer look at.

Soft Packet Classification:

One of the primary arguments in favor of offloading virtual edge switching to hardware is that a TCAM can do a lookup faster than x86. This is unequivocally true, TCAMs have lots and lots of gates (and are commensurately costly and have high power demands) so that they can do lookups of many rules in parallel. A general CPU cannot match the lookup capacity of a TCAM in a degenerate case.

However, software packet classification has come a long way. Under realistic workloads and rule sets that are found in virtualized environments (e.g. mult-tenant isolation with a sane security policy), soft switching can handle lookups at line rates with the resource usage mentioned above (less than a core for 10G) and so does not add appreciable overhead.

How is this achieved? For Open vSwitch, which looks at many more headers than will fit in a standard TCAM, the common case lookup reduces to the overhead of a hash (due to extensive use of flow caching) and can achieve the same throughput as normal soft forwarding. We have run Open vSwitch with hundreds of thousands of forwarding rules and still achieved similar performance numbers to those described above.

Flow setup, on the other hand, is marginally more expensive since it cannot benefit from the caching. Performance of the packet classifier in Open vSwitch relies on our observation that flow tables used in practice (in the environments we’re familiar with) tend to have only a handful of unique sets of wildcarded fields. Each of these observed wildcard sets has its own hash table, hashed on the basis of the fields that are not wildcarded. Therefore classifying a packet requires a O(1) lookup in each hash table and selecting the highest-priority match. Lookup performance is therefore linear in the number of unique wildcard sets in the flow table. Since this tends to be small, classifier overhead tends to be negligible.

We realize that this is all a bit hand-wavy and needs to be backed up with hard performance results. Because soft classification is such an important (and somewhat nuanced) issue, we will dedicate a future post to it.

“Yeah, this is all great. But when is soft switching not a good fit?”

While we would contend that soft switching is good for most deployment environments, there are instances in which passthrough or tagging is useful.

In our experience, the mainstay argument for passthrough is reduced latency to the wire. So while average latency is probably ok, specialized apps that have very small request/response type workloads can be impacted by the latency of soft switching.

Another common use case for passthrough is a local appliance VM that acts as an inline device between normal application VMs and the network. Such an appliance VM has no need of most of the mobility or other hypervisor provided goodness that is sacrificed with passthrough but it does have a need to process traffic with as little overhead as possible.

Passthrough is also useful for providing the guest with access to hardware that is not exposed by the emulated NIC (for example, some NICs have IPsec offload but that is not generally exposed).

Of course, if you do tagging to a physical switch you get access to the all of the advanced features that have been developed over time, all exposed through a CLI that people are familiar with (this is clearly less true with Cisco’s Nexus 1k). In general, this line of argument has more to do with the immaturity of software switches than any real fundamental limitation. But it’s a reasonable use case from an operations perspective.

The final, and probably most widely used (albeit least talked about) use case for passthrough is drag racing. Hypervisor vendors need to make sure that they can all post the absolute highest, break-neck performance numbers for cross-hypervisor performance comparisons (yup, sleazy), regardless of how much fine print is required to qualify them. Why else would any sane vendor of the most valuable piece of real estate in the network (the last inch) cede it to a NIC vendor? And of course the NIC vendors that are lucky enough to be blessed by a hypervisor into passthrough Valhalla can drag race with each other with all their os-specific hacks again.

“What am I supposed to conclude from all of this?”

Hopefully we’ve made our opinion clear: soft switching kicks mucho ass. There is good reason that it is far and away the dominant technology used for switching at the virtual edge. To distill the argument even further, our calculus is simple …

Software flexibility + 1 core x86 + 10G networking + cheap gear + other cool shit >
       saving a core of x86 + costly specialized hardware + unlikely to be realized benefit of doing classification in the access switch.

More seriously, while we make the case for soft switching, we still believe there is ample room for hardware acceleration. However, rather than just shipping off packets to a hardware device, we believe that stateless offload in the NIC is a better approach. In the next post in this series, we will describe how we think the hardware ecosystem should evolve to aid the virtual networking problem at the edge.

SR-IOV is not Passthrough (Response to “The Rise of Soft Switching: Part 1″)

[Note: I've been working with networking at the virtual edge for over 3 years now, and every time SR-IOV has come up, it has been in the context of passthrough (bypassing the virtual switch).  However, Sean Varley of Intel wrote to point out that the two should not be conflated.  He further points out the potential benefits that SR-IOV can provide.  Rather than try and rehash an already great note, I figured I would just include it here in its entirety.   So first, sorry for screwing that up.  Mea culpa.  Second, thanks to Sean for providing the clarification.  And finally, passthrough still sucks ;)]

I’d like to clarify a technological fact that you may have overlooked in this recent blog post that I think has done a disservice to the SR-IOV technology and partly isn’t even relevant to the point you are trying make with this post.
The fact is essentially this – SR-IOV is NOT a kernel or soft switch bypass technology.  I agree that there is a great misperception in the industry that it is only relevant under these circumstances, however, as a specification it merely allows for multiple child functions to be instantiated under a parent function in a standard manner under the PCIe interface.  This has much more to do with identifying, associating, and managing NIC partitions than anything else.  By erroneously applying this basic functionality to a single specific use case of soft switch bypass, you are pigeon holing it to a small space and perpetuating a significant myth in the industry.
I have attached a single diagram that shows the wide variety of uses for SR-IOV in emulated, local kernel, and even non virtualized environments where SR-IOV is a significant enabling technology.  The essence of NIC partitioning is very simple:  I want a single big pipe to look like a bunch of smaller pipes.  SR-IOV doesn’t do this but it allows a standard interface to be applied to a partition that enables every OS in the industry to recognize and enumerate that partition as an interface to bind whatever service it likes to according to a standard.  All other partitioning technologies that exist today (Flex10, vNIC, etc.) are proprietary technologies that actually clutter the market place with vertical, inherently incompatible (amongst vendors) solutions to NIC partitioning.

Openflow is the answer! (now .. what was the question again?)

Wow, over the last couple of weeks there has been an escalation in the confusion around Openflow. The problem is on both sides of the metaphorical isle.   Those behind the hype engine are spinning out of control with fanciful tales about bending the laws of physics (“Openflow will keep the icecream in your refrigerator cold during a power outage!”). And those opposed to OpenFlow (for whatever reason) are using the hype to build non-existant strawmen to attack (“Openflow cannot bend the laws of physics, therefore it isn’t useful”).

So for this post, I figured I’d go through some of the dubious claims and bogus strawmen and try and beat some sense into the discussion.  Each of the claims I talk to below were pulled from some article/blog/interview I ran into over the last few weeks.

Claim: “Openflow provides a more powerful interface for managing the network than exists today”

Patently false. APIs expose functionality, not create it. Openflow is a subset of what switching chipsets offer today. If you want more power, sign an NDA with Broadcom and write to their SDK directly.

What Openflow does attempt to do is provide a standard interface above the box. If you believe that SDN is a powerful concept (and I do), then you do need some interface that is sufficiently expressive and hopefully widely supported.

This is almost not worth saying, but clearly Openflow is more expressive than a CLI. CLIs are fine for manual configuration, but they totally blow for building automated systems. The shortcomings are blindingly obvious, for example traditional CLIs have no clear data schema or state semantics and they change all the time because the designers are trying to solve an HCI not an API versioning problem. However, I don’t think there is any real disagreement on this point.

Claim: “Openflow will make hardware simpler”

I highly doubt this. Even with Openflow, a practical network deployment needs a bunch of stuff. It needs to do lookups over common headers (minimally L2/L3/L4), it may need hash-based multi-pathing, it may need port groups, it may need tunneling, it may need fine-grained link timers. If you grab a chip from Broadcom, I’m not exactly sure what you’d throw away if using Openflow.

What Openflow may discourage is stupid shit like creating new tags as an excuse for hardware stickyness (“you want new feature X? It’s implemented using the Suxen tag(tm) and only our new chip supports it.”). This is because an Openflow-like interface can  effectively flatten layers. For example, I don’t have to use the MPLs tag for MPLS per se. I can use it for network virtualization (for instance) or for identifying aggregates that correspond to the same security group. However, that doesn’t mean hardware is simpler. Just that the design isn’t redundant to fulfill a business need. (Post facto note:  There are some great points regarding this issue in the comments.  We’ll work to spin this out into a separate post.)

Or more to the point. There are an awful lot of fields and bits in a header. And an awful lot of lookup capacity in a switch. If you shuffle around what fields mean what, you can almost certainly do what you want without having to change how the hardware parses the packet.

Claim: “Openflow will commoditize the hardware layer”

This is total Kaiser Soze-esque nonsense (yup, that’s the picture at the top of the post). To begin with, networking hardware is already on its way to horizontal integration. The reason that Arista is successful has nothing to do with Openflow (they strictly don’t support it and never have), it has nothing to do with SDN, it’s because Ken Duda, and Andy Bechtolsheim are total bad-asses and have built a great product. Oh, and because merchant silicon has come of age, meaning you don’t need need an ASIC team to be successful in the market.

The difference between OpenFlow and what Arista supports is that with Openflow you choose to build to an industry standard interface, and not a proprietary one. Whether or not you care is, of course, totally up to you.

So Openflow does provide another layer of horizontal integration and once the ecosystem develops that it is, in my opinion, a very good thing. But the ecosystem is still embryonic, so it will take some time before the benefits can be realized.

I think the power of merchant silicon and the rise of the “commodity” fabric are a far greater threat to the crusty scale-up network model. Oddly, Openflow has become a relatively significant distraction. That said, as this area matures, creating a horizontal layer “above the box” will grow in significance.

Claim: “Openflow does not map efficiently to existing hardware”

True! Older versions of Openflow did not map well to existing silicon chips. This is a *major* focus of the existing design effort; to make Openflow more flexible. As with any design effort, the trade-off is between having a future proof roadmap, and having a practical design that can have tangible benefits now. So we’re trying to thread the needle practically but with some foresight. It will take a little patience to see how this evolves.

Claim: “Openflow reduces complexity”

This is a meaningless statement. Openflow is like USB, it doesn’t “do” anything new. A case can be made for SDN to reduce complexity, but it does this by constraining the distribution model and providing a development platform that doesn’t suck. No magic there.

Claim: “Openflow will obviate the need for traditional distributed routing protocols”

I hear this a lot, but I just don’t buy it. There are certainly those within the Openflow community who disagree with me, so please take this for what it’s worth (very little …)

In my opinion, traditional distributed protocols are very good at what they do, scaling, routing packets in complex graphs, converging, etc. If I were to build a fabric, I’d sure as hell do it with an distributed routing protocol (again, not everyone agrees on this point). What traditional protocols suck at is distributing all the other state in the network. What state is that? If you look at the state in a switching chip, a lot of it isn’t traditionally populated through distributed protocols. For example, the ACL tables, the tunnels, many types of tags (e.g. VLAN), etc.

Going forward, it may be the case the distributed routing protocols get pulled into the controller. Why? Because controllers have much higher cpu density and therefore can run the computation for multiple switches. A multicore server will kick the crap out of the standard switch management CPU. Like I described in a previous post, this is no different than the evolution from distance vector to link state, it’s just taking it a step further. However, the protocol certainly is still distributed.

Claim: “Openflow cannot scale because of X”

I’ve addressed this at length in a previous post.  Still, I’m going to have to call Doug Gourlay out.  And I quote …

“I honestly don’t think it’s going to work,” Arista’s Gourlay said. “The flow setup rates on Stanford’s largest production OpenFlow network are 500 flows per second. We’ve had to deal with networks with a million flows being set up in seconds. I’m not sure it’s going to scale up to that.”

Other than being a basic logical fallacy (Stanford’s largest production Openflow network has absolutely nothing to do with the scaling properties of the Openflow or SDN architecture),  there appears to be an implicit assumption that flow-setups are in some way a limiting resource.  This clearly isn’t the case if they don’t leave the datapath which is a valid (and popular) deployment model for Openflow.  Doug’s a great guy, at a great company, but this is a careless statement, and it doesn’t help further the dialog.

Claim: “Openflow helps virtualize the network”

Again, this is almost a meaningless statement. A compiler helps virtualize the network if it is being used to write code to that effect. The fact is, the pioneers of network virtualization, such as VMWare, Amazon, and Cisco, don’t use Openflow.

So yes, you can use Openflow for virtualization (that’s a help, I guess … ). And open standards are a good thing. But no, you certainly don’t need to.

OK, that’s enough for now.

To be very clear, I’m a huge fan of Openflow. And I’m a huge fan of SDN.  Yet, neither is a panacea, and neither is a system or even a solution. One is an effort to provide a standardized interface for building awesome systems (Openflow). The other is a philosophical model for how to build those systems (SDN). It will be up to the systems themselves to validate the approach.

The Rise of Soft Switching (Part I: Introduction and Background)

[This series is written by Jesse Gross, Andrew Lambeth, Ben Pfaff, and Martin Casado. Ben is an early and continuing contributor to the design and implementation of OpenFlow. He's also a primary developer of Open vSwitch. Jesse is also a lead developer of Open vSwitch and is responsible for the kernel work and datapath. Andrew has been virtualizing networking for long enough to have coined the term "vswitch", and led the vDS distributed switching project at VMware. All authors currently work at Nicira.]

How many times have you been involved in the following conversation?

NIC vendor: “You should handle VM networking at the NIC. It has direct access to server memory and can use a real switching chip with a TCAM to do fast lookups. Much faster than an x86. Clearly the NIC is the correct place to do inter-VM networking.

Switch vendor: “Nonsense! You should handle VM networking at the switch. Just slap a tag on it, shunt it out, and let the real men (and women) do the switching and routing. Not only do we have bigger TCAMs, it’s what we do for a living. You trust the rest of your network to Crisco, so trust your inter-VM traffic too. The switch is the correct place to do inter-VM networking.

Hypervisor vendor: “You guys are both wrong, networking should be done in software in the hypervisor. Software is flexible, efficient and the hypervisor already has rich knowledge of VM properties such as addressing and motion events. The hypervisor is the correct place to do inter-VM networking.

I doubt anyone familiar with these arguments is fooled into thinking they are anything other than what they are, poorly cloaked marketing pitches charading as a technical discussion. This topic in particular is a focus of a lot of marketing noise. Why? Probably because of its strategic importance. The access layer to the network hasn’t opened for over a decade and with virtualization, there is the opportunity to shift control of the network from the core into the edge. This has to be making the traditional hardware vendors pretty nervous.

Fortunately, while the market blather decries this as a nuanced issue, we believe the technical discussion is actually fairly straightforward.

And that is the goal of this series of posts, to explore the technical implications of doing virtual edge networking in software vs. various configurations of hardware (Passthrough, tagging, switching in the NIC, etc.) The current plan is to split the posts into three parts. First, we’ll provide an overview of the proposed solution space. In the next post, we’ll focus on soft switching in particular, and finally we’ll describe how we would like to see the hardware ecosystem evolve to better support networking at the virtual edge.

Not to spoil the journey, but the high-order take-away of this series is that for almost any deployment environment, soft switching is the right approach. And by “right” we mean flexible, powerful, economic, and fast. Really fast. Modern vswitches (that don’t suck) can handle 1G at less than 20% of a core, and 10G at about a core. We’re going to discuss this at length in the next post.

Before we get started, it’s probably worth talking to the bias of the authors. We’re all software guys, so we have a natural preference for soft solutions. That said, we’re also all involved in Open vSwitch which is built to support both hardware (NIC and first hop switch) and software forwarding models. In fact, some of the more important deployments we’re involved in use Open vSwitch to drive switching hardware directly.

Alright, on to the focus of this post, background.

At it’s most basic, networking at the virtual edge simply refers to how forwarding and policy decisions are applied to VM traffic. Since VMs may be co-resident on a server, this means either doing the networking in software, or sending it to hardware to make the decision (or a hybrid of the two) and then back.

We’ll cover some of the more commonly discussed proposals for doing this here:

Tagging + Hairpinning: Tagging with hairpinning is a method for getting inter-VM traffic off of a server so that the first hop forwarding decisions can be enforced by the access hardware switch. Predictably, this approach is strongly used/backed by HP ProCurve and Cisco.

The idea is simple. When a VM sends a packet, it gets a tag unique to the sending VM, and then the packet is sent to the first hop switch. The switch does forwarding/policy lookup on the packet, and if the next hop is another VM resident on the same server, the packet is sent back to the server (hairpinned) to be delivered to the VM. The tagging can be done by the vswitch in the hypervisor, or by the NIC when using passthrough (discussed below).

The rational for this approach is that special purpose switching hardware can perform packet classification (forwarding and policy decisions) faster than software. In the next post, we’ll discuss whether this is an appreciable win over software classification at the edge (hint, not really).

The very obvious downsides to hairpinning are twofold. First, the bisectional bandwidth for inter-VM communication is limited by the first hop link. And it consumes bandwidth which could otherwise be used exclusively for communication between VMs and the outside world.

Perhaps not so obvious is that paying DMA and transmission costs for inter-VM communication increases latency. Also, by shunting the switching decision off to a remote device, you’re throwing away a goldmine of rich contextual information about the state of the VM’s doing the sending and receiving, as well as (potentially) the applications within those VMs.

Regarding the tags themselves, with VNTag, Cisco proposes a new field in the Ethernet header (thereby changing the Ethernet framing format, and thereby requiring new hardware, … ). And HP, a primary driver behind VEPA, decided to use the VLAN tag (and or source MAC) which any modern switching chipset can handle. In either case, a tag is just bits, so whether it is a new tag (requiring a new ASIC), or the use of an existing tag (VLAN and MPLS are commonly suggested candidates), the function is the same.

Blind Trust MAC Authentication: Another approach for networking at the virtual edge is to rely on MAC addresses to identify VMs. It appears that some of the products that do this are repurposed NAC solutions being pushed into the virtualization market. Yuck.

In any case, assuming there is no point of presence on the hypervisor, there are a ton of problems with this approach. They rarely provide any means to manage policy of inter-VM traffic, they can be easily fooled through source spoofing, and often they rely on hypervisor specific implementation tricks to determine move events (like looking for VMWare RARPs). Double yuck. This is the last we’ll mention of this approach as it really isn’t a valid solution to the problem.

Switching in the NIC: Another common proposal is to do all inter-VM networking in the NIC. The rational is twofold, if passthrough is being used (discussed further below) the hypervisor is bypassed, so something needs to do the classification. And secondly, packet classification is faster in hardware than software.

However, switching within the NIC hasn’t caught on (and we don’t believe it will to any significant degree). Using passthrough obviates many advantages of virtualization (as we describe below), and DMA’ing to the NIC is not free. Further, switching chipsets on NICs to date have not been nearly as powerful as those used in standard switching gear. The only clear advantage to doing switching in the NIC is the avoidance of hair-pinning (and perhaps shared memory with the CPU via QPI).

[note: the original version of this article conflated SR-IOV and passthrough which is incorrect.  While it is often used in conjunction with passthrough, SR-IOV itself is not a passthrough technology]

Where does Passthrough fit in all of this? Passthrough is a method of bypassing the hypervisor so that the packet is DMA’d from the guest directly to the NIC for forwarding (and vice versa). This can be used in conjunction with NIC-based switching as well as tagging and enforcement in the access switch. Passthrough basicaly de-virtualizes the network by removing the layer of software indirection provided by the hypervisor.

While the argument for passthrough is one of saving CPU, doing so is a complete anathema to many hypervisor developers (and their customers!) due to the loss of the software interposition layer. With passthrough, you generally loose the following: memory overcommit, page sharing, live migration, fault tolerance (live standby), live snapshots, the ability to interpose on the IO with the flexibility of software on x86, and device driver independence.

Regarding this last issue (hardware decoupling), with passthrough, if you buy new hardware, prepare to enjoy hours of upgrading/changing device drivers in hundreds of VMs. Or keep your old HW and enjoy updating the drivers anyway because NICs have hardware errata. Also enjoy lots of fun trying to restore from a disk snapshot or deploy from a template that has the old/wrong device driver in it.

To be fair, VMware and others have been investing large amounts of engineering resources into address this by performing unnatural acts like injecting device driver code into guests. But to date, the solutions appear to be intrusive and proprietary workarounds limited to a small subset of the available hardware. More general solutions, such as NPA, have no declared release vehicle, and from the Linux kernel mailing list appear to have died (or perhaps put on hold).

This all said, there actually are some reasonable (albeit fringe) use cases for Passthrough, such as throughput sensitive network appliances or single packet latency messaging applications. We’ll talk to these more in a later post.

Soft Switching: Soft switching is where networking at the virtual edge is done in software withing the hypervisor vswitch, and without offloading the decision to special purpose forwarding hardware. This is far and away the most popular approach in use today, and the singular focus of our next post in this series, so we’ll leave it at that for now.

Is that all?

No. This isn’t an exhaustive overview of all available approaches (far from it). Rather, it is an opinion-soaked survey of what has the most industry noise. In the next post, we will do a deep dive into soft switching, focusing in particular on the performance implications (latency, throughput, cpu overhead) in comparison to the approaches mentioned above.

Until then …

The Scaling Implications of SDN

In my experience, it’s rare to have a discussion about SDN without someone getting their panties in a bunch over scaling. Nevermind that SDN networks have already been implemented that scale to tens of thousands of ports. And nevermind that distributed systems are built today that manage hundreds of thousands of entities and petabytes of data. It remains a perennial point of contention.

So here is a hand-wavy – probably not totally convincing – but hopefully offers some intuitive understanding – attempt at an explanation of the scaling limits of SDN.

But first! Some history: Unfortunately, I think I am partially to blame for the sorry state of affairs. In some of the earliest writeups we would describe an SDN controller as “logically centralized”. The intended meaning of this was something along the lines of “it has a centralized programmatic model, but was really distributed”.

Now, what does “logically centralized” actually mean? Nothing. It’s a nonsense term and the result of sloppy thinking. Either you’re centralized, or you’re distributed (thanks to Teemu Koponen for pointing that out). So by logically centralized, I guess we really meant “distributed”, and that is what we should have said. Whoops!

Clearly, a centralized controller cannot scale. Perhaps it can handle a really large network. Hell, I’ve heard claims of single node SDN networks that scale to thousands of switches and tens of thousands of ports. Whether or not that’s true, at some point either CPU or memory will run out.

However, controllers can (and should!) be distributed. And in that case, I would argue that there are no inherent bottlenecks. Trivially, you can distribute controllers such that the total amount of available CPU and memory for the control path is equivalent to the sum total of management cpu’s on the switches themselves. Controllers can distribute state amongst themselves like network nodes do today.

Still not convinced? Let’s deconstruct this …

Latency: How might SDN affect latency? That depends on how it is used. In some implementations, datapath traffic never leaves the hardware. In which case, datapath latency should be identical to traditional networking. Other implementations will forward the first packet of a flow up to the controller and then cache the forward decisions on the datapath. In this case, the flow setup will have to pay an RTT to the controller. The following paragraph will discuss how much that might be.

Whether or not data traffic is sent to the controller, control traffic often is (for example BGP from a neighboring network, or the signalling of a port status change). In this case the additional overhead is the RTT to the controller + the cost of the computation to determine what to do next. How much that is will vary wildlyby deployment. It’s possible to keep this sub millisecond (200-300us is not unusual). I’ve seen as little as 70us claimed, but I doubt that could be maintained under any real load.

So you may not want to pay this latency per flow (though in many cases it probably doesn’t matter). However, there isn’t any appreciable overhead for responding to network events. And as I’ll describe later, it probably will reduce the total time needed to disseminate control traffic.

Datapath Scaling: The datapath is where the original OpenFlow design was broken. This is because OpenFlow used the abstraction of a switch datapath as a single TCAM. Clearly, this can be problematic for some forwarding rulesets. Assume, for example, that you are building an SDN application which would integrate with BGP, and map prefixes to tags with QoS policies. Implemented this within a single flow table would require (RIB * QoS rules) entries. Ouch.

However, later versions of OpenFlow support multiple tables which take care of this Cartesian explosion nightmare. Rather than having to multiply all the rules together, they can be placed in separate tables limiting the maximum size requirements of a single table (by a lot!).

Still, many modern implementations of SDN dispense with the abstraction of a flow altogether and program the switch hardware tables directly. I think we’ll see a lot more of this going forward.

Convergence on failure:  Alright, what happens on link (or node) failure? Traditionally, when a link fails, the information of the failed link is propagated through flooding. Flooding, of course, scales linearly with the distance of the longest loop free path. If you compare this with SDN, instead of being flooded the information goes to one of the controllers, which sends the updated routing tables to the affected switches.

A few things to note:

  • With SDN, the controller should only be updating the switches whose tables have actually changed.
  • With SDN, the total cost is link propagation to the controller + cost of computation + time to update to affected switches.
  • In a distributed SDN case, an implementation may distributed the link update among the controller clusters which then do the computation for their piece of the network.

Unfortunately for SDN, when in-band control is in play, things are a lot uglier. Inband control basically means that the datapath is used both for SDN control traffic and data. So a failure on the network may affect connectivity to the controller which must be patched up somehow before the controller can fix the problem for the rest of the network.

Patching up the control channel, in this case, is generally done with an IGP, so we’re strictly worse off than if we just used an IGP to begin with. Suck.

So bottom line, if convergence time is important, out of band control should be used. It’s simple to build a cheap, reliable out of band control network using traditional gear (the amount of traffic is minimal).

Computation and Memory: It’s easy to see that the latest multi-core server available from your friendly  server vendor can beat the pants off of whatever crap embedded cpu you’ll find in most networking gear (800mhz-1Ghz is common).

I think Nick McKeown said it best.  If we look historically, early routing protocols used distance vector routing algorithms (like RIP) which have very low computational requirements, but suck in a bunch of other ways (convergence times, split infinity etc.). As cpu’s became more powerful, it was possible for each node to compute single source shortest path to all destinations on each failure, which is the standard approach used today.

Calculating more routes (e.g. all pairs shortest path, or multiple source to all destinations) on stronger CPUs really  follows as a natural evolution. The clear “next step”. And a beefy server with lots of memory can compute *a lot* of routes, and quickly. Further, we already know how to distribute the algorithms.

Some parting notes: The goal of SDN is not to scale a simple network fabric. This is something that distributed routing algorithms do just fine. The goal of SDN is to provide a design paradigm which allows the creation of more sophisticated control paradigms: TE, security, virtualization, service interposition, etc. Perhaps that means just manipulating state at the edge of the network while traditional protocols are used in the core. Or perhaps that means using SDN throughout. In eithercase, SDN architecturally should be able to handle the scale. Remember, it’s the same amount of state that’s being passed around.

I think the right question to ask is not “does it scale” but “is it worth the hassle of building networks this way”? Building an SDN network requires some thought. In addition to the physical network nodes, a control network (probably), and controller servers need to be thrown into the mix.

So, is it worth it?

That, is up to you to decide.   Ultimately the answer will rely on what gets built using SDN and who wants to run it.  I think it’s still too early in the development cycle to derive a meaningful prediction.

Presentation on SDN and High-Level Network Abstractions

If you haven’t heard the talk, or seen the slides, these are a must.  Professor Shenker, one of the creators of SDN, put together this talk about why programming to high-level abstractions is better than protocol design, and why SDN helps achieve it.

An Extremely Brief Conceptual Introduction to Open vSwitch


Open vSwitch is one of my favorite open source projects.   For those of you who aren’t familiar with it, it’s a switch stack which can by run both as a soft switch (vswitch) within a virtualized environment)and as the control stack for hardware switches. Good stuff.

However, the real kung-fu is that Open vSwitch is built for programmatic state distribution. It does this through two interfaces. One being OpenFlow (with a ton of extensions) for managing the forwarding behaviour of the fast path, and the second being a JSON-RPC based config protocol used for less time critical configuration (tunnels, QoS, NetFlow, etc.). You can view the schema for the config protocol here.


So, what might you do with something like Open vSwitch? Well, the idea is to enable the creation of automated network infrastructure for virtualized environments.

For example, you could use a centralized control system to manage network policies policies that migrate with VMs. This has already been done within the XenServer environment with the Citrix Distributed vSwitch Controller.

Open vSwitch also supports programmatic control of address remapping (L2 and L3 a la OpenFlow), and programmatic control of tunnels as well as multiple tunnel types (e.g. GRE, IPsec, and CAPWAP). These come in handy for various network virtualization functions such as supporting mobility or an L2 service model across subnets.

Open vSwitch is already used in a bunch of production environments. It is most commonly used as a vswitch in large cloud deployments (many thousands of servers) for automated VLAN, policy, and tunnel management. However, I know of a number of deployments which use it as a simple OpenFlow switch, or a more sophisticated programmatic switch to control hardware environments.


Open vSwitch is fast. Damn fast. Some performance tests have shown it to be faster than the native Linux bridge. Open vSwitch uses flow-caching (when running in software), so even under complex configurations the common case should be blazingly fast. Open vSwitch also has highly optimized tunneling implementations.


Open vSwitch is primarily developed and deployed in Linux (however there are ports to other OSes, particularly those used in embedded environments). It is commonly used with both Xen and KVM (there are production environments using both). Further, it has been integrated into a number of cloud management systems including Xen Cloud Platform, OpenQRM, and OpenNebula (a long with a bunch of proprietary CMSes).  It’s currently being integrated into Open Stack.  You can track the progress here.

Why Do I Care?

Mostly because with Open vSwitch, as with other distributed switch solutions, it’s possible to build really sophisticated networks with not-so-sophisticated hardware.  For example, an L3 network from your neighborhood OEM or other other low-cost hardware vendor (check out, for example, Pronto), plus Open vSwitch and a bit of programming can equate to a cheap, bad-ass network for virtual deployments.  But that, my friends, is a topic for another post.

What OpenFlow is (and more importantly, what it’s not)

There has been a lot of chatter on OpenFlow lately. Unsurprisingly, the vast majority is unguided hype and clueless naysaying. However, there have also been some very thoughtful blogs which, in my opinion, focus on some of the important issues without elevating it to something it isn’t.

Here are some of my favorites:

I would suggesting reading all three if you haven’t already.

In any case, the goal of this post is to provide some background on OpenFlow which seems to be missing outside of the crowd in which it was created and has been used.

Quickly, what is OpenFlow? OpenFlow is a fairly simple protocol for remote communication with a switch datapath. We created it because existing protocols were either too complex or too difficult to use for efficiently managing remote state. The initial draft of OpenFlow had support for reading and writing to the switch forwarding tables, soft forwarding state, asynchronous events (new packets arriving, flow timeouts), and access to switch counters. Later versions added barriers to allow synchronous operations, cookies for more robust state management, support for multiple forwarding tables, and a bunch of other stuff I can’t remember off hand.

Initially, OpenFlow attempted to define what a switch datpath should look like (first as one large TCAM, and then multiple in succession). However, in practice this is often ignored as hardware platforms tend to be irreconcilably different (there isn’t an obvious canonical forwarding pipeline common to the chipsets I’m familiar with).

And just as quickly, what isn’t OpenFlow? It isn’t new or novel, there have been many similar protocols proposed. It isn’t particularly well designed (the initial designers and implementors were more interested in building systems than designing a protocol), although it is continually being improved. And it isn’t a replacement for SNMP or NetConf, as it is myopically focussed on the forwarding path.

So, it’s reasonable to ask, why then is OpenFlow needed? The short answer is, it isn’t. In fact, focussing on OpenFlow misses the point of the rational behind its creation. What *is* needed are standardized and supported APIs for controlling switch forwarding state.

Why is this?

Consider system design in traditional networking versus modern distributed software development.

Traditional Networking (Protocol design):

Adding functionality to a network generally reduces to creating a new protocol. This is a fairly low-level endeavor often consisting of worrying about the wire format and state distribution algorithm (e.g. flooding). In almost all cases, protocols assume a purely distributed model which means it should handle high rates of churn (nodes coming and going), heterogenous node resources (e.g. cpu), and almost certainly means the state across nodes is eventually consistent (often the bane of correctness).

In addition, to follow the socially acceptable route, somewhere during this process you bring your new protocol to a standards body stuffed by the vendors and spend years (likely) arguing with a bunch of lifers on exactly what the spec should look like. But I’ll ignore that for this discussion.

So, what sucks about this? Many things. First, the distribution model is fixed. It’s *much* easier to build a system where you can dictate the distribution model (e.g. a tightly coupled cluster of servers). It involves a low-level fetish with on-the-wire data formats, and there is no standard way to implement your protocol on a given vendors platform (and even if you could, it isn’t clear how you would handle conflicts with the other mechanisms managing switch state on the box).

Developing Distributed Software:

Now, lets consider how this compares to building a modern distributed  system. Say, for example, you wanted to build a system which abstracts a large datacenter consisting of hundreds of switches and tens of thousands of access ports as a single logical switch to the administrator. The administrator should be able to do everything they
can do on a single switch, allocate subnets, configure ACLs and QoS policy, etc. Finally, assume all of the switches supported something like OpenFlow.

The problem then reduces to building a software system that manages the global network state. While it may be possible to implement everything on a single server, for reliability (and probably scale) it’s more likely you’d want to implement this as a tightly coupled cluster of servers. No problem, we know how to do that, there are tons of great tools available for building scale-out distributed systems.

So why is building a tightly coupled distributed system preferable to designing a new protocol (or more likely, set of protocols)?

  • You have fine grained control of the distribution mechanism (you don’t have to do something crude like send all state to all nodes)
  • You have fine grained control of the consistency model. You can use existing (high-level) software for state distribution and coordination (for example, ZooKeeper and Cassandra)
  • You don’t have to worry about low-level data formatting issues. Distributed data stores and packages like thrift and protocol buffers handle that for you.
  • You (hopefully) reduce the debugging problem to a tightly coupled cluster of nodes.

And what might you lose?

  • Scale? Seeing that systems of similar design are used today to orchestrate hundreds of thousands of servers and petabytes of data, if built right, scale should not be an issue.  There are OpenFlow-like solutions deployed today that have been tested to tens of thousands of ports.
  • Multi-vendor support? If the switches support a standard like OpenFlow (and can forward Ethernet) then you should be able to use any vendors’ switch.
  • However, you most likely will not have interoperability at the controller level (unless a standardized software platform was introduced).

It is important to note that in both approaches (traditional networking and building a modern distributed system), the state being managed is the same. It is the forwarding and configuration state of the switching chips network wide. It’s only the method of it’s management that has changed.

So where does OpenFlow fit in all this? It’s really a very minor, unimportant, easy to replace, mechanism in a broader development paradigm. Whether it is OpenFlow, or a simple RPC schema on protocol buffers/thrift/json/etc. just doesn’t matter. What does matter is that the protocol has very clear data consistency semantics so that it can be used to build a robust, remote state management system.

Do I believe that building networks in this manner is a net win? I really do. It may not be suitable for all deployment environments, but for those building networks as systems, it clearly has value.

I realize that the following is a tired example. But it is nevertheless true. Companies like Google, Yahoo, Amazon, and Facebook have changed how we think about infrastructure through kick-ass innovation in compute, storage and databases. The algorithm is simple: bad-ass developers + infrastructure = competitive advantage + general awesomeness.

However, lacking a consistent API across vendors and devices, the network has remained, in contrast, largely untouched. This is what will change.

Blah, blah, blah, Is it actually real? There are a number of production networks outside of research and academia running OpenFlow (or similar mechanism) today. Generally these are built using whitebox hardware since the vendor supported OpenFlow supply chain is extremely immature. Clearly it is very early days and these deployments are limited to the secretive, thrill-seeking fringe. However, the black art is escaping the cloister and I suspect we’ll hear much more about these and new deployments in the near future.

What does this mean to the hardware ecosystem? Certainly less than the tech pundits would like to claim.

Minimally it will enable innovation for those who care, namely those who take pride in infrastructure innovation, and the many startups looking to offer products built on this model.

Maximally? Who the hell knows. Is there going to be an industry cataclysm in which the last companies standing are Broadcom, an ODM (Quanta), an OEM (Dell) and a software company to string them all together? No, of course not. But that doesn’t mean we won’t see some serious shifts in hegemony in the market over the next 5-10 years. And my guess is that today’s incumbents are going to have a hell of a time holding on. Why? for the following two reasons:

  1. The organizational white blood cells are too near sighted (or too stupid) to allow internal efforts build the correct systems. For example, I’m sure “overlay” is a four letter word in Cisco because it robs hardware of value. This addiction to hardware development cycle and margins is a critical hindrance.So while these companies potter around with their 4-7 year ASIC design cycles, others are going to maximize what goes in software and start innovating like Lady Gaga with a sewing machine.
  2. The traditional vendors, to my knowledge, just don’t have the teams to build this type of software. At least not today. This can be bought or built, but it will take time. And it’s not at all clear that the whiteblood cells will let that happen.

So there you have it. OpenFlow is a small piece to a broader puzzle. It has warts, but with community involvement is getting improved, and much more importantly, it is being used.


Get every new post delivered to your Inbox.

Join 374 other followers