[This post is written by Alex Bachmutsky and Martin Casado. Alex is a Distinguished Engineer at Ericsson in Silicon Valley, driving system architecture aspects of the company's next generation platform. He is the author of the book “Platform Design for Telecommunication Gateways,” and co-authored “WiMAX Evolution: Emerging Technologies and Applications”.]
This is yet another (unplanned) addendum to the soft switching series.
Our previous posts were arguing that for virtual edge switching, using the servers compute was the best point in the design space given the current hardware landscape. The basic argument was that it is possible to achieve 10G from the server in soft switching by dedicating a single core (about $60 – $120 of silicon depending on how you count). So, it is difficult to make the argument for passthrough plus specialized hardware for two reasons. First, given currently available choices, price performance will almost certainly be higher with x86. Second, you loose many of the benefits of virtualization that are retained when using soft switching (which we’ve explained here).
Alex submitted two very good (and very detailed) responses to our claims (you can see the summary here). In them he argued that looking at the basic components costs, a hardware offload solution should be both lower power, and provide better cost performance than doing forwarding in x86. He also argued that switching and NPU chipsets are flexible and support mature, high-level development environments, which may make them suitable for the virtual networking problem.
Alex is an expert in this area, he’s incredibly experienced, and, at least partially, he’s right. Specialized hardware should be able to hit better price/performance and lower power. And it is true that development environments have come a long way.
So what’s the explanation for the discrepancy in the view points?
That is what we’ve teamed up to discuss in this post. It turns out the differences in view are twofold. The first difference comes from the perspective of a large company (and commensurate purchasing power) versus that of a handful of customers. While today, “intelligent NICs” are a least twice (generally more like 3-5x) as expensive as a pure soft solution, this is a supply side issue and isn’t justifiable by the bill of materials (BoM). A company sufficiently large with sufficient investment could overcome this obstacle.
The second difference is distributing an appliance versus distributing software. Distributing an appliance allows ultimate control over hardware configuration, development environment, runtime environment etc. Distributing software, on the other hand, often requires dealing with multiple hardware configurations, and complex software interactions both with drivers and the operating system.
Lets start with a quick resync:
In our original post, we argued that for most workloads, the most flexible and cost effective method for doing networking at the virtual edge is to do soft switching on the server (which is what 99% of all virtual deployments have done over the last decade).
The other two options we considered are using a switching chip (either in the NIC or the ToR switch) or an NPU (most likely on the NIC due to port density issues).
We’ll discuss Switching Chips First:
In our experience, non-NPU chipsets in the NIC are not sufficiently flexible for networking at the virtualized edge. The limitations are manyfold, size of lookup of metadata between tables, number of stages in the lookup pipeline, and economically viable table sizes (sometimes table can be increased, for instance, by external TCAM, but it is an expensive option and hence ignored here for budget sensitive implementations we are focusing on). A good concrete example is table space for tunnels. Network virtualization solutions often use lots and lots of tunnels (N^2 in the number of servers is not unusual). Soft switches have no problem supporting tens of thousands of tunnels without performance degradation. This is one or two orders of magnitude more than you’ll find on existing non-NPU NIC chipsets.
So this is simply a limitation in the supply chain, perhaps a future NIC will be sufficiently flexible for the virtual networking problem, until then, we’ll omit this as an option.
We both agreed that standard silicon is getting very close to being useful for edge virtual switching. The next generation of 10G switches appear to be particularly compelling. So while still suffering the flexiliby limitations and the shortcomings of hairpinning on inter-VM traffic (like reducing edge link bandwidth and ToR switching capacity), and table space issues, they do appear to be a viable option for some set of the switching decision going forward. This is due to improved support for tagging and tunneling, increased sizes in ACL tables, and improved lookup generality between tables.
So we’re hopeful that next generation ToR switches have an open interface like OpenFlow so that the network virtualization layer can manage the forwarding state.
So what about NPUs?
To be clear, Open vSwitch has already been ported to multiple NPU-NIC platforms. In the cases I am familiar with, inter-VM traffic through the CPU is far slower (presumedly due to DMA overhead) than keeping it on the x86. However, off server traffic requires less CPU.
To explore this solution space in more detail, we’ll subdivide the NPU space into two groups: classical NPUs and multicore-based NPUs.
Many NPUs in the former group have the required level of the processing flexibility, some even have integrated general-purpose CPUs that can be used to implement, for instance, OpenFlow-based control plane. However, their data plane processing is usually based on proprietary microcode that may be a development hurdle in terms of available expertise and toolchain.
Any investment to develop on such an NPU is a sunk cost, both in training the developers to the internal details of the hardware featureset and limitations, and developing the code to work within the environment. It is very difficult (if not impossible), for example, to port the microcode written for one such NPU to the NPU of another vendor.
So while this route is not only viable one but very likely beneficial for creating a mass produced appliance, the investment for supporting “yet another NIC” in a software distribution is likely unjustifiable. And without sufficient economies of scale for the NPU (which can be ensured by a large vendor) the cost to the consumer would likely stunt adoption.
The latter group is based on general-purpose multicore CPUs, but it integrates some NPU-like features, such as streaming I/O, HW packet pre-classification, HW packet re-ordering and atomic flow handling, some level of traffic management, special HW offload engines and more. Since they are based on general-purpose processors, they could be programmed using similar languages and tools as other soft switch solutions. These NPUs can be either used as a main processing engine, or alternatively offload some tasks from the main CPU when integrated on intelligent NIC cards.
If the supply side can be ironed out, both in terms of purchase model and price, this would seem to strike a reasonable balance between development overhead, portability, and price/power/performance. Below, we’ll discuss some of the other challenges that need to be overcome on the supply side to be competitive with x86.
Let’s start with the economics of switching at the edge. First, in our experience, workloads in the cloud are either mostly idle, or handling lots of TCP traffic, generally HTTP offsite (trading and HPC being the obvious exceptions). Thus 10G at MTU size packets is not only sufficient, it is usually overkill. That said, it is good to be prepared as data usage continues to grow.
Soft switching can be co-resident with the management domain. This has been the standard with Xen deployments in which the Linux bridge or Open vSwitch shares the CPUs allocated to the management domain. For 1G, allocating a single core to both is most likely sufficient. For 10G, allocating an extra core is necessary.
So if we assume worst case (requiring a full core for networking), given a fairly modern CPU, a core weighs in at $60-$100. Motherboard and packaging for that core is probably another $50 (no need for additional memory or harddrive space for soft switching).
On the high end, if you could get specialized hardware for less than (say) $100 – $150 over the price of a standard NIC per server, then there could be an argument for using them. And since often NICs are bonded for resilience, it should probably be half of that or at least include dual ports for the same price for fault resiliency.
Unfortunately, to our knowledge, such a NIC doesn’t exist at those price points. If we’re wrong please let us know (price, relative power and performance) and we’ll update this. To date, all NPU-based NICs we’ve looked are 2-5 times what they should be to pencil out competitively.
However, clearly it should be possible to make a NIC that is optimized for virtual edge switching as the raw components (when purchased at scale) do not justify these high price tags. We’ve found that usually NPU chip vendors do not manufacture their own NIC cards (or do that only for evaluation purposes), and 3rd parties charge a significant premium for their role in the supply chain. Therefore, we posit that this is largely a supply chain issue and perhaps due to the immaturity of the market. Regular NICs are a commodity, intelligent NICs are still a “luxury” without a proportional relationship to their BOM differences.
Of course, this argument is based on traditional cloud packet loads. If the environment instead was hosting an application with small-sized and/or latency sensitive traffic (e.g. voice), then system design criteria would be very different and a specialized hardware solution becomes more compelling.
Drivers are non-existant or poor for specialized NICs. If you look at the HCL (hardware compatibility lists) for VMWare, Citrix, or even the common Linux distributions, NPU-based NICs are rarely (if ever) supported. Generally software solutions rely on the underlying operating system to provide the appropriate drivers. Going with a nonstandard NIC requires getting the hypervisor vendors to support it, which is unlikely. Even multicore NPUs are not supported well, because they are based on non-x86 ISA (MIPS, ARM, PPC) with much more limited support. Practically, it becomes chicken-and-egg problem: major hypervisor vendors don’t spend enough efforts to support lower volume multicore NPUs, and system developers don’t select these NPUs because of lack of drivers causing low volume. Any break out of that vicious circle may change the equation.
While the development tool set for embedded processors has come a long way, it still doesn’t match that of standard x86/Linux environment. To be clear, this is only a very minor hurdle to a large development shop with in house expertise in embedded development.
However, from the perspective of a smaller company, dealing with embedded development often means finding and employing relatively specialized developers, more expensive tools (often), and a more complex debug and testing environment. My (Martin’s) experience with side-by-side projects working both on specialized hardware environments, and standard (non-embedded) x86 server’s is that the former is at least twice as slow.
Where does that leave us?
A quick summary of the discussion thus far is as follows.
Considering only the cost and performance properties of specialized hardware, a virtual switching solution using them should have better price performance, and lower power than an equivalent x86 solution. However, few virtualized workloads could actually take advantage of the additional hardware. And supply side issues (no such component exists today at a competitive price point), and complications in inserting specialized hardware into today’s server ecosystem, remain enormous hurdles to realizing this potential. Which is probably why 99% of all virtual deployments over the last decade (and certainly all of the largest virtual operations in the world) have relied on soft switching.
Alex is right to remind us that it is not always a technology limitation, and that there is a room for hardware to come in at the right price/performance if the supply chain and development support matures. Which it very well may be. We hope that intelligent NICs will become more affordable and their price will reflect the BOM. We also hope that NPU vendors will sponsor in some ways the development of hypervisor and middleware layers by major SW suppliers as opposed to some proprietary solutions available on the market today.
As I’ve mentioned previously, there are a number of production deployments which use Open vSwitch directly on hardware. In an upcoming blog post, we’ll dig into those use cases a little further.
NetworkWorld recently posted a blog that includes an interview with me (Martin) on OpenFlow’s roots and relevance in the current ecosystem.
While it’s generous to label me as the “inventor” of OpenFlow, the accolade is somewhat misleading. So I wanted to jot off a quick post to clarify things a bit.
To begin with, there is no sole inventor of OpenFlow. It’s been the product of many tens of individuals and organizations, over multiple years, and really, it’s still in its infancy and can be expected to evolve drastically going forward.
Regarding the specifics of its origins. I wrote the first, half-baked, unofficial draft in late 2007 with Nick McKeown (my advisor) as a follow on to research we were doing at Stanford with Scott Shenker, and Justin Pettit (among others). The first contributors to a “full” spec were Ben Pfaff, Justin, Nick, and myself. Ben and Justin did the lion’s share of the work evolving the protocol into something that could actually be used and implemented. And Justin was the primary editor and wrote the bulk of the initial spec text.
Justin, Ben, and I were at Nicira at the time and needed a protocol for remote switch management, so we coordinated with Nick to develop something that could be useful to a wider community.
Within a few months, we handed the spec, along with a working implementation (primarily written by Ben and Justin) to Stanford as there was growing interest in building a community effort around it. Since then, we’ve continued to play a limited role in its development. However, there have been many very influential contributors since (Rajiv Ramanathan, Jean Tourhilles, Glen Gibb, Brandon Heller, and Ed Crabbe just to name a very few).
So there you have it. The largely uninteresting, and somewhat convoluted origins of OpenFlow.