[This post was written with Andrew Lambeth. Andrew has been virtualizing networking for long enough to have coined the term "vswitch", and led the vDS distributed switching project at VMware. ]
Or at least, it doesn’t need to solve the problem in the same way.
It’s commonly said that “networking needs a VMWare”. Hell, there have been occasions in which we’ve said something very similar. However, while the analogy has an obvious appeal (virtual, flexible, thin layer of indirection in software, commoditize, commoditize, commoditize!), a closer look suggests that it draws from a very superficial understanding of the technology, and in the limit, it doesn’t make much sense.
It’s no surprise that many are drawn to this line of thought. It probably stems from the realization that virtualizing the network rather than managing the physical components is the right direction for networks to evolve. On this point, it appears there is broad agreement. In order to bring networking up to the operational model of compute (and perhaps disrupt the existing supply chain a bit) virtualization is needed.
Beyond this gross comparison, however, the analogy breaks down. The reality is that the technical requirements for server virtualization and network virtualization are very, very different.
Server Virtualization vs. Network Virtualization
With server virtualization, virtualizing CPU, memory and device I/O is incredibly complex, and the events that need to be handled with translation or emulation happen at CPU cycle timescale. So the virtualization logic must be both highly sophisticated and highly performant on the “datapath” (the datapath for compute virtualization being the instruction stream and I/O events).
On the other hand, the datapath operations for network virtualization are almost trivially simple. All they involve is mapping one address/context space to another address/context space. This effectively reduces to an additional header on the packet (or tag), and one or two more lookups on the datapath. Somewhat revealing of this simplicity, there are multiple reasonable solutions that address the datapath component, NVGRE, and VXLAN being two recently publicized proposals.
If the datapath is so simple, it’s reasonable to ask why network virtualization isn’t already a solved problem.
The answer, is that there is a critical difference between network virtualization and server virtualization and that difference is where the bulk of complexity for network virtualization resides.
What is that difference?
Virtualized servers are effectively self contained in that they are only very loosely coupled to one another (there are a few exceptions to this rule, but even then, the groupings with direct relationships are small). As a result, the virtualization logic doesn’t need to deal with the complexity of state sharing between many entities.
A virtualized network solution, on the other hand, has to deal with all ports on the network, most of which can be assumed to have a direct relationship (the ability to communicate via some service model). Therefore, the virtual networking logic not only has to deal with N instances of N state (assuming every port wants to talk to every other port), but it has to ensure that state is consistent (or at least safely inconsistent) along all of the elements on the path of a packet. Inconsistent state can result in packet loss (not a huge deal) or much worse, delivery of the packet to the wrong location.
It’s important to remember that networking traditionally has only had to deal with eventual consistency. That is “after state change, the network will take some time to converge, and until that time, all bets are off”. Eventual consistency is fine for basic forwarding provided that loops are prevented using a TTL, or perhaps the algorithm ensures loop freedom while it is converging. However, eventual consistency doesn’t work so well with virtualization. During failure, for example, it would suck if packets from tenant A managed to leak over to tenant B’s network. It would also suck if ACLs configured in tenant A’s were not enforced correctly during convergence.
So simply, the difference between server virtualization, and network virtualization is that network virtualization is all about scale (dealing with the complexity of many interconnected entities which is generally a N2 problem), and it is all about distributed state consistency. Or more concretely, it is a distributed state management problem rather than a low level exercise in dealing with the complexities of various hardware devices.
Of course, depending on the layer of networking being virtualized, the amount of state that has to be managed varies.
All network virtualization solutions have to handle basic address mapping. That is, provide a virtual address space (generally addresses of the packets within the tunnel) and the physical address space (the external tunnel header), and a mapping between the two (virtual address X is at physical address Y). Any of the many tunnel overlays solutions, whether ad hoc, proprietary, or standardized provide this basic mapping service.
To then virtualize L2, requires almost no additional state management. The L2 forwarding tables are dynamically populated from passing traffic. And the size of a single broadcast domain has fairly limited scale, supporting hundreds or low thousands of active MACs. So the only additional state that has to be managed is the association of a port (virtual or physical) to a broadcast domain which is what virtual networking standards like NVGRE and VXLAN provide.
As an aside, it’s a shame that standards like NVGRE and VXLAN choose to dictate the wire format (important for hardware compatibility) and the method for managing the context mapping between address domains (multicast), but not the control interface to manage the rest of the state. Specifying the wire format is fine. However, requiring a specific mechanism (and a shaky one at that) for managing the virtual to physical address mappings severely limits the solution space. And not specifying the control interface for managing the rest of the state effectively guarantees that implementations will be vertically integrated and proprietary.
For L3, there is a lot more state to deal with, and the number of end points to which this state applies can be very large. There are a number of datacenters today who have, or plan to have, millions of VMs. Because of this, any control plane that hopes to offer a virtualized L3 solution needs to manage potentially millions of entries at hundreds of thousands of end points (assuming the first hop network logic is within the vswitch). Clearly, scale is a primary consideration.
As another aside, in our experience, there is a lot of confusion on what exactly L3 virtualization is. While a full discussion will have to wait for a future post, it is worth pointing out that running a router as a VM is *not* network virtualization, it is x86 virtualization. Network virtualzation involves mapping between network address contexts in a manner that does not effect the total available bandwidth of the physical fabric. Running a networking stack in a virtual machine, while it does provide the benefits of x86 virtualization, limits the cross-sectional bandwidth of the emulated network to the throughput of a virtual machine. Ouch.
For L4 and above, the amount of state that has to be shared and the rate that it changes increases again by orders of magnitude. Take, for example, WAN optimization. A virtualized WAN optimization solution should be enforced throughout the network (for example, each vswitch running a piece of it) yet this would incur a tremendous amount of control overhead to create a shared content cache.
So while server virtualization lives and dies by the ability to deal with the complexity of virtualizing complex hardware interfaces of many devices at speed, network virtualization’s primary technical challenge is scale. Any solution that doesn’t deal with this up front will probably run into a wall at L2, or with some luck, basic L3.
This is all interesting … but why do I care?
Full virtualization of the network address space and service model is still a relatively new area. However, rather than tackle the problem of network virtualization directly, it appears that a fair amount of energy in industry is being poured into point solutions. This reminds us of the situation 10 years ago when many people were trying to solve server sprawl problems with application containers, and the standard claim was that virtualization didn’t offer additional benefits to justify the overhead and complexity of fully virtualizing the platform. Had that mindset prevailed, today we’d have solutions doing minimal server consolidation for a small handful of applications on only one or possibly two OSs, instead of a set of solutions that solve this and many many more problems for any application and most any OS. That mindset, for example could never have produced vMotion, which was unimaginable at the outset of server virtualization.
At the same time, those who are advocating for network virtualization tend to draw technical comparisons with server virtualization. And while clearly there is a similarity at the macro level, this comparison belies the radically different technical challenges of the two problems. And it belies the radically different approaches needed to solve the two problems. Network virtualization is not the same as server virtualization any more than server virtualization is the same as storage virtualization. Saying “the network needs a VMware” in 2012 is a little like saying “the x86 needs an EMC” in 2002.
Perhaps the confusion is harmless, but it does seem to effect how the solution space is viewed, and that may be drawing the conversation away from what really is important, scale (lots of it) and distributed state consistency. Worrying about the datapath , is worrying about a trivial component of an otherwise enormously challenging problem.
[This post was written with Andrew Lambeth. A version of it has been posted on Search Networking, but I wasn't totally happy with how that turned out. So here is a revised version. ]
[A lot of the content of this post was drawn from conversations with Juan Lage, Rajiv Ramanathan, and Mohammad Attar]
Some of the more aggressive buzz around OpenFlow has lauded it as a mechanism for re-implementing networking in total. While in some (fairly fanciful) reality, that could be the case, categorical statements of this nature tend hide practical design trade-offs that exist between any set of technologies within a design continuum.
What do I mean by that? Just that there are things that OpenFlow and more broadly SDN are better suited for than others.
In fact, the original work that lead to OpenFlow was not meant to re-implement all of networking. That’s not to say that this isn’t a worthwhile (if not quixotic) goal. Yet our focus was to explore new methods for datapath state whose management was difficult to do with using the traditional approach of full distribution with eventual consistency.
And what sort of state might that be? Clearly destination-based, shortest path forwarding state can be calculated in a distributed fashion. But there is a lot more state in the datapath beyond that used for standard forwarding (filters, tagging, policy routing, QoS policy, etc.). And there are a lot more desired uses for networks than vanilla destination-based forwarding.
Of course, much of this “other” state is not computed algorithmically today. Rather, it is updated manually, or through scripts whose function more closely resembles macro replacement than computation.
Still, our goal was (and still is) to compute this state programmatically.
So the question really boils down to whether the algorithm needed to compute the datapath state is easily distributed. For example, if the state management algorithm has any of the following properties it probably isn’t, and is therefore a potential candidate for SDN.
- It is not amenable to being split up into many smaller pieces (as opposed to fewer, larger instances). The limiting property in these cases is often excessive communication overhead between the control nodes.
- It is not amenable to running on heterogeneous compute environments. For example those with varying processor speeds and available memory.
- It is not amenable to relatively long RTT’s for communication between distributed instances. In the purely distributed case, the upper bound for communicating between any two nodes scales linearly with the longest loop free path.
- The algorithm requires sophisticated distributed coordination between instances (for example distributed locking or leader election)
There are many examples of algorithms that have these properties which can (and are) used in networking. The one I generally use as an example is a runtime policy compiler. One of our earliest SDN implementations was effectively a Datalog compiler that would take a topologically independent network policy and compile it into flows (in the form of ACL and policy routing rules).
Other examples include implementing global solvers to optimize routes for power, cost, security policy, etc., and managing distributed virtual network contexts.
The distribution properties of most of these algorithms are well understood (and have been for decades). And so it is fairly straightforward to put together an argument which demonstrates that SDN offers advantages over traditional approaches in these environments. In fact, outside of OpenFlow, it isn’t uncommon to see elements of the control plane decoupled from the dataplane in security, management, and virtualization products.
However, networking’s raison d’être, it’s killer app, is forwarding. Plain ol’ vanilla forwarding. And as we all know, the networking community long ago developed algorithms to do that which distribute wonderfully across many heterogeneous compute nodes.
So, that begs the question. Does SDN provide any value to the simple problem of forwarding? That is, if the sole purpose of my network is to move traffic between two end-points, should I use a trusty distributed algorithm (like an L3 stack) that is well understood and has matured for the last couple of decades? Or is there some compelling reason to use an SDN approach?
This is the question we’d like to explore in this post. The punchline (as always, for the impatient) is that I find it very difficult to argue that SDN has value when it comes to providing simple connectivity. That’s simply not the point in the design space that OpenFlow was created for. And simple distributed approaches, like L3 + ECMP, tend to work very well. On the other hand, in environments where transport is expensive along some dimension, and global optimization provides value, SDN starts to become attractive.
First, lets take a look at the problem of forwarding:
Forwarding to me simply means find a path between a source and a destination in a network topology. Of course, you don’t want the path to suck, meaning that the algorithm should efficiently use available bandwidth and not choose horribly suboptimal hop counts.
For the purposes of this discussion, I’m going to assume two scenarios in which we want to do forwarding: (a) let’s assume that bandwidth and connectivity are cheap and plentiful and that any path to get between two points are roughly the same (b) lets assume none of these properties.
We’ll start with the latter case.
In many networks, not all paths are equal. Some may be more expensive than others due to costs of third party transit, some may be more congested than others leading to queuing delays and loss, different paths may support different latencies or maximum bandwidth limits, and so on.
In some deployments, the forwarding problem can be further complicated by security constraints (all flows of type X must go through middleboxes) and other policy requirements.
Further, some of these properties change in real time. Take for example the cost of transit. The price of a link could increases dramatically after some threshold of use has been hit. Or consider how a varying traffic matrix may affect the queuing delay and available bandwidth of network paths.
Under such conditions, one can start to make an argument for SDN over traditional distributed routing protocols.
Why? For two reasons. First, the complexity of the computation increases with the number of properties being optimized over. It also increases with the complexity in the policy model, for example a policy that operates over source, protocol and destination is going to be more difficult than one that only considers destination.
Second, as the frequency in which these properties change increases, the amount of information the needs to be disseminated increases. An SDN approach can greatly limit the total amount of information that needs to hit the wire by reducing the distribution of the control nodes. Fully distributed protocols generally flood this information as it isn’t clear which node needs to know about the updates. There have been many proposals in the literature to address these problems, but to my knowledge they’ve seen little or no adoption.
I presume these costs are why a lot of TE engines operate offline where you can throw a lot of compute and memory at the problem.
So, if optimaility is really important (due to the cost of getting it wrong), and the inputs change a lot, and there is a lot of stuff being optimized over, SDN can help.
Now, what about the case in which bandwidth is abundant, relatively cheap and any sensible path between two points will do?
The canonical example for this is the datacenter fabric. In order to accommodate workload placement anywhere, datacenter physical networks are often built using non-oversubscribed topologies. And the cost of equipment to build these is plummeting. If you are comfortable with an extremely raw supply channel, it’s possible to get 48 ports of 10G today for under $5k. That’s pretty damn cheap.
So, say you’re building a datacenter fabric, you’ve purchased piles of cheap 10G gear and you’ve wired up a fat tree of some sort. Do you then configure OSPF with ECMP (or some other multipathing approach) and be done with it? Or does it make sense to attempt to use an SDN approach to compute the forwarding paths?
It turns out that efficiently calculating forwarding paths in a highly connected graph, and converging those paths on failure is something distributed protocols do really, really well. So well, in fact, that it’s hard to find areas to improve.
For example, the common approach of using multipathing approximates Valiant load balancing which effectively means the following: if you send a packet to an arbitrary point in the network, and that point forwards it to the destination, then for a regular topology, and any traffic matrix, you’ll be pretty close to fully using the fabric bandwidth (within a factor of two given some assumptions on flow arrival rates).
That’s a pretty stunning statement. What’s more stunning is that it can be accomplished with local decisions, and without requiring per-flow state, or any additional control overhead in the network. It also obviates the need for a control loop to monitor and respond to changes in the network. This latter property is particularly nice as the care and feeding of control loops to prevent oscillations or divergence can add a ton of complexity to the system.
On the other hand, trying to scale a solution using classic OpenFlow almost certainly won’t work. To begin with the use of n-tuples (say per-flow, or even per host/destination pair) will most likely result in table space exhaustion. Even with very large tables (hundreds of thousands) the solution is unlikely to be suitable for even moderately sized datacenters.
Further, to efficiently use the fabric, multipathing should be done per flow. It’s highly unlikely (in my experience) that having the controller participate in flow setup will have the desired performance and scale characteristics. Therefore the multipathing should be done in the fabric (which is possible in a later version of OpenFlow like 1.1 or the upcoming 1.2).
Given these constraints, an SDN approach would probably look a lot like a traditional routing protocol. That is, the resulting state would most likely be destination IP prefixes (so we can take advantage of aggregation and reduce the table requirements by a factor of N over source-destination pairs). Further, multipathing, and link failure detection would have to be done on the switch.
Another complication of using SDN is establishing connectivity to the controller. This clearly requires each switch to run something to bootstrap communication, like a traditional protocol. In most SDN implementations I know of, L3 is used for this purpose. So now, not only are we effectively mimicking L3 in the controller to manage datapath state, we haven’t been able to get rid of the distributed approach potentially doubling the control complexity network wide.
So, does SDN provide value for forwarding in these environments? Given the previous discussion it is difficult to argue in favor of SDN.
Does SDN provide more functionality? Unlikely, being limited to manipulating destination prefixes with multipathing being carried out by switches robs SDN of much of its value. The controller cannot make decisions on anything but the destination. And since the controller doesn’t know a priori which path a flow will take, it will be difficult to add additional table rules without replicating state all over the network (clearly a scalability issue).
How about operational simplicity? One might argue that in order to scale an IGP one would have to manually configure areas which presumably would be done automatically with SDN. However, it is difficult to argue that building a separate control network, or in-band-configuring is any less complex than a simple ID to switch mapping.
What about scale? I’ll leave the details of that discussion to another post. Juan Lage, Rajiv Ramanathan, and I have had a on-again, off-again e-mail discussion comparing that scaling properties of SDN to that of L3 for building a fabric. The upshot is that there are nice proof-points on both sides, but given today’s switching chipsets, even with SDN, you basically end up populating the same tables that L3 does. So any scale argument revolves around reducing the RTT to the controller through a single-function control network, and reducing the need to flood. Neither of these tricks are likely to produce a significant scale advantage for SDN, but they seem to produce some.
So, what’s the take-away?
It’s worth remembering that SDN, like any technology, is actually a point in the design space, and is not necessarily the best option for all deployment environments.
My mental model for SDN starts with looking at the forwarding state, and asking the question “what algorithm needs to run to compute that state”. The second question I ask is, “can that algorithm be easily distributed amongst many nodes with different compute and memory resources”. In many use cases, the answer to this latter question is “not-really”. The examples I provide general reduce to global solvers used for finding optimal solutions with many (often changing) variables. And in such cases, SDN is a shoo in.
However, in the case of building out networking fabrics in which bandwidth is plentiful and shortest path (or something similar) is sufficient, there are a number of great, well tested distributed algorithms for the job. In such environments, it’s difficult to argue the merits of building out a separate control network, adding additional control nodes, and running vastly less mature software.
But perhaps, that’s just me. Thoughts?
For those of you who are interested, my keynote at last week’s Open Networking Summit provides some background on OpenFlow and SDN in the form of a historical narrative. However, the main point of the talk (which doesn’t come across as well as I would have liked) is that it really is the community, and not necessarily the technology, that makes the SDN movement so special. I believe the technology will work itself out. However, building a diverse community with strong representation across the networking ecosystem (from ODMs, to customers, and everyone in between) is a very difficult undertaking. And now that we have such a community let’s be sure to acknowledge its importance and focus on continuing to cultivate and grow it.
And for some only tangentially related trivia, I’ve been tracking down the origins of the term ‘SDN’. From what I’ve been able to dig up, it was coined by Kate Greene (who had covered software defined radio) while putting together this article. So, thanks Kate.
This is roughly 6 weeks after the draft was made public. Of course, the standardization process will probably change a few things, but it’s great to be able to have something tangible now. And, as I’ve mentioned before, Open vSwitch should already support NVGRE.
There has been more movement in the industry towards L2 in L3 tunneling from the edge as the approach for tackling issues with virtual networking.
Hot on the heals of the VMWare/Cisco-led VXLAN announcement, an Internet draft on NVGRE (authored by Microsoft, Intel, Dell, Broadcom, and Arista, but my guess is that Microsoft is the primary driver) showed up with relatively little fanfare. You can check it out here.
In this blog post, I’ll briefly introduce NVGRE . However, I’d like to spend more time providing broader context on where these technologies fit into the virtual networking solution space. Specifically, I’ll argue that the tunneling protocol is a minor aspect of a complete solution, and we need open standards around the control and configuration interface as much as we need to standardize the wire format.
We’ll get to that in a few paragraphs. But first, What is NVGRE?
NVGRE is very similar to VXLAN (my comments on VXLAN here). Basically, it uses GRE as a method to tunnel L2 packets across an IP fabric, and uses 24 bits of the GRE key as a logical network discriminator (which they call a tenant network ID or TNI). By logical network discriminator, I mean it indicates which logical network a particular packet is part of. Also like VXLAN, logical broadcast is achieved through physical multicast.
A day in the life of a packet is simple. I’ll sketch out the case of a packet being sent from a VM. The vswitch, on receiving a packet from a vNic, does two lookups (a) it uses the destination MAC address to determine which tunnel to send the packet to (b) it uses the ingress vNic to determine the tenant network ID. If the MAC in (a) is known, the vswitch will cram the packet into the associated point-to-point GRE tunnel, setting the GRE key to the tenant network ID. If it isn’t know, it will tunnel the packet to the multicast address associated with that tenant network ID. Easy peasy.
While architecturally NVGRE is very similar to VXLAN, there are some differences that have practical implications.
On positive side, NVGRE’s use of GRE eases the compatibility requirement for existing hardware and software stacks. Many of the switching chips I’m familiar with already support GRE, so porting it to these environments is likely much easier than a non-supported tunneling format.
As another example, on a quick read of the RFC, it appears that Open vSwitch can already support NVGRE — it supports GRE, allows for looking and setting the GRE key (including masking, so can be limited to 24 bits), and it supports learning and explicit population of the logical L2 table. In Open vSwitch’s case, all of this can be driven through OpenFlow and the Open vSwitch configuration protocol.
On the other hand, GRE does not take advantage of a standard transport protocol (TCP/UDP), so logical flow information cannot be reflected in the outer header ports like you can for VXLAN. This means that ECMP hashing in the fabric cannot provide flow-level granularity which is desirable to take advantage of all bandwidth. As the protocol catches on, this is simple to address in hardware and will likely happen. [Update: Since writing this, I've been told that some hardware can use the GRE key in the ECMP hash, which improves the granularity of load balancing in the fabric. However, per-flow load-balancing (over the logical 5 tuple) is still not possible.]
OK, so there you have it. A very brief glimpse of NVGRE. Sure there is more to say, but I have a hard time getting excited about tunneling protocols. Why? That is what I’m going to talk about next.
Tunneling vs. Network Virtualization
So, while it is moderately interesting to explore NVGRE and VXLAN, it’s important to remember that the tunneling formats are really a very minor (and easily changed) component of a full network virtualization solution. That is, how a system tunnels, whether over UDP, or GRE, or CAPWAP or something else, doesn’t define what functions the system provides. Rather it specifies what the tunnel packets look like on the wire (an almost trivial consideration).
It’s also important to remember that these two proposals (VXLAN and NVGRE) in specific, while an excellent step in the right direction, are almost certainly going to see a lot of change going forward. As they stand now, they are fairly limited. For example, they only support L2 within the logical network. Also, they both abdicate a lot of responsibility to multicast. This limits scalability on many modern fabrics, requires the deployment of multicast which many large operators eschew, and has real shortcomings when it comes to speed of provisioning a new network, manageability, and security.
Clearly these issues are going to be addressed as the protocols mature. For example, it’s likely that there will be support for L2 and L3 in the logical network, as well as ACLs, QoS primitives, etc. Also, it is likely that there will be some primitives for support secure group joins, and perhaps even support for creating optimized multicast trees that don’t rely on the fabric (some existing solutions already do this).
So if we step back and look at the current situation. We have “standardized” the tunneling protocol and some basic mechanism for L2 in logical space, and not much else. However, we know that these protocols will need evolve going forward. And it’s very likely that this evolution will have non-trivial dependency on the control path.
Therefore, I would argue that the real issue is not “lets standardize the tunneling protocol” but rather, “lets standardize the control interface to configure the tunnels and associated state so system developers can use it address these and future challenges”. That is, if vswitches and pswitches of the future have souped up versions of NVGRE and VXLAN, something is still going to have to provide the orchestration of these primitives. And that is where the real function comes from. Neither of these proposals have addressed this, for example, the interface to provide the mapping from a vNIC to a tenant network ID is unspecified.
So what might this look like? I’ll use Open vSwitch to provide a specific example. Open vSwitch is a great platform for NVGRE (and soon VXLAN). It supports tens of thousands of GRE tunnels without a problem. And it allows setting and doing lookups on the GRE key. But what makes it immediately useful is not that it supports these tunneling primitives, but that a third party can pick it up, and use OpenFlow to configure those keys, tunnels, and any other network state (ACLs, L3, etc.) to build a full solution.
And that is the crux of the issue. If you get a solution that supports VXLAN or NVGRE, even though they are standardized, you are only getting a piece of the solution. The second piece we need, and should all ask for, is an open standard for configuring this interface. We (Open vSwitch) use OpenFlow. But any standard will do.
What about Microsoft?
From what I can tell (and admittedly, I don’t know very much), this seems to be what Microsoft is getting right. Rather than just specify a tunneling protocol, it appears that they’re also opening up the vswitch interface to support implementations from multiple vendors. While this itself doesn’t provide an open interface to virtual networking configuration, it does allow someone to do that work. For example, NEC has announced they will be releasing an OpenFlow compatible vSwitch for Windows Server 8. My guess is that this is a port of Open vSwitch. (Hey NEC, if you are using Open vSwitch, care to share the changes with the rest of the community?)
In any case, NEC will probably charge you for it, but we will work hard to make a full Open vSwitch port for Windows Server 8 available. Of course it will be open source, so we encourage users to get it for free and innovate in the source as well as use the open management protocols.
We’ll also be demoing Open vSwitch support for NVGRE within OpenStack/Quantum, so stay tuned for that.
Anyway, kudos to Microsoft for supporting L2 in L3 and adding a contender to the pool. It will be interesting to see what finally pops out of the IETF standardization process. I presume it will be some agglomeration achieved through consensus.
And double kudos for opening the vswitch interface. I think we’re going to see a lot of cool innovation in networking around Windows Server 8. And that’s good for all of us.
[This post is written by Alex Bachmutsky and Martin Casado. Alex is a Distinguished Engineer at Ericsson in Silicon Valley, driving system architecture aspects of the company's next generation platform. He is the author of the book “Platform Design for Telecommunication Gateways,” and co-authored “WiMAX Evolution: Emerging Technologies and Applications”.]
This is yet another (unplanned) addendum to the soft switching series.
Our previous posts were arguing that for virtual edge switching, using the servers compute was the best point in the design space given the current hardware landscape. The basic argument was that it is possible to achieve 10G from the server in soft switching by dedicating a single core (about $60 – $120 of silicon depending on how you count). So, it is difficult to make the argument for passthrough plus specialized hardware for two reasons. First, given currently available choices, price performance will almost certainly be higher with x86. Second, you loose many of the benefits of virtualization that are retained when using soft switching (which we’ve explained here).
Alex submitted two very good (and very detailed) responses to our claims (you can see the summary here). In them he argued that looking at the basic components costs, a hardware offload solution should be both lower power, and provide better cost performance than doing forwarding in x86. He also argued that switching and NPU chipsets are flexible and support mature, high-level development environments, which may make them suitable for the virtual networking problem.
Alex is an expert in this area, he’s incredibly experienced, and, at least partially, he’s right. Specialized hardware should be able to hit better price/performance and lower power. And it is true that development environments have come a long way.
So what’s the explanation for the discrepancy in the view points?
That is what we’ve teamed up to discuss in this post. It turns out the differences in view are twofold. The first difference comes from the perspective of a large company (and commensurate purchasing power) versus that of a handful of customers. While today, “intelligent NICs” are a least twice (generally more like 3-5x) as expensive as a pure soft solution, this is a supply side issue and isn’t justifiable by the bill of materials (BoM). A company sufficiently large with sufficient investment could overcome this obstacle.
The second difference is distributing an appliance versus distributing software. Distributing an appliance allows ultimate control over hardware configuration, development environment, runtime environment etc. Distributing software, on the other hand, often requires dealing with multiple hardware configurations, and complex software interactions both with drivers and the operating system.
Lets start with a quick resync:
In our original post, we argued that for most workloads, the most flexible and cost effective method for doing networking at the virtual edge is to do soft switching on the server (which is what 99% of all virtual deployments have done over the last decade).
The other two options we considered are using a switching chip (either in the NIC or the ToR switch) or an NPU (most likely on the NIC due to port density issues).
We’ll discuss Switching Chips First:
In our experience, non-NPU chipsets in the NIC are not sufficiently flexible for networking at the virtualized edge. The limitations are manyfold, size of lookup of metadata between tables, number of stages in the lookup pipeline, and economically viable table sizes (sometimes table can be increased, for instance, by external TCAM, but it is an expensive option and hence ignored here for budget sensitive implementations we are focusing on). A good concrete example is table space for tunnels. Network virtualization solutions often use lots and lots of tunnels (N^2 in the number of servers is not unusual). Soft switches have no problem supporting tens of thousands of tunnels without performance degradation. This is one or two orders of magnitude more than you’ll find on existing non-NPU NIC chipsets.
So this is simply a limitation in the supply chain, perhaps a future NIC will be sufficiently flexible for the virtual networking problem, until then, we’ll omit this as an option.
We both agreed that standard silicon is getting very close to being useful for edge virtual switching. The next generation of 10G switches appear to be particularly compelling. So while still suffering the flexiliby limitations and the shortcomings of hairpinning on inter-VM traffic (like reducing edge link bandwidth and ToR switching capacity), and table space issues, they do appear to be a viable option for some set of the switching decision going forward. This is due to improved support for tagging and tunneling, increased sizes in ACL tables, and improved lookup generality between tables.
So we’re hopeful that next generation ToR switches have an open interface like OpenFlow so that the network virtualization layer can manage the forwarding state.
So what about NPUs?
To be clear, Open vSwitch has already been ported to multiple NPU-NIC platforms. In the cases I am familiar with, inter-VM traffic through the CPU is far slower (presumedly due to DMA overhead) than keeping it on the x86. However, off server traffic requires less CPU.
To explore this solution space in more detail, we’ll subdivide the NPU space into two groups: classical NPUs and multicore-based NPUs.
Many NPUs in the former group have the required level of the processing flexibility, some even have integrated general-purpose CPUs that can be used to implement, for instance, OpenFlow-based control plane. However, their data plane processing is usually based on proprietary microcode that may be a development hurdle in terms of available expertise and toolchain.
Any investment to develop on such an NPU is a sunk cost, both in training the developers to the internal details of the hardware featureset and limitations, and developing the code to work within the environment. It is very difficult (if not impossible), for example, to port the microcode written for one such NPU to the NPU of another vendor.
So while this route is not only viable one but very likely beneficial for creating a mass produced appliance, the investment for supporting “yet another NIC” in a software distribution is likely unjustifiable. And without sufficient economies of scale for the NPU (which can be ensured by a large vendor) the cost to the consumer would likely stunt adoption.
The latter group is based on general-purpose multicore CPUs, but it integrates some NPU-like features, such as streaming I/O, HW packet pre-classification, HW packet re-ordering and atomic flow handling, some level of traffic management, special HW offload engines and more. Since they are based on general-purpose processors, they could be programmed using similar languages and tools as other soft switch solutions. These NPUs can be either used as a main processing engine, or alternatively offload some tasks from the main CPU when integrated on intelligent NIC cards.
If the supply side can be ironed out, both in terms of purchase model and price, this would seem to strike a reasonable balance between development overhead, portability, and price/power/performance. Below, we’ll discuss some of the other challenges that need to be overcome on the supply side to be competitive with x86.
Let’s start with the economics of switching at the edge. First, in our experience, workloads in the cloud are either mostly idle, or handling lots of TCP traffic, generally HTTP offsite (trading and HPC being the obvious exceptions). Thus 10G at MTU size packets is not only sufficient, it is usually overkill. That said, it is good to be prepared as data usage continues to grow.
Soft switching can be co-resident with the management domain. This has been the standard with Xen deployments in which the Linux bridge or Open vSwitch shares the CPUs allocated to the management domain. For 1G, allocating a single core to both is most likely sufficient. For 10G, allocating an extra core is necessary.
So if we assume worst case (requiring a full core for networking), given a fairly modern CPU, a core weighs in at $60-$100. Motherboard and packaging for that core is probably another $50 (no need for additional memory or harddrive space for soft switching).
On the high end, if you could get specialized hardware for less than (say) $100 – $150 over the price of a standard NIC per server, then there could be an argument for using them. And since often NICs are bonded for resilience, it should probably be half of that or at least include dual ports for the same price for fault resiliency.
Unfortunately, to our knowledge, such a NIC doesn’t exist at those price points. If we’re wrong please let us know (price, relative power and performance) and we’ll update this. To date, all NPU-based NICs we’ve looked are 2-5 times what they should be to pencil out competitively.
However, clearly it should be possible to make a NIC that is optimized for virtual edge switching as the raw components (when purchased at scale) do not justify these high price tags. We’ve found that usually NPU chip vendors do not manufacture their own NIC cards (or do that only for evaluation purposes), and 3rd parties charge a significant premium for their role in the supply chain. Therefore, we posit that this is largely a supply chain issue and perhaps due to the immaturity of the market. Regular NICs are a commodity, intelligent NICs are still a “luxury” without a proportional relationship to their BOM differences.
Of course, this argument is based on traditional cloud packet loads. If the environment instead was hosting an application with small-sized and/or latency sensitive traffic (e.g. voice), then system design criteria would be very different and a specialized hardware solution becomes more compelling.
Drivers are non-existant or poor for specialized NICs. If you look at the HCL (hardware compatibility lists) for VMWare, Citrix, or even the common Linux distributions, NPU-based NICs are rarely (if ever) supported. Generally software solutions rely on the underlying operating system to provide the appropriate drivers. Going with a nonstandard NIC requires getting the hypervisor vendors to support it, which is unlikely. Even multicore NPUs are not supported well, because they are based on non-x86 ISA (MIPS, ARM, PPC) with much more limited support. Practically, it becomes chicken-and-egg problem: major hypervisor vendors don’t spend enough efforts to support lower volume multicore NPUs, and system developers don’t select these NPUs because of lack of drivers causing low volume. Any break out of that vicious circle may change the equation.
While the development tool set for embedded processors has come a long way, it still doesn’t match that of standard x86/Linux environment. To be clear, this is only a very minor hurdle to a large development shop with in house expertise in embedded development.
However, from the perspective of a smaller company, dealing with embedded development often means finding and employing relatively specialized developers, more expensive tools (often), and a more complex debug and testing environment. My (Martin’s) experience with side-by-side projects working both on specialized hardware environments, and standard (non-embedded) x86 server’s is that the former is at least twice as slow.
Where does that leave us?
A quick summary of the discussion thus far is as follows.
Considering only the cost and performance properties of specialized hardware, a virtual switching solution using them should have better price performance, and lower power than an equivalent x86 solution. However, few virtualized workloads could actually take advantage of the additional hardware. And supply side issues (no such component exists today at a competitive price point), and complications in inserting specialized hardware into today’s server ecosystem, remain enormous hurdles to realizing this potential. Which is probably why 99% of all virtual deployments over the last decade (and certainly all of the largest virtual operations in the world) have relied on soft switching.
Alex is right to remind us that it is not always a technology limitation, and that there is a room for hardware to come in at the right price/performance if the supply chain and development support matures. Which it very well may be. We hope that intelligent NICs will become more affordable and their price will reflect the BOM. We also hope that NPU vendors will sponsor in some ways the development of hypervisor and middleware layers by major SW suppliers as opposed to some proprietary solutions available on the market today.
As I’ve mentioned previously, there are a number of production deployments which use Open vSwitch directly on hardware. In an upcoming blog post, we’ll dig into those use cases a little further.
NetworkWorld recently posted a blog that includes an interview with me (Martin) on OpenFlow’s roots and relevance in the current ecosystem.
While it’s generous to label me as the “inventor” of OpenFlow, the accolade is somewhat misleading. So I wanted to jot off a quick post to clarify things a bit.
To begin with, there is no sole inventor of OpenFlow. It’s been the product of many tens of individuals and organizations, over multiple years, and really, it’s still in its infancy and can be expected to evolve drastically going forward.
Regarding the specifics of its origins. I wrote the first, half-baked, unofficial draft in late 2007 with Nick McKeown (my advisor) as a follow on to research we were doing at Stanford with Scott Shenker, and Justin Pettit (among others). The first contributors to a “full” spec were Ben Pfaff, Justin, Nick, and myself. Ben and Justin did the lion’s share of the work evolving the protocol into something that could actually be used and implemented. And Justin was the primary editor and wrote the bulk of the initial spec text.
Justin, Ben, and I were at Nicira at the time and needed a protocol for remote switch management, so we coordinated with Nick to develop something that could be useful to a wider community.
Within a few months, we handed the spec, along with a working implementation (primarily written by Ben and Justin) to Stanford as there was growing interest in building a community effort around it. Since then, we’ve continued to play a limited role in its development. However, there have been many very influential contributors since (Rajiv Ramanathan, Jean Tourhilles, Glen Gibb, Brandon Heller, and Ed Crabbe just to name a very few).
So there you have it. The largely uninteresting, and somewhat convoluted origins of OpenFlow.
Thanks, Steve Herrod.
If you missed it, VMware, Cisco, Broadcom and others announced support for VXLAN a new tunneling format. For the curious, here is the RFC draft.
So quickly, what is VXLAN? It’s an L2 in L3 tunneling mechanism that supports L2 learning and tenancy information.
And what does it do for the world? A lot really. There currently is fair bit of effort in the virtualization space to address issues of L2 adjacency across subnets, VM mobility across L3 boundaries, overlapping IP spaces, service interposition, and a host of other sticky networking shortcomings. Like VXLAN, these solutions are often built on L2 in L3 tunneling, however the approaches are often form-fitted and rarely are they compatible with another implementation.
Announcing support for a common approach is a very welcome move from the industry. It will, of course, facilitate interoperability between implementations, and it will pave the way for broad support in hardware (very happy to see Broadcom and Intel in the announcements).
Ivan (as usual, ahead of the game) has already commented on some of the broader implications. His post is worth a read.
I don’t have too much to add outside of a few immediate comments.
First, we need an open implementation of this available. The Open vSwitch project has already started to dig in and will try and have something out soon — perhaps for the next OpenStack summit. More about this as the effort moves along.
Second, this is a *great* opportunity for the NIC vendors to support acceleration of VXLAN in hardware. It would be particularly nice if LRO support worked in conjunction with the tunneling so that interrupt overhead from the VMs is minimized, and the hardware handles segmentation and coalescing
Finally, it’s great to see this sort of validation for the L3 fabric, which is an excellent way to build a datacenter network. IGP + ECMP + a well understood topology (e.g. CLOS or Flattened butterfly) is the foundation of many large fabrics today, and I predict many more going forward.
[This post is written by Teemu Koponen, and Martin Casado. Teemu has architected and been involved in the implementation of multiple widely used OpenFlow controllers.]
Over the last year, there has been some discussion within the OpenFlow and broader SDN community about designing and standardizing a controller API in addition to the switch interface.
To be honest, standardizing a controller API sounds like a poor idea. The controller is software, not a protocol. To our knowledge there are 9 controllers in the wild today, and more than half of them are open source. If the community wants an open software platform, it should divert resources from arguing over standards to building open source software.
(yeah ok, this comic is only tangentially relevant … )
In addition to being a waste of time, premature standardization could be detrimental to the SDN effort. Minimally, it will unnecessarily constrain a fledgling software ecosystem. Software innovation should come through development and experience with deployment, not through consensus built on prophecies.
Further, it’s not clear how to design and standardize an interface to a (non-existent) large software system out of whole cloth. Even Unix which started in the early 70s did not get standardized until the mid/late 80s.
For SDN in particular, there is really very little experience building controllers systems in the wild today. We (the authors) have built four controllers, three which have seen production use, and two of which have multiple products built on them from different organizations. And still, we would certainly not argue for standardizing a particular interface. Hell, we’re not even clear if there is a one-size-fits-all interface for controllers targeted at drastically different deployment environments. In fact, our experience suggests the contrary.
In our opinion, if an interface is going to be standardized, it should follow the software model of standardization. Either (a) take a system that has demonstrated broad success and generality over many years and call it a standard, or (b) during the design phase of a particular software project, have the *developers* work to create a standard API with the appropriate community feedback (which is probably product and system developers).
But most importantly, and it really bears repeating, we all win if we let an unbridled software ecosystem flourish within the controller space.
OK, so while we think any discussion about standardizing a controller interface is extremely premature, the question of “what makes a good controller API” is absolutely crucial to SDN and very much worth discussing. However, there has been relatively little discourse on controller APIs. And that’s exactly what we’ll be focusing on in rest of this post. We’ll start by providing a brief look at common controller designs, and then take a closer look at Onix, the controller platform we work on.
Common Controller Types
If you survey the existing controller landscape there appears to be three broad classes of API.
Single Purpose Controllers: These controllers are built for a specific function. So while they may use some common code (like an OpenFlow library) to communicate with the switch, they are otherwise not built with different uses in mind. Or put another way, they lack a control-logic agnostic interface for extension, so there really is little more to talk about.
OpenFlow Controllers: Yeah, that is probably a confusing name, but bear with us. A number of controllers provide a high-level interface to OpenFlow (specifically) and some infrastructure for hosting one or more control “applications”. These are general purpose platforms meant for extension. However, the interface is built around the OpenFlow protocol itself, meaning the interface is a thin wrapper around each message, without offering higher-level abstractions. Therefore, as the protocol changes (e.g., a new message is added), the API necessarily reflects this change. Most controllers (including one we’ve written) fall in this camp.
In our experience, the primary problem with this tight coupling is that it is challenging to evolve any aspect of the control application, switch protocol, or controller state distribution (assuming the controllers form a cluster) without refactoring all the layers of the software stack in tandem.
General SDN Controllers: We’re sort of making these names up as we go along, but a general SDN controller is one in which the control application is fully decoupled from both the underlying protocol(s) communicating with the switches, and the protocol(s) exchanging state between controller instances (assuming the controllers can run in a cluster).
The decoupling is achieved by turning the network control problem into a state management problem, and only exposing the state to be managed to the application, and not the mechanisms by which it arrived.
Applications operate over control traffic, flow table state, configuration state, port state, counters, etc. However, how this data is pulled into the controller, and how any changes are pushed back down to the switch is not the application’s business. It can just happily modify network state as if it were local and the platform will take care of the state dissemination.
The platform we’ve been working on over the last couple of years (Onix) is of this latter category. It supports controller clustering (distribution), multiple controller/switch protocols (including OpenFlow) and provides a number of API design concessions to allow it to scale to very large deployments (tens or hundreds of thousands of ports under control). Since Onix is the controller we’re most familiar with, we’ll focus on it.
So, what does the Onix API look like? It’s extremely simple. In brief, it presents the network to applications as an eventually consistent graph that is shared among the nodes in the controller cluster. In addition, it provides applications with distributed coordination primitives for coordinating nodes’ access to that graph.
Conceptually the interface is as straightforward as it sounds, however understanding its use in practice, especially in a distributed setting, bears more discussion.
So, digging a little further …
The Network as an Eventually Consistent Graph
In Onix, the physical network is represented as a graph for the control application. Elements in the graph may represent physical objects such as switches, or their subparts, such as physical ports, or lookup tables. And it may contain logical entities such as logical ports, tunnels, BFD configuration, etc.
Control applications operate on the graph to find elements of interest and read and write to them. They can register for notifications about state changes in the elements, and they can even extend the existing elements to hold network state specific to a particular control logic.
We happen to call this graph “the NIB” (for Network Information Base).
The NIB abstracts away both the protocols for state synchronization between the controller and the switches and the protocols used for state synchronization between controllers (discussed further below).
Regarding switch state, the control applications operate on the state directly, independent of how it is synchronized with the switch. For example, control logic may traverse the NIB looking for a particular switch, modify some forwarding table entries, and then place a trigger to receive notification on any future changes (such as a port status change event).
Doing this within a single controller node is pretty straightforward. However, for resilience and scale, most production deployments will want to run with multiple controllers. Hence, the platform needs to support clustering.
Within Onix, the NIB is the central mechanism for distributed state sharing. That is, the NIB is actually a shared datastructure among the nodes in the cluster. If a node updates the NIB, eventually that update will propagate to the other nodes interested in that portion of the graph (assuming no network failures).
Under this model, the control logic remains unaware of the specifics of the distribution mechanisms for controller-to-controller state. However, it is the responsibility of the application logic to coordinate which controller instance is responsible for which portion of the graph at any given time. To aid with this, Onix includes a number of standard distributed coordination primitives such as distributed locking, leader election, etc. Using them, control applications can safely partition the work amongst the cluster.
Because the control logic is decoupled from the distribution mechanisms, the Onix may be configured to use different distributed datastores to share different parts of the graph. For more dynamic state, it may use high-performance, memory-only, key/value store, whereas for more static information (such as admin configured parts of the graph) it may prefer storage that is strongly consistent and provides durability.
Of course, a design like this pushes a lot of complexity to the application. For example, the application is responsible for all distributed coordination (though tools for doing so are provided), and if the application chooses to use an eventually consistent storage back-end, it must tolerate eventual consistency in the NIB (including conflicts, data disappearing, etc.)
An alternative approach would have been to constrain the distribution model to something simple (like strongly consistent updates to all data) which would have greatly simplified the API. However, this would come at a huge cost to scalability.
Which begs the question …
Is This the Right Interface?
Perhaps for some environments. Almost certainly not for others. There have been multiple control applications built on Onix, and it is used in large production deployments in the data centers, as well as in the access and core networks. However, it is probably too heavyweight for smaller networks (the home or small enterprise), and it is certainly too complex to use as a basic research tool.
The NIB is also a very low-level interface. Clearly, there is a lot of scaffolding one can build on top to aid in application development. For example, routing libraries, packet classification engines, network discovery components, configuration management frameworks, and what not.
And that is the beauty of software. If you don’t like an API, you fix it, build on it, or you build your own platform. Sure, as a platform matures, binary compatibility and stability of APIs becomes crucial, but that is more a matter of versioning than a priori standardization and design.
So our vote is to keep standards away from the controller design, to support and promote multiple efforts, and to let the ecosystem play out naturally.