[This post was written with Bruce Davie]
Network virtualization has been around in some form or other for many years, but it seems of late to be getting more attention than ever. This is especially true in SDN circles, where we frequently hear of network virtualization as one of the dominant use cases of SDN. Unfortunately, as with much of SDN, the discussion has been muddled, and network virtualization is being both conflated with SDN and described as a direct result of it. However, SDN is definitely not network virtualization. And network virtualization does not require SDN.
No doubt, part of the problem is that there is no broad consensus on what network virtualization is. So this post is an attempt to construct a reasonable working definition of network virtualization. In particular, we want to distinguish network virtualization from some related technologies with which it is sometimes confused, and explain how it relates to SDN.
A good place to start is to take a step back and look at how virtualization has been defined in computing. Historically, virtualization of computational resources such as CPU and memory has allowed programmers (and applications) to be freed from the limitations of physical resources. Virtual memory, for example, allows an application to operate under the illusion that it has dedicated access to a vast amount of contiguous memory, even when the physical reality is that the memory is limited, partitioned over multiple banks, and shared with other applications. From the application’s perspective, the abstraction of virtual memory is almost indistinguishable from that provided by physical memory, supporting the same address structure and memory operations.
As another example, server virtualization presents the abstraction of a virtual machine, preserving all the details of a physical machine: CPU cycles, instruction set, I/O, etc.
A key point here is that virtualization of computing hardware preserves the abstractions that were presented by the resources being virtualized. Why is this important? Because changing abstractions generally means changing the programs that want to use the virtualized resources. Server virtualization was immediately useful because existing operating systems could be run on top of the hypervisor without modification. Memory virtualization was immediately useful because the programming model did not have to change.
Virtualization and the Power of New Abstractions
Virtualization should not change the basic abstractions exposed to workloads, however it nevertheless does introduce new abstractions. These new abstractions represent the logical enclosure of the entity being virtualized (for example a process, a logical volume, or a virtual machine). It is in these new abstractions that the real power of virtualization can be found.
So while the most immediate benefit of virtualization is the ability to multiplex hardware between multiple workloads (generally for the efficiency, fault containment or security), the longer term impact comes from the ability of the new abstractions to change the operational paradigm.
Server virtualization provides the most accessible example of this. The early value proposition of hypervisor products was simply server consolidation. However, the big disruption that followed server virtualization was not consolidation but the fundamental change to the operational model created by the introduction of the VM as a basic unit of operations.
This is a crucial point. When virtualizing some set of hardware resources, a new abstraction is introduced, and it will become a basic unit of operation. If that unit is too fine grained (e.g. just exposing logical CPUs) the impact on the operational model will be limited. Get it right, however, and the impact can be substantial.
As it turns out, the virtual machine was the right level of abstraction to dramatically impact data center operations. VMs embody a fairly complete target for the things operational staff want to do with servers: provisioning new workloads, moving workloads, snapshotting workloads, rolling workloads back in time, etc.
- virtualization exposes a logical view of some resource decoupled from the physical substrate without changing the basic abstractions.
- virtualization also introduces new abstractions – the logical container of virtualized resources.
- it is the manipulation of these new abstractions that has the potential to change the operational paradigm.
- the suitability of the new abstraction for simplifying operations is important.
Given this as background, let’s turn to network virtualization.
Network Virtualization, Then and Now
As noted above, network virtualization is an extremely broad and overloaded term that has been in use for decades. Overlays, MPLS, VPNs, VLANs, LISP, Virtual routers, VRFs can all be thought of as network virtualization of some form. An earlier blog post by Bruce Davie (here) touched on the relationship between these concepts and network virtualization as we’re defining it here. The key point of that post is that when employing one of the aforementioned network virtualization primitives, we’re virtualizing some aspect of the network (a LAN segment, an L3 path, an L3 forwarding table, etc.) but rarely a network in its entirety with all its properties.
For example, if you use VLANs to virtualize an L2 segment, you don’t get virtualized counters that stay in sync when a VM moves, or a virtual ACL that keeps working wherever the VM is located. For those sorts of capabilities, you need some other mechanisms.
To put it in the context of the previous discussion, traditional network virtualization mechanisms don’t provide the most suitable operational abstractions. For example, provisioning new workloads or moving workloads still requires operational overhead to update the network state, and this is generally a manual process.
Modern approaches to network virtualization try and address this disconnect. Rather than providing a bunch of virtualized components, network virtualization today tries to provide a suitable basic unit of operations. Unsurprisingly, the abstraction is of a “virtual network”.
To be complete, a virtual network should both support the basic abstractions provided by physical networks today (L2, L3, tagging, counters, ACLs, etc.) as well as introduce a logical abstraction that encompasses all of these to be used as the basis for operation.
And just like the compute analog, this logical abstraction should support all of the operational niceties we’ve come to expect from virtualization: dynamic creation, deletion, migration, configuration, snapshotting, and roll-back.
Cleaning up the Definition of Network Virtualization
Given the previous discussion, we would characterize network virtualization as follows:
- Introduces the concept of a virtual network that is decoupled from the physical network.
- The virtual networks don’t change any of the basic abstractions found in physical networks.
- The virtual networks are exposed as a new logical abstraction that can form a basic unit of operation (creation, deletion, migration, dynamic service insertion, snapshotting, inspection, and so on).
Network Virtualization is not SDN
SDN is a mechanism, and network virtualization is a solution. It is quite possible to have network virtualization solution that doesn’t use SDN, and to use SDN to build a network that has no virtualized properties.
SDN provides network virtualization in about the same way Python does - it’s a tool (and not a mandatory one). That said, SDN does have something to offer as a mechanism for network virtualization.
A simple way to think about the problem of network virtualization is that the solution must map multiple logical abstractions onto the physical network, and keep those abstractions consistent as both the logical and physical worlds change. Since these logical abstractions may reside anywhere in the network, this becomes a fairly complicated state management problem that must be enforced network-wide.
However, managing large amounts of states with reasonable consistency guarantees is something that SDN is particularly good at. It is no coincidence that most of the network virtualization solutions out there (from a variety of vendors using a variety of approaches) have a logically centralized component of some form for state management.
The point of this post was simply to provide some scaffolding around the discussion of network virtualization. To summarize quickly, modern concepts of network virtualization both preserve traditional abstractions and provide a basic unit of operations which is a (complete) virtual network. And that new abstraction should support the same operational abstractions as its computational analog.
While SDN provides a useful approach to building a network virtualization solution, it isn’t the only way. And lets not confuse tools with solutions.
Over the next few years, we expect to see a variety of mechanisms for implementing virtual networking take hold. Some hardware-based, some software-based, some using tunnels, others using tags, some relying more on traditional distributed protocols, others relying on SDN.
In the end, the market will choose the winning mechanism(s). In the meantime, let’s make sure we clarify the dialog so that informed decisions are possible.
On a lark, I compiled a list of “open” OpenFlow software projects that I knew about off-hand, or could find with minimal effort searching online.
You can find the list here.
Unsurprisingly, most of the projects either come directly from or originated at academia or industry research. As I’ve argued before, with respect to standardization, the more design and insight that comes from real code and plugging real holes, the better. And it is still very, very early days in the SDN engineering cycle. So, it is nice to see the diversity in projects and I hope the ecosystem continues to broaden with more controllers and associated projects entering the fray.
If you know of a project that I’ve missed (I’m only listing those that have code or bits available for free online — with the exception of Pica8 which will send you code on request) please mention it in the comments or e-mail me and I’ll add it to the list.
[This post was written with Bruce Davie, and Andrew Lambeth.]
Recently, Jesse Gross, Bruce Davie and a number of contributors submitted the STT draft to the IETF (link to the draft here). STT is an encapsulation format for network virtualization. Unlike other protocols in this space (namely VXLAN and NVGRE), it was designed to be used with soft switching within the server (generally in the vswitch in the hypervisor) while taking advantage of hardware acceleration at the NIC. The goal is to preserve the flexibility and development speed of software while still providing hardware forwarding speeds.
The quick list of differentiators are i) it takes advantage of TSO available in NICs today allowing tunneling from software at 10G while consuming relatively little cpu ii) there are more bits allocated to the virtual network meta data carried per packet, and those bits are unstructured allowing for arbitrary interpretation by software and iii) the control plane is decoupled from the actual encapsulation.
There are a number of other software-centric features like better byte alignment of the headers, but these are not architecturally significant.
Of course, the publication of the draft drew reasonable skepticism on whether the industry needed yet another encapsulation format. And that is the question we will focus on in this post.
But first, let us try to provide a useful decomposition of the virtual networking problem (as it pertains to Distributed Edge Overlays DEO).
Distributed Edge Overlays (DEO)
Distributed edge overlays have gained a lot of traction as a mechanism for network virtualization. A reasonable characterization of the problem space can be found in the IETF nvo3 problem statement draft. Two recent DEO related drafts submitted to the IETF in addition to STT are NVGRE, and VXLAN.
The basic idea is to use tunneling (generally L2 in L3) to create an overlay network from the edge that exposes a virtual view of one or more network to end hosts. The edge may be, for example, a vswitch in the hypervisor, or the first hop physical switch.
DEO solutions can be roughly decomposed into three independent components.
- Encapsulation format:The encapsulation format is what the packet looks like on the wire. The format has implications both on hardware compatibility, and the amount of information that is carried across the tunnel with the packet.As an example of encapsulation, with NVGRE the encapsulation format is GRE where the GRE key is used to store some additional information (the tenant network ID).
- Control plane:The control plane disseminates the state needed to figure out which tunnels to create, which packets go in which tunnels, and what state is associated with packets as they traverse the tunnels. Changes to both the physical and virtual views of the network often require state to be updated and/or moved around the network.There are many ways to implement a control plane, either using traditional protocols (for example, NVGRE and the first VXLAN draft abdicate a lot of control responsibility to multicast), or something more SDN-esque like a centralized datastore, or even a proper SDN controller.
- Logical view:The logical view is what the “virtual network” looks like from the perspective of an end host. In the case of VXLAN and NVGRE, they offer a basic L2 learning domain. However, you can imagine this being extended to L3 (to support very large virtual networks, for example), security policies, and even higher-level services.The logical view defines the network services available to the virtual machine. For example, if only L2 is available, it may not be practical to run workloads of thousands of machines within a single virtual network due to scaling limitations of broadcast. If the virtual network provided L3, it could potentially host such workloads and still provide the benefits of virtualization such as support for VM mobility without requiring IP renumbering, higher-level service interposition (like adding firewalls), and mobile policies.
Before we jump into a justification for STT, we would like to make the point that each of these components really are logically distinct, and a good design should keep them decoupled. Why? For modularity. For example, if a particular encapsulation format has broad hardware support, it would be a shame to restrict it to a single control plane design. It would also be a shame to restrict it to a particular logical network view.
VXLAN and NVGRE or both guilty of this to some extent. For example, the NVGRE and the original VXLAN draft specify multicast as the mechanism to use for part of the control plane (with other parts left unspecified). The latest VXLAN addresses this somewhat, which is a great improvement.
Also, both VXLAN and NVGRE fix the logical forwarding model to L2 going as far as to specify how the logical forwarding tables get populated. Again, this is an unnecessary restriction.
For protocols that are hardware centric (which both VXLAN and NVGRE appear to me), this makes some modicum of sense, lookup space is expensive, and decoupling may require an extra level of indirection. However, for software this is simply bad design.
STT on the other hand limits its focus to the encapsulation format, and does not constrain the other components within the specification.
[Note: The point of this post is not to denigrate VXLAN or NVGRE, but rather to point out that they are comparatively less suited for running within the vswitch. If the full encap/decap and lookup logic is resides fully within hardware, VXLAN and NVGRE are both well designed and reasonable options.]
OK, on to a more detailed justification for STT
To structure the discussion, we’ll step through each logical component of the DEO architecture and describe the design decisions made by STT and how they compare to similar proposals.
Logical view: It turns out that the more information you can tack on to a packet as it transits the network, the richer a logical view you can create. Both NVGRE and VXLAN not only limit the additional information to 32 bits, but they also specify that those bits must contain the logical network ID. This leaves no additional space for specifying other aspects of the logical view that might be interesting to the control plane.
STT differs from NVGRE and VXLAN in two ways. First, it allocates more space to the per-packet metadata. Second, it doesn’t specify how that field is interpreted. This allows the virtual network control plane to use it for state versioning (useful for consistency across multiple switches), additional logical network meta-data, tenant identification, etc.
Of course, having a structured field of limited size makes a lot of sense for NVGRE and VXLAN where it is assumed that encap/decap and interpretation of those bits are likely to be in switching hardware.
However, STT is optimizing for soft switching with hardware accelerating in the NIC. Having larger, unstructured fields provides more flexibility for the software to work with. And, as I’ll describe below, it doesn’t obviate the ability to use hardware acceleration in the NIC to get vastly better performance than a pure software approach.
Control Plane: The STT draft says nothing about the control plane that is used for managing the tunnels and the lookup state feeding into them. This means that securing the control channel, state dissemination, packet replication, etc. are outside of, and thus not constrained by, the spec.
Encapsulation format: This is were STT really shines. STT was designed to take advantage of TSO and LRO engines in existing NICs today. With STT, it is possible to tunnel at 10G from the guest while consuming only a fraction of a CPU core. We’ve seen speedups up to 10x over pure software tunneling.
In other words, STT was designed to let you retain all the high performance features of the NIC when you start tunneling from the edge, while still retaining the flexibility of software to perform the network virtualization functions.
Here is how it works.
When a guest VM sends a packet to the wire, the transitions between the guest and the hypervisor (this is a software domain crossing which requires flushing the TLB, and likely the loss of cache locality, etc.) and the hypervisor and the NIC are relatively expensive. This is why hypervisor vendors take pains to always support TSO all the way up to the guest
Without tunneling, vswitches can take advantage of TSO by exposing a TSO enabled NIC to the guest and then passing large TCP frames to the hardware NIC which performs the segmentation. However, when tunneling is involved, this isn’t possible unless the NIC supports segmentation of the TCP frame within the tunnel in hardware (which hopefully will happen as tunneling protocols get adopted).
With STT, the guests are also exposed to a TSO enabled NIC, however instead of passing the packets directly to the NIC, the vswitch inserts an additional header that looks like a TCP packet, and performs all of the additional network virtualization procedures.
As a result, with STT, the guest ends up sending and receiving massive frames to the hypervisor (up to 64k) which are then encapsulated in software, and ultimately segmented in hardware by the NIC. The result is that the number of domain crossings are reduced by a significant factor in the case of high-throughput TCP flows.
One alternative to going through all this trouble to amortize the guest/hypervisor transistions is to try eliminating them altogether by exposing the NIC HW to the guest, with a technique commonly referred to as passthrough. However, with passthrough software is unable to make any forwarding decisions on the packet before it is sent to the NIC. Passthrough creates a number of problems by exposing the physical NIC to the guest which obviates many of the advantages of NIC virtualization (we describe these shortcomings at length here).
For modern NICs that support TSO and LRO, the only additional overhead that STT provides over sending a raw L2 frame is the memcpy() of the header for encap/decap, and the transmission cost of those additional bytes.
It’s worth pointing out that even if reassembly on the receive side is done in software (which is the case with some NICs), the interrupt coalescing between the hypervisor and the guest is still a significant performance win.
How does this compare to other tunneling proposals? The most significant difference is that NICs don’t support the tunneling protocols today, so they have to be implemented in software which results in a relatively significant performance hit. Eventually NICs will support multiple tunneling protocols, and hopefully they will also support the same stateless (on the send side) TCP segmentation offloading. However, this is unlikely to happen with LOM for awhile.
As a final point, much of STT was designed for efficient processing in software. It contains redundant fields in the header for more efficient lookup and padding to improve byte-alignment on 32-bit boundaries.
So, What’s Not to Like?
STT in it’s current form is a practical hack (more or less). Admittedly, it follows more of a “systems” than a networking aesthetic. More focus was put on practicality, performance, and software processing, than being parsimonious with lookup bits in the header.
As a result, there are definitely some blemishes. For example, because it uses a valid TCP header, but doesn’t have an associated TCP state machine, middleboxes that don’t do full TCP termination are likely to get confused (although it is a little difficult for us to see this as a real shortcoming given all of the other problems passive middleboxes have correctly reconstructing end state). Also, there is currently no simple way to distinguish it from standard TCP traffic (again, a problem for middleboxes). Of course, the opacity of tunnels to middleboxes is nothing new, but these are generally fair criticisms.
In the end, our guess is that abusing existing TSO and LRO engines will not ingratiate STT with traditional networking wonks any time soon …
However, we believe that outside of the contortions needed to be compatible with existing TSO/LRO engines, STT has a suitable design for software based tunneling with hardware offload. Because the protocol does not over-specify the broader system in which the tunnel will sit, as the hardware ecosystem evolves, it should be possible to also evolve the protocol fields themselves (like getting rid of using an actual TCP header and setting the outer IP protocol to 6) without having to rewrite the control plane logic too.
Ultimately, we think there is room for a tunneling protocol that provides the benefits of STT, namely the ability to do processing in software with minimal hardware offload for send and receive segmentation. As long as there is compatible hardware, the particulars or the protocol header are less important. After all, it’s only (mostly) software.
[A lot of the content of this post was drawn from conversations with Juan Lage, Rajiv Ramanathan, and Mohammad Attar]
Some of the more aggressive buzz around OpenFlow has lauded it as a mechanism for re-implementing networking in total. While in some (fairly fanciful) reality, that could be the case, categorical statements of this nature tend hide practical design trade-offs that exist between any set of technologies within a design continuum.
What do I mean by that? Just that there are things that OpenFlow and more broadly SDN are better suited for than others.
In fact, the original work that lead to OpenFlow was not meant to re-implement all of networking. That’s not to say that this isn’t a worthwhile (if not quixotic) goal. Yet our focus was to explore new methods for datapath state whose management was difficult to do with using the traditional approach of full distribution with eventual consistency.
And what sort of state might that be? Clearly destination-based, shortest path forwarding state can be calculated in a distributed fashion. But there is a lot more state in the datapath beyond that used for standard forwarding (filters, tagging, policy routing, QoS policy, etc.). And there are a lot more desired uses for networks than vanilla destination-based forwarding.
Of course, much of this “other” state is not computed algorithmically today. Rather, it is updated manually, or through scripts whose function more closely resembles macro replacement than computation.
Still, our goal was (and still is) to compute this state programmatically.
So the question really boils down to whether the algorithm needed to compute the datapath state is easily distributed. For example, if the state management algorithm has any of the following properties it probably isn’t, and is therefore a potential candidate for SDN.
- It is not amenable to being split up into many smaller pieces (as opposed to fewer, larger instances). The limiting property in these cases is often excessive communication overhead between the control nodes.
- It is not amenable to running on heterogeneous compute environments. For example those with varying processor speeds and available memory.
- It is not amenable to relatively long RTT’s for communication between distributed instances. In the purely distributed case, the upper bound for communicating between any two nodes scales linearly with the longest loop free path.
- The algorithm requires sophisticated distributed coordination between instances (for example distributed locking or leader election)
There are many examples of algorithms that have these properties which can (and are) used in networking. The one I generally use as an example is a runtime policy compiler. One of our earliest SDN implementations was effectively a Datalog compiler that would take a topologically independent network policy and compile it into flows (in the form of ACL and policy routing rules).
Other examples include implementing global solvers to optimize routes for power, cost, security policy, etc., and managing distributed virtual network contexts.
The distribution properties of most of these algorithms are well understood (and have been for decades). And so it is fairly straightforward to put together an argument which demonstrates that SDN offers advantages over traditional approaches in these environments. In fact, outside of OpenFlow, it isn’t uncommon to see elements of the control plane decoupled from the dataplane in security, management, and virtualization products.
However, networking’s raison d’être, it’s killer app, is forwarding. Plain ol’ vanilla forwarding. And as we all know, the networking community long ago developed algorithms to do that which distribute wonderfully across many heterogeneous compute nodes.
So, that begs the question. Does SDN provide any value to the simple problem of forwarding? That is, if the sole purpose of my network is to move traffic between two end-points, should I use a trusty distributed algorithm (like an L3 stack) that is well understood and has matured for the last couple of decades? Or is there some compelling reason to use an SDN approach?
This is the question we’d like to explore in this post. The punchline (as always, for the impatient) is that I find it very difficult to argue that SDN has value when it comes to providing simple connectivity. That’s simply not the point in the design space that OpenFlow was created for. And simple distributed approaches, like L3 + ECMP, tend to work very well. On the other hand, in environments where transport is expensive along some dimension, and global optimization provides value, SDN starts to become attractive.
First, lets take a look at the problem of forwarding:
Forwarding to me simply means find a path between a source and a destination in a network topology. Of course, you don’t want the path to suck, meaning that the algorithm should efficiently use available bandwidth and not choose horribly suboptimal hop counts.
For the purposes of this discussion, I’m going to assume two scenarios in which we want to do forwarding: (a) let’s assume that bandwidth and connectivity are cheap and plentiful and that any path to get between two points are roughly the same (b) lets assume none of these properties.
We’ll start with the latter case.
In many networks, not all paths are equal. Some may be more expensive than others due to costs of third party transit, some may be more congested than others leading to queuing delays and loss, different paths may support different latencies or maximum bandwidth limits, and so on.
In some deployments, the forwarding problem can be further complicated by security constraints (all flows of type X must go through middleboxes) and other policy requirements.
Further, some of these properties change in real time. Take for example the cost of transit. The price of a link could increases dramatically after some threshold of use has been hit. Or consider how a varying traffic matrix may affect the queuing delay and available bandwidth of network paths.
Under such conditions, one can start to make an argument for SDN over traditional distributed routing protocols.
Why? For two reasons. First, the complexity of the computation increases with the number of properties being optimized over. It also increases with the complexity in the policy model, for example a policy that operates over source, protocol and destination is going to be more difficult than one that only considers destination.
Second, as the frequency in which these properties change increases, the amount of information the needs to be disseminated increases. An SDN approach can greatly limit the total amount of information that needs to hit the wire by reducing the distribution of the control nodes. Fully distributed protocols generally flood this information as it isn’t clear which node needs to know about the updates. There have been many proposals in the literature to address these problems, but to my knowledge they’ve seen little or no adoption.
I presume these costs are why a lot of TE engines operate offline where you can throw a lot of compute and memory at the problem.
So, if optimaility is really important (due to the cost of getting it wrong), and the inputs change a lot, and there is a lot of stuff being optimized over, SDN can help.
Now, what about the case in which bandwidth is abundant, relatively cheap and any sensible path between two points will do?
The canonical example for this is the datacenter fabric. In order to accommodate workload placement anywhere, datacenter physical networks are often built using non-oversubscribed topologies. And the cost of equipment to build these is plummeting. If you are comfortable with an extremely raw supply channel, it’s possible to get 48 ports of 10G today for under $5k. That’s pretty damn cheap.
So, say you’re building a datacenter fabric, you’ve purchased piles of cheap 10G gear and you’ve wired up a fat tree of some sort. Do you then configure OSPF with ECMP (or some other multipathing approach) and be done with it? Or does it make sense to attempt to use an SDN approach to compute the forwarding paths?
It turns out that efficiently calculating forwarding paths in a highly connected graph, and converging those paths on failure is something distributed protocols do really, really well. So well, in fact, that it’s hard to find areas to improve.
For example, the common approach of using multipathing approximates Valiant load balancing which effectively means the following: if you send a packet to an arbitrary point in the network, and that point forwards it to the destination, then for a regular topology, and any traffic matrix, you’ll be pretty close to fully using the fabric bandwidth (within a factor of two given some assumptions on flow arrival rates).
That’s a pretty stunning statement. What’s more stunning is that it can be accomplished with local decisions, and without requiring per-flow state, or any additional control overhead in the network. It also obviates the need for a control loop to monitor and respond to changes in the network. This latter property is particularly nice as the care and feeding of control loops to prevent oscillations or divergence can add a ton of complexity to the system.
On the other hand, trying to scale a solution using classic OpenFlow almost certainly won’t work. To begin with the use of n-tuples (say per-flow, or even per host/destination pair) will most likely result in table space exhaustion. Even with very large tables (hundreds of thousands) the solution is unlikely to be suitable for even moderately sized datacenters.
Further, to efficiently use the fabric, multipathing should be done per flow. It’s highly unlikely (in my experience) that having the controller participate in flow setup will have the desired performance and scale characteristics. Therefore the multipathing should be done in the fabric (which is possible in a later version of OpenFlow like 1.1 or the upcoming 1.2).
Given these constraints, an SDN approach would probably look a lot like a traditional routing protocol. That is, the resulting state would most likely be destination IP prefixes (so we can take advantage of aggregation and reduce the table requirements by a factor of N over source-destination pairs). Further, multipathing, and link failure detection would have to be done on the switch.
Another complication of using SDN is establishing connectivity to the controller. This clearly requires each switch to run something to bootstrap communication, like a traditional protocol. In most SDN implementations I know of, L3 is used for this purpose. So now, not only are we effectively mimicking L3 in the controller to manage datapath state, we haven’t been able to get rid of the distributed approach potentially doubling the control complexity network wide.
So, does SDN provide value for forwarding in these environments? Given the previous discussion it is difficult to argue in favor of SDN.
Does SDN provide more functionality? Unlikely, being limited to manipulating destination prefixes with multipathing being carried out by switches robs SDN of much of its value. The controller cannot make decisions on anything but the destination. And since the controller doesn’t know a priori which path a flow will take, it will be difficult to add additional table rules without replicating state all over the network (clearly a scalability issue).
How about operational simplicity? One might argue that in order to scale an IGP one would have to manually configure areas which presumably would be done automatically with SDN. However, it is difficult to argue that building a separate control network, or in-band-configuring is any less complex than a simple ID to switch mapping.
What about scale? I’ll leave the details of that discussion to another post. Juan Lage, Rajiv Ramanathan, and I have had a on-again, off-again e-mail discussion comparing that scaling properties of SDN to that of L3 for building a fabric. The upshot is that there are nice proof-points on both sides, but given today’s switching chipsets, even with SDN, you basically end up populating the same tables that L3 does. So any scale argument revolves around reducing the RTT to the controller through a single-function control network, and reducing the need to flood. Neither of these tricks are likely to produce a significant scale advantage for SDN, but they seem to produce some.
So, what’s the take-away?
It’s worth remembering that SDN, like any technology, is actually a point in the design space, and is not necessarily the best option for all deployment environments.
My mental model for SDN starts with looking at the forwarding state, and asking the question “what algorithm needs to run to compute that state”. The second question I ask is, “can that algorithm be easily distributed amongst many nodes with different compute and memory resources”. In many use cases, the answer to this latter question is “not-really”. The examples I provide general reduce to global solvers used for finding optimal solutions with many (often changing) variables. And in such cases, SDN is a shoo in.
However, in the case of building out networking fabrics in which bandwidth is plentiful and shortest path (or something similar) is sufficient, there are a number of great, well tested distributed algorithms for the job. In such environments, it’s difficult to argue the merits of building out a separate control network, adding additional control nodes, and running vastly less mature software.
But perhaps, that’s just me. Thoughts?
For those of you who are interested, my keynote at last week’s Open Networking Summit provides some background on OpenFlow and SDN in the form of a historical narrative. However, the main point of the talk (which doesn’t come across as well as I would have liked) is that it really is the community, and not necessarily the technology, that makes the SDN movement so special. I believe the technology will work itself out. However, building a diverse community with strong representation across the networking ecosystem (from ODMs, to customers, and everyone in between) is a very difficult undertaking. And now that we have such a community let’s be sure to acknowledge its importance and focus on continuing to cultivate and grow it.
And for some only tangentially related trivia, I’ve been tracking down the origins of the term ‘SDN’. From what I’ve been able to dig up, it was coined by Kate Greene (who had covered software defined radio) while putting together this article. So, thanks Kate.
[This post is written by Teemu Koponen, and Martin Casado. Teemu has architected and been involved in the implementation of multiple widely used OpenFlow controllers.]
Over the last year, there has been some discussion within the OpenFlow and broader SDN community about designing and standardizing a controller API in addition to the switch interface.
To be honest, standardizing a controller API sounds like a poor idea. The controller is software, not a protocol. To our knowledge there are 9 controllers in the wild today, and more than half of them are open source. If the community wants an open software platform, it should divert resources from arguing over standards to building open source software.
(yeah ok, this comic is only tangentially relevant … )
In addition to being a waste of time, premature standardization could be detrimental to the SDN effort. Minimally, it will unnecessarily constrain a fledgling software ecosystem. Software innovation should come through development and experience with deployment, not through consensus built on prophecies.
Further, it’s not clear how to design and standardize an interface to a (non-existent) large software system out of whole cloth. Even Unix which started in the early 70s did not get standardized until the mid/late 80s.
For SDN in particular, there is really very little experience building controllers systems in the wild today. We (the authors) have built four controllers, three which have seen production use, and two of which have multiple products built on them from different organizations. And still, we would certainly not argue for standardizing a particular interface. Hell, we’re not even clear if there is a one-size-fits-all interface for controllers targeted at drastically different deployment environments. In fact, our experience suggests the contrary.
In our opinion, if an interface is going to be standardized, it should follow the software model of standardization. Either (a) take a system that has demonstrated broad success and generality over many years and call it a standard, or (b) during the design phase of a particular software project, have the *developers* work to create a standard API with the appropriate community feedback (which is probably product and system developers).
But most importantly, and it really bears repeating, we all win if we let an unbridled software ecosystem flourish within the controller space.
OK, so while we think any discussion about standardizing a controller interface is extremely premature, the question of “what makes a good controller API” is absolutely crucial to SDN and very much worth discussing. However, there has been relatively little discourse on controller APIs. And that’s exactly what we’ll be focusing on in rest of this post. We’ll start by providing a brief look at common controller designs, and then take a closer look at Onix, the controller platform we work on.
Common Controller Types
If you survey the existing controller landscape there appears to be three broad classes of API.
Single Purpose Controllers: These controllers are built for a specific function. So while they may use some common code (like an OpenFlow library) to communicate with the switch, they are otherwise not built with different uses in mind. Or put another way, they lack a control-logic agnostic interface for extension, so there really is little more to talk about.
OpenFlow Controllers: Yeah, that is probably a confusing name, but bear with us. A number of controllers provide a high-level interface to OpenFlow (specifically) and some infrastructure for hosting one or more control ”applications”. These are general purpose platforms meant for extension. However, the interface is built around the OpenFlow protocol itself, meaning the interface is a thin wrapper around each message, without offering higher-level abstractions. Therefore, as the protocol changes (e.g., a new message is added), the API necessarily reflects this change. Most controllers (including one we’ve written) fall in this camp.
In our experience, the primary problem with this tight coupling is that it is challenging to evolve any aspect of the control application, switch protocol, or controller state distribution (assuming the controllers form a cluster) without refactoring all the layers of the software stack in tandem.
General SDN Controllers: We’re sort of making these names up as we go along, but a general SDN controller is one in which the control application is fully decoupled from both the underlying protocol(s) communicating with the switches, and the protocol(s) exchanging state between controller instances (assuming the controllers can run in a cluster).
The decoupling is achieved by turning the network control problem into a state management problem, and only exposing the state to be managed to the application, and not the mechanisms by which it arrived.
Applications operate over control traffic, flow table state, configuration state, port state, counters, etc. However, how this data is pulled into the controller, and how any changes are pushed back down to the switch is not the application’s business. It can just happily modify network state as if it were local and the platform will take care of the state dissemination.
The platform we’ve been working on over the last couple of years (Onix) is of this latter category. It supports controller clustering (distribution), multiple controller/switch protocols (including OpenFlow) and provides a number of API design concessions to allow it to scale to very large deployments (tens or hundreds of thousands of ports under control). Since Onix is the controller we’re most familiar with, we’ll focus on it.
So, what does the Onix API look like? It’s extremely simple. In brief, it presents the network to applications as an eventually consistent graph that is shared among the nodes in the controller cluster. In addition, it provides applications with distributed coordination primitives for coordinating nodes’ access to that graph.
Conceptually the interface is as straightforward as it sounds, however understanding its use in practice, especially in a distributed setting, bears more discussion.
So, digging a little further …
The Network as an Eventually Consistent Graph
In Onix, the physical network is represented as a graph for the control application. Elements in the graph may represent physical objects such as switches, or their subparts, such as physical ports, or lookup tables. And it may contain logical entities such as logical ports, tunnels, BFD configuration, etc.
Control applications operate on the graph to find elements of interest and read and write to them. They can register for notifications about state changes in the elements, and they can even extend the existing elements to hold network state specific to a particular control logic.
We happen to call this graph “the NIB” (for Network Information Base).
The NIB abstracts away both the protocols for state synchronization between the controller and the switches and the protocols used for state synchronization between controllers (discussed further below).
Regarding switch state, the control applications operate on the state directly, independent of how it is synchronized with the switch. For example, control logic may traverse the NIB looking for a particular switch, modify some forwarding table entries, and then place a trigger to receive notification on any future changes (such as a port status change event).
Doing this within a single controller node is pretty straightforward. However, for resilience and scale, most production deployments will want to run with multiple controllers. Hence, the platform needs to support clustering.
Within Onix, the NIB is the central mechanism for distributed state sharing. That is, the NIB is actually a shared datastructure among the nodes in the cluster. If a node updates the NIB, eventually that update will propagate to the other nodes interested in that portion of the graph (assuming no network failures).
Under this model, the control logic remains unaware of the specifics of the distribution mechanisms for controller-to-controller state. However, it is the responsibility of the application logic to coordinate which controller instance is responsible for which portion of the graph at any given time. To aid with this, Onix includes a number of standard distributed coordination primitives such as distributed locking, leader election, etc. Using them, control applications can safely partition the work amongst the cluster.
Because the control logic is decoupled from the distribution mechanisms, the Onix may be configured to use different distributed datastores to share different parts of the graph. For more dynamic state, it may use high-performance, memory-only, key/value store, whereas for more static information (such as admin configured parts of the graph) it may prefer storage that is strongly consistent and provides durability.
Of course, a design like this pushes a lot of complexity to the application. For example, the application is responsible for all distributed coordination (though tools for doing so are provided), and if the application chooses to use an eventually consistent storage back-end, it must tolerate eventual consistency in the NIB (including conflicts, data disappearing, etc.)
An alternative approach would have been to constrain the distribution model to something simple (like strongly consistent updates to all data) which would have greatly simplified the API. However, this would come at a huge cost to scalability.
Which begs the question …
Is This the Right Interface?
Perhaps for some environments. Almost certainly not for others. There have been multiple control applications built on Onix, and it is used in large production deployments in the data centers, as well as in the access and core networks. However, it is probably too heavyweight for smaller networks (the home or small enterprise), and it is certainly too complex to use as a basic research tool.
The NIB is also a very low-level interface. Clearly, there is a lot of scaffolding one can build on top to aid in application development. For example, routing libraries, packet classification engines, network discovery components, configuration management frameworks, and what not.
And that is the beauty of software. If you don’t like an API, you fix it, build on it, or you build your own platform. Sure, as a platform matures, binary compatibility and stability of APIs becomes crucial, but that is more a matter of versioning than a priori standardization and design.
So our vote is to keep standards away from the controller design, to support and promote multiple efforts, and to let the ecosystem play out naturally.
Scott’s keynote at Ericsson research on SDN. I really encourage anyone who is interested in OpenFlow and/or SDN to view it. It is, in my opinion, the cleanest justification for SDN, and appropriately articulates where OpenFlow fits in the broader context (… as a minor mechanism that is capable, but generally unimportant).
Wow, over the last couple of weeks there has been an escalation in the confusion around Openflow. The problem is on both sides of the metaphorical isle. Those behind the hype engine are spinning out of control with fanciful tales about bending the laws of physics (“Openflow will keep the icecream in your refrigerator cold during a power outage!”). And those opposed to OpenFlow (for whatever reason) are using the hype to build non-existant strawmen to attack (“Openflow cannot bend the laws of physics, therefore it isn’t useful”).
So for this post, I figured I’d go through some of the dubious claims and bogus strawmen and try and beat some sense into the discussion. Each of the claims I talk to below were pulled from some article/blog/interview I ran into over the last few weeks.
Claim: “Openflow provides a more powerful interface for managing the network than exists today”
Patently false. APIs expose functionality, not create it. Openflow is a subset of what switching chipsets offer today. If you want more power, sign an NDA with Broadcom and write to their SDK directly.
What Openflow does attempt to do is provide a standard interface above the box. If you believe that SDN is a powerful concept (and I do), then you do need some interface that is sufficiently expressive and hopefully widely supported.
This is almost not worth saying, but clearly Openflow is more expressive than a CLI. CLIs are fine for manual configuration, but they totally blow for building automated systems. The shortcomings are blindingly obvious, for example traditional CLIs have no clear data schema or state semantics and they change all the time because the designers are trying to solve an HCI not an API versioning problem. However, I don’t think there is any real disagreement on this point.
Claim: “Openflow will make hardware simpler”
I highly doubt this. Even with Openflow, a practical network deployment needs a bunch of stuff. It needs to do lookups over common headers (minimally L2/L3/L4), it may need hash-based multi-pathing, it may need port groups, it may need tunneling, it may need fine-grained link timers. If you grab a chip from Broadcom, I’m not exactly sure what you’d throw away if using Openflow.
What Openflow may discourage is stupid shit like creating new tags as an excuse for hardware stickyness (“you want new feature X? It’s implemented using the Suxen tag(tm) and only our new chip supports it.”). This is because an Openflow-like interface can effectively flatten layers. For example, I don’t have to use the MPLs tag for MPLS per se. I can use it for network virtualization (for instance) or for identifying aggregates that correspond to the same security group. However, that doesn’t mean hardware is simpler. Just that the design isn’t redundant to fulfill a business need. (Post facto note: There are some great points regarding this issue in the comments. We’ll work to spin this out into a separate post.)
Or more to the point. There are an awful lot of fields and bits in a header. And an awful lot of lookup capacity in a switch. If you shuffle around what fields mean what, you can almost certainly do what you want without having to change how the hardware parses the packet.
Claim: “Openflow will commoditize the hardware layer”
This is total Kaiser Soze-esque nonsense (yup, that’s the picture at the top of the post). To begin with, networking hardware is already on its way to horizontal integration. The reason that Arista is successful has nothing to do with Openflow (they strictly don’t support it and never have), it has nothing to do with SDN, it’s because Ken Duda, and Andy Bechtolsheim are total bad-asses and have built a great product. Oh, and because merchant silicon has come of age, meaning you don’t need need an ASIC team to be successful in the market.
The difference between OpenFlow and what Arista supports is that with Openflow you choose to build to an industry standard interface, and not a proprietary one. Whether or not you care is, of course, totally up to you.
So Openflow does provide another layer of horizontal integration and once the ecosystem develops that it is, in my opinion, a very good thing. But the ecosystem is still embryonic, so it will take some time before the benefits can be realized.
I think the power of merchant silicon and the rise of the “commodity” fabric are a far greater threat to the crusty scale-up network model. Oddly, Openflow has become a relatively significant distraction. That said, as this area matures, creating a horizontal layer “above the box” will grow in significance.
Claim: “Openflow does not map efficiently to existing hardware”
True! Older versions of Openflow did not map well to existing silicon chips. This is a *major* focus of the existing design effort; to make Openflow more flexible. As with any design effort, the trade-off is between having a future proof roadmap, and having a practical design that can have tangible benefits now. So we’re trying to thread the needle practically but with some foresight. It will take a little patience to see how this evolves.
Claim: “Openflow reduces complexity”
This is a meaningless statement. Openflow is like USB, it doesn’t “do” anything new. A case can be made for SDN to reduce complexity, but it does this by constraining the distribution model and providing a development platform that doesn’t suck. No magic there.
Claim: “Openflow will obviate the need for traditional distributed routing protocols”
I hear this a lot, but I just don’t buy it. There are certainly those within the Openflow community who disagree with me, so please take this for what it’s worth (very little …)
In my opinion, traditional distributed protocols are very good at what they do, scaling, routing packets in complex graphs, converging, etc. If I were to build a fabric, I’d sure as hell do it with an distributed routing protocol (again, not everyone agrees on this point). What traditional protocols suck at is distributing all the other state in the network. What state is that? If you look at the state in a switching chip, a lot of it isn’t traditionally populated through distributed protocols. For example, the ACL tables, the tunnels, many types of tags (e.g. VLAN), etc.
Going forward, it may be the case the distributed routing protocols get pulled into the controller. Why? Because controllers have much higher cpu density and therefore can run the computation for multiple switches. A multicore server will kick the crap out of the standard switch management CPU. Like I described in a previous post, this is no different than the evolution from distance vector to link state, it’s just taking it a step further. However, the protocol certainly is still distributed.
Claim: “Openflow cannot scale because of X”
I’ve addressed this at length in a previous post. Still, I’m going to have to call Doug Gourlay out. And I quote …
“I honestly don’t think it’s going to work,” Arista’s Gourlay said. “The flow setup rates on Stanford’s largest production OpenFlow network are 500 flows per second. We’ve had to deal with networks with a million flows being set up in seconds. I’m not sure it’s going to scale up to that.”
Other than being a basic logical fallacy (Stanford’s largest production Openflow network has absolutely nothing to do with the scaling properties of the Openflow or SDN architecture), there appears to be an implicit assumption that flow-setups are in some way a limiting resource. This clearly isn’t the case if they don’t leave the datapath which is a valid (and popular) deployment model for Openflow. Doug’s a great guy, at a great company, but this is a careless statement, and it doesn’t help further the dialog.
Claim: “Openflow helps virtualize the network”
Again, this is almost a meaningless statement. A compiler helps virtualize the network if it is being used to write code to that effect. The fact is, the pioneers of network virtualization, such as VMWare, Amazon, and Cisco, don’t use Openflow.
So yes, you can use Openflow for virtualization (that’s a help, I guess … ). And open standards are a good thing. But no, you certainly don’t need to.
OK, that’s enough for now.
To be very clear, I’m a huge fan of Openflow. And I’m a huge fan of SDN. Yet, neither is a panacea, and neither is a system or even a solution. One is an effort to provide a standardized interface for building awesome systems (Openflow). The other is a philosophical model for how to build those systems (SDN). It will be up to the systems themselves to validate the approach.
In my experience, it’s rare to have a discussion about SDN without someone getting their panties in a bunch over scaling. Nevermind that SDN networks have already been implemented that scale to tens of thousands of ports. And nevermind that distributed systems are built today that manage hundreds of thousands of entities and petabytes of data. It remains a perennial point of contention.
So here is a hand-wavy – probably not totally convincing – but hopefully offers some intuitive understanding – attempt at an explanation of the scaling limits of SDN.
But first! Some history: Unfortunately, I think I am partially to blame for the sorry state of affairs. In some of the earliest writeups we would describe an SDN controller as “logically centralized”. The intended meaning of this was something along the lines of “it has a centralized programmatic model, but was really distributed”.
Now, what does “logically centralized” actually mean? Nothing. It’s a nonsense term and the result of sloppy thinking. Either you’re centralized, or you’re distributed (thanks to Teemu Koponen for pointing that out). So by logically centralized, I guess we really meant “distributed”, and that is what we should have said. Whoops!
Clearly, a centralized controller cannot scale. Perhaps it can handle a really large network. Hell, I’ve heard claims of single node SDN networks that scale to thousands of switches and tens of thousands of ports. Whether or not that’s true, at some point either CPU or memory will run out.
However, controllers can (and should!) be distributed. And in that case, I would argue that there are no inherent bottlenecks. Trivially, you can distribute controllers such that the total amount of available CPU and memory for the control path is equivalent to the sum total of management cpu’s on the switches themselves. Controllers can distribute state amongst themselves like network nodes do today.
Still not convinced? Let’s deconstruct this …
Latency: How might SDN affect latency? That depends on how it is used. In some implementations, datapath traffic never leaves the hardware. In which case, datapath latency should be identical to traditional networking. Other implementations will forward the first packet of a flow up to the controller and then cache the forward decisions on the datapath. In this case, the flow setup will have to pay an RTT to the controller. The following paragraph will discuss how much that might be.
Whether or not data traffic is sent to the controller, control traffic often is (for example BGP from a neighboring network, or the signalling of a port status change). In this case the additional overhead is the RTT to the controller + the cost of the computation to determine what to do next. How much that is will vary wildlyby deployment. It’s possible to keep this sub millisecond (200-300us is not unusual). I’ve seen as little as 70us claimed, but I doubt that could be maintained under any real load.
So you may not want to pay this latency per flow (though in many cases it probably doesn’t matter). However, there isn’t any appreciable overhead for responding to network events. And as I’ll describe later, it probably will reduce the total time needed to disseminate control traffic.
Datapath Scaling: The datapath is where the original OpenFlow design was broken. This is because OpenFlow used the abstraction of a switch datapath as a single TCAM. Clearly, this can be problematic for some forwarding rulesets. Assume, for example, that you are building an SDN application which would integrate with BGP, and map prefixes to tags with QoS policies. Implemented this within a single flow table would require (RIB * QoS rules) entries. Ouch.
However, later versions of OpenFlow support multiple tables which take care of this Cartesian explosion nightmare. Rather than having to multiply all the rules together, they can be placed in separate tables limiting the maximum size requirements of a single table (by a lot!).
Still, many modern implementations of SDN dispense with the abstraction of a flow altogether and program the switch hardware tables directly. I think we’ll see a lot more of this going forward.
Convergence on failure: Alright, what happens on link (or node) failure? Traditionally, when a link fails, the information of the failed link is propagated through flooding. Flooding, of course, scales linearly with the distance of the longest loop free path. If you compare this with SDN, instead of being flooded the information goes to one of the controllers, which sends the updated routing tables to the affected switches.
A few things to note:
- With SDN, the controller should only be updating the switches whose tables have actually changed.
- With SDN, the total cost is link propagation to the controller + cost of computation + time to update to affected switches.
- In a distributed SDN case, an implementation may distributed the link update among the controller clusters which then do the computation for their piece of the network.
Unfortunately for SDN, when in-band control is in play, things are a lot uglier. Inband control basically means that the datapath is used both for SDN control traffic and data. So a failure on the network may affect connectivity to the controller which must be patched up somehow before the controller can fix the problem for the rest of the network.
Patching up the control channel, in this case, is generally done with an IGP, so we’re strictly worse off than if we just used an IGP to begin with. Suck.
So bottom line, if convergence time is important, out of band control should be used. It’s simple to build a cheap, reliable out of band control network using traditional gear (the amount of traffic is minimal).
Computation and Memory: It’s easy to see that the latest multi-core server available from your friendly server vendor can beat the pants off of whatever crap embedded cpu you’ll find in most networking gear (800mhz-1Ghz is common).
I think Nick McKeown said it best. If we look historically, early routing protocols used distance vector routing algorithms (like RIP) which have very low computational requirements, but suck in a bunch of other ways (convergence times, split infinity etc.). As cpu’s became more powerful, it was possible for each node to compute single source shortest path to all destinations on each failure, which is the standard approach used today.
Calculating more routes (e.g. all pairs shortest path, or multiple source to all destinations) on stronger CPUs really follows as a natural evolution. The clear “next step”. And a beefy server with lots of memory can compute *a lot* of routes, and quickly. Further, we already know how to distribute the algorithms.
Some parting notes: The goal of SDN is not to scale a simple network fabric. This is something that distributed routing algorithms do just fine. The goal of SDN is to provide a design paradigm which allows the creation of more sophisticated control paradigms: TE, security, virtualization, service interposition, etc. Perhaps that means just manipulating state at the edge of the network while traditional protocols are used in the core. Or perhaps that means using SDN throughout. In eithercase, SDN architecturally should be able to handle the scale. Remember, it’s the same amount of state that’s being passed around.
I think the right question to ask is not “does it scale” but “is it worth the hassle of building networks this way”? Building an SDN network requires some thought. In addition to the physical network nodes, a control network (probably), and controller servers need to be thrown into the mix.
So, is it worth it?
That, is up to you to decide. Ultimately the answer will rely on what gets built using SDN and who wants to run it. I think it’s still too early in the development cycle to derive a meaningful prediction.
If you haven’t heard the talk, or seen the slides, these are a must. Professor Shenker, one of the creators of SDN, put together this talk about why programming to high-level abstractions is better than protocol design, and why SDN helps achieve it.