The Scaling Implications of SDN

In my experience, it’s rare to have a discussion about SDN without someone getting their panties in a bunch over scaling. Nevermind that SDN networks have already been implemented that scale to tens of thousands of ports. And nevermind that distributed systems are built today that manage hundreds of thousands of entities and petabytes of data. It remains a perennial point of contention.

So here is a hand-wavy – probably not totally convincing – but hopefully offers some intuitive understanding – attempt at an explanation of the scaling limits of SDN.

But first! Some history: Unfortunately, I think I am partially to blame for the sorry state of affairs. In some of the earliest writeups we would describe an SDN controller as “logically centralized”. The intended meaning of this was something along the lines of “it has a centralized programmatic model, but was really distributed”.

Now, what does “logically centralized” actually mean? Nothing. It’s a nonsense term and the result of sloppy thinking. Either you’re centralized, or you’re distributed (thanks to Teemu Koponen for pointing that out). So by logically centralized, I guess we really meant “distributed”, and that is what we should have said. Whoops!

Clearly, a centralized controller cannot scale. Perhaps it can handle a really large network. Hell, I’ve heard claims of single node SDN networks that scale to thousands of switches and tens of thousands of ports. Whether or not that’s true, at some point either CPU or memory will run out.

However, controllers can (and should!) be distributed. And in that case, I would argue that there are no inherent bottlenecks. Trivially, you can distribute controllers such that the total amount of available CPU and memory for the control path is equivalent to the sum total of management cpu’s on the switches themselves. Controllers can distribute state amongst themselves like network nodes do today.

Still not convinced? Let’s deconstruct this …

Latency: How might SDN affect latency? That depends on how it is used. In some implementations, datapath traffic never leaves the hardware. In which case, datapath latency should be identical to traditional networking. Other implementations will forward the first packet of a flow up to the controller and then cache the forward decisions on the datapath. In this case, the flow setup will have to pay an RTT to the controller. The following paragraph will discuss how much that might be.

Whether or not data traffic is sent to the controller, control traffic often is (for example BGP from a neighboring network, or the signalling of a port status change). In this case the additional overhead is the RTT to the controller + the cost of the computation to determine what to do next. How much that is will vary wildlyby deployment. It’s possible to keep this sub millisecond (200-300us is not unusual). I’ve seen as little as 70us claimed, but I doubt that could be maintained under any real load.

So you may not want to pay this latency per flow (though in many cases it probably doesn’t matter). However, there isn’t any appreciable overhead for responding to network events. And as I’ll describe later, it probably will reduce the total time needed to disseminate control traffic.

Datapath Scaling: The datapath is where the original OpenFlow design was broken. This is because OpenFlow used the abstraction of a switch datapath as a single TCAM. Clearly, this can be problematic for some forwarding rulesets. Assume, for example, that you are building an SDN application which would integrate with BGP, and map prefixes to tags with QoS policies. Implemented this within a single flow table would require (RIB * QoS rules) entries. Ouch.

However, later versions of OpenFlow support multiple tables which take care of this Cartesian explosion nightmare. Rather than having to multiply all the rules together, they can be placed in separate tables limiting the maximum size requirements of a single table (by a lot!).

Still, many modern implementations of SDN dispense with the abstraction of a flow altogether and program the switch hardware tables directly. I think we’ll see a lot more of this going forward.

Convergence on failure:  Alright, what happens on link (or node) failure? Traditionally, when a link fails, the information of the failed link is propagated through flooding. Flooding, of course, scales linearly with the distance of the longest loop free path. If you compare this with SDN, instead of being flooded the information goes to one of the controllers, which sends the updated routing tables to the affected switches.

A few things to note:

  • With SDN, the controller should only be updating the switches whose tables have actually changed.
  • With SDN, the total cost is link propagation to the controller + cost of computation + time to update to affected switches.
  • In a distributed SDN case, an implementation may distributed the link update among the controller clusters which then do the computation for their piece of the network.

Unfortunately for SDN, when in-band control is in play, things are a lot uglier. Inband control basically means that the datapath is used both for SDN control traffic and data. So a failure on the network may affect connectivity to the controller which must be patched up somehow before the controller can fix the problem for the rest of the network.

Patching up the control channel, in this case, is generally done with an IGP, so we’re strictly worse off than if we just used an IGP to begin with. Suck.

So bottom line, if convergence time is important, out of band control should be used. It’s simple to build a cheap, reliable out of band control network using traditional gear (the amount of traffic is minimal).

Computation and Memory: It’s easy to see that the latest multi-core server available from your friendly  server vendor can beat the pants off of whatever crap embedded cpu you’ll find in most networking gear (800mhz-1Ghz is common).

I think Nick McKeown said it best.  If we look historically, early routing protocols used distance vector routing algorithms (like RIP) which have very low computational requirements, but suck in a bunch of other ways (convergence times, split infinity etc.). As cpu’s became more powerful, it was possible for each node to compute single source shortest path to all destinations on each failure, which is the standard approach used today.

Calculating more routes (e.g. all pairs shortest path, or multiple source to all destinations) on stronger CPUs really  follows as a natural evolution. The clear “next step”. And a beefy server with lots of memory can compute *a lot* of routes, and quickly. Further, we already know how to distribute the algorithms.

Some parting notes: The goal of SDN is not to scale a simple network fabric. This is something that distributed routing algorithms do just fine. The goal of SDN is to provide a design paradigm which allows the creation of more sophisticated control paradigms: TE, security, virtualization, service interposition, etc. Perhaps that means just manipulating state at the edge of the network while traditional protocols are used in the core. Or perhaps that means using SDN throughout. In eithercase, SDN architecturally should be able to handle the scale. Remember, it’s the same amount of state that’s being passed around.

I think the right question to ask is not “does it scale” but “is it worth the hassle of building networks this way”? Building an SDN network requires some thought. In addition to the physical network nodes, a control network (probably), and controller servers need to be thrown into the mix.

So, is it worth it?

That, is up to you to decide.   Ultimately the answer will rely on what gets built using SDN and who wants to run it.  I think it’s still too early in the development cycle to derive a meaningful prediction.


Presentation on SDN and High-Level Network Abstractions

If you haven’t heard the talk, or seen the slides, these are a must.  Professor Shenker, one of the creators of SDN, put together this talk about why programming to high-level abstractions is better than protocol design, and why SDN helps achieve it.


What OpenFlow is (and more importantly, what it’s not)

There has been a lot of chatter on OpenFlow lately. Unsurprisingly, the vast majority is unguided hype and clueless naysaying. However, there have also been some very thoughtful blogs which, in my opinion, focus on some of the important issues without elevating it to something it isn’t.

Here are some of my favorites:

I would suggesting reading all three if you haven’t already.

In any case, the goal of this post is to provide some background on OpenFlow which seems to be missing outside of the crowd in which it was created and has been used.

Quickly, what is OpenFlow? OpenFlow is a fairly simple protocol for remote communication with a switch datapath. We created it because existing protocols were either too complex or too difficult to use for efficiently managing remote state. The initial draft of OpenFlow had support for reading and writing to the switch forwarding tables, soft forwarding state, asynchronous events (new packets arriving, flow timeouts), and access to switch counters. Later versions added barriers to allow synchronous operations, cookies for more robust state management, support for multiple forwarding tables, and a bunch of other stuff I can’t remember off hand.

Initially, OpenFlow attempted to define what a switch datpath should look like (first as one large TCAM, and then multiple in succession). However, in practice this is often ignored as hardware platforms tend to be irreconcilably different (there isn’t an obvious canonical forwarding pipeline common to the chipsets I’m familiar with).

And just as quickly, what isn’t OpenFlow? It isn’t new or novel, there have been many similar protocols proposed. It isn’t particularly well designed (the initial designers and implementors were more interested in building systems than designing a protocol), although it is continually being improved. And it isn’t a replacement for SNMP or NetConf, as it is myopically focussed on the forwarding path.

So, it’s reasonable to ask, why then is OpenFlow needed? The short answer is, it isn’t. In fact, focussing on OpenFlow misses the point of the rational behind its creation. What *is* needed are standardized and supported APIs for controlling switch forwarding state.

Why is this?

Consider system design in traditional networking versus modern distributed software development.

Traditional Networking (Protocol design):

Adding functionality to a network generally reduces to creating a new protocol. This is a fairly low-level endeavor often consisting of worrying about the wire format and state distribution algorithm (e.g. flooding). In almost all cases, protocols assume a purely distributed model which means it should handle high rates of churn (nodes coming and going), heterogenous node resources (e.g. cpu), and almost certainly means the state across nodes is eventually consistent (often the bane of correctness).

In addition, to follow the socially acceptable route, somewhere during this process you bring your new protocol to a standards body stuffed by the vendors and spend years (likely) arguing with a bunch of lifers on exactly what the spec should look like. But I’ll ignore that for this discussion.

So, what sucks about this? Many things. First, the distribution model is fixed. It’s *much* easier to build a system where you can dictate the distribution model (e.g. a tightly coupled cluster of servers). It involves a low-level fetish with on-the-wire data formats, and there is no standard way to implement your protocol on a given vendors platform (and even if you could, it isn’t clear how you would handle conflicts with the other mechanisms managing switch state on the box).

Developing Distributed Software:

Now, lets consider how this compares to building a modern distributed  system. Say, for example, you wanted to build a system which abstracts a large datacenter consisting of hundreds of switches and tens of thousands of access ports as a single logical switch to the administrator. The administrator should be able to do everything they
can do on a single switch, allocate subnets, configure ACLs and QoS policy, etc. Finally, assume all of the switches supported something like OpenFlow.

The problem then reduces to building a software system that manages the global network state. While it may be possible to implement everything on a single server, for reliability (and probably scale) it’s more likely you’d want to implement this as a tightly coupled cluster of servers. No problem, we know how to do that, there are tons of great tools available for building scale-out distributed systems.

So why is building a tightly coupled distributed system preferable to designing a new protocol (or more likely, set of protocols)?

  • You have fine grained control of the distribution mechanism (you don’t have to do something crude like send all state to all nodes)
  • You have fine grained control of the consistency model. You can use existing (high-level) software for state distribution and coordination (for example, ZooKeeper and Cassandra)
  • You don’t have to worry about low-level data formatting issues. Distributed data stores and packages like thrift and protocol buffers handle that for you.
  • You (hopefully) reduce the debugging problem to a tightly coupled cluster of nodes.

And what might you lose?

  • Scale? Seeing that systems of similar design are used today to orchestrate hundreds of thousands of servers and petabytes of data, if built right, scale should not be an issue.  There are OpenFlow-like solutions deployed today that have been tested to tens of thousands of ports.
  • Multi-vendor support? If the switches support a standard like OpenFlow (and can forward Ethernet) then you should be able to use any vendors’ switch.
  • However, you most likely will not have interoperability at the controller level (unless a standardized software platform was introduced).

It is important to note that in both approaches (traditional networking and building a modern distributed system), the state being managed is the same. It is the forwarding and configuration state of the switching chips network wide. It’s only the method of it’s management that has changed.

So where does OpenFlow fit in all this? It’s really a very minor, unimportant, easy to replace, mechanism in a broader development paradigm. Whether it is OpenFlow, or a simple RPC schema on protocol buffers/thrift/json/etc. just doesn’t matter. What does matter is that the protocol has very clear data consistency semantics so that it can be used to build a robust, remote state management system.

Do I believe that building networks in this manner is a net win? I really do. It may not be suitable for all deployment environments, but for those building networks as systems, it clearly has value.

I realize that the following is a tired example. But it is nevertheless true. Companies like Google, Yahoo, Amazon, and Facebook have changed how we think about infrastructure through kick-ass innovation in compute, storage and databases. The algorithm is simple: bad-ass developers + infrastructure = competitive advantage + general awesomeness.

However, lacking a consistent API across vendors and devices, the network has remained, in contrast, largely untouched. This is what will change.

Blah, blah, blah, Is it actually real? There are a number of production networks outside of research and academia running OpenFlow (or similar mechanism) today. Generally these are built using whitebox hardware since the vendor supported OpenFlow supply chain is extremely immature. Clearly it is very early days and these deployments are limited to the secretive, thrill-seeking fringe. However, the black art is escaping the cloister and I suspect we’ll hear much more about these and new deployments in the near future.

What does this mean to the hardware ecosystem? Certainly less than the tech pundits would like to claim.

Minimally it will enable innovation for those who care, namely those who take pride in infrastructure innovation, and the many startups looking to offer products built on this model.

Maximally? Who the hell knows. Is there going to be an industry cataclysm in which the last companies standing are Broadcom, an ODM (Quanta), an OEM (Dell) and a software company to string them all together? No, of course not. But that doesn’t mean we won’t see some serious shifts in hegemony in the market over the next 5-10 years. And my guess is that today’s incumbents are going to have a hell of a time holding on. Why? for the following two reasons:

  1. The organizational white blood cells are too near sighted (or too stupid) to allow internal efforts build the correct systems. For example, I’m sure “overlay” is a four letter word in Cisco because it robs hardware of value. This addiction to hardware development cycle and margins is a critical hindrance.So while these companies potter around with their 4-7 year ASIC design cycles, others are going to maximize what goes in software and start innovating like Lady Gaga with a sewing machine.
  2. The traditional vendors, to my knowledge, just don’t have the teams to build this type of software. At least not today. This can be bought or built, but it will take time. And it’s not at all clear that the whiteblood cells will let that happen.

So there you have it. OpenFlow is a small piece to a broader puzzle. It has warts, but with community involvement is getting improved, and much more importantly, it is being used.