The Scaling Implications of SDN
Posted: June 8, 2011
In my experience, it’s rare to have a discussion about SDN without someone getting their panties in a bunch over scaling. Nevermind that SDN networks have already been implemented that scale to tens of thousands of ports. And nevermind that distributed systems are built today that manage hundreds of thousands of entities and petabytes of data. It remains a perennial point of contention.
So here is a hand-wavy – probably not totally convincing – but hopefully offers some intuitive understanding – attempt at an explanation of the scaling limits of SDN.
But first! Some history: Unfortunately, I think I am partially to blame for the sorry state of affairs. In some of the earliest writeups we would describe an SDN controller as “logically centralized”. The intended meaning of this was something along the lines of “it has a centralized programmatic model, but was really distributed”.
Now, what does “logically centralized” actually mean? Nothing. It’s a nonsense term and the result of sloppy thinking. Either you’re centralized, or you’re distributed (thanks to Teemu Koponen for pointing that out). So by logically centralized, I guess we really meant “distributed”, and that is what we should have said. Whoops!
Clearly, a centralized controller cannot scale. Perhaps it can handle a really large network. Hell, I’ve heard claims of single node SDN networks that scale to thousands of switches and tens of thousands of ports. Whether or not that’s true, at some point either CPU or memory will run out.
However, controllers can (and should!) be distributed. And in that case, I would argue that there are no inherent bottlenecks. Trivially, you can distribute controllers such that the total amount of available CPU and memory for the control path is equivalent to the sum total of management cpu’s on the switches themselves. Controllers can distribute state amongst themselves like network nodes do today.
Still not convinced? Let’s deconstruct this …
Latency: How might SDN affect latency? That depends on how it is used. In some implementations, datapath traffic never leaves the hardware. In which case, datapath latency should be identical to traditional networking. Other implementations will forward the first packet of a flow up to the controller and then cache the forward decisions on the datapath. In this case, the flow setup will have to pay an RTT to the controller. The following paragraph will discuss how much that might be.
Whether or not data traffic is sent to the controller, control traffic often is (for example BGP from a neighboring network, or the signalling of a port status change). In this case the additional overhead is the RTT to the controller + the cost of the computation to determine what to do next. How much that is will vary wildlyby deployment. It’s possible to keep this sub millisecond (200-300us is not unusual). I’ve seen as little as 70us claimed, but I doubt that could be maintained under any real load.
So you may not want to pay this latency per flow (though in many cases it probably doesn’t matter). However, there isn’t any appreciable overhead for responding to network events. And as I’ll describe later, it probably will reduce the total time needed to disseminate control traffic.
Datapath Scaling: The datapath is where the original OpenFlow design was broken. This is because OpenFlow used the abstraction of a switch datapath as a single TCAM. Clearly, this can be problematic for some forwarding rulesets. Assume, for example, that you are building an SDN application which would integrate with BGP, and map prefixes to tags with QoS policies. Implemented this within a single flow table would require (RIB * QoS rules) entries. Ouch.
However, later versions of OpenFlow support multiple tables which take care of this Cartesian explosion nightmare. Rather than having to multiply all the rules together, they can be placed in separate tables limiting the maximum size requirements of a single table (by a lot!).
Still, many modern implementations of SDN dispense with the abstraction of a flow altogether and program the switch hardware tables directly. I think we’ll see a lot more of this going forward.
Convergence on failure: Alright, what happens on link (or node) failure? Traditionally, when a link fails, the information of the failed link is propagated through flooding. Flooding, of course, scales linearly with the distance of the longest loop free path. If you compare this with SDN, instead of being flooded the information goes to one of the controllers, which sends the updated routing tables to the affected switches.
A few things to note:
- With SDN, the controller should only be updating the switches whose tables have actually changed.
- With SDN, the total cost is link propagation to the controller + cost of computation + time to update to affected switches.
- In a distributed SDN case, an implementation may distributed the link update among the controller clusters which then do the computation for their piece of the network.
Unfortunately for SDN, when in-band control is in play, things are a lot uglier. Inband control basically means that the datapath is used both for SDN control traffic and data. So a failure on the network may affect connectivity to the controller which must be patched up somehow before the controller can fix the problem for the rest of the network.
Patching up the control channel, in this case, is generally done with an IGP, so we’re strictly worse off than if we just used an IGP to begin with. Suck.
So bottom line, if convergence time is important, out of band control should be used. It’s simple to build a cheap, reliable out of band control network using traditional gear (the amount of traffic is minimal).
Computation and Memory: It’s easy to see that the latest multi-core server available from your friendly server vendor can beat the pants off of whatever crap embedded cpu you’ll find in most networking gear (800mhz-1Ghz is common).
I think Nick McKeown said it best. If we look historically, early routing protocols used distance vector routing algorithms (like RIP) which have very low computational requirements, but suck in a bunch of other ways (convergence times, split infinity etc.). As cpu’s became more powerful, it was possible for each node to compute single source shortest path to all destinations on each failure, which is the standard approach used today.
Calculating more routes (e.g. all pairs shortest path, or multiple source to all destinations) on stronger CPUs really follows as a natural evolution. The clear “next step”. And a beefy server with lots of memory can compute *a lot* of routes, and quickly. Further, we already know how to distribute the algorithms.
Some parting notes: The goal of SDN is not to scale a simple network fabric. This is something that distributed routing algorithms do just fine. The goal of SDN is to provide a design paradigm which allows the creation of more sophisticated control paradigms: TE, security, virtualization, service interposition, etc. Perhaps that means just manipulating state at the edge of the network while traditional protocols are used in the core. Or perhaps that means using SDN throughout. In eithercase, SDN architecturally should be able to handle the scale. Remember, it’s the same amount of state that’s being passed around.
I think the right question to ask is not “does it scale” but “is it worth the hassle of building networks this way”? Building an SDN network requires some thought. In addition to the physical network nodes, a control network (probably), and controller servers need to be thrown into the mix.
So, is it worth it?
That, is up to you to decide. Ultimately the answer will rely on what gets built using SDN and who wants to run it. I think it’s still too early in the development cycle to derive a meaningful prediction.