Scale, SDN, and Network Virtualization

[This post was put together by Teemu Koponen, Andrew Lambeth, Rajiv Ramanathan, and Martin Casado]

Scale has been an active (and often contentious) topic in the discourse around SDN (and by SDN we refer to the traditional definition) long before the term was coined. Criticism of the work that lead to SDN argued that changing the model of the control plane from anything but full distribution would lead to scalability challenges. Later arguments reasoned that SDN results in *more* scalable network designs because there is no longer the need to flood the entire network state in order to create a global view at each switch. Finally, there is the common concern that calls into question the scalability of using traditional SDN (a la OpenFlow) to control physical switches due to forwarding table limits.

However, while there has been a lot of talk, there have been relatively few real-world examples to back up the rhetoric. Most arguments appeal to reason, argue (sometimes convincingly) from first principles, or point to related but ultimately different systems.

The goal of this post is to add to the discourse by presenting some scaling data, taken over a two-year period, from a production network virtualization solution that uses an SDN approach. Up front, we don’t want to overstate the significance of this post as it only covers a tiny sliver of the design space. However, it does provide insight into a real system, and that’s always an interesting centerpiece around which to hold a conversation.

Of course, under the broadest of terms, an SDN approach can have the same scaling properties as traditional networking. For example, there is no reason that controllers can’t run traditional routing protocols between them. However, a more useful line of inquiry is around the scaling properties of a system built using an SDN approach that actually benefits from the architecture, and scaling properties of an SDN system that differs from the traditional approach. We briefly touch both of these topics in the discussion below.

The System

The system we’ll be describing underlies the network virtualization platform described here. The core system has been in development for 4-5 years, has been in production for over 2 years, and has been deployed in many different environments.

A Scale Graph

By scale, we’re simply referring to the number of elements (nodes, links, end points, etc.) that a system can handle without negatively impacting runtime (e.g. change in the topology, controller failure, software upgrade, etc.). In the context of network virtualization, the elements under control that we focus on are virtual switches, physical ports, and virtual ports. Below is a graph of the scale numbers for virtual ports and hypervisors under control that we’ve validated over the last two years for one particular use case.

Scale Graph

The Y axis to the left is the number of logical ports (ports on logical switches), the Y axis on the right is the number of hypervisors (and therefore virtual switches) under control. We assume that the average number of logical ports per logical switch is constant (in this case 4), although varying that is clearly another interesting metric worth tracking. Of course, these results are in no way exhaustive, as they only reflect one use case that we commonly see in the field. Different configurations will likely have different numbers.

Some additional information about the graph:

  • For comparison, the physical analog of this would be 100,000 servers (end hosts), 5,000 ToR switches, 25,000 VLANs and all the fabric ports that connect these ToR switches together.
  • The gains in scale from Jan 2011 to Jan 2013 were all done with by improving the scaling properties of a single node. That is, rather than adding more resources by adding controller nodes, the engineering team continued to optimize the existing implementation (data structures, algorithms, language specific overhead, etc,.). However, the controllers were being run as a cluster during that time so they were still incurring the full overhead of consensus and data replication.
  • The gains shown for the last two datapoints were only from distribution (adding resources), without any changes to the core scaling properties of a single node. In this case, moving from 3 to 4 and finally 5 nodes.

Discussion

Raw scaling numbers are rarely interesting as they vary by use case, and the underlying server hardware running the controllers. What we do find interesting, though, is the relative increase in performance over time. In both cases, the increase in scale grows significantly as more nodes are added to the cluster, and as the implementation is tuned and improved.

It’s also interesting to note what the scaling bottlenecks are. While most of the discussion around SDN has focused on fundamental limits of the architecture, we have not found this be a significant contributor either way. That is, at this point we’ve not run into any architectural scaling limitations; rather, what we’ve seen are implementation shortcomings (e.g. inefficient code, inefficient scheduling, bugs) and difficulty in verification of very large networks. In fact, we believe there is significant architectural headroom to continue scaling at a similar pace.

SDN vs. Traditional Protocols

One benefit of SDN that we’ve not seen widely discussed is its ability to enable rapid evolution of solutions to address network scaling issues, especially in comparison to slow-moving standards bodies and multi-year ASIC design/development cycles. This has allowed us to continually modify our system to improve scale while still providing strong consistency guarantees, which are very important for our particular application space.

It’s easy to point out examples in traditional networking where this would be beneficial but isn’t practical in short time periods. For example, consider traditional link state routing. Generally, the topology is assumed to be unknown; for every link change event, the topology database is flooded throughout the network. However, in most environments, the topology is fixed or is slow changing and easily discoverable via some other mechanism. In such environments, the static topology can be distributed to all nodes, and then during link change events only link change data needs to be passed around rather than passing around megs of link state database. Changing this behavior would likely require a change to the RFC. Changes to the RFC, though, would require common agreement amongst all parties, and traditionally results in years of work by a very political standardization process.

For our system, however, as our understanding for the problem grows we’re able to evolve not only the algorithms and data structures used, but the distribution model (which is reflected by the last two points in the graph) and the amount of shared information.

Of course, the tradeoff for this flexibility is that the protocol used between the controllers is no longer standard. Indeed, the cluster looks much more like a tightly coupled distributed software system than a loose collection of nodes. This is why standard interfaces around SDN applications are so important. For network virtualization this would be the northbound side (e.g. Quantum), the southbound side (e.g. ovsdb-conf), and federation between controller clusters.

Parting Thoughts

This is only a snapshot of a very complex topic. The point of this post is not to highlight the scale of a particular system—clearly a moving target—but rather to provide some insight into the scaling properties of a real application built using an SDN approach.


11 Comments on “Scale, SDN, and Network Virtualization”

  1. Nice summary. I have two questions.

    First, you say “in most environments, the topology is fixed or is slow changing”. Are you referring to the physical topology, which is usually highly regular and slow changing, or the logical (virtual, overlay) topology, which can change quite rapidly and dramatically in many situations (such as a dev/test IaaS environment)?

    And second, the scale and lifetime of the system is such that you should have experienced a statistically significant number of failures (whether hardware or software). Can you comment on the robustness of the system in operational use?

    • Hi Geoff.

      1) Great questions. This as referring to the physical topology in datacenter networks. The point was simply that OSPF doesn’t actually have to send around the link-state database each time there is link change. However, getting a standardized implementation that had this (relatively minor) optimization would be difficult given the standards process. Certainly I agree in a virtual networking context the virtual topology is highly dynamic and needs to be handled in a different way.

      2) Robustness is most relevant in the context of the product built around the controller (which includes the vswitches, management, integration API’s, etc.), which where not part of the focus of this post. Perhaps this is a good discussion to defer to a post that focuses robustness of SDN approaches. In fact, Brad McConnell of Rackspace and I were recently talking about writing such a post. I think we can probably better address this issue at that time. But certainly an important (!!) topic area.

  2. guruparulkar says:

    Insightful and valuable post…

  3. Paul Gleichauf says:

    It is good to see a sample configuration’s evolution in performance over time even given the careful stated caveats.

    To amplify Geoff’s comments about dynamism and robustness, it would be very interesting to see profiles of configuration change timescales as well as profiles of flow lifetimes and which protocols are invoked. Although the reference to NVP as the basis of this network example is helpful, I was not able to glean what kind and volume of network monitoring traffic is making its way back up to the controllers for making decisions on resource allocation across the network. On the surface it would seem that monitoring traffic will increase dramatically with scale out. This network example is a single tier of control (and implicitly monitoring); is it correct to guess that it does not yet require scaling in depth?

    • Paul, I agree on all points. This only provides a glimpse and not a very deep one. The goal of the post was to focus on gains over time, not on the raw scaling implications of SDN for a particular application. I agree that a much deeper treatment is needed. But that should probably be better done as a peer reviewed technical paper, rather than a blog post. Teemu and I have been talking about submitting something on this topic for ages. Perhaps now it is time.

      • Tracy Corbo says:

        This reminds me of the transition to client/server computing and the need for dare I say it “middleware” to glue it all together. But I won’t venture down that rathole. In any case, yes the networks must scale, but just as important they must be manageable. It would be great to see network monitoring and troubleshooting baked into the design right from the beginning and not as an afterthought. No one has addressed monitoring, performance management, or troubleshooting and we all know who gets the blame when things break. All fingers point to the network.

  4. Alex Bachmutsky says:

    Very good post. A few comments:
    1. I would add to the title “… in datacenters”, because the study has been done in DC, and some of your assumptions on the physical network stability could be incorrect for access, aggregation or even core networks for telco service providers. There is a very significant difference between telecom and datacom today.
    2. Scalability is not only a rule programming. There is also a need for notifications scalability, and standard SDN today lacks capabilities similar to alarm aggregation and correlation known from traditional management solutions. Maybe not as critical for very stable networks with low number of notifications.
    3. You’ve assumed no high availability for the controller. I don’t know about DCs, but in telecom there is a need for redundancy (nobody will deploy non-redundant control for 25K switches). High availability will significantly affect the scalability and performance, because of the main HA rule – standby at any point of time should know at least the same or more than the active node.
    4. You probably assumed fully or mostly pre-provisioned scenario. The scalability will be significantly affected with high number of apriori unknown flows that the switch will have to send to the controller for analysis and decision making/provisioning. For some applications, this could be very large. It would also require much more scalable dataplane in the controller vs. just a control plane. Even in DCs, VM migration will grow (and SDN hopefully should be one of reasons why that happens), and when we have application following data or data following application, the requirement for the controller scalability should grow correspondingly. It will be even more significant when VMs will frequently move across DCs.
    5. You’ve assumed a single controller. In reality there are many cases when multiple controllers are deployed (ownership, redundancy, L1/L2/L3/network administration boundaries, etc.). Having multiple controllers that need to cooperate, align policies, perform cross-controller validity checks and many other similar functions will affect the scalability numbers.
    6. I do not know your assumptions on number of switches to be programmed for each transaction. This is a well-known problem in existing Path Computation Engines (PCEs), where some implementations inject a single event into the network that propagates it across multiple network elements, exactly for the scalability reasons. I also don’t know your assumptions about negative use cases, consistency checks across multiple network elements, rollbacks, etc. Again, maybe all these problems are more relevant for telecom…
    7. In multi-tenant environment there is a need to partition a single physical table into a number of logical tables, especially with HW-based switches. It works well when logical tables are sparse. When a shared table becomes loaded, there is a need to repartition a table, use it as a cache for most active flows and other tricks to achieve the required scalability. All these operations will cause a lot of load on the controller. Have you considered that?

    Of course, as a person pushing for SDN in telecom, I would like to have a better visibility into the telecom SDN scalability. At the same time, the last thing I want is a claim that the attempted SDN deployment is not working as advertised ;-)

    • Hi Alex. As always, great comments. A few quick responses:

      1. You’re right, thanks for pointing that out.
      2. Totally agree. We’ve had to do a lot of work in this area.
      3. Actually, the controller is fully clustered and all data points include the overhead of replication etc. In fact, the last couple of datapoints show gains gained solely via distribution (adding more controllers).
      4. Yes and no. We tend to do incremental computation when new VMs show, go away, or move. Our deployments are incredibly dynamic in general.
      5. Again, this is not the case. This is a fully clustered (active-active with distributed computation and replication for HA) solution.
      6. Very important issue that we didn’t touch on in this post.
      7. Our end points are primarily in software so we don’t have to deal with hardware table limits. However, this will certainly be an issue when dealing with hardware vteps.

  5. zanetworker says:

    I enjoyed reading as always, but there is something that keeps me confused. In this article you are mentioning ‘traditional SDN’ which uses openFlow as a southbound protocol. In another post you have also mentioned that, Network Virtualization can be achieved without SDN yet it can use SDN. So my question here is, When you say we are following an SDN approach, does it mean that using Quantum API in the Northbound and the OVSDB in the southbound is considered an SDN approach ? In another words, using only a management protocol like OVSDB is enough and considered as well an SDN apporach ?

    • Yeah, I can see how that is confusing. By traditional SDN we simply meant that the control plane has a different distribution model than the dataplane. In this case, it is a tightly coupled distributed system. SDN as portrayed by analysts and some companies is focused more on open APIs which is not in keeping with the original intent. So under that definition, a controller using OVSDB would be SDN. However, in our case we use both OpenFlow and OVSDB.

  6. Jeroen van Bemmel says:

    Interesting post, and still very relevant today. I just read your recent white paper on various aspects of NVP, including convergence timing: http://download3.vmware.com/software/vmw-tools/technical_reports/network_virt_in_multi_tenant_dc.pdf

    Could you comment on the relationship between controller scalability and convergence time?

    Regards,
    Jeroen van Bemmel, Nuage Networks


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 412 other followers