What OpenFlow is (and more importantly, what it’s not)

There has been a lot of chatter on OpenFlow lately. Unsurprisingly, the vast majority is unguided hype and clueless naysaying. However, there have also been some very thoughtful blogs which, in my opinion, focus on some of the important issues without elevating it to something it isn’t.

Here are some of my favorites:

I would suggesting reading all three if you haven’t already.

In any case, the goal of this post is to provide some background on OpenFlow which seems to be missing outside of the crowd in which it was created and has been used.

Quickly, what is OpenFlow? OpenFlow is a fairly simple protocol for remote communication with a switch datapath. We created it because existing protocols were either too complex or too difficult to use for efficiently managing remote state. The initial draft of OpenFlow had support for reading and writing to the switch forwarding tables, soft forwarding state, asynchronous events (new packets arriving, flow timeouts), and access to switch counters. Later versions added barriers to allow synchronous operations, cookies for more robust state management, support for multiple forwarding tables, and a bunch of other stuff I can’t remember off hand.

Initially, OpenFlow attempted to define what a switch datpath should look like (first as one large TCAM, and then multiple in succession). However, in practice this is often ignored as hardware platforms tend to be irreconcilably different (there isn’t an obvious canonical forwarding pipeline common to the chipsets I’m familiar with).

And just as quickly, what isn’t OpenFlow? It isn’t new or novel, there have been many similar protocols proposed. It isn’t particularly well designed (the initial designers and implementors were more interested in building systems than designing a protocol), although it is continually being improved. And it isn’t a replacement for SNMP or NetConf, as it is myopically focussed on the forwarding path.

So, it’s reasonable to ask, why then is OpenFlow needed? The short answer is, it isn’t. In fact, focussing on OpenFlow misses the point of the rational behind its creation. What *is* needed are standardized and supported APIs for controlling switch forwarding state.

Why is this?

Consider system design in traditional networking versus modern distributed software development.

Traditional Networking (Protocol design):

Adding functionality to a network generally reduces to creating a new protocol. This is a fairly low-level endeavor often consisting of worrying about the wire format and state distribution algorithm (e.g. flooding). In almost all cases, protocols assume a purely distributed model which means it should handle high rates of churn (nodes coming and going), heterogenous node resources (e.g. cpu), and almost certainly means the state across nodes is eventually consistent (often the bane of correctness).

In addition, to follow the socially acceptable route, somewhere during this process you bring your new protocol to a standards body stuffed by the vendors and spend years (likely) arguing with a bunch of lifers on exactly what the spec should look like. But I’ll ignore that for this discussion.

So, what sucks about this? Many things. First, the distribution model is fixed. It’s *much* easier to build a system where you can dictate the distribution model (e.g. a tightly coupled cluster of servers). It involves a low-level fetish with on-the-wire data formats, and there is no standard way to implement your protocol on a given vendors platform (and even if you could, it isn’t clear how you would handle conflicts with the other mechanisms managing switch state on the box).

Developing Distributed Software:

Now, lets consider how this compares to building a modern distributed  system. Say, for example, you wanted to build a system which abstracts a large datacenter consisting of hundreds of switches and tens of thousands of access ports as a single logical switch to the administrator. The administrator should be able to do everything they
can do on a single switch, allocate subnets, configure ACLs and QoS policy, etc. Finally, assume all of the switches supported something like OpenFlow.

The problem then reduces to building a software system that manages the global network state. While it may be possible to implement everything on a single server, for reliability (and probably scale) it’s more likely you’d want to implement this as a tightly coupled cluster of servers. No problem, we know how to do that, there are tons of great tools available for building scale-out distributed systems.

So why is building a tightly coupled distributed system preferable to designing a new protocol (or more likely, set of protocols)?

  • You have fine grained control of the distribution mechanism (you don’t have to do something crude like send all state to all nodes)
  • You have fine grained control of the consistency model. You can use existing (high-level) software for state distribution and coordination (for example, ZooKeeper and Cassandra)
  • You don’t have to worry about low-level data formatting issues. Distributed data stores and packages like thrift and protocol buffers handle that for you.
  • You (hopefully) reduce the debugging problem to a tightly coupled cluster of nodes.

And what might you lose?

  • Scale? Seeing that systems of similar design are used today to orchestrate hundreds of thousands of servers and petabytes of data, if built right, scale should not be an issue.  There are OpenFlow-like solutions deployed today that have been tested to tens of thousands of ports.
  • Multi-vendor support? If the switches support a standard like OpenFlow (and can forward Ethernet) then you should be able to use any vendors’ switch.
  • However, you most likely will not have interoperability at the controller level (unless a standardized software platform was introduced).

It is important to note that in both approaches (traditional networking and building a modern distributed system), the state being managed is the same. It is the forwarding and configuration state of the switching chips network wide. It’s only the method of it’s management that has changed.

So where does OpenFlow fit in all this? It’s really a very minor, unimportant, easy to replace, mechanism in a broader development paradigm. Whether it is OpenFlow, or a simple RPC schema on protocol buffers/thrift/json/etc. just doesn’t matter. What does matter is that the protocol has very clear data consistency semantics so that it can be used to build a robust, remote state management system.

Do I believe that building networks in this manner is a net win? I really do. It may not be suitable for all deployment environments, but for those building networks as systems, it clearly has value.

I realize that the following is a tired example. But it is nevertheless true. Companies like Google, Yahoo, Amazon, and Facebook have changed how we think about infrastructure through kick-ass innovation in compute, storage and databases. The algorithm is simple: bad-ass developers + infrastructure = competitive advantage + general awesomeness.

However, lacking a consistent API across vendors and devices, the network has remained, in contrast, largely untouched. This is what will change.

Blah, blah, blah, Is it actually real? There are a number of production networks outside of research and academia running OpenFlow (or similar mechanism) today. Generally these are built using whitebox hardware since the vendor supported OpenFlow supply chain is extremely immature. Clearly it is very early days and these deployments are limited to the secretive, thrill-seeking fringe. However, the black art is escaping the cloister and I suspect we’ll hear much more about these and new deployments in the near future.

What does this mean to the hardware ecosystem? Certainly less than the tech pundits would like to claim.

Minimally it will enable innovation for those who care, namely those who take pride in infrastructure innovation, and the many startups looking to offer products built on this model.

Maximally? Who the hell knows. Is there going to be an industry cataclysm in which the last companies standing are Broadcom, an ODM (Quanta), an OEM (Dell) and a software company to string them all together? No, of course not. But that doesn’t mean we won’t see some serious shifts in hegemony in the market over the next 5-10 years. And my guess is that today’s incumbents are going to have a hell of a time holding on. Why? for the following two reasons:

  1. The organizational white blood cells are too near sighted (or too stupid) to allow internal efforts build the correct systems. For example, I’m sure “overlay” is a four letter word in Cisco because it robs hardware of value. This addiction to hardware development cycle and margins is a critical hindrance.So while these companies potter around with their 4-7 year ASIC design cycles, others are going to maximize what goes in software and start innovating like Lady Gaga with a sewing machine.
  2. The traditional vendors, to my knowledge, just don’t have the teams to build this type of software. At least not today. This can be bought or built, but it will take time. And it’s not at all clear that the whiteblood cells will let that happen.

So there you have it. OpenFlow is a small piece to a broader puzzle. It has warts, but with community involvement is getting improved, and much more importantly, it is being used.