Network Virtualization and the End-to-End Principle

[This post was written by Dinesh Dutt with help from Martin Casado.  Dinesh is Chief Scientist at Cumulus Networks. Before that, he was a Cisco Fellow, working on various data center technologies from ASICs to protocols to RFCs. He’s a primary co-author on the TRILL RFC and the VxLAN draft at the IETF.  Sudeep Goswami, Shrijeet Mukherjee, Teemu Koponen, Dmitri Kalintsev, and T. Sridhar provided useful feedback along the way.]

In light of the seismic shifts introduced by server and network virtualization, many questions pertaining to the role of end hosts and the networking subsystem have come to the fore. Of the many questions raised by network virtualization, a prominent one is this: what function does the physical network provide in network virtualization? This post considers this question through the lens of the end-to-end argument.

Networking and Modern Data Center Applications

There are a few primary lessons learnt from the large scale data centers run by companies such as Amazon, Google, Facebook and Microsoft. The first such lesson is that a physical network built on L3 with equal-cost multipathing (ECMP) is a good fit for the modern data center. These networks provide predictable latency, scale well, converge quickly when nodes or links change, and provide a very fine-grained failure domain.

Second, historically, throwing bandwidth at the problem leads to far simpler networking compared to using complex features to overcome bandwidth limitations. The cost of building such high capacity networks has dropped dramatically in the last few years. By making networks follow the KISS principle, the networks are more robust and can be built out of simple building blocks.

Finally, there is value in moving functions from the network to the edge where there are better semantics, a richer compute model, and lower performance demands. This is evidenced by the applications that are driving data center. Over time, they have subsumed many of the functions that prior generation applications relied on the network for. For example, Hadoop has its own discovery mechanism instead of assuming that all peers are on the same broadcast medium (L2 network). Failure, security and other such characteristics are often built into the application, the compute harness, or the PaaS layer.

There is no debate about the role of networking for such applications. Yes, networks can attempt to do better load spreading or such, but vendors don’t design Hadoop protocol into networking equipment and debate about the performance benefits of doing so.

The story is much different when discussing virtual datacenters (for brevity, we’ll refer to these virtualized datacenters as “clouds” while acknowledging it is a misuse of the term) that host traditional workloads. Here there is active debate as to where functionality should lie.

Network virtualization is a key cloud-enabling technology. Network virtualization does to networking what server virtualization did to servers. It takes the fundamental pieces that constitute networking – addresses and connectivity(including policies that determine connectivity) – and virtualizes them such that many virtual networks can be multiplexed onto a single physical network.

Unlike software layers within modern data center applications that provide similar functionality (although with different abstractions) there is an ongoing discussion on where the virtual network should be implemented. In what follows, we view this discussion in light of the end-to-end argument.

Network Virtualization and the End-to-End Principle

The end-to-end principle ( is a fundamental principle defining the design and functioning of the largest network of them all, the Internet. Over the years since its first inception, in 1984, the principle has been revisited and revised, many times, by the authors themselves and by others. But a fundamental idea it postulated remains as relevant today as when it was first formulated.

With regard to the question of where to place a function, in an application or in the communication subsystem, this is what the original paper says (this oft-quoted section comes at the end of a discussion where the application being discussed is reliable file transfer and the function is reliability): “The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the end points of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.)” [Emphasis is by the authors of this post].

Consider the application of this statement to virtual networking. One of the primary pieces of information required in network virtualization is the virtual network ID (or VNI). Let us consider who can provide the information correctly and completely.

In the current world of server virtualization, the network is completely unaware of when a VM is enabled or disabled and therefore joins or leaves (or creates or destroys) a virtual network. Furthermore, since the VM itself is unaware of the fact that it is running virtualized and that the NIC it sees is really a virtual NIC, there is no information in the packet itself that can help a networking device such as a first hop router or switch identify the virtual network solely on the basis of an incoming frame. The hypervisor on the virtualized server is the only one that is aware of this detail and so it is the only one that can correctly implement this function of associating a packet to a virtual network.

Some solutions to network virtualization concede this point that the hypervisor has to be involved in the decision making of which virtual network a packet belongs to. But they’d like to consider a solution in which the hypervisor signals to the first hop switch the desire for a new virtual network and the first hop switch returns back a tag such as a VLAN for the hypervisor to tag the frame with. The first hop switch then uses this tag to be act as the edge of the network virtualization overlay. Let us consider what this entails to the overall system design. As a co-inventor of VxLAN while at Cisco, I’ve grappled with these consequences during the design of such approaches.

The robustness of a system is determined partly by how many touch points are involved when a function has to be performed. In the case of the network virtualization overlay, the only touch points involved are the ones that are directly involved in the communication: the sending and receiving hypervisors. The state about the bringup and teardown of a virtual network and the connectivity matrix of that virtual network do not involve the physical network. Since fewer touchpoints are involved in the functioning of a virtual network, it is easier to troubleshoot and diagnose problems with the virtual network (decomposing it as discussed in an earlier blog post).

Another data point for this argument comes from James Hamilton’s famous cry of “the network is in my way”. His frustration arose partly from the then in-vogue model of virtualizing networks. VLANs which were the primary construct in virtualizing a network involved touching multiple physical nodes to enable a new VLAN to come up. Since a new VLAN coming up could destabilize the existing network by becoming the straw that broke the camel’s back, a cumbersome, manual and lengthy process was required to add a new VLAN. This constrained the agility with which virtual networks could be spun up and spun down. Furthermore, to scale the solution even mildly, required the reinvention of how the primary L2 control protocol, spanning tree, worked (think MSTP).

Besides the technical merits of the end-to-end principle, another of its key consequences is the effect on innovation. It has been eloquently argued many times that services such as Netflix and VoIP are possible largely because the Internet design has the end-to-end principle as a fundamental rubric. Similarly, by looking at network virtualization as an application best implemented by end stations instead of a tight integration with the communication subsystem, it becomes clear that user choice and innovation become possible with this loose coupling. For example, you can choose between various network virtualization solutions when you separate the physical network from the virtual network overlay. And you can evolve the functions at software time scales. Also, you can use the same physical infrastructure for PaaS and IaaS applications instead of designing different networks for each kind. Lack of choice and control has been a driving factor in the revolution underway in networking today. So, this consequence of the end-to-end principle is not an academic point.

The final issue is performance. The end-to-end principal clearly allows functions to be pushed into the network as a performance optimization. This topic deserves a post in itself (we’ve addressed it in pieces before, but not in its entirety), so we’ll just tee up the basic arguments. Of course, if the virtual network overlay provides sufficient performance, there is nothing additional to do. If not, then the question remains of where to put the functionality to improve performance.

Clearly some functions should be in the physical network, such as packet replication, and enforcing QoS priorities. However, in general, we would argue that it is better to extend the end-host programming model (additional hardware instructions, more efficient memory models, etc.) where all end host applications can take advantage of it, than push a subset of the functions into the network and require an additional mechanism for state synchronization to relay edge semantics. But again, we’ll address these issues in a more focused post later.

Wrapping Up

At an architectural level, a virtual network overlay is not much different than functionality that already exists in a big data application. In both cases, functionality that applications have traditionally relied on the network for – discovery, handling failures, visibility and security – have been subsumed into the application.

Ivan Pepelnjak said it first, and said it best when comparing network virtualization to Skype. Network virtualization is better compared to the network layer that has organically evolved within applications than to traditional networking functionality found in switches and routers.

If the word “network” was not used in “virtual network overlay” or if it’s origins hadn’t been so intermixed with the discontentment with VLANs, we wonder if the debate would exist at all.

Of Mice and Elephants

[This post has been written by Martin Casado and Justin Pettit with hugely useful input from Bruce Davie, Teemu Koponen, Brad Hedlund, Scott Lowe, and T. Sridhar]


This post introduces the topic of network optimization via large flow (elephant) detection and handling.  We decompose the problem into three parts, (i) why large (elephant) flows are an important consideration, (ii) smart things we can do with them in the network, and (iii) detecting elephant flows and signaling their presence.  For (i), we explain the basis of elephant and mice and why this matters for traffic optimization. For (ii) we present a number of approaches for handling the elephant flows in the physical fabric, several of which we’re working on with hardware partners.  These include using separate queues for elephants and mice (small flows), using a dedicated network for elephants such as an optical fast path, doing intelligent routing for elephants within the physical network, and turning elephants into mice at the edge. For (iii), we show that elephant detection can be done relatively easily in the vSwitch.  In fact, Open vSwitch has supported per-flow tracking for years. We describe how it’s easy to identify elephant flows at the vSwitch and in turn provide proper signaling to the physical network using standard mechanisms.  We also show that it’s quite possible to handle elephants using industry standard hardware based on chips that exist today.

Finally, we argue that it is important that this interface remain standard and decoupled from the physical network because the accuracy of elephant detection can be greatly improved through edge semantics such as application awareness and a priori knowledge of the workloads being used.

The Problem with Elephants in a Field of Mice

Conventional wisdom (somewhat substantiated by research) is that the majority of flows within the datacenter are short (mice), yet the majority of packets belong to a few long-lived flows (elephants).  Mice are often associated with bursty, latency-sensitive apps whereas elephants tend to be large transfers in which throughput is far more important than latency.

Here’s why this is important.  Long-lived TCP flows tend to fill network buffers end-to-end, and this introduces non-trivial queuing delay to anything that shares these buffers.  In a network of elephants and mice, this means that the more latency-sensitive mice are being affected. A second-order problem is that mice are generally very bursty, so adaptive routing techniques aren’t effective with them.  Therefore routing in data centers often uses stateless, hash-based multipathing such as Equal-cost multi-path routing (ECMP).  Even for very bursty traffic, it has been shown that this approach is within a factor of two of optimal, independent of the traffic matrix.  However, using the same approach for very few elephants can cause suboptimal network usage, like hashing several elephants on to the same link when another link is free.  This is a direct consequence of the law of small numbers and the size of the elephants.

Treating Elephants Differently than Mice 

Most proposals for dealing with this problem involve identifying the elephants, and handling them differently than the mice.  Here are a few approaches that are either used today, or have been proposed:

  1.  Throw mice and elephants into different queues.  This doesn’t solve the problem of hashing long-lived flows to the same link, but it does alleviate the queuing impact on mice by the elephants.  Fortunately, this can be done easily on standard hardware today with DSCP bits.
  2. Use different routing approaches for mice and elephants.  Even though mice are too bursty to do something smart with, elephants are by definition longer lived and are likely far less bursty.  Therefore, the physical fabric could adaptively route the elephants while still using standard hash-based multipathing for the mice.
  3. Turn elephants into mice.  The basic idea here is to split an elephant up into a bunch of mice (for example, by using more than one ephemeral source port for the flow) and letting end-to-end mechanisms deal with possible re-ordering.  This approach has the nice property that the fabric remains simple and uses a single queuing and routing mechanism for all traffic.  Also, SACK in modern TCP stacks handles reordering much better than traditional stacks.  One way to implement this in an overlay network is to modify the ephemeral port of the outer header to create the necessary entropy needed by the multipathing hardware.
  4. Send elephants along a separate physical network.  This is an extreme case of 2.  One method of implementing this is to have two spines in a leaf/spine architecture, and having the top-of-rack direct the flow to the appropriate spine.  Often an optical switch is proposed for the spine.  One method for doing this is to do a policy-based routing decision using  a DSCP value that by convention denotes “elephant”.

Elephant Detection

 At this point it should be clear that handling elephants requires detection of elephants.  It should also be clear that we’ve danced around the question of what exactly characterizes an elephant.  Working backwards from the problem of introducing queuing delays on smaller, latency-sensitive flows, it’s fair to say that an elephant has high throughput for a sustained period.

Often elephants can be determined a priori without actually trying to infer them from network effects.  In a number of the networks we work with, the elephants are either related to cloning, backup, or VM migrations, all of which can be inferred from the edge or are known to the operators.  vSphere, for example, knows that a flow belongs to a migration.  And in Google’s published work on using OpenFlow, they had identified the flows on which they use the TE engine beforehand (reference here).

Dynamic detection is a bit trickier.  Doing it from within the network is hard due to the difficulty of flow tracking in high-density switching ASICs.  A number of sampling methods have been proposed, such as sampling the buffers or using sFlow. However the accuracy of such approaches hasn’t been clear due to the sampling limitations at high speeds.

On the other hand, for virtualized environments (which is a primary concern of ours given that the authors work at VMware), it is relatively simple to do flow tracking within the vSwitch.  Open vSwitch, for example, has supported per-flow granularity for the past several releases now with each flow record containing the bytes and packets sent.  Given a specified threshold, it is trivial for the vSwitch to mark certain flows as elephants.

The More Vantage Points, the Better

It’s important to remember that there is no reason to limit elephant detection to a single approach.  If you know that a flow is large a priori, great.  If you can detect elephants in the network by sampling buffers, great.  If you can use the vSwitch to do per-packet flow tracking without requiring any sampling heuristics, great.  In the end, if multiple methods identify it as an elephant, it’s still an elephant.

For this reason we feel that it is very important that the identification of elephants should be decoupled from the physical hardware and signaled over a standard interface.  The user, the policy engine, the application, the hypervisor, a third party network monitoring system, and the physical network should all be able identify elephants.

Fortunately, this can easily be done relatively simply using standard interfaces.  For example, to affect per-packet handling of elephants, marking the DSCP bits is sufficient, and the physical infrastructure can be configured to respond appropriately.

Another approach we’re exploring takes a more global view.  The idea is for each vSwitch to expose its elephants along with throughput metrics and duration.  With that information, an SDN controller for the virtual edge can identify the heaviest hitters network wide, and then signal them to the physical network for special handling.  Currently, we’re looking at exposing this information within an OVSDB column.

Are Elephants Obfuscated by the Overlay?

No.  For modern overlays, flow-level information, and QoS marking are all available in the outer header and are directly visible to the underlay physical fabric.  Elephant identification can exploit this characteristic.

Going Forward

This is a very exciting area for us.  We believe there is a lot of room to bring to bear edge understanding of workloads, and the ability for  software at the edge to do sophisticated trending analysis to the problem of elephant detection and handling.  It’s early days yet, but our initial forays both with customers and hardware partners, has been very encouraging.

More to come.