Network Virtualization and the End-to-End Principle

[This post was written by Dinesh Dutt with help from Martin Casado.  Dinesh is Chief Scientist at Cumulus Networks. Before that, he was a Cisco Fellow, working on various data center technologies from ASICs to protocols to RFCs. He’s a primary co-author on the TRILL RFC and the VxLAN draft at the IETF.  Sudeep Goswami, Shrijeet Mukherjee, Teemu Koponen, Dmitri Kalintsev, and T. Sridhar provided useful feedback along the way.]

In light of the seismic shifts introduced by server and network virtualization, many questions pertaining to the role of end hosts and the networking subsystem have come to the fore. Of the many questions raised by network virtualization, a prominent one is this: what function does the physical network provide in network virtualization? This post considers this question through the lens of the end-to-end argument.

Networking and Modern Data Center Applications

There are a few primary lessons learnt from the large scale data centers run by companies such as Amazon, Google, Facebook and Microsoft. The first such lesson is that a physical network built on L3 with equal-cost multipathing (ECMP) is a good fit for the modern data center. These networks provide predictable latency, scale well, converge quickly when nodes or links change, and provide a very fine-grained failure domain.

Second, historically, throwing bandwidth at the problem leads to far simpler networking compared to using complex features to overcome bandwidth limitations. The cost of building such high capacity networks has dropped dramatically in the last few years. By making networks follow the KISS principle, the networks are more robust and can be built out of simple building blocks.

Finally, there is value in moving functions from the network to the edge where there are better semantics, a richer compute model, and lower performance demands. This is evidenced by the applications that are driving data center. Over time, they have subsumed many of the functions that prior generation applications relied on the network for. For example, Hadoop has its own discovery mechanism instead of assuming that all peers are on the same broadcast medium (L2 network). Failure, security and other such characteristics are often built into the application, the compute harness, or the PaaS layer.

There is no debate about the role of networking for such applications. Yes, networks can attempt to do better load spreading or such, but vendors don’t design Hadoop protocol into networking equipment and debate about the performance benefits of doing so.

The story is much different when discussing virtual datacenters (for brevity, we’ll refer to these virtualized datacenters as “clouds” while acknowledging it is a misuse of the term) that host traditional workloads. Here there is active debate as to where functionality should lie.

Network virtualization is a key cloud-enabling technology. Network virtualization does to networking what server virtualization did to servers. It takes the fundamental pieces that constitute networking – addresses and connectivity(including policies that determine connectivity) – and virtualizes them such that many virtual networks can be multiplexed onto a single physical network.

Unlike software layers within modern data center applications that provide similar functionality (although with different abstractions) there is an ongoing discussion on where the virtual network should be implemented. In what follows, we view this discussion in light of the end-to-end argument.

Network Virtualization and the End-to-End Principle

The end-to-end principle ( is a fundamental principle defining the design and functioning of the largest network of them all, the Internet. Over the years since its first inception, in 1984, the principle has been revisited and revised, many times, by the authors themselves and by others. But a fundamental idea it postulated remains as relevant today as when it was first formulated.

With regard to the question of where to place a function, in an application or in the communication subsystem, this is what the original paper says (this oft-quoted section comes at the end of a discussion where the application being discussed is reliable file transfer and the function is reliability): “The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the end points of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.)” [Emphasis is by the authors of this post].

Consider the application of this statement to virtual networking. One of the primary pieces of information required in network virtualization is the virtual network ID (or VNI). Let us consider who can provide the information correctly and completely.

In the current world of server virtualization, the network is completely unaware of when a VM is enabled or disabled and therefore joins or leaves (or creates or destroys) a virtual network. Furthermore, since the VM itself is unaware of the fact that it is running virtualized and that the NIC it sees is really a virtual NIC, there is no information in the packet itself that can help a networking device such as a first hop router or switch identify the virtual network solely on the basis of an incoming frame. The hypervisor on the virtualized server is the only one that is aware of this detail and so it is the only one that can correctly implement this function of associating a packet to a virtual network.

Some solutions to network virtualization concede this point that the hypervisor has to be involved in the decision making of which virtual network a packet belongs to. But they’d like to consider a solution in which the hypervisor signals to the first hop switch the desire for a new virtual network and the first hop switch returns back a tag such as a VLAN for the hypervisor to tag the frame with. The first hop switch then uses this tag to be act as the edge of the network virtualization overlay. Let us consider what this entails to the overall system design. As a co-inventor of VxLAN while at Cisco, I’ve grappled with these consequences during the design of such approaches.

The robustness of a system is determined partly by how many touch points are involved when a function has to be performed. In the case of the network virtualization overlay, the only touch points involved are the ones that are directly involved in the communication: the sending and receiving hypervisors. The state about the bringup and teardown of a virtual network and the connectivity matrix of that virtual network do not involve the physical network. Since fewer touchpoints are involved in the functioning of a virtual network, it is easier to troubleshoot and diagnose problems with the virtual network (decomposing it as discussed in an earlier blog post).

Another data point for this argument comes from James Hamilton’s famous cry of “the network is in my way”. His frustration arose partly from the then in-vogue model of virtualizing networks. VLANs which were the primary construct in virtualizing a network involved touching multiple physical nodes to enable a new VLAN to come up. Since a new VLAN coming up could destabilize the existing network by becoming the straw that broke the camel’s back, a cumbersome, manual and lengthy process was required to add a new VLAN. This constrained the agility with which virtual networks could be spun up and spun down. Furthermore, to scale the solution even mildly, required the reinvention of how the primary L2 control protocol, spanning tree, worked (think MSTP).

Besides the technical merits of the end-to-end principle, another of its key consequences is the effect on innovation. It has been eloquently argued many times that services such as Netflix and VoIP are possible largely because the Internet design has the end-to-end principle as a fundamental rubric. Similarly, by looking at network virtualization as an application best implemented by end stations instead of a tight integration with the communication subsystem, it becomes clear that user choice and innovation become possible with this loose coupling. For example, you can choose between various network virtualization solutions when you separate the physical network from the virtual network overlay. And you can evolve the functions at software time scales. Also, you can use the same physical infrastructure for PaaS and IaaS applications instead of designing different networks for each kind. Lack of choice and control has been a driving factor in the revolution underway in networking today. So, this consequence of the end-to-end principle is not an academic point.

The final issue is performance. The end-to-end principal clearly allows functions to be pushed into the network as a performance optimization. This topic deserves a post in itself (we’ve addressed it in pieces before, but not in its entirety), so we’ll just tee up the basic arguments. Of course, if the virtual network overlay provides sufficient performance, there is nothing additional to do. If not, then the question remains of where to put the functionality to improve performance.

Clearly some functions should be in the physical network, such as packet replication, and enforcing QoS priorities. However, in general, we would argue that it is better to extend the end-host programming model (additional hardware instructions, more efficient memory models, etc.) where all end host applications can take advantage of it, than push a subset of the functions into the network and require an additional mechanism for state synchronization to relay edge semantics. But again, we’ll address these issues in a more focused post later.

Wrapping Up

At an architectural level, a virtual network overlay is not much different than functionality that already exists in a big data application. In both cases, functionality that applications have traditionally relied on the network for – discovery, handling failures, visibility and security – have been subsumed into the application.

Ivan Pepelnjak said it first, and said it best when comparing network virtualization to Skype. Network virtualization is better compared to the network layer that has organically evolved within applications than to traditional networking functionality found in switches and routers.

If the word “network” was not used in “virtual network overlay” or if it’s origins hadn’t been so intermixed with the discontentment with VLANs, we wonder if the debate would exist at all.

19 Comments on “Network Virtualization and the End-to-End Principle”

  1. Great post. One of the things the post reinforced for me is the need to give the application developer as much control of the virtual network as reasonably possible. If the application developer has the freedom to bring up and tear down networks on demand while assigning new and interesting layer 2 protocols then we get new and different applications that are less likely to adversely affect the physical network. In the instance of these next generation applications we may actually have greater troubleshooting capability as we can limit the troubleshooting to a single virtual overlay.

    Further the advantage of these software based overlays you not only eliminate the reliance of a single vendor but like the Internet you eliminate limitations of physical borders if the underlying bandwidth is available. This again makes possible applications that couldn’t otherwise be created.

    Even though the technical depth of these posts may be a little beyond my pay grade I enjoy the practical applications of the deep dives.

    Great stuff guys.

    • casadomartin says:

      That’s right Keith, pushing this functionality to the edge provides a much stronger coupling with the application. I particularly liked that you pointed out the impact on troubleshooting which I think could be huge. Your comments are always appreciated.

  2. Hi Dinesh

    I agree with most of what you have said here but I have a few questions/ comments if we continue down the same line of thinking.

    1) Do we know whether fully non-blocking CLOS networks will be efficiently utilized in most practical scenarios ? Wouldn’t one want to for example to group all VMs of a given tenant on the same server (or nearby servers) in order to optimize network utilization as well as performance. In that case a network designed for full non blocking any to any traffic pattern will be over designed. So while I agree with the notion of throwing bandwidth at the problem, it seems like the jury is still out on whether all or most network fabrics need to be designed to be a fully non-blocking CLOS.

    2) Even if we do start with a fully non-blocking CLOS, we have to deal with the network upgrades problem. Is the carrier/ network admin going to upgrade all elements of the CLOS fabric at the same time ? or will it be more practically economical to upgrade it in pieces. And if so, again we can have a network that does not look like a non blocking fabric. Other practicalities (including support for existing/ legacy networks) can also cause the network to not be a seamless non blocking fabric.

    3) It would seem that proposals for centralized “controllers” for the overlays are at odds with the distributed end-to-end principle. Why have any function be centralized and become the bottleneck for innovation as well as performance/ scale ?

    Would appreciate your thoughts/ feedback.

    Sanjeev Rampal

    • casadomartin says:

      Hey Sanjeev,

      Two comments regarding #3

      a) by my reading, the end-to-end principle is more concerned with where to put functionality (at the edge) than the mechanism for distributing control state. I’m not sure it pertains to the core SDN architectural proposal. For example, it discusses encryption, yet I don’t think a central key server would be considered a violation of the principle.

      b) that said, all of the production overlay solutions I am aware of use a distributed control plane, not centralized.

    • Dinesh Dutt says:

      HI Sanjeev,

      Thanks for making the time to read the posting and ask the questions.

      1) Who said anything about fully non-blocking CLOS ?And as an example of subsuming the idea of placement into an application, consider Hadoop that has a “rack affinity” concept to ensure placement of storage nodes to avoid a single failure from cauing data loss.

      2) People are building networks slightly differently, it seems from the data I have. The CLOS works well for upgrading. People can add/upgrade a pod/cluster at a time. People also are heading in the direction of building/upgrading networks in the data center using the server model: new processor architectures and the economics/benefit ratio of upgrading to the new one instead of working with the old or adding complex features to make the old stuff last a little longer isn’t worth it. Its much larger to do this at scale. Cost as a reason to avoid upgrades is becoming less interesting. People don’t design Netflix to work at modem speeds. They just refuse to deal with it.

      3) I agree with Martin’s response. If you look at various revisions or refactoring of the E2E argument, the point you make is precisely addressed by them.

      All that said, by building a loosely coupled model between the underlay and overlay, overlays can decide how to best take advantage of the information that underlays provide. With a tightly integrated system between the two, that becomes difficult, if not impossible to achieve.


      • Hi Martin, Dinesh

        Thanks for taking the time and good thoughts. I am sure the debates wont end here but I would just summarize as follows.

        1) E2E principle from Clark at al makes perfect sense and we agree on that. What we can debate is how much of that applies to network overlays. The original E2E principle was directed at non-routing functionality such as TCP and it made sense to push that to the ends (and specifically into the end host network stacks). With network overlays, we are not adding any new transport functionality (as TCP did) and simply adding a second network layer (IP over IP really). So it could be a stretch to say that E2E principle suggests moving the network layer into hosts when all we are doing is a IP network routing function and not modifying the actual end host networking stacks as TCP did.

        2) Wrt the controllers topic, if we are saying that the controller is also primarily doing a network routing function (tenant discovery or VPN discovery) that is not so compelling imo since there are existing non-centralized and standardized ways of doing that (BGP VPNS, LISP etc). Some alternate controller proposals are looking to add value at higher layers.

        Thanks again for the good discussion.

        • Dinesh Dutt says:

          Hi Sanjeev,

          A strategy I find useful when I encounter a seemingly strange viewpoint is to try and make a case for that argument first. Not a strawman, but a decent one. That usually helps me work through my own objections.

          Cheers and thanks for making the time to engage,


        • Simon Leinen says:

          About your claim that “The original E2E principle was directed at non-routing functionality such as TCP”:
          Well, the seminal paper used reliable transmission as a prime example, and that is indeed a function of TCP. But the paper presents the end-to-end argument as a general one. And the authors of the paper had in fact done prior work where they applied similar thinking to routing: Look at “Source Routing for Campus-Wide Internet Transport” (1980). However, they stopped working on routing, because at that time, routing in general was considered a concern of The Phone Company, and it was unnecessary for some random CS researchers to get into that. Apparently that was also a reason for the “campus-wide” qualification in the title of that paper.
          Note that I wasn’t there and am only relating what I learned from Dave Reed’s postings to the e2e-interest mailing list, I guess. Applying e2e to routing is something that I find hugely interesting and that should be revived. That’s why I need to react when someone says “oh, but everybody knows that principle shouldn’t be applied to routing”.
          Both reliable transmission and flow control were once considered as essential functions of the network layer, and now everybody accepts that they are mostly done at the end hosts (using TCP). I can easily imagine how in a different universe, the same would be true for most of routing – at least path selection.
          (Anyway, ignore end-to-end arguments in any domain at your own risk 🙂

  3. gcx says:

    Great post. KISS principle definitely works. The point about throwing more bandwidth at the problem to keep things simple is relevant. This has worked well in service provider networks where complexity is pushed to the edge and the core is a simple packet forwarding engine. You can have multiple services such as customer VPNs, Internet edge connections, VoIP/SIP trunking, circuit emulation services etc. running over this core. Trying fancy QoS / traffic engineering mechanisms in the core will only add to complexity and may not yield the desired results.

  4. Paul Gleichauf says:


    I enjoyed reading your arguments for network simplification through over-provisioning and embrace of the end-to-end principle. One of the good things about this blog is that it delves into the challenges and not just the advantages of new ways of thinking about networks. Theoretically in an unconstrained network design I agree with the ideal of the over-provisioning and the end-to-end principle as limited guides to designing networks. Unfortunately the numerous practice considerations force us to make compromises, some of which are fundamentally unavoidable and force a reconsideration of what that first S in KISS really means in context.

    Let me make some (inflammatory?) counter statements to challenge your blog entry and encourage debate.

    Historically over-provisioning of networks has frequently been proposed as a panacea to simplify operation of communications networks across a variety of applications and service types. It does not get any easier to predict how much “over” to provision a highly dynamic virtual network with not only a mixture the normal traffic fluctuations within subnetworks, but also a constantly changing set of (new) applications and subnetworks.

    Over-provisioning bandwidth is an easier solution given some sets of constraints on a network design, but not others. Time sensitive applications (for example latency or jitter constrained ones) require some form of resource reservation among and along virtual subnetworks competing for the same resources to assure sufficient capacity. The network has both the visibility into and the ability to arbitrate and among competing demands and correlate to interpret active connections. This type of provisioning is not “over” nor is it KISS-simple in a dynamic world seeking to optimize business costs and user experience.

    Various types of middle boxes have been preferentially deployed to limit over-provisioning. The examples are numerous and cross service types, though they are closer to the edge than the core. They straddle the functionality of server and network (and storage) nodes.

    An over-provisioned network is more likely to suffer from low statistical utilization by definition. A large category of customers will not rip-and-replace any time soon because of perceived physical and economic constraints. Looking back into the history of server virtualization, one of the most compelling arguments for its spread was the observation that it used to be that data center servers suffered from 15% statistical utilization, and that virtualization of that same server with a mix of applications could be run at north of 75% utilization. Those benefits were not fully reaped until the hardware had virtualization support, but the efficiency argument accelerated server replacements. A similar figure of merit for network virtualization is needed to overcome economic constraints arguments for application use cases such as those proposed by the SP-centric ETSI NFV working group.

    A contrary form of the end-to-end principle could be called the in-the-middle principle through a couple of simple substitutions:

    The function in question can completely and correctly be implemented only with the knowledge and help of the application standing in the middle of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is necessary. (Sometimes an incomplete version of the function provided by the application may be useful as a performance enhancement.)

    It should not longer be heresy to say that the end-to-end principle and the in-the-middle principle are extremes of a more complete theory of how to construct next generation networks that can be adapted to a broad spectrum of constraints, including how to balance distributed processing versus centralized controls. Whether the solutions that such a theory produces obey KISS is likely in the eyes of the customer when they total up their running bills for both capital and operational expenses.

    With best regards as always,


    • Dinesh Dutt says:

      Hi Paul,

      You make inflammatory remarks? Na, never 🙂

      The main thrust of this article was an attempt to apply the e2e principle to the debate around network virtualization. That throwing bandwidth at the problem rather than design complex features into the network was there to talk about what the large data center providers have chosen to do. We didn’t suggest that they over-provision their networks. AFAIK, they use oversubscription ratios that are not 1:1.

      Can the physical network provide QoS that is based on looking at the virtual network ID (VNI)? Sure, that is what we meant when we said: “Clearly some functions should be in the physical network, such as packet replication, and enforcing QoS priorities” in the article. But, the underlay does not inherently have a way of identifying the VNI by looking at packets that haven’t already been appropriately tagged by the hypervisor. Trying to suck that function into the underlay is circuitous and unnecessary is the point we’re trying to make. And on that road lie the problems that come from violating the e2e principle.

      Another way of looking at the problem is to ask why are underlay folks not trying to suck Hadoop’s internal protocols into the underlay? We claim that network virtualization is really an application like Hadoop. That is what frees it to provide the agility (spin-up/spin-down, numbers, new features etc.) that people want in cloud deployments.

      Does this clarify things?

      Thanks for taking the time to read and comment,


      • Paul Gleichauf says:


        Seeking to add light to the embers, not trying to burn the place down…

        I cut quite a bit out of my original reply in the interests of focusing on points that I did not completely agree with, even if I was sympathetic to the goals, rather than adding praise to points with which I agree. My comments were intended to be extensible to outside the data center, in part because I was unclear about some of the motivation (beyond L3 multipath) in citing Amazon, Google, Facebook, and Microsoft since most of them don’t universally deploy virtualization even on their data center servers today.

        I would love to see an analysis of contemporary over-provisioning patterns across data centers and even into the WAN. And I like, and hopefully reinforced, the idea that mechanisms should be placed where the information is available to take full advantage, whether that is at the edge on your examples (VIDs), or in the network as in mine. Note that I think that whatever is placed in the network should be shaved with a finely honed Occam’s razor.

        My arguments attempted to highlight that network designs are shaped by real and sometimes unavoidable constraints. Depending on point of view the starting point may be “application centric” and legitimately come from the edge inward, or it might be optimized from a “network centric” perspective and worked out toward the edge. As near as I can tell the pendulum is swinging to favor starting from the former.

        When the network, physical and virtual, is regarded as a dynamical system I am skeptical that it is a good idea to make the entire infrastructure reflect the agility of applications at the edge. I am sure we will have interesting continuing discussions about whether network virtualization should be and can be treated as if it were an application.


  5. Handry says:

    Note that I wasn’t there and am only relating what I learned from Dave Reed’s postings to the e2e-interest mailing list, I guess. Applying e2e to routing is something that I find hugely interesting and that should be revived. That’s why I need to react when someone says “oh, but everybody knows that principle shouldn’t be applied to routing”.

  6. Bhargav says:

    Nice post.

    1) From what i understand, there seem to be 2 school of thoughts WRT Bandwidth.
    A) At one level we are talking about throwing more bandwidth to reduce the complexity (CLOS networks)
    B) At other level, have heard people talk about better utilization of link in their DC’s. Today i think it is about 20-30%

    Should not there be a fine balance between A & B ?

    2) Hadoop and Skype are great examples of virtualization but both assumes that there is unlimited underlay bandwidth. I am not sure if such an assumption is valid. There should be some-kind of feedback between physical and virtual. Example, Identifying Elephant flows at the edge and propagating those to the physical. Similarly, physical can provide feedback to virtual in certain way.

    3) Looks like network virtualization would bring in great benefits like what you have pointed but is not manageability an issue from network functions perspective ?. Now, i have N software systems to manage instead of just one HW system ?.


  7. Mark Smith says:

    I considered e2e quite a lot when I was preparing a presentation to Ausnog last year on what the implementation of MPTCP and encryption on smartphones and tablets (which I generically called “Mobile Multihomed Hosts”) might mean to the network.+
    I came to the conclusion that I think it is really just a restatement of the old truism of “if you want something done properly, you need to do it yourself.”
    In real life, we have to outsource functions, because it is hard for us to gain the expertise necessary to perform functions we are not expert in. The scenario I imagine to demonstrate this is engaging a lawyer to defend you in court – ideally for the best representation (ignoring the influence of your emotion), you would defend yourself, because your evidence and experience would all be first hand, be most accurate and you have the greatest amount of interest in the outcome. By engaging a lawyer you are sacrificing a level of accuracy and ownership as a trade-off to get their expertise.
    Fortunately, unlike us humans, computing devices just need a software update to gain expertise and to be able to take full ownership of functions that matter to them the most.

    + “The Rapid Rise of the Mobile Multihomed Host, and What It Might Mean to the Network”

  8. garegin says:

    the end to end principle. well, we haven’t had that since the advent of NAT. all those middleboxes on the internet seem to violate that rule as well.

  9. […] players talking about in SDN like Martin Cassado, Joe Onisick, Brad Hedlund and others, plus the many players in the vendor space who are mentioned […]

  10. […] players talking about in SDN like Martin Cassado, Joe Onisick, Brad Hedlund and others, plus the many players in the vendor space who are mentioned […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s