[This post was written by Dinesh Dutt with help from Martin Casado. Dinesh is Chief Scientist at Cumulus Networks. Before that, he was a Cisco Fellow, working on various data center technologies from ASICs to protocols to RFCs. He’s a primary co-author on the TRILL RFC and the VxLAN draft at the IETF. Sudeep Goswami, Shrijeet Mukherjee, Teemu Koponen, Dmitri Kalintsev, and T. Sridhar provided useful feedback along the way.]
In light of the seismic shifts introduced by server and network virtualization, many questions pertaining to the role of end hosts and the networking subsystem have come to the fore. Of the many questions raised by network virtualization, a prominent one is this: what function does the physical network provide in network virtualization? This post considers this question through the lens of the end-to-end argument.
Networking and Modern Data Center Applications
There are a few primary lessons learnt from the large scale data centers run by companies such as Amazon, Google, Facebook and Microsoft. The first such lesson is that a physical network built on L3 with equal-cost multipathing (ECMP) is a good fit for the modern data center. These networks provide predictable latency, scale well, converge quickly when nodes or links change, and provide a very fine-grained failure domain.
Second, historically, throwing bandwidth at the problem leads to far simpler networking compared to using complex features to overcome bandwidth limitations. The cost of building such high capacity networks has dropped dramatically in the last few years. By making networks follow the KISS principle, the networks are more robust and can be built out of simple building blocks.
Finally, there is value in moving functions from the network to the edge where there are better semantics, a richer compute model, and lower performance demands. This is evidenced by the applications that are driving data center. Over time, they have subsumed many of the functions that prior generation applications relied on the network for. For example, Hadoop has its own discovery mechanism instead of assuming that all peers are on the same broadcast medium (L2 network). Failure, security and other such characteristics are often built into the application, the compute harness, or the PaaS layer.
There is no debate about the role of networking for such applications. Yes, networks can attempt to do better load spreading or such, but vendors don’t design Hadoop protocol into networking equipment and debate about the performance benefits of doing so.
The story is much different when discussing virtual datacenters (for brevity, we’ll refer to these virtualized datacenters as “clouds” while acknowledging it is a misuse of the term) that host traditional workloads. Here there is active debate as to where functionality should lie.
Network virtualization is a key cloud-enabling technology. Network virtualization does to networking what server virtualization did to servers. It takes the fundamental pieces that constitute networking – addresses and connectivity(including policies that determine connectivity) – and virtualizes them such that many virtual networks can be multiplexed onto a single physical network.
Unlike software layers within modern data center applications that provide similar functionality (although with different abstractions) there is an ongoing discussion on where the virtual network should be implemented. In what follows, we view this discussion in light of the end-to-end argument.
Network Virtualization and the End-to-End Principle
The end-to-end principle (http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf) is a fundamental principle defining the design and functioning of the largest network of them all, the Internet. Over the years since its first inception, in 1984, the principle has been revisited and revised, many times, by the authors themselves and by others. But a fundamental idea it postulated remains as relevant today as when it was first formulated.
With regard to the question of where to place a function, in an application or in the communication subsystem, this is what the original paper says (this oft-quoted section comes at the end of a discussion where the application being discussed is reliable file transfer and the function is reliability): “The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the end points of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.)” [Emphasis is by the authors of this post].
Consider the application of this statement to virtual networking. One of the primary pieces of information required in network virtualization is the virtual network ID (or VNI). Let us consider who can provide the information correctly and completely.
In the current world of server virtualization, the network is completely unaware of when a VM is enabled or disabled and therefore joins or leaves (or creates or destroys) a virtual network. Furthermore, since the VM itself is unaware of the fact that it is running virtualized and that the NIC it sees is really a virtual NIC, there is no information in the packet itself that can help a networking device such as a first hop router or switch identify the virtual network solely on the basis of an incoming frame. The hypervisor on the virtualized server is the only one that is aware of this detail and so it is the only one that can correctly implement this function of associating a packet to a virtual network.
Some solutions to network virtualization concede this point that the hypervisor has to be involved in the decision making of which virtual network a packet belongs to. But they’d like to consider a solution in which the hypervisor signals to the first hop switch the desire for a new virtual network and the first hop switch returns back a tag such as a VLAN for the hypervisor to tag the frame with. The first hop switch then uses this tag to be act as the edge of the network virtualization overlay. Let us consider what this entails to the overall system design. As a co-inventor of VxLAN while at Cisco, I’ve grappled with these consequences during the design of such approaches.
The robustness of a system is determined partly by how many touch points are involved when a function has to be performed. In the case of the network virtualization overlay, the only touch points involved are the ones that are directly involved in the communication: the sending and receiving hypervisors. The state about the bringup and teardown of a virtual network and the connectivity matrix of that virtual network do not involve the physical network. Since fewer touchpoints are involved in the functioning of a virtual network, it is easier to troubleshoot and diagnose problems with the virtual network (decomposing it as discussed in an earlier blog post).
Another data point for this argument comes from James Hamilton’s famous cry of “the network is in my way”. His frustration arose partly from the then in-vogue model of virtualizing networks. VLANs which were the primary construct in virtualizing a network involved touching multiple physical nodes to enable a new VLAN to come up. Since a new VLAN coming up could destabilize the existing network by becoming the straw that broke the camel’s back, a cumbersome, manual and lengthy process was required to add a new VLAN. This constrained the agility with which virtual networks could be spun up and spun down. Furthermore, to scale the solution even mildly, required the reinvention of how the primary L2 control protocol, spanning tree, worked (think MSTP).
Besides the technical merits of the end-to-end principle, another of its key consequences is the effect on innovation. It has been eloquently argued many times that services such as Netflix and VoIP are possible largely because the Internet design has the end-to-end principle as a fundamental rubric. Similarly, by looking at network virtualization as an application best implemented by end stations instead of a tight integration with the communication subsystem, it becomes clear that user choice and innovation become possible with this loose coupling. For example, you can choose between various network virtualization solutions when you separate the physical network from the virtual network overlay. And you can evolve the functions at software time scales. Also, you can use the same physical infrastructure for PaaS and IaaS applications instead of designing different networks for each kind. Lack of choice and control has been a driving factor in the revolution underway in networking today. So, this consequence of the end-to-end principle is not an academic point.
The final issue is performance. The end-to-end principal clearly allows functions to be pushed into the network as a performance optimization. This topic deserves a post in itself (we’ve addressed it in pieces before, but not in its entirety), so we’ll just tee up the basic arguments. Of course, if the virtual network overlay provides sufficient performance, there is nothing additional to do. If not, then the question remains of where to put the functionality to improve performance.
Clearly some functions should be in the physical network, such as packet replication, and enforcing QoS priorities. However, in general, we would argue that it is better to extend the end-host programming model (additional hardware instructions, more efficient memory models, etc.) where all end host applications can take advantage of it, than push a subset of the functions into the network and require an additional mechanism for state synchronization to relay edge semantics. But again, we’ll address these issues in a more focused post later.
At an architectural level, a virtual network overlay is not much different than functionality that already exists in a big data application. In both cases, functionality that applications have traditionally relied on the network for - discovery, handling failures, visibility and security – have been subsumed into the application.
Ivan Pepelnjak said it first, and said it best when comparing network virtualization to Skype. Network virtualization is better compared to the network layer that has organically evolved within applications than to traditional networking functionality found in switches and routers.
If the word “network” was not used in “virtual network overlay” or if it’s origins hadn’t been so intermixed with the discontentment with VLANs, we wonder if the debate would exist at all.
[This post has been written by Martin Casado and Justin Pettit with hugely useful input from Bruce Davie, Teemu Koponen, Brad Hedlund, Scott Lowe, and T. Sridhar]
This post introduces the topic of network optimization via large flow (elephant) detection and handling. We decompose the problem into three parts, (i) why large (elephant) flows are an important consideration, (ii) smart things we can do with them in the network, and (iii) detecting elephant flows and signaling their presence. For (i), we explain the basis of elephant and mice and why this matters for traffic optimization. For (ii) we present a number of approaches for handling the elephant flows in the physical fabric, several of which we’re working on with hardware partners. These include using separate queues for elephants and mice (small flows), using a dedicated network for elephants such as an optical fast path, doing intelligent routing for elephants within the physical network, and turning elephants into mice at the edge. For (iii), we show that elephant detection can be done relatively easily in the vSwitch. In fact, Open vSwitch has supported per-flow tracking for years. We describe how it’s easy to identify elephant flows at the vSwitch and in turn provide proper signaling to the physical network using standard mechanisms. We also show that it’s quite possible to handle elephants using industry standard hardware based on chips that exist today.
Finally, we argue that it is important that this interface remain standard and decoupled from the physical network because the accuracy of elephant detection can be greatly improved through edge semantics such as application awareness and a priori knowledge of the workloads being used.
The Problem with Elephants in a Field of Mice
Conventional wisdom (somewhat substantiated by research) is that the majority of flows within the datacenter are short (mice), yet the majority of packets belong to a few long-lived flows (elephants). Mice are often associated with bursty, latency-sensitive apps whereas elephants tend to be large transfers in which throughput is far more important than latency.
Here’s why this is important. Long-lived TCP flows tend to fill network buffers end-to-end, and this introduces non-trivial queuing delay to anything that shares these buffers. In a network of elephants and mice, this means that the more latency-sensitive mice are being affected. A second-order problem is that mice are generally very bursty, so adaptive routing techniques aren’t effective with them. Therefore routing in data centers often uses stateless, hash-based multipathing such as Equal-cost multi-path routing (ECMP). Even for very bursty traffic, it has been shown that this approach is within a factor of two of optimal, independent of the traffic matrix. However, using the same approach for very few elephants can cause suboptimal network usage, like hashing several elephants on to the same link when another link is free. This is a direct consequence of the law of small numbers and the size of the elephants.
Treating Elephants Differently than Mice
Most proposals for dealing with this problem involve identifying the elephants, and handling them differently than the mice. Here are a few approaches that are either used today, or have been proposed:
- Throw mice and elephants into different queues. This doesn’t solve the problem of hashing long-lived flows to the same link, but it does alleviate the queuing impact on mice by the elephants. Fortunately, this can be done easily on standard hardware today with DSCP bits.
- Use different routing approaches for mice and elephants. Even though mice are too bursty to do something smart with, elephants are by definition longer lived and are likely far less bursty. Therefore, the physical fabric could adaptively route the elephants while still using standard hash-based multipathing for the mice.
- Turn elephants into mice. The basic idea here is to split an elephant up into a bunch of mice (for example, by using more than one ephemeral source port for the flow) and letting end-to-end mechanisms deal with possible re-ordering. This approach has the nice property that the fabric remains simple and uses a single queuing and routing mechanism for all traffic. Also, SACK in modern TCP stacks handles reordering much better than traditional stacks. One way to implement this in an overlay network is to modify the ephemeral port of the outer header to create the necessary entropy needed by the multipathing hardware.
- Send elephants along a separate physical network. This is an extreme case of 2. One method of implementing this is to have two spines in a leaf/spine architecture, and having the top-of-rack direct the flow to the appropriate spine. Often an optical switch is proposed for the spine. One method for doing this is to do a policy-based routing decision using a DSCP value that by convention denotes “elephant”.
At this point it should be clear that handling elephants requires detection of elephants. It should also be clear that we’ve danced around the question of what exactly characterizes an elephant. Working backwards from the problem of introducing queuing delays on smaller, latency-sensitive flows, it’s fair to say that an elephant has high throughput for a sustained period.
Often elephants can be determined a priori without actually trying to infer them from network effects. In a number of the networks we work with, the elephants are either related to cloning, backup, or VM migrations, all of which can be inferred from the edge or are known to the operators. vSphere, for example, knows that a flow belongs to a migration. And in Google’s published work on using OpenFlow, they had identified the flows on which they use the TE engine beforehand (reference here).
Dynamic detection is a bit trickier. Doing it from within the network is hard due to the difficulty of flow tracking in high-density switching ASICs. A number of sampling methods have been proposed, such as sampling the buffers or using sFlow. However the accuracy of such approaches hasn’t been clear due to the sampling limitations at high speeds.
On the other hand, for virtualized environments (which is a primary concern of ours given that the authors work at VMware), it is relatively simple to do flow tracking within the vSwitch. Open vSwitch, for example, has supported per-flow granularity for the past several releases now with each flow record containing the bytes and packets sent. Given a specified threshold, it is trivial for the vSwitch to mark certain flows as elephants.
The More Vantage Points, the Better
It’s important to remember that there is no reason to limit elephant detection to a single approach. If you know that a flow is large a priori, great. If you can detect elephants in the network by sampling buffers, great. If you can use the vSwitch to do per-packet flow tracking without requiring any sampling heuristics, great. In the end, if multiple methods identify it as an elephant, it’s still an elephant.
For this reason we feel that it is very important that the identification of elephants should be decoupled from the physical hardware and signaled over a standard interface. The user, the policy engine, the application, the hypervisor, a third party network monitoring system, and the physical network should all be able identify elephants.
Fortunately, this can easily be done relatively simply using standard interfaces. For example, to affect per-packet handling of elephants, marking the DSCP bits is sufficient, and the physical infrastructure can be configured to respond appropriately.
Another approach we’re exploring takes a more global view. The idea is for each vSwitch to expose its elephants along with throughput metrics and duration. With that information, an SDN controller for the virtual edge can identify the heaviest hitters network wide, and then signal them to the physical network for special handling. Currently, we’re looking at exposing this information within an OVSDB column.
Are Elephants Obfuscated by the Overlay?
No. For modern overlays, flow-level information, and QoS marking are all available in the outer header and are directly visible to the underlay physical fabric. Elephant identification can exploit this characteristic.
This is a very exciting area for us. We believe there is a lot of room to bring to bear edge understanding of workloads, and the ability for software at the edge to do sophisticated trending analysis to the problem of elephant detection and handling. It’s early days yet, but our initial forays both with customers and hardware partners, has been very encouraging.
More to come.
[This post was written by Martin Casado and Amar Padmanahban, with helpful input from Scott Lowe, Bruce Davie, and T. Sridhar]
This is the first in a multi-part discussion on visibility and debugging in networks that provide network virtualization, and specifically in the case where virtualization is implemented using edge overlays.
In this post, we’re primarily going to cover some background, including current challenges to visibility and debugging in virtual data centers, and how the abstractions provided by virtual networking provide a foundation for addressing them.
The macro point is that much of the difficulty in visibility and troubleshooting in today’s environments is due to the lack of consistent abstractions that both provide an aggregate view of distributed state and hide unnecessary complexity. And that network virtualization not only provides virtual abstractions that can be used to directly address many of the most pressing issues, but also provides a global view that can greatly aid in troubleshooting and debugging the physical network as well.
A Messy State of Affairs
While it’s common to blame server virtualization for complicating network visibility and troubleshooting, this isn’t entirely accurate. It is quite possible to build a static virtual datacenter and, assuming the vSwitch provides sufficient visibility and control (which they have for years), the properties are very similar to physical networks. Even if VM mobility is allowed, simple distributed switching will keep counters and ACLs consistent.
A more defensible position is that server virtualization encourages behavior that greatly complicates visibility and debugging of networks. This is primarily seen as server virtualization gives way to full datacenter virtualization and, as a result, various forms of network virtualization are creeping in. However, this is often implemented as a collection of disparate (and distributed) mechanisms, without exposing simplified, unified abstractions for monitoring and debugging. And the result of introducing a new layer of indirection without the proper abstractions is, as one would expect, chaos. Our point here is not that network virtualization creates this chaos – as we’ll show below, the reverse can be true, provided one pays attention to the creation of useful abstractions as part of the network virtualization solution.
Let’s consider some of the common visibility issues that can arise. Network virtualization is generally implemented with a tag (for segmentation) or tunneling (introducing a new address space), and this can hide traffic, confuse analysis on end-to-end reachability, and cause double counting (or undercounting) of bytes or even packets. Further, the edge understanding of the tag may change over time, and any network traces collected would therefore become stale unless also updated. Often logically grouped VMs, like those of a single application or tenant, are scattered throughout the datacenter (not necessarily on the same VLAN), and there isn’t any network-visible identifier that signifies the grouping. For example, it can be hard to say something like “mirror all traffic associated with tenant A”, or “how many bytes has tenant A sent”. Similarly, ACLs and other state affecting reachability is distributed across multiple locations (source, destination, vswitches, pswitches, etc.) and can be difficult to analyze in aggregate. Overlapping address spaces, and dynamically assigned IP addresses, preclude any simplistic IP-based monitoring schemes. And of course, dynamic provisioning, random VM placement, and VM mobility can all make matters worse.
Yes, there are solutions to many of these issues, but in aggregate, they can present a real hurdle to smooth operations, billing and troubleshooting in public and private data centers. Fortunately, it doesn’t have to be this way.
Life Becomes Easy When the Abstractions are Right
So much of computer science falls into place when the right abstractions are used. Servers provide a good example of this. Compute virtualization has been around in pieces since the introducing of the operating system. Memory, IO, and the instruction sets have long been virtualized and provide the basis of modern multi-process systems. However, until the popularization of the virtual machine abstraction, these virtualization primitives did not greatly impact the operations of servers themselves. This is because there was no inclusive abstraction that represented a full server (a basic unit of operations in an IT shop). With virtual machines, all state associated with a workload is represented holistically, allowing us to create, destroy, save, introspect, track, clone, modify, limit, etc. Visibility and monitoring in multi-user environments arguably became easier as well. Independent of which applications and operating systems are installed, it’s possible to know exactly how much memory, I/O and CPU a virtual machine is consuming, and that is generally attributed back to a user.
So is it with network virtualization – the virtual network abstraction can provide many of the same benefits as the virtual machine abstraction. However, it also provides an additional benefit that isn’t so commonly enjoyed with server virtualization: network virtualization provides an aggregated view of distributed state. With manual distributed state management being one of the most pressing operational challenges in today’s data centers, this is a significant win.
To illustrate this, we’ll provide a quick primer on network virtualization and then go through an example of visibility and debugging in a network virtualization environment.
Network Virtualization as it Pertains to Visibility and Monitoring
Network virtualization, like server virtualization, exposes a a virtual network that looks like a physical network, but has the operational model of a virtual machine. Virtual networks (if implemented completely) support L2-L7 network functions, complex virtual topologies, counters, and management interfaces. The particular implementation of network virtualization we’ll be discussing is edge overlays, in which the mechanism used to introduce the address space for the virtual domain is an L2 over L3 tunnel mesh terminated at the edge (likely the vswitch). However, the point of this particular post is not to focus on the how the network virtualization is implemented, but rather, how decoupling the logical view from the physical transport affects visibility and troubleshooting.
A virtual network (in most modern implementations, at least) is a logically centralized entity. Consequently, it can be monitored and managed much like a single physical switch. Rx/Tx counters can be monitored to determine usage. ACL counters can be read to determine if something is being dropped due to policy configuration. Mirroring of a virtual switch can siphon off traffic from an entire virtual network independent of where the VMs are or what physical network is being used in the datacenter. And of course, all of this is kept consistent independent of VM mobility or even changes to the physical network.
The introduction of a logically centralized virtual network abstraction addresses many of the problems found in todays virtualized data centers. The virtualization of counters can be used for billing and accounting without worrying about VM movements, the hiding or double counting of traffic, the distribution of VMs and network services across the datacenter. The virtualization of security configuration (e.g. ACLs) and their counters turns a messy distributed state problem into a familiar central rule set. In fact, in the next post, we’ll describe how we use this aggregate view to perform header space analysis to answer sophisticated reachability questions over state which would traditionally be littered throughout the datacenter. The virtualization of management interfaces natively provides accurate, multi-tenant support of existing tool chains (e.g. NetFlow, SNMP, sFlow, etc.), and also resolves the problem of state gathering when VMs are dispersed across a datacenter.
Impact On Workflow
However, as with virtual machines, while some level of sanity has been restored to the environment, the number of entities to monitor has increased. Instead of having to divine what is going on in a single, distributed, dynamic, complex network, there are now multiple, much simpler (relatively static) networks that must be monitored. These network are (a) the physical network, which now only needs to be concerned with packet transport (and thus has become vastly simpler) and (b) the various logical networks implemented on top of it.
In essence, visibility and trouble shooting now much take into account the new logical layer. Fortunately, because virtualization doesn’t change the basic abstractions, existing tools can be used. However, as with the introduction of any virtual layer, there will be times when the mapping of physical resources to virtual ones becomes essential.
We’ll use troubleshooting as an example. Let’s assume that VM A can’t talk to VM B. The steps one takes to determine what goes on are as follows:
- Existing tools are pointed to the effected virtual network and rx/tx counters are inspected as well as any ACLs and forwarding rules. If something in the virtual abstraction is dropping the packets (like an ACL), we know what the problem is, and we’re done.
- If it looks like nothing in the virtual space is dropping the packet, it becomes a physical network troubleshooting problem. The virtual network can now reveal the relevant physical network and paths to monitor. In fact, often this process can be fully automated (as we’ll describe in the next post). In the system we work on, for example, often you can detect which links in the physical network packets are being dropped on (or where some amount of packet loss is occurring) solely from the edge.
A number of network visibility, management, and root cause detection tools are already undergoing the changes needed to make this a one step process form the operators view. However, it is important to understand what’s going on, under the covers.
Wrapping Up for Now
This post was really aimed at teeing up the topic on visibility and debugging in a virtual network environment. In the next point, we’ll go through a specific example of an edge overlay network virtualization solution, and how it provides visibility of the virtual networks, and advanced troubleshooting of the physical network. In future posts, we’ll also cover tool chains that are already being adapted to take advantage of the visibility and troubleshooting gains possible with network virtualization.
[This post was put together by Teemu Koponen, Andrew Lambeth, Rajiv Ramanathan, and Martin Casado]
Scale has been an active (and often contentious) topic in the discourse around SDN (and by SDN we refer to the traditional definition) long before the term was coined. Criticism of the work that lead to SDN argued that changing the model of the control plane from anything but full distribution would lead to scalability challenges. Later arguments reasoned that SDN results in *more* scalable network designs because there is no longer the need to flood the entire network state in order to create a global view at each switch. Finally, there is the common concern that calls into question the scalability of using traditional SDN (a la OpenFlow) to control physical switches due to forwarding table limits.
However, while there has been a lot of talk, there have been relatively few real-world examples to back up the rhetoric. Most arguments appeal to reason, argue (sometimes convincingly) from first principles, or point to related but ultimately different systems.
The goal of this post is to add to the discourse by presenting some scaling data, taken over a two-year period, from a production network virtualization solution that uses an SDN approach. Up front, we don’t want to overstate the significance of this post as it only covers a tiny sliver of the design space. However, it does provide insight into a real system, and that’s always an interesting centerpiece around which to hold a conversation.
Of course, under the broadest of terms, an SDN approach can have the same scaling properties as traditional networking. For example, there is no reason that controllers can’t run traditional routing protocols between them. However, a more useful line of inquiry is around the scaling properties of a system built using an SDN approach that actually benefits from the architecture, and scaling properties of an SDN system that differs from the traditional approach. We briefly touch both of these topics in the discussion below.
The system we’ll be describing underlies the network virtualization platform described here. The core system has been in development for 4-5 years, has been in production for over 2 years, and has been deployed in many different environments.
A Scale Graph
By scale, we’re simply referring to the number of elements (nodes, links, end points, etc.) that a system can handle without negatively impacting runtime (e.g. change in the topology, controller failure, software upgrade, etc.). In the context of network virtualization, the elements under control that we focus on are virtual switches, physical ports, and virtual ports. Below is a graph of the scale numbers for virtual ports and hypervisors under control that we’ve validated over the last two years for one particular use case.
The Y axis to the left is the number of logical ports (ports on logical switches), the Y axis on the right is the number of hypervisors (and therefore virtual switches) under control. We assume that the average number of logical ports per logical switch is constant (in this case 4), although varying that is clearly another interesting metric worth tracking. Of course, these results are in no way exhaustive, as they only reflect one use case that we commonly see in the field. Different configurations will likely have different numbers.
Some additional information about the graph:
- For comparison, the physical analog of this would be 100,000 servers (end hosts), 5,000 ToR switches, 25,000 VLANs and all the fabric ports that connect these ToR switches together.
- The gains in scale from Jan 2011 to Jan 2013 were all done with by improving the scaling properties of a single node. That is, rather than adding more resources by adding controller nodes, the engineering team continued to optimize the existing implementation (data structures, algorithms, language specific overhead, etc,.). However, the controllers were being run as a cluster during that time so they were still incurring the full overhead of consensus and data replication.
- The gains shown for the last two datapoints were only from distribution (adding resources), without any changes to the core scaling properties of a single node. In this case, moving from 3 to 4 and finally 5 nodes.
Raw scaling numbers are rarely interesting as they vary by use case, and the underlying server hardware running the controllers. What we do find interesting, though, is the relative increase in performance over time. In both cases, the increase in scale grows significantly as more nodes are added to the cluster, and as the implementation is tuned and improved.
It’s also interesting to note what the scaling bottlenecks are. While most of the discussion around SDN has focused on fundamental limits of the architecture, we have not found this be a significant contributor either way. That is, at this point we’ve not run into any architectural scaling limitations; rather, what we’ve seen are implementation shortcomings (e.g. inefficient code, inefficient scheduling, bugs) and difficulty in verification of very large networks. In fact, we believe there is significant architectural headroom to continue scaling at a similar pace.
SDN vs. Traditional Protocols
One benefit of SDN that we’ve not seen widely discussed is its ability to enable rapid evolution of solutions to address network scaling issues, especially in comparison to slow-moving standards bodies and multi-year ASIC design/development cycles. This has allowed us to continually modify our system to improve scale while still providing strong consistency guarantees, which are very important for our particular application space.
It’s easy to point out examples in traditional networking where this would be beneficial but isn’t practical in short time periods. For example, consider traditional link state routing. Generally, the topology is assumed to be unknown; for every link change event, the topology database is flooded throughout the network. However, in most environments, the topology is fixed or is slow changing and easily discoverable via some other mechanism. In such environments, the static topology can be distributed to all nodes, and then during link change events only link change data needs to be passed around rather than passing around megs of link state database. Changing this behavior would likely require a change to the RFC. Changes to the RFC, though, would require common agreement amongst all parties, and traditionally results in years of work by a very political standardization process.
For our system, however, as our understanding for the problem grows we’re able to evolve not only the algorithms and data structures used, but the distribution model (which is reflected by the last two points in the graph) and the amount of shared information.
Of course, the tradeoff for this flexibility is that the protocol used between the controllers is no longer standard. Indeed, the cluster looks much more like a tightly coupled distributed software system than a loose collection of nodes. This is why standard interfaces around SDN applications are so important. For network virtualization this would be the northbound side (e.g. Quantum), the southbound side (e.g. ovsdb-conf), and federation between controller clusters.
This is only a snapshot of a very complex topic. The point of this post is not to highlight the scale of a particular system—clearly a moving target—but rather to provide some insight into the scaling properties of a real application built using an SDN approach.
[This post was written with Jesse Gross, Ben Basler, Bruce Davie, and Andrew Lambeth]
Tunneling has earned a bad name over the years in networking circles.
Much of the problem is historical. When a new tunneling mode is introduced in a hardware device, it is often implemented in the slow path. And once it is pushed down to the fastpath, implementations are often encumbered by key or table limits, or sometimes throughput is halved due to additional lookups.
However, none of these problems are intrinsic to tunneling. At its most basic, a tunnel is a handful of additional bits that need to be slapped onto outgoing packets. Rarely, outside of encryption, is there significant per-packet computation required by a tunnel. The transmission delay of the tunnel header is insignificant, and the impact on throughput is – or should be – similarly minor.
In fact, our experience implementing multiple tunneling protocols within Open vSwitch is that it is possible to do tunneling in software with performance and overhead comparable to non encapsulated traffic, and to support hundreds of thousands of tunnel end points.
And that is the goal of this post: to start the discussion on the performance of tunneling in software from the network edge.
An emerging method of network virtualization is to use tunneling from the edges to decoupled the virtual network address space from the physical address space. Often the tunneling is done in software in the hypervisor. Tunneling from within the server has a number of advantages: software tunneling can easily support hundreds of thousands of tunnels, it is not sensitive to key sizes, it can support complex lookup functions and header manipulations, it simplifies the server/switch interface and reduces demands on the in-network switching ASICs, and it naturally offers software flexibility and a rapid development cycle.
An idealized forwarding path is shown in the figure below. We assume that the tunnels are terminated within the hypervisor. The hypervisor is responsible for mapping packets from VIFs to tunnels, and from tunnels to VIFs. The hypervisor is also responsible for the forwarding decision on the outer header (mapping the encapsulated packet to the next physical hop).
Some Performance Numbers for Software Tunneling
The following tests show throughput and cpu overhead for tunneling within Open vSwitch. Traffic was generated with netperf attempting to emulate a high-bandwidth TCP flow. The MTU for the VM and the physical NICs are 1500bytes and the packet payload size is 32k. The test shows results using no tunneling (OVS bridge), GRE, and STT.
The results show aggregate bidirectional throughput, meaning that 20Gbps is a 10G NIC sending and receiving at line rate. All tests where done using Ubuntu 12.04 and KVM on an Intel Xeon 2.40GHz servers interconnected with a Dell 10G switch. We use standard 10G Broadcom NICs. CPU numbers reflect the percentage of a single core used for each of the processes tracked.
The following results show the performance of a single flow between two VMs on different hypervisors. We include the Linux bridge to show that performance is comparable. Note that the CPU only includes the CPU dedicated to switching in the hypervisor and not the overhead in the guest needed to push/consume traffic.
|Throughput||Recv side cpu||Send side cpu|
|Linux Bridge:||9.3 Gbps||85%||75%|
|OVS Bridge:||9.4 Gbps||82%||70%|
This next table shows the aggregate throughput of two hypervisors with 4 VMs each. Since each side is doing both send and receive, we don’t differentiate between the two.
|OVS Bridge:||18.4 Gbps||150%|
Interpreting the Results
Clearly these results (aside from GRE, discussed below) indicate that the overhead of software for tunneling is negligible. It’s easy enough to see why that is so. Tunneling requires copying the tunnel bits onto the header, an extra lookup (at least on receive), and the transmission delay of those extra bits when placing the packet on the wire. When compared to all of the other work that needs to be done during the domain crossing between the guest and the hypervisor, the overhead really is negligible.
In fact, with the right tunneling protocol, the performance is roughly equivalent to non-tunneling, and CPU overhead can even be lower.
STT’s lower CPU usage than non-tunneled traffic is not a statistical anomaly but is actually a property of the protocol. The primary reason is that STT allows for better coalescing on the received side in the common case (since we know how many packets are outstanding). However, the point of this post is not to argue that STT is better than other tunneling protocols, just that if implemented correctly, tunneling can have comparable performance to non-tunneled traffic. We’ll address performance specific aspects of STT relative to other protocols in a future post.
The reason that GRE numbers are so low is that with the GRE outer header it is not possible to take advantage of offload features on most existing NICs (we have discussed this problem in more detail before). However, this is a shortcoming of the NIC hardware in the near term. Next generation NICs will support better tunnel offloads, and in a couple of years, we’ll start to see them show up in LOM.
In the meantime, STT should work on any standard NIC with TSO today.
The point of this post is that at the edge, in software, tunneling overhead is comparable to raw forwarding, and under some conditions it is even beneficial. For virtualized workloads, the overhead of software forwarding is in the noise when compared to all of the other machinations performed by the hypervisor.
Technologies like passthrough are unlikely to have a significant impact on throughput, but they will save CPU cycles on the server. However, that savings comes at a fairly steep cost as we have explained before, and doesn’t play out in most deployment environments.
[This post was written with Bruce Davie]
Network virtualization has been around in some form or other for many years, but it seems of late to be getting more attention than ever. This is especially true in SDN circles, where we frequently hear of network virtualization as one of the dominant use cases of SDN. Unfortunately, as with much of SDN, the discussion has been muddled, and network virtualization is being both conflated with SDN and described as a direct result of it. However, SDN is definitely not network virtualization. And network virtualization does not require SDN.
No doubt, part of the problem is that there is no broad consensus on what network virtualization is. So this post is an attempt to construct a reasonable working definition of network virtualization. In particular, we want to distinguish network virtualization from some related technologies with which it is sometimes confused, and explain how it relates to SDN.
A good place to start is to take a step back and look at how virtualization has been defined in computing. Historically, virtualization of computational resources such as CPU and memory has allowed programmers (and applications) to be freed from the limitations of physical resources. Virtual memory, for example, allows an application to operate under the illusion that it has dedicated access to a vast amount of contiguous memory, even when the physical reality is that the memory is limited, partitioned over multiple banks, and shared with other applications. From the application’s perspective, the abstraction of virtual memory is almost indistinguishable from that provided by physical memory, supporting the same address structure and memory operations.
As another example, server virtualization presents the abstraction of a virtual machine, preserving all the details of a physical machine: CPU cycles, instruction set, I/O, etc.
A key point here is that virtualization of computing hardware preserves the abstractions that were presented by the resources being virtualized. Why is this important? Because changing abstractions generally means changing the programs that want to use the virtualized resources. Server virtualization was immediately useful because existing operating systems could be run on top of the hypervisor without modification. Memory virtualization was immediately useful because the programming model did not have to change.
Virtualization and the Power of New Abstractions
Virtualization should not change the basic abstractions exposed to workloads, however it nevertheless does introduce new abstractions. These new abstractions represent the logical enclosure of the entity being virtualized (for example a process, a logical volume, or a virtual machine). It is in these new abstractions that the real power of virtualization can be found.
So while the most immediate benefit of virtualization is the ability to multiplex hardware between multiple workloads (generally for the efficiency, fault containment or security), the longer term impact comes from the ability of the new abstractions to change the operational paradigm.
Server virtualization provides the most accessible example of this. The early value proposition of hypervisor products was simply server consolidation. However, the big disruption that followed server virtualization was not consolidation but the fundamental change to the operational model created by the introduction of the VM as a basic unit of operations.
This is a crucial point. When virtualizing some set of hardware resources, a new abstraction is introduced, and it will become a basic unit of operation. If that unit is too fine grained (e.g. just exposing logical CPUs) the impact on the operational model will be limited. Get it right, however, and the impact can be substantial.
As it turns out, the virtual machine was the right level of abstraction to dramatically impact data center operations. VMs embody a fairly complete target for the things operational staff want to do with servers: provisioning new workloads, moving workloads, snapshotting workloads, rolling workloads back in time, etc.
- virtualization exposes a logical view of some resource decoupled from the physical substrate without changing the basic abstractions.
- virtualization also introduces new abstractions – the logical container of virtualized resources.
- it is the manipulation of these new abstractions that has the potential to change the operational paradigm.
- the suitability of the new abstraction for simplifying operations is important.
Given this as background, let’s turn to network virtualization.
Network Virtualization, Then and Now
As noted above, network virtualization is an extremely broad and overloaded term that has been in use for decades. Overlays, MPLS, VPNs, VLANs, LISP, Virtual routers, VRFs can all be thought of as network virtualization of some form. An earlier blog post by Bruce Davie (here) touched on the relationship between these concepts and network virtualization as we’re defining it here. The key point of that post is that when employing one of the aforementioned network virtualization primitives, we’re virtualizing some aspect of the network (a LAN segment, an L3 path, an L3 forwarding table, etc.) but rarely a network in its entirety with all its properties.
For example, if you use VLANs to virtualize an L2 segment, you don’t get virtualized counters that stay in sync when a VM moves, or a virtual ACL that keeps working wherever the VM is located. For those sorts of capabilities, you need some other mechanisms.
To put it in the context of the previous discussion, traditional network virtualization mechanisms don’t provide the most suitable operational abstractions. For example, provisioning new workloads or moving workloads still requires operational overhead to update the network state, and this is generally a manual process.
Modern approaches to network virtualization try and address this disconnect. Rather than providing a bunch of virtualized components, network virtualization today tries to provide a suitable basic unit of operations. Unsurprisingly, the abstraction is of a “virtual network”.
To be complete, a virtual network should both support the basic abstractions provided by physical networks today (L2, L3, tagging, counters, ACLs, etc.) as well as introduce a logical abstraction that encompasses all of these to be used as the basis for operation.
And just like the compute analog, this logical abstraction should support all of the operational niceties we’ve come to expect from virtualization: dynamic creation, deletion, migration, configuration, snapshotting, and roll-back.
Cleaning up the Definition of Network Virtualization
Given the previous discussion, we would characterize network virtualization as follows:
- Introduces the concept of a virtual network that is decoupled from the physical network.
- The virtual networks don’t change any of the basic abstractions found in physical networks.
- The virtual networks are exposed as a new logical abstraction that can form a basic unit of operation (creation, deletion, migration, dynamic service insertion, snapshotting, inspection, and so on).
Network Virtualization is not SDN
SDN is a mechanism, and network virtualization is a solution. It is quite possible to have network virtualization solution that doesn’t use SDN, and to use SDN to build a network that has no virtualized properties.
SDN provides network virtualization in about the same way Python does - it’s a tool (and not a mandatory one). That said, SDN does have something to offer as a mechanism for network virtualization.
A simple way to think about the problem of network virtualization is that the solution must map multiple logical abstractions onto the physical network, and keep those abstractions consistent as both the logical and physical worlds change. Since these logical abstractions may reside anywhere in the network, this becomes a fairly complicated state management problem that must be enforced network-wide.
However, managing large amounts of states with reasonable consistency guarantees is something that SDN is particularly good at. It is no coincidence that most of the network virtualization solutions out there (from a variety of vendors using a variety of approaches) have a logically centralized component of some form for state management.
The point of this post was simply to provide some scaffolding around the discussion of network virtualization. To summarize quickly, modern concepts of network virtualization both preserve traditional abstractions and provide a basic unit of operations which is a (complete) virtual network. And that new abstraction should support the same operational abstractions as its computational analog.
While SDN provides a useful approach to building a network virtualization solution, it isn’t the only way. And lets not confuse tools with solutions.
Over the next few years, we expect to see a variety of mechanisms for implementing virtual networking take hold. Some hardware-based, some software-based, some using tunnels, others using tags, some relying more on traditional distributed protocols, others relying on SDN.
In the end, the market will choose the winning mechanism(s). In the meantime, let’s make sure we clarify the dialog so that informed decisions are possible.
On a lark, I compiled a list of “open” OpenFlow software projects that I knew about off-hand, or could find with minimal effort searching online.
You can find the list here.
Unsurprisingly, most of the projects either come directly from or originated at academia or industry research. As I’ve argued before, with respect to standardization, the more design and insight that comes from real code and plugging real holes, the better. And it is still very, very early days in the SDN engineering cycle. So, it is nice to see the diversity in projects and I hope the ecosystem continues to broaden with more controllers and associated projects entering the fray.
If you know of a project that I’ve missed (I’m only listing those that have code or bits available for free online — with the exception of Pica8 which will send you code on request) please mention it in the comments or e-mail me and I’ll add it to the list.
[This post was written with Bruce Davie, and Andrew Lambeth.]
Recently, Jesse Gross, Bruce Davie and a number of contributors submitted the STT draft to the IETF (link to the draft here). STT is an encapsulation format for network virtualization. Unlike other protocols in this space (namely VXLAN and NVGRE), it was designed to be used with soft switching within the server (generally in the vswitch in the hypervisor) while taking advantage of hardware acceleration at the NIC. The goal is to preserve the flexibility and development speed of software while still providing hardware forwarding speeds.
The quick list of differentiators are i) it takes advantage of TSO available in NICs today allowing tunneling from software at 10G while consuming relatively little cpu ii) there are more bits allocated to the virtual network meta data carried per packet, and those bits are unstructured allowing for arbitrary interpretation by software and iii) the control plane is decoupled from the actual encapsulation.
There are a number of other software-centric features like better byte alignment of the headers, but these are not architecturally significant.
Of course, the publication of the draft drew reasonable skepticism on whether the industry needed yet another encapsulation format. And that is the question we will focus on in this post.
But first, let us try to provide a useful decomposition of the virtual networking problem (as it pertains to Distributed Edge Overlays DEO).
Distributed Edge Overlays (DEO)
Distributed edge overlays have gained a lot of traction as a mechanism for network virtualization. A reasonable characterization of the problem space can be found in the IETF nvo3 problem statement draft. Two recent DEO related drafts submitted to the IETF in addition to STT are NVGRE, and VXLAN.
The basic idea is to use tunneling (generally L2 in L3) to create an overlay network from the edge that exposes a virtual view of one or more network to end hosts. The edge may be, for example, a vswitch in the hypervisor, or the first hop physical switch.
DEO solutions can be roughly decomposed into three independent components.
- Encapsulation format:The encapsulation format is what the packet looks like on the wire. The format has implications both on hardware compatibility, and the amount of information that is carried across the tunnel with the packet.As an example of encapsulation, with NVGRE the encapsulation format is GRE where the GRE key is used to store some additional information (the tenant network ID).
- Control plane:The control plane disseminates the state needed to figure out which tunnels to create, which packets go in which tunnels, and what state is associated with packets as they traverse the tunnels. Changes to both the physical and virtual views of the network often require state to be updated and/or moved around the network.There are many ways to implement a control plane, either using traditional protocols (for example, NVGRE and the first VXLAN draft abdicate a lot of control responsibility to multicast), or something more SDN-esque like a centralized datastore, or even a proper SDN controller.
- Logical view:The logical view is what the “virtual network” looks like from the perspective of an end host. In the case of VXLAN and NVGRE, they offer a basic L2 learning domain. However, you can imagine this being extended to L3 (to support very large virtual networks, for example), security policies, and even higher-level services.The logical view defines the network services available to the virtual machine. For example, if only L2 is available, it may not be practical to run workloads of thousands of machines within a single virtual network due to scaling limitations of broadcast. If the virtual network provided L3, it could potentially host such workloads and still provide the benefits of virtualization such as support for VM mobility without requiring IP renumbering, higher-level service interposition (like adding firewalls), and mobile policies.
Before we jump into a justification for STT, we would like to make the point that each of these components really are logically distinct, and a good design should keep them decoupled. Why? For modularity. For example, if a particular encapsulation format has broad hardware support, it would be a shame to restrict it to a single control plane design. It would also be a shame to restrict it to a particular logical network view.
VXLAN and NVGRE or both guilty of this to some extent. For example, the NVGRE and the original VXLAN draft specify multicast as the mechanism to use for part of the control plane (with other parts left unspecified). The latest VXLAN addresses this somewhat, which is a great improvement.
Also, both VXLAN and NVGRE fix the logical forwarding model to L2 going as far as to specify how the logical forwarding tables get populated. Again, this is an unnecessary restriction.
For protocols that are hardware centric (which both VXLAN and NVGRE appear to me), this makes some modicum of sense, lookup space is expensive, and decoupling may require an extra level of indirection. However, for software this is simply bad design.
STT on the other hand limits its focus to the encapsulation format, and does not constrain the other components within the specification.
[Note: The point of this post is not to denigrate VXLAN or NVGRE, but rather to point out that they are comparatively less suited for running within the vswitch. If the full encap/decap and lookup logic is resides fully within hardware, VXLAN and NVGRE are both well designed and reasonable options.]
OK, on to a more detailed justification for STT
To structure the discussion, we’ll step through each logical component of the DEO architecture and describe the design decisions made by STT and how they compare to similar proposals.
Logical view: It turns out that the more information you can tack on to a packet as it transits the network, the richer a logical view you can create. Both NVGRE and VXLAN not only limit the additional information to 32 bits, but they also specify that those bits must contain the logical network ID. This leaves no additional space for specifying other aspects of the logical view that might be interesting to the control plane.
STT differs from NVGRE and VXLAN in two ways. First, it allocates more space to the per-packet metadata. Second, it doesn’t specify how that field is interpreted. This allows the virtual network control plane to use it for state versioning (useful for consistency across multiple switches), additional logical network meta-data, tenant identification, etc.
Of course, having a structured field of limited size makes a lot of sense for NVGRE and VXLAN where it is assumed that encap/decap and interpretation of those bits are likely to be in switching hardware.
However, STT is optimizing for soft switching with hardware accelerating in the NIC. Having larger, unstructured fields provides more flexibility for the software to work with. And, as I’ll describe below, it doesn’t obviate the ability to use hardware acceleration in the NIC to get vastly better performance than a pure software approach.
Control Plane: The STT draft says nothing about the control plane that is used for managing the tunnels and the lookup state feeding into them. This means that securing the control channel, state dissemination, packet replication, etc. are outside of, and thus not constrained by, the spec.
Encapsulation format: This is were STT really shines. STT was designed to take advantage of TSO and LRO engines in existing NICs today. With STT, it is possible to tunnel at 10G from the guest while consuming only a fraction of a CPU core. We’ve seen speedups up to 10x over pure software tunneling.
In other words, STT was designed to let you retain all the high performance features of the NIC when you start tunneling from the edge, while still retaining the flexibility of software to perform the network virtualization functions.
Here is how it works.
When a guest VM sends a packet to the wire, the transitions between the guest and the hypervisor (this is a software domain crossing which requires flushing the TLB, and likely the loss of cache locality, etc.) and the hypervisor and the NIC are relatively expensive. This is why hypervisor vendors take pains to always support TSO all the way up to the guest
Without tunneling, vswitches can take advantage of TSO by exposing a TSO enabled NIC to the guest and then passing large TCP frames to the hardware NIC which performs the segmentation. However, when tunneling is involved, this isn’t possible unless the NIC supports segmentation of the TCP frame within the tunnel in hardware (which hopefully will happen as tunneling protocols get adopted).
With STT, the guests are also exposed to a TSO enabled NIC, however instead of passing the packets directly to the NIC, the vswitch inserts an additional header that looks like a TCP packet, and performs all of the additional network virtualization procedures.
As a result, with STT, the guest ends up sending and receiving massive frames to the hypervisor (up to 64k) which are then encapsulated in software, and ultimately segmented in hardware by the NIC. The result is that the number of domain crossings are reduced by a significant factor in the case of high-throughput TCP flows.
One alternative to going through all this trouble to amortize the guest/hypervisor transistions is to try eliminating them altogether by exposing the NIC HW to the guest, with a technique commonly referred to as passthrough. However, with passthrough software is unable to make any forwarding decisions on the packet before it is sent to the NIC. Passthrough creates a number of problems by exposing the physical NIC to the guest which obviates many of the advantages of NIC virtualization (we describe these shortcomings at length here).
For modern NICs that support TSO and LRO, the only additional overhead that STT provides over sending a raw L2 frame is the memcpy() of the header for encap/decap, and the transmission cost of those additional bytes.
It’s worth pointing out that even if reassembly on the receive side is done in software (which is the case with some NICs), the interrupt coalescing between the hypervisor and the guest is still a significant performance win.
How does this compare to other tunneling proposals? The most significant difference is that NICs don’t support the tunneling protocols today, so they have to be implemented in software which results in a relatively significant performance hit. Eventually NICs will support multiple tunneling protocols, and hopefully they will also support the same stateless (on the send side) TCP segmentation offloading. However, this is unlikely to happen with LOM for awhile.
As a final point, much of STT was designed for efficient processing in software. It contains redundant fields in the header for more efficient lookup and padding to improve byte-alignment on 32-bit boundaries.
So, What’s Not to Like?
STT in it’s current form is a practical hack (more or less). Admittedly, it follows more of a “systems” than a networking aesthetic. More focus was put on practicality, performance, and software processing, than being parsimonious with lookup bits in the header.
As a result, there are definitely some blemishes. For example, because it uses a valid TCP header, but doesn’t have an associated TCP state machine, middleboxes that don’t do full TCP termination are likely to get confused (although it is a little difficult for us to see this as a real shortcoming given all of the other problems passive middleboxes have correctly reconstructing end state). Also, there is currently no simple way to distinguish it from standard TCP traffic (again, a problem for middleboxes). Of course, the opacity of tunnels to middleboxes is nothing new, but these are generally fair criticisms.
In the end, our guess is that abusing existing TSO and LRO engines will not ingratiate STT with traditional networking wonks any time soon …
However, we believe that outside of the contortions needed to be compatible with existing TSO/LRO engines, STT has a suitable design for software based tunneling with hardware offload. Because the protocol does not over-specify the broader system in which the tunnel will sit, as the hardware ecosystem evolves, it should be possible to also evolve the protocol fields themselves (like getting rid of using an actual TCP header and setting the outer IP protocol to 6) without having to rewrite the control plane logic too.
Ultimately, we think there is room for a tunneling protocol that provides the benefits of STT, namely the ability to do processing in software with minimal hardware offload for send and receive segmentation. As long as there is compatible hardware, the particulars or the protocol header are less important. After all, it’s only (mostly) software.
[This post was written with Andrew Lambeth]
Our last post “Networking Doesn’t Need a VMware” made the point that drawing a simple analogy between server and network virtualization can steer the technical discourse on network virtualization in the wrong direction. The sentiment comes from the many partner, analyst, and media meetings we’ve been involved in that persistently focus on relatively uninteresting areas of the network virtualization space, specifically, details of encapsulation formats and lookup pipelines.
In this series of writeups, we take a deeper look and discuss some areas in which network
virtualization would do well to emulate server virtualization. This is a fairly broad topic so we’ll break it up across a couple of posts.
In this part, we’ll focus on address space virtualization.
Quick heads up that the length of this post got a little bit out of hand. For those who don’t have the time/patience/inclination/attention span, the synopsis is as follows:
One of the key strengths of a hypervisor lies in its insertion of a completely new address space below the guest OS’s view of what it believes to be the physical address space. And while there are several possible ways to interpose on network address space to achieve some form of virtualization, encapsulation provides the closest analog to the hierarchical memory virtualization used in compute. It does so by taking advantage of the hierarchy inherent in the physical topology, and allowing both the virtual and physical address spaces to support complete forwarding and addressing models. However, like memory virtualization’s page table, encapsulation requires maintenance of the address mappings (rules to tunnel mappings). The interface for doing so should be open, and a good candidate for that interface is OpenFlow.
Now, onto the detailed argument …
Virtual Memory in Compute
Virtual memory has been a core component of compute virtualization for decades. The basic concept is very simple, multiple (generally sparse) virtual address spaces are multiplexed to a single compact physical address using a page table that contains the between the two.
An OS virtualizes the address space for a process by populating a table with a single level of translations from virtual addresses (VA) to physical addresses (PA). All hypervisors support the ability for a guest OS to continue working in this mode by adding the notion of a third address space called machine addresses (MA) for the true physical addresses.
Since x86 hardware initially supported only a single level of mappings the hypervisor implemented a complete MMU in software to capture and maintain the guest’s VA to PA mappings, and then created a second set of mappings from guest PA to actual hardware MA. What was actually programmed into hardware was VA to MA mappings, in order to not incur overhead during the actual memory references by the guest, but the full heirarchy was maintained so at any time the hypervisor components could easily take an address from any of the three address spaces and map it back to any of the other address spaces.
Having a multi-level hierarchical mapping was so powerful that eventually CPU vendors added support for multi-level page tables in the hardware MMU (called Nested Page Tables on AMD and Extended Page Tables on Intel). Arguably this was the biggest architectural change to CPUs over the last decade.
Benefits of address virtualization in compute
Although this is pretty basic stuff, it is worth enumerating the benefits it provides to see how these can be applied to the networking world.
- It allows the multiplexing of multiple large, sparse address spaces onto a smaller, compact physical address space.
- It supports mobility of a process within a physical address space. This can be used to more efficiently allocated processes to memory, or take advantage of new memory as it is added.
- VMs don’t have to coordinate with other VMs to select their address space. Or more generally, there are no constraints of the virtual address space that can be allocated to a guest VM.
- A virtual memory subsystem provides a basic unit of isolation. A VM cannot mistakingly (or maliciously) address another VM unless another part of the system is busted or compromised.
Address Virtualization for Networking
The point of this post is to explore whether memory virtualization in compute can provide guidance on network virtualization. It would be nice, for example, to retain the same benefits that made memory virtualization so successful.
We’ll start with some common approaches to network virtualization and see how they compare.
Tagging : The basic idea behind tagging is to mark packets at the edge of the network with some bits (the tag) that contains the virtual context. This tag generally encodes a unique identifier for the virtual network and perhaps the virtual ingress port. As the packet traverses the network, the tag is used to segment the forwarding tables to only apply to rules associated with that tag.
A tag does not provide “virtualization” in the same way virtual memory of compute does. Rather it provides segmentation (which was also a phase compute memory went through decades ago). That is, it doesn’t introduce a new address space, but rather segments an existing address space. As a result, the same addresses are used for both physical and virtual purposes. The implications of this can be quite limiting. For example:
- Since the addresses are used to address things in the virtual world *and* the physical world, they have to be exposed to the physical forwarding tables. Therefore the nice property of aggregation that comes with hierarchical address mapping cannot be exploited. Using VLANs as an example, every VM MAC has to be exposed to the hardware putting a lot of pressure on physical L2 tables. If virtual addresses where mapped to a smaller subset of physical addresses instead, the requirement for large L2 tables would go away.
- Due to layering, tags generally only segment a single address space (e.g. L2). Again, because there is no new address space introduced, this means that all of the virtual contexts must all have identical addressing models (e.g. L2) *and* the virtual addresses space must be the same as the physical address space. This unnecessarily couples the virtual and physical worlds. Imagine a case in which the model used to address the virtual domain would not be suitable for physical forwarding. The classic example of this is L2. VMs are given Ethernet NICs and may want to talk using L2 only. However, L2 is generally not a great way of building large fabrics. Introducing an additional address space can provide the desired service model at the virtual realm, and the appropriate forwarding model in the physical realm.
- Another shortcoming of tagging is that you cannot take advantage of mobility through address remapping. In the virtual address realm, you can arbitrarily map a virtual address to a physical address. If the process or VM moves, the mapping just needs to be updated. A tag however does not provide address mobility. The reason that VEPA, VNTAG, etc. support mobility within an L2 domain is by virtue of learned soft state in L2 (which exists whether or not a tag is in use). For layers that don’t support address mobility (say vanilla IP), using a tag won’t somehow enable it.It’s worth pointing out that this is a classic problem within virtualized datacenters today. L2 supports mobility, and L3 often does not. So while both can be segmented using tags, often only the L2 portion allows mobility confining VMs to a given subnet. Most of the approaches to VM mobility across subnets introduce another layer of addressing, whether this is done with LISP, tunnels, etc. However, as soon as encapsulation is introduced, it begs the question of why use tags at all.
There are a number of other differences between tagging and address-level virtualization (like requiring all switches en route to understand the tag), but hopefully you’ve gotten the point. To be more like memory virtualization, a new addressing layer needs to be present, and segmenting the physical address space doesn’t provide the same properties.
Address Mapping : Another method of network virtualization is address mapping. Unlike tagging, a new address space is introduced and mapped onto the physical address space by one or more devices in the network. The most common example of address mapping is NAT, although it doesn’t necessarily have to be limited to L3.
To avoid changing the framing format of the packet, address mapping operates by updating the address in place, meaning that the same field is used for the updated address. However, multiplexing multiple larger address spaces onto a smaller physical address space (for example, a bunch of private IP subnets onto a smaller physical IP subnet) is a lossy operation and therefore requires additional bits to map back from the smaller space to the larger space.
Here in lies much of the complexity (and commensurate shortcoming) of address mapping. Generally, the additional bits are added to a smaller field in the packet, and the full mapping information is stored as state on the device performing translation. This is what NAT does, the original 5 tuple is stashed on the device, and the ephemeral transport port is used to launder a key that points back to that 5 tuple on the return path.
When compared to virtual memory, the model doesn’t hold up particularly well. Within a server, the entire page table is always accessible, so addresses can be mapped back and forth with some additional information to aid in the demultiplexing (generally the pid). With network address translation, the “page table” is created on demand, and stored in a single device. Therefore, in order to support component failover of asymmetric paths, the per-flow state has to be replicated to all other devices that could be on the alternate path. Also, because state is set up during flow initiation from the virtual address, inbound flows cannot be forwarded unless the destination public IP address is effectively “pinned” to a virtual address. As a result, virtual to virtual communication in which the end points are behind different devices have to resort to OOB techniques like NAT punching in order to communicated.
This doesn’t mean that addressing mapping isn’t hugely useful in practice. Clearly for mapping from a virtual address space to a physical address space this is the correct approach. However, if communicating from a virtual address to a virtual address through a physical address space, it it is far more limited than its compute analog.
Tunneling : By tunneling we mean that the payload of a packet is another packet (headers and all).
Like address mapping (and virtual memory) tunneling introduces another address space. However, there are two differences. First, the virtual address space doesn’t have to look anything like the physical. For example, it is possible to have the virtual address space be IPv6 and the physical address space be IPv4. Second, the full address mapping is stored in the packet so that there is no need to create per-flow state within the network to manage the mappings.
Tunneling (or perhaps more broadly encapsulation) is probably the most popular method of doing full network virtualization today. TRILL uses L2 in L2, LISP uses encapsulation, VXLAN (which is LISP as far as I can tell) uses L2 in L3, VCDNI uses L2 in L2, NVGRE uses L2 in L3, etc.
The general approach maps to virtual memory pretty well. The outer header can be likened to a physical address. The inner header can be likened to a virtual address. And often a shim header is included just after the outer header that contains demultiplexing information (the equivalent of a PID). The “page table” consists of on-datapath table table entries which map packets to tunnels.
Take for example an overlay mesh (like NVGRE). The L2 table at the edge that points to the tunnels is effectively mapping from virtual addresses to physical addresses. And there is no reason to limit this lookup to L2, it could provide L2 and L3 in the virtual domain. If a VM moves, this “page table” is updated as it would be in a server of a process is moved.
Because the packets are encapsulated, switches in the “physical domain” only have to deal with the outer header greatly reducing the number of addresses they have to deal with. Switches which contain the “page table” however, have to do lookups in the virtual world (e.g. map from packet to tunnel), map to the physical world (throw a tunnel on the packet) and then forward the packet in the physical world (figure out which port to send the tunneled traffic out on). Note that these are effectively the same steps an MMU takes within a server.
Further, like memory virtualization, tunneling provides nice isolation properties. For example, a VM in the virtual address space cannot address the physical network unless the virtual network is somehow bridged into it.
Also, like hierarchical memory, it is possible for the logical networks to have totally different forwarding stacks. One could be L2-only, the other ipv4, and another ipv6, and all of these could differ from the physical substrate. A common setup has IPv6 run in the virtual domain (for end-to-end addressing), and IPv4 for the physical fabric (where a large address space isn’t needed).
Fortunately, we already have a a multi-level hierarchical topology in the network (in particularly the datacenter). This allows for either the addition of a new layer at the edge which provides a virtual address space without changing any other components, or just changing the components at the outer edge of the hierarchy.
So what are the shortcomings of this approach? The most immediate are performance issues with tunneling, additional overhead overhead in the packet, and the need to maintain the distributed “page table”.
Tunneling performance is no longer the problem it use to be. Even from software in the server, clever tunnel implementations are able to take advantage of LRO and the like and can achieve 10G performance without much CPU. Hardware tunneling on most switching chipsets performs at line rate as well. Header overhead marginally increases transmission delay, and will reduce total throughput if a link is saturated.
Keeping the “page tables” up to date on the other hand is still something the industry is grappling with. Most standards punt on describing how this is done, or even the interface to use to write to the “page table”. Some piggyback on L2 learning (like NVGRE) to construct some of the state but still rely on an out of band mechanism for other state (in the case of NVGRE the VIF to tenantID mapping).
As we’ve said many times in the past, this interface should be standardized if for no other reason than to provide modularity in the architecture. Not doing that would be like limiting a hardware MMU on a CPU to only work with a single Operating System.
Fortunately, Microsoft realizes this need and has opened the interface to their vswitch on Windows Server 8, and XenServer, KVM and other Linux-based hypervisors support Open vSwitch.
For hardware platforms it would also be nice to expose the ability to manage this state. Of course, we would argue that OpenFlow is a good candidate if for no other reason than it has been already used for this purpose successfully.
Wrapping Up ..
If the analogy of memory virtualization in compute holds then encapsulation is the correct way to go about network virtualization. It introduces a new address space that is unrestricted in forwarding model, it takes advantage of hierarchy inherent in the physical topology allowing for address aggregation further up in the hierarchy, and it provides the basic virtualization properties of isolation and transparent mobility.
In the next post of this series, we’ll explore how server virtualization suggests that the way that encapsulation is implemented today isn’t quite right, and some things we may want to do to fix it.
[This post was written with Andrew Lambeth. Andrew has been virtualizing networking for long enough to have coined the term "vswitch", and led the vDS distributed switching project at VMware. ]
Or at least, it doesn’t need to solve the problem in the same way.
It’s commonly said that “networking needs a VMWare”. Hell, there have been occasions in which we’ve said something very similar. However, while the analogy has an obvious appeal (virtual, flexible, thin layer of indirection in software, commoditize, commoditize, commoditize!), a closer look suggests that it draws from a very superficial understanding of the technology, and in the limit, it doesn’t make much sense.
It’s no surprise that many are drawn to this line of thought. It probably stems from the realization that virtualizing the network rather than managing the physical components is the right direction for networks to evolve. On this point, it appears there is broad agreement. In order to bring networking up to the operational model of compute (and perhaps disrupt the existing supply chain a bit) virtualization is needed.
Beyond this gross comparison, however, the analogy breaks down. The reality is that the technical requirements for server virtualization and network virtualization are very, very different.
Server Virtualization vs. Network Virtualization
With server virtualization, virtualizing CPU, memory and device I/O is incredibly complex, and the events that need to be handled with translation or emulation happen at CPU cycle timescale. So the virtualization logic must be both highly sophisticated and highly performant on the “datapath” (the datapath for compute virtualization being the instruction stream and I/O events).
On the other hand, the datapath operations for network virtualization are almost trivially simple. All they involve is mapping one address/context space to another address/context space. This effectively reduces to an additional header on the packet (or tag), and one or two more lookups on the datapath. Somewhat revealing of this simplicity, there are multiple reasonable solutions that address the datapath component, NVGRE, and VXLAN being two recently publicized proposals.
If the datapath is so simple, it’s reasonable to ask why network virtualization isn’t already a solved problem.
The answer, is that there is a critical difference between network virtualization and server virtualization and that difference is where the bulk of complexity for network virtualization resides.
What is that difference?
Virtualized servers are effectively self contained in that they are only very loosely coupled to one another (there are a few exceptions to this rule, but even then, the groupings with direct relationships are small). As a result, the virtualization logic doesn’t need to deal with the complexity of state sharing between many entities.
A virtualized network solution, on the other hand, has to deal with all ports on the network, most of which can be assumed to have a direct relationship (the ability to communicate via some service model). Therefore, the virtual networking logic not only has to deal with N instances of N state (assuming every port wants to talk to every other port), but it has to ensure that state is consistent (or at least safely inconsistent) along all of the elements on the path of a packet. Inconsistent state can result in packet loss (not a huge deal) or much worse, delivery of the packet to the wrong location.
It’s important to remember that networking traditionally has only had to deal with eventual consistency. That is “after state change, the network will take some time to converge, and until that time, all bets are off”. Eventual consistency is fine for basic forwarding provided that loops are prevented using a TTL, or perhaps the algorithm ensures loop freedom while it is converging. However, eventual consistency doesn’t work so well with virtualization. During failure, for example, it would suck if packets from tenant A managed to leak over to tenant B’s network. It would also suck if ACLs configured in tenant A’s were not enforced correctly during convergence.
So simply, the difference between server virtualization, and network virtualization is that network virtualization is all about scale (dealing with the complexity of many interconnected entities which is generally a N2 problem), and it is all about distributed state consistency. Or more concretely, it is a distributed state management problem rather than a low level exercise in dealing with the complexities of various hardware devices.
Of course, depending on the layer of networking being virtualized, the amount of state that has to be managed varies.
All network virtualization solutions have to handle basic address mapping. That is, provide a virtual address space (generally addresses of the packets within the tunnel) and the physical address space (the external tunnel header), and a mapping between the two (virtual address X is at physical address Y). Any of the many tunnel overlays solutions, whether ad hoc, proprietary, or standardized provide this basic mapping service.
To then virtualize L2, requires almost no additional state management. The L2 forwarding tables are dynamically populated from passing traffic. And the size of a single broadcast domain has fairly limited scale, supporting hundreds or low thousands of active MACs. So the only additional state that has to be managed is the association of a port (virtual or physical) to a broadcast domain which is what virtual networking standards like NVGRE and VXLAN provide.
As an aside, it’s a shame that standards like NVGRE and VXLAN choose to dictate the wire format (important for hardware compatibility) and the method for managing the context mapping between address domains (multicast), but not the control interface to manage the rest of the state. Specifying the wire format is fine. However, requiring a specific mechanism (and a shaky one at that) for managing the virtual to physical address mappings severely limits the solution space. And not specifying the control interface for managing the rest of the state effectively guarantees that implementations will be vertically integrated and proprietary.
For L3, there is a lot more state to deal with, and the number of end points to which this state applies can be very large. There are a number of datacenters today who have, or plan to have, millions of VMs. Because of this, any control plane that hopes to offer a virtualized L3 solution needs to manage potentially millions of entries at hundreds of thousands of end points (assuming the first hop network logic is within the vswitch). Clearly, scale is a primary consideration.
As another aside, in our experience, there is a lot of confusion on what exactly L3 virtualization is. While a full discussion will have to wait for a future post, it is worth pointing out that running a router as a VM is *not* network virtualization, it is x86 virtualization. Network virtualzation involves mapping between network address contexts in a manner that does not effect the total available bandwidth of the physical fabric. Running a networking stack in a virtual machine, while it does provide the benefits of x86 virtualization, limits the cross-sectional bandwidth of the emulated network to the throughput of a virtual machine. Ouch.
For L4 and above, the amount of state that has to be shared and the rate that it changes increases again by orders of magnitude. Take, for example, WAN optimization. A virtualized WAN optimization solution should be enforced throughout the network (for example, each vswitch running a piece of it) yet this would incur a tremendous amount of control overhead to create a shared content cache.
So while server virtualization lives and dies by the ability to deal with the complexity of virtualizing complex hardware interfaces of many devices at speed, network virtualization’s primary technical challenge is scale. Any solution that doesn’t deal with this up front will probably run into a wall at L2, or with some luck, basic L3.
This is all interesting … but why do I care?
Full virtualization of the network address space and service model is still a relatively new area. However, rather than tackle the problem of network virtualization directly, it appears that a fair amount of energy in industry is being poured into point solutions. This reminds us of the situation 10 years ago when many people were trying to solve server sprawl problems with application containers, and the standard claim was that virtualization didn’t offer additional benefits to justify the overhead and complexity of fully virtualizing the platform. Had that mindset prevailed, today we’d have solutions doing minimal server consolidation for a small handful of applications on only one or possibly two OSs, instead of a set of solutions that solve this and many many more problems for any application and most any OS. That mindset, for example could never have produced vMotion, which was unimaginable at the outset of server virtualization.
At the same time, those who are advocating for network virtualization tend to draw technical comparisons with server virtualization. And while clearly there is a similarity at the macro level, this comparison belies the radically different technical challenges of the two problems. And it belies the radically different approaches needed to solve the two problems. Network virtualization is not the same as server virtualization any more than server virtualization is the same as storage virtualization. Saying “the network needs a VMware” in 2012 is a little like saying “the x86 needs an EMC” in 2002.
Perhaps the confusion is harmless, but it does seem to effect how the solution space is viewed, and that may be drawing the conversation away from what really is important, scale (lots of it) and distributed state consistency. Worrying about the datapath , is worrying about a trivial component of an otherwise enormously challenging problem.