[This post was written by Martin Casado and Amar Padmanahban, with helpful input from Scott Lowe, Bruce Davie, and T. Sridhar]
This is the first in a multi-part discussion on visibility and debugging in networks that provide network virtualization, and specifically in the case where virtualization is implemented using edge overlays.
In this post, we’re primarily going to cover some background, including current challenges to visibility and debugging in virtual data centers, and how the abstractions provided by virtual networking provide a foundation for addressing them.
The macro point is that much of the difficulty in visibility and troubleshooting in today’s environments is due to the lack of consistent abstractions that both provide an aggregate view of distributed state and hide unnecessary complexity. And that network virtualization not only provides virtual abstractions that can be used to directly address many of the most pressing issues, but also provides a global view that can greatly aid in troubleshooting and debugging the physical network as well.
A Messy State of Affairs
While it’s common to blame server virtualization for complicating network visibility and troubleshooting, this isn’t entirely accurate. It is quite possible to build a static virtual datacenter and, assuming the vSwitch provides sufficient visibility and control (which they have for years), the properties are very similar to physical networks. Even if VM mobility is allowed, simple distributed switching will keep counters and ACLs consistent.
A more defensible position is that server virtualization encourages behavior that greatly complicates visibility and debugging of networks. This is primarily seen as server virtualization gives way to full datacenter virtualization and, as a result, various forms of network virtualization are creeping in. However, this is often implemented as a collection of disparate (and distributed) mechanisms, without exposing simplified, unified abstractions for monitoring and debugging. And the result of introducing a new layer of indirection without the proper abstractions is, as one would expect, chaos. Our point here is not that network virtualization creates this chaos – as we’ll show below, the reverse can be true, provided one pays attention to the creation of useful abstractions as part of the network virtualization solution.
Let’s consider some of the common visibility issues that can arise. Network virtualization is generally implemented with a tag (for segmentation) or tunneling (introducing a new address space), and this can hide traffic, confuse analysis on end-to-end reachability, and cause double counting (or undercounting) of bytes or even packets. Further, the edge understanding of the tag may change over time, and any network traces collected would therefore become stale unless also updated. Often logically grouped VMs, like those of a single application or tenant, are scattered throughout the datacenter (not necessarily on the same VLAN), and there isn’t any network-visible identifier that signifies the grouping. For example, it can be hard to say something like “mirror all traffic associated with tenant A”, or “how many bytes has tenant A sent”. Similarly, ACLs and other state affecting reachability is distributed across multiple locations (source, destination, vswitches, pswitches, etc.) and can be difficult to analyze in aggregate. Overlapping address spaces, and dynamically assigned IP addresses, preclude any simplistic IP-based monitoring schemes. And of course, dynamic provisioning, random VM placement, and VM mobility can all make matters worse.
Yes, there are solutions to many of these issues, but in aggregate, they can present a real hurdle to smooth operations, billing and troubleshooting in public and private data centers. Fortunately, it doesn’t have to be this way.
Life Becomes Easy When the Abstractions are Right
So much of computer science falls into place when the right abstractions are used. Servers provide a good example of this. Compute virtualization has been around in pieces since the introducing of the operating system. Memory, IO, and the instruction sets have long been virtualized and provide the basis of modern multi-process systems. However, until the popularization of the virtual machine abstraction, these virtualization primitives did not greatly impact the operations of servers themselves. This is because there was no inclusive abstraction that represented a full server (a basic unit of operations in an IT shop). With virtual machines, all state associated with a workload is represented holistically, allowing us to create, destroy, save, introspect, track, clone, modify, limit, etc. Visibility and monitoring in multi-user environments arguably became easier as well. Independent of which applications and operating systems are installed, it’s possible to know exactly how much memory, I/O and CPU a virtual machine is consuming, and that is generally attributed back to a user.
So is it with network virtualization – the virtual network abstraction can provide many of the same benefits as the virtual machine abstraction. However, it also provides an additional benefit that isn’t so commonly enjoyed with server virtualization: network virtualization provides an aggregated view of distributed state. With manual distributed state management being one of the most pressing operational challenges in today’s data centers, this is a significant win.
To illustrate this, we’ll provide a quick primer on network virtualization and then go through an example of visibility and debugging in a network virtualization environment.
Network Virtualization as it Pertains to Visibility and Monitoring
Network virtualization, like server virtualization, exposes a a virtual network that looks like a physical network, but has the operational model of a virtual machine. Virtual networks (if implemented completely) support L2-L7 network functions, complex virtual topologies, counters, and management interfaces. The particular implementation of network virtualization we’ll be discussing is edge overlays, in which the mechanism used to introduce the address space for the virtual domain is an L2 over L3 tunnel mesh terminated at the edge (likely the vswitch). However, the point of this particular post is not to focus on the how the network virtualization is implemented, but rather, how decoupling the logical view from the physical transport affects visibility and troubleshooting.
A virtual network (in most modern implementations, at least) is a logically centralized entity. Consequently, it can be monitored and managed much like a single physical switch. Rx/Tx counters can be monitored to determine usage. ACL counters can be read to determine if something is being dropped due to policy configuration. Mirroring of a virtual switch can siphon off traffic from an entire virtual network independent of where the VMs are or what physical network is being used in the datacenter. And of course, all of this is kept consistent independent of VM mobility or even changes to the physical network.
The introduction of a logically centralized virtual network abstraction addresses many of the problems found in todays virtualized data centers. The virtualization of counters can be used for billing and accounting without worrying about VM movements, the hiding or double counting of traffic, the distribution of VMs and network services across the datacenter. The virtualization of security configuration (e.g. ACLs) and their counters turns a messy distributed state problem into a familiar central rule set. In fact, in the next post, we’ll describe how we use this aggregate view to perform header space analysis to answer sophisticated reachability questions over state which would traditionally be littered throughout the datacenter. The virtualization of management interfaces natively provides accurate, multi-tenant support of existing tool chains (e.g. NetFlow, SNMP, sFlow, etc.), and also resolves the problem of state gathering when VMs are dispersed across a datacenter.
Impact On Workflow
However, as with virtual machines, while some level of sanity has been restored to the environment, the number of entities to monitor has increased. Instead of having to divine what is going on in a single, distributed, dynamic, complex network, there are now multiple, much simpler (relatively static) networks that must be monitored. These network are (a) the physical network, which now only needs to be concerned with packet transport (and thus has become vastly simpler) and (b) the various logical networks implemented on top of it.
In essence, visibility and trouble shooting now much take into account the new logical layer. Fortunately, because virtualization doesn’t change the basic abstractions, existing tools can be used. However, as with the introduction of any virtual layer, there will be times when the mapping of physical resources to virtual ones becomes essential.
We’ll use troubleshooting as an example. Let’s assume that VM A can’t talk to VM B. The steps one takes to determine what goes on are as follows:
- Existing tools are pointed to the effected virtual network and rx/tx counters are inspected as well as any ACLs and forwarding rules. If something in the virtual abstraction is dropping the packets (like an ACL), we know what the problem is, and we’re done.
- If it looks like nothing in the virtual space is dropping the packet, it becomes a physical network troubleshooting problem. The virtual network can now reveal the relevant physical network and paths to monitor. In fact, often this process can be fully automated (as we’ll describe in the next post). In the system we work on, for example, often you can detect which links in the physical network packets are being dropped on (or where some amount of packet loss is occurring) solely from the edge.
A number of network visibility, management, and root cause detection tools are already undergoing the changes needed to make this a one step process form the operators view. However, it is important to understand what’s going on, under the covers.
Wrapping Up for Now
This post was really aimed at teeing up the topic on visibility and debugging in a virtual network environment. In the next point, we’ll go through a specific example of an edge overlay network virtualization solution, and how it provides visibility of the virtual networks, and advanced troubleshooting of the physical network. In future posts, we’ll also cover tool chains that are already being adapted to take advantage of the visibility and troubleshooting gains possible with network virtualization.
[This post was put together by Teemu Koponen, Andrew Lambeth, Rajiv Ramanathan, and Martin Casado]
Scale has been an active (and often contentious) topic in the discourse around SDN (and by SDN we refer to the traditional definition) long before the term was coined. Criticism of the work that lead to SDN argued that changing the model of the control plane from anything but full distribution would lead to scalability challenges. Later arguments reasoned that SDN results in *more* scalable network designs because there is no longer the need to flood the entire network state in order to create a global view at each switch. Finally, there is the common concern that calls into question the scalability of using traditional SDN (a la OpenFlow) to control physical switches due to forwarding table limits.
However, while there has been a lot of talk, there have been relatively few real-world examples to back up the rhetoric. Most arguments appeal to reason, argue (sometimes convincingly) from first principles, or point to related but ultimately different systems.
The goal of this post is to add to the discourse by presenting some scaling data, taken over a two-year period, from a production network virtualization solution that uses an SDN approach. Up front, we don’t want to overstate the significance of this post as it only covers a tiny sliver of the design space. However, it does provide insight into a real system, and that’s always an interesting centerpiece around which to hold a conversation.
Of course, under the broadest of terms, an SDN approach can have the same scaling properties as traditional networking. For example, there is no reason that controllers can’t run traditional routing protocols between them. However, a more useful line of inquiry is around the scaling properties of a system built using an SDN approach that actually benefits from the architecture, and scaling properties of an SDN system that differs from the traditional approach. We briefly touch both of these topics in the discussion below.
The system we’ll be describing underlies the network virtualization platform described here. The core system has been in development for 4-5 years, has been in production for over 2 years, and has been deployed in many different environments.
A Scale Graph
By scale, we’re simply referring to the number of elements (nodes, links, end points, etc.) that a system can handle without negatively impacting runtime (e.g. change in the topology, controller failure, software upgrade, etc.). In the context of network virtualization, the elements under control that we focus on are virtual switches, physical ports, and virtual ports. Below is a graph of the scale numbers for virtual ports and hypervisors under control that we’ve validated over the last two years for one particular use case.
The Y axis to the left is the number of logical ports (ports on logical switches), the Y axis on the right is the number of hypervisors (and therefore virtual switches) under control. We assume that the average number of logical ports per logical switch is constant (in this case 4), although varying that is clearly another interesting metric worth tracking. Of course, these results are in no way exhaustive, as they only reflect one use case that we commonly see in the field. Different configurations will likely have different numbers.
Some additional information about the graph:
- For comparison, the physical analog of this would be 100,000 servers (end hosts), 5,000 ToR switches, 25,000 VLANs and all the fabric ports that connect these ToR switches together.
- The gains in scale from Jan 2011 to Jan 2013 were all done with by improving the scaling properties of a single node. That is, rather than adding more resources by adding controller nodes, the engineering team continued to optimize the existing implementation (data structures, algorithms, language specific overhead, etc,.). However, the controllers were being run as a cluster during that time so they were still incurring the full overhead of consensus and data replication.
- The gains shown for the last two datapoints were only from distribution (adding resources), without any changes to the core scaling properties of a single node. In this case, moving from 3 to 4 and finally 5 nodes.
Raw scaling numbers are rarely interesting as they vary by use case, and the underlying server hardware running the controllers. What we do find interesting, though, is the relative increase in performance over time. In both cases, the increase in scale grows significantly as more nodes are added to the cluster, and as the implementation is tuned and improved.
It’s also interesting to note what the scaling bottlenecks are. While most of the discussion around SDN has focused on fundamental limits of the architecture, we have not found this be a significant contributor either way. That is, at this point we’ve not run into any architectural scaling limitations; rather, what we’ve seen are implementation shortcomings (e.g. inefficient code, inefficient scheduling, bugs) and difficulty in verification of very large networks. In fact, we believe there is significant architectural headroom to continue scaling at a similar pace.
SDN vs. Traditional Protocols
One benefit of SDN that we’ve not seen widely discussed is its ability to enable rapid evolution of solutions to address network scaling issues, especially in comparison to slow-moving standards bodies and multi-year ASIC design/development cycles. This has allowed us to continually modify our system to improve scale while still providing strong consistency guarantees, which are very important for our particular application space.
It’s easy to point out examples in traditional networking where this would be beneficial but isn’t practical in short time periods. For example, consider traditional link state routing. Generally, the topology is assumed to be unknown; for every link change event, the topology database is flooded throughout the network. However, in most environments, the topology is fixed or is slow changing and easily discoverable via some other mechanism. In such environments, the static topology can be distributed to all nodes, and then during link change events only link change data needs to be passed around rather than passing around megs of link state database. Changing this behavior would likely require a change to the RFC. Changes to the RFC, though, would require common agreement amongst all parties, and traditionally results in years of work by a very political standardization process.
For our system, however, as our understanding for the problem grows we’re able to evolve not only the algorithms and data structures used, but the distribution model (which is reflected by the last two points in the graph) and the amount of shared information.
Of course, the tradeoff for this flexibility is that the protocol used between the controllers is no longer standard. Indeed, the cluster looks much more like a tightly coupled distributed software system than a loose collection of nodes. This is why standard interfaces around SDN applications are so important. For network virtualization this would be the northbound side (e.g. Quantum), the southbound side (e.g. ovsdb-conf), and federation between controller clusters.
This is only a snapshot of a very complex topic. The point of this post is not to highlight the scale of a particular system—clearly a moving target—but rather to provide some insight into the scaling properties of a real application built using an SDN approach.
Two weeks ago I gave a short presentation at the Open Networking Summit. With only 15 minutes allocated per speaker, I wasn’t sure I’d be able to make much of an impact. However, there has been a lot of reaction to the talk – much of it positive – so I’m posting the slides here and including them below. A video of the presentation is also available in the ONS video archive (free registration required).
[This post was written by JR Rivers, Bruce Davie, and Martin Casado]
One of the important characteristics of network virtualization is the decoupling of network services from the underlying physical network. That decoupling is fundamental to the definition of network virtualization: it’s the delivery of network services independent of a physical network that makes those services virtual. Furthermore, many of the benefits of virtualization – such as the ability to move network services along with the workloads that need those services, without touching hardware – follow directly from this decoupling.
In spite of all the benefits that flow from decoupling virtual networks from the underlying physical network, we occasionally hear the concern that something has been lost by not having more direct interaction with the physical network. Indeed, we’ve come across a common intuition that applications would somehow be better off if they could directly control what the physical network is doing. The goal of this post is to explain why we disagree with this view.
It’s worth noting that this idea of getting networks to do something special for certain applications is hardly a novel idea. Consider the history of Voice over IP as an example. It wasn’t that long ago when using Ethernet for phone calls was a research project. Advances in the capacity of both the end points as well as the underlying physical network changed all of that and today VOIP is broadly utilized by consumers and enterprises around the world. Let’s break down the architecture that enabled VOIP.
A call starts with end-points (VOIP phones and computers) interacting with a controller that provisions the connection between them. In this case, provisioning involves authenticating end-points, finding other end-points, and ringing the other end. This process creates a logical connection between the end-points that overlays the physical network(s) that connect them. From there, communication occurs directly between the end-points. The breakthroughs that allowed Voice Over IP were a) ubiquitous end-points with the capacity to encode voice and communicate via IP and b) physical networks with enough capacity to connect the end-points while still carrying their normal workload.
Now, does VOIP need anything special from the network itself? Back in the 1990s, many people believed that to enable VOIP it would be necessary to signal into the network to request bandwidth for each call. Both ATM signalling and RSVP (the Resource Reservation Protocol) were proposed to address this problem. But by the time VOIP really started to gain traction, network bandwidth was becoming so abundant that these explicit communication methods between the endpoints and the network proved un-necessary. Some simple marking of VOIP packets to ensure that they didn’t encounter long queues on bottleneck links was all that was needed in the QoS department. Intelligent behavior at the end-points (such as adaptive bit-rate codecs) made the solution even more robust. Today, of course, you can make a VOIP call between continents without any knowledge of the underlying network.
These same principles have been applied to more interactive use cases such as web-based video conferencing, interactive gaming, tweeting, you name it. The majority of the ways that people interact electronically are based on two fundamental premises: a logical connection between two or more end-points and a high capacity IP network fabric.
Returning to the context of network virtualization, IP fabrics allow system architects to build highly scalable physical networks; the summarization properties of IP and its routing protocols allow the connection of thousands of endpoints without imposing the knowledge of each one on the core of the network. This both reduces the complexity (and cost) of the networking elements, and improves their ability to heal in the event that something goes wrong. IP networks readily support large sets of equal cost paths between end-points, allowing administrators to simultaneously add capacity and redundancy. Path selection can be based on a variety of techniques such as statistical selection (hashing of headers), Valiant Load Balancing, and automated identification of “elephant” flows.
Is anything lost if applications don’t interact directly with the network forwarding elements? In theory, perhaps, an application might be able to get a path that is more well-suited to its precise bandwidth needs if it could talk to the network. In practice, a well-provisioned IP network with rich multipath capabilities is robust, effective, and simple. Indeed, it’s been proven that multipath load-balancing can get very close to optimal utilization, even when the traffic matrix is unknown (which is the normal case). So it’s hard to argue that the additional complexity of providing explicit communication mechanisms for applications to signal their needs to the physical network are worth the cost. In fact, we’ll argue in a future post that trying to carefully engineer traffic is counter-productive in data centers because the traffic patterns are so unpredictable. Combine this with the benefits of decoupling the network services from the physical fabric, and it’s clear that a virtualization overlay on top of a well-provisioned IP network is a great fit for the modern data center.
[This post was written by Bruce Davie and Martin Casado.]
With the growth of interest in network virtualization, there has been a tendency to focus on the encapuslations that are required to tunnel packets across the physical infrastructure, sometimes neglecting the fact that tunneling is just one (small) part of an overall architecture for network virtualization. Since this post is going to do just that – talk about tunnel encapsulations – we want to reiterate the point that a complete network virtualization solution is about much more than a tunnel encapsulation. It entails (at least) a control plane, a management plane, and a set of new abstractions for networking, all of which aim to change the operational model of networks from the traditional, physical model. We’ve written about these aspects of network virtualization before (e.g., here).
In this post, however, we do want to talk about tunneling encapsulations, for reasons that will probably be readily apparent. There is more than one viable encapsulation in the marketplace now, and that will be the case for some time to come. Does it make any difference which one is used? In our opinion, it does, but it’s not a simple beauty contest in which one encaps will be declared the winner. We will explore some of the tradeoffs in this post.
There are three main encapsulation formats that have been proposed for network virtualization: VXLAN, NVGRE, and STT. We’ll focus on VXLAN and STT here. Not only are they the two that VMware supports (now that Nicira is part of VMware) but they also represent two quite distinct points in the design space, each of which has its merits.
One of the salient advantages of VXLAN is that it’s gained traction with a solid number of vendors in a relatively short period. There were demonstrations of several vendors’ implementations at the recent VMworld event. It fills an important market need, by providing a straightforward way to encapsulate Layer 2 payloads such that the logical semantics of a LAN can be provided among virtual machines without concern for the limitations of physical layer 2 networks. For example, a VXLAN can provide logical L2 semantics among machines spread across a large data center network, without requiring the physical network to provide arbitrarily large L2 segments.
At the risk of stating the obvious, the fact that VXLAN has been implemented by multiple vendors makes it an ideal choice for multi-vendor deployments. But we should be clear what “multi-vendor” means in this case. Network virtualization entails tunneling packets through the data center routers and switches, and those devices only forward based on the outer header of the tunnel – a plain old IP (or MAC header). So the entities that need to terminate tunnels for network virtualization are the ones that we are concerned about here.
In many virtualized data center deployments, most of the traffic flows from VM to VM (“east-west” traffic) in which case the tunnels are terminated in vswitches. It is very rare for those vswitches to be from different vendors, so in this case, one might not be so concerned about multi-vendor support for the tunnel encaps. Other issues, such as efficiency and ability to evolve quickly might be more important, as we’ll discuss below.
Of course, there are plenty of cases where traffic doesn’t just flow east-west. It might need to go out of the data center to the Internet (or some other WAN), i.e. “north-south”. It might also need to be sent to some sort of appliance such as a load balancer, firewall, intrusion detection system, etc. And there are also plenty of cases where a tunnel does need to be terminated on a switch or router, such as to connect non-virtualized workloads to the virtualized network. In all of these cases, we’re clearly likely to run into multi-vendor situations for tunnel termination. Hence the need for a common, stable, and straightfoward approach to tunneling among all those devices.
Now, getting back to server-server traffic, why wouldn’t you just use VXLAN? One clear reason is efficiency, as we’ve discussed here. Since tunneling between hypervisors is required for network virtualization, it’s essential that tunneling not impose too high an overhead in terms of CPU load and network throughput. STT was designed with those goals in mind and performs very well on those dimensions using today’s commodity NICs. Given the general lack of multi-vendor issues when tunneling between hypervisors, STT’s significant performance advantage makes it a better fit in this scenario.
The performance advantage of STT may be viewed as somewhat temporary – it’s a result of STT’s ability to leverage TCP segmentation offload (TSO) in today’s NICs. Given the rise in importance of tunneling, and the momentum behind VXLAN, it reasonable to expect that a new generation of NICs will emerge that allow other tunnel encapsulations to be used without disabling TSO. When that happens, performance differences between STT and VXLAN should (mostly) disappear, given appropriate software to leverage the new NICs.
Another factor that comes into play when tunneling traffic from server to server is that we may want to change the semantics of the encapsualution from time to time as new features and capabilities make their way into the network virtualization platform. Indeed, one of overall advantages of network virtualization is the ease with which the capabilities of the network can be upgraded over time, since they are all implemented in software that is completely independent of the underlying hardware. To make the most of this potential for new feature deployment, it’s helpful to have a tunnel encaps with fields that can be modified as required by new capabilities. An encaps that typically operates between the vswitches of a single vendor (like STT) can meet this goal, while an encaps designed to facilitate multi-vendor scenarios (like VXLAN) needs to have the meaning of every header field pretty well nailed down.
So, where does that leave us? In essence, with two good solutions for tunneling, each of which meets a subset of the total needs of the market, but which can be used side-by-side with no ill effect. Consequently, we believe that VXLAN will continue to be a good solution for the multi-vendor environments that often occur in data center deployments, while STT will, for at least a couple of years, be the best approach for hypervisor-to-hypervisor tunnels. A complete network virtualization solution will need to use both encapsulations. There’s nothing wrong with that – building tunnels of the correct encapsulation type can be handled by the controller, without the need for user involvement. And, of course, we need to remember that a complete solution is about much more than just the bits on the wire.
[This post was written with input from Martin Casado, Ben Pfaff, Justin Pettit and Ben Basler.]
The Open vSwitch (OVS) project is obviously dear to our hearts at Nicira (and now VMware). OVS is a fairly standard open source project, with many dozens of people from companies around the world contributing patches and reviewing them. However, there is more to openness than just open source software; open protocols (with freely accessible specs) are also important. Of course, Open vSwitch is well known as an implementation of the OpenFlow protocol, for which the specs are freely available. But there is another protocol, arguably just as important as OpenFlow, which is used to manage Open vSwitch instances. That protocol is known as the Open vSwitch Data Base management protocol or OVSDB protocol. While the specification of that protocol can be found within the Open vSwitch sources, it’s a bit of an effort to figure out exactly how it works. With that in mind, as well as a desire to be as open as possible, we decided to publish a spec for the OVSDB protocol in a more familiar and accessible format – an Internet draft.
To be clear, anyone can publish an Internet draft, and that does not make something into a standard. There’s a possibility that this Internet draft may be suitable for publication as an informational RFC. That wouldn’t make it a standard either, but it would at least provide an archival publication mechanism for a protocol that is quite widely used. Whether that happens or not depends on the “Independent Stream Editor”, part of the rather complex organization that handles RFC publication. (See http://www.rfc-editor.org/RFCeditor.html.)
So, what is this OVSDB protocol? Obviously, you could just go and read our new Internet draft, but here is the quick summary. While OpenFlow establishes flow state in a switch, there’s a lot more to Open vSwitch – indeed there is more to networking – than just setting up flow (or forwarding) table entries. In Open vSwitch, you can create many virtual switch instances, attach interfaces to those switches, set QOS policies on interfaces, and so on. None of these configuration tasks can be done with OpenFlow, so you need a management/configuration protocol to do them.
The OVSDB protocol has been part of the Open vSwitch implementation for many years. It is essentially a general purpose protocol for interacting with a database, and in Open vSwitch the database in question is a set of tables representing switch configuration data. (Some readers may be familiar with of-config – the OpenFlow config protocol – a more recent effort; we believe that protocol could actually be implemented on top of OVSDB.)
To step back for a moment, networking folks often think of any network device as having a control plane and a data plane. Sadly, the management plane has been all too often neglected. OVSDB is a protocol that was created to address that important but neglected aspect of networking. We think that making networks dramatically easier to manage is in fact one of the major benefits of network virtualization. That’s a topic we’ve discussed elsewhere; for now, I’ll just urge readers of this blog to go take a look at our current approach to managing and configuring Open vSwitch instances.
[This post was written with Jesse Gross, Ben Basler, Bruce Davie, and Andrew Lambeth]
Tunneling has earned a bad name over the years in networking circles.
Much of the problem is historical. When a new tunneling mode is introduced in a hardware device, it is often implemented in the slow path. And once it is pushed down to the fastpath, implementations are often encumbered by key or table limits, or sometimes throughput is halved due to additional lookups.
However, none of these problems are intrinsic to tunneling. At its most basic, a tunnel is a handful of additional bits that need to be slapped onto outgoing packets. Rarely, outside of encryption, is there significant per-packet computation required by a tunnel. The transmission delay of the tunnel header is insignificant, and the impact on throughput is – or should be – similarly minor.
In fact, our experience implementing multiple tunneling protocols within Open vSwitch is that it is possible to do tunneling in software with performance and overhead comparable to non encapsulated traffic, and to support hundreds of thousands of tunnel end points.
And that is the goal of this post: to start the discussion on the performance of tunneling in software from the network edge.
An emerging method of network virtualization is to use tunneling from the edges to decoupled the virtual network address space from the physical address space. Often the tunneling is done in software in the hypervisor. Tunneling from within the server has a number of advantages: software tunneling can easily support hundreds of thousands of tunnels, it is not sensitive to key sizes, it can support complex lookup functions and header manipulations, it simplifies the server/switch interface and reduces demands on the in-network switching ASICs, and it naturally offers software flexibility and a rapid development cycle.
An idealized forwarding path is shown in the figure below. We assume that the tunnels are terminated within the hypervisor. The hypervisor is responsible for mapping packets from VIFs to tunnels, and from tunnels to VIFs. The hypervisor is also responsible for the forwarding decision on the outer header (mapping the encapsulated packet to the next physical hop).
Some Performance Numbers for Software Tunneling
The following tests show throughput and cpu overhead for tunneling within Open vSwitch. Traffic was generated with netperf attempting to emulate a high-bandwidth TCP flow. The MTU for the VM and the physical NICs are 1500bytes and the packet payload size is 32k. The test shows results using no tunneling (OVS bridge), GRE, and STT.
The results show aggregate bidirectional throughput, meaning that 20Gbps is a 10G NIC sending and receiving at line rate. All tests where done using Ubuntu 12.04 and KVM on an Intel Xeon 2.40GHz servers interconnected with a Dell 10G switch. We use standard 10G Broadcom NICs. CPU numbers reflect the percentage of a single core used for each of the processes tracked.
The following results show the performance of a single flow between two VMs on different hypervisors. We include the Linux bridge to show that performance is comparable. Note that the CPU only includes the CPU dedicated to switching in the hypervisor and not the overhead in the guest needed to push/consume traffic.
|Throughput||Recv side cpu||Send side cpu|
|Linux Bridge:||9.3 Gbps||85%||75%|
|OVS Bridge:||9.4 Gbps||82%||70%|
This next table shows the aggregate throughput of two hypervisors with 4 VMs each. Since each side is doing both send and receive, we don’t differentiate between the two.
|OVS Bridge:||18.4 Gbps||150%|
Interpreting the Results
Clearly these results (aside from GRE, discussed below) indicate that the overhead of software for tunneling is negligible. It’s easy enough to see why that is so. Tunneling requires copying the tunnel bits onto the header, an extra lookup (at least on receive), and the transmission delay of those extra bits when placing the packet on the wire. When compared to all of the other work that needs to be done during the domain crossing between the guest and the hypervisor, the overhead really is negligible.
In fact, with the right tunneling protocol, the performance is roughly equivalent to non-tunneling, and CPU overhead can even be lower.
STT’s lower CPU usage than non-tunneled traffic is not a statistical anomaly but is actually a property of the protocol. The primary reason is that STT allows for better coalescing on the received side in the common case (since we know how many packets are outstanding). However, the point of this post is not to argue that STT is better than other tunneling protocols, just that if implemented correctly, tunneling can have comparable performance to non-tunneled traffic. We’ll address performance specific aspects of STT relative to other protocols in a future post.
The reason that GRE numbers are so low is that with the GRE outer header it is not possible to take advantage of offload features on most existing NICs (we have discussed this problem in more detail before). However, this is a shortcoming of the NIC hardware in the near term. Next generation NICs will support better tunnel offloads, and in a couple of years, we’ll start to see them show up in LOM.
In the meantime, STT should work on any standard NIC with TSO today.
The point of this post is that at the edge, in software, tunneling overhead is comparable to raw forwarding, and under some conditions it is even beneficial. For virtualized workloads, the overhead of software forwarding is in the noise when compared to all of the other machinations performed by the hypervisor.
Technologies like passthrough are unlikely to have a significant impact on throughput, but they will save CPU cycles on the server. However, that savings comes at a fairly steep cost as we have explained before, and doesn’t play out in most deployment environments.