The Overhead of Software Tunneling

[This post was written with Jesse Gross, Ben Basler, Bruce Davie, and Andrew Lambeth]

Tunneling has earned a bad name over the years in networking circles.

Much of the problem is historical. When a new tunneling mode is introduced in a hardware device, it is often implemented in the slow path. And once it is pushed down to the fastpath, implementations are often encumbered by key or table limits, or sometimes throughput is halved due to additional lookups.

However, none of these problems are intrinsic to tunneling. At its most basic, a tunnel is a handful of additional bits that need to be slapped onto outgoing packets. Rarely, outside of encryption, is there significant per-packet computation required by a tunnel. The transmission delay of the tunnel header is insignificant, and the impact on throughput is – or should be – similarly minor.

In fact, our experience implementing multiple tunneling protocols within Open vSwitch is that it is possible to do tunneling in software with performance and overhead comparable to non encapsulated traffic, and to support hundreds of thousands of tunnel end points.

Given the growing importance of tunneling in virtual networking (as evidenced by the emergence of protocols such as STT, NVGRE, and VXLAN, it’s worth exploring its performance implications.

And that is the goal of this post: to start the discussion on the performance of tunneling in software from the network edge.

Background

An emerging method of network virtualization is to use tunneling from the edges to decoupled the virtual network address space from the physical address space. Often the tunneling is done in software in the hypervisor. Tunneling from within the server has a number of advantages: software tunneling can easily support hundreds of thousands of tunnels, it is not sensitive to key sizes, it can support complex lookup functions and header manipulations, it simplifies the server/switch interface and reduces demands on the in-network switching ASICs, and it naturally offers software flexibility and a rapid development cycle.

An idealized forwarding path is shown in the figure below. We assume that the tunnels are terminated within the hypervisor. The hypervisor is responsible for mapping packets from VIFs to tunnels, and from tunnels to VIFs. The hypervisor is also responsible for the forwarding decision on the outer header (mapping the encapsulated packet to the next physical hop).

Some Performance Numbers for Software Tunneling

The following tests show throughput and cpu overhead for tunneling within Open vSwitch.   Traffic was generated with netperf  attempting to emulate a high-bandwidth TCP flow. The MTU for the VM and the physical NICs are 1500bytes and the packet payload size is 32k. The test shows results using no tunneling (OVS bridge), GRE, and STT.

The results show aggregate bidirectional throughput, meaning that 20Gbps is a 10G NIC sending and receiving at line rate. All tests where done using Ubuntu 12.04 and KVM on an Intel Xeon 2.40GHz servers interconnected with a Dell 10G switch. We use standard 10G Broadcom NICs. CPU numbers reflect the percentage of a single core used for each of the processes tracked.

The following results show the performance of a single flow between two VMs on different hypervisors. We include the Linux bridge to show that performance is comparable. Note that the CPU only includes the CPU dedicated to switching in the hypervisor and not the overhead in the guest needed to push/consume traffic.

Throughput Recv side cpu Send side cpu
Linux Bridge: 9.3 Gbps 85% 75%
OVS Bridge: 9.4 Gbps 82% 70%
OVS-STT: 9.5 Gbps 70% 70%
OVS-GRE: 2.3 Gbps 75% 97%

This next table shows the aggregate throughput of two hypervisors with 4 VMs each. Since each side is doing both send and receive, we don’t differentiate between the two.

Throughput CPU
OVS Bridge: 18.4 Gbps 150%
OVS-STT: 18.5 Gbps 120%
OVS-GRE: 2.3 Gbps 150%

Interpreting the Results

Clearly these results (aside from GRE, discussed below) indicate that the overhead of software for tunneling is negligible. It’s easy enough to see why that is so. Tunneling requires copying the tunnel bits onto the header, an extra lookup (at least on receive), and the transmission delay of those extra bits when placing the packet on the wire. When compared to all of the other work that needs to be done during the domain crossing between the guest and the hypervisor, the overhead really is negligible.

In fact, with the right tunneling protocol, the performance is roughly equivalent to non-tunneling, and CPU overhead can even be lower.

STT’s lower CPU usage than non-tunneled traffic is not a statistical anomaly but is actually a property of the protocol. The primary reason is that STT allows for better coalescing on the received side in the common case (since we know how many packets are outstanding). However, the point of this post is not to argue that STT is better than other tunneling protocols, just that if implemented correctly, tunneling can have comparable performance to non-tunneled traffic. We’ll address performance specific aspects of STT relative to other protocols in a future post.

The reason that GRE numbers are so low is that with the GRE outer header it is not possible to take advantage of offload features on most existing NICs (we have discussed this problem in more detail before). However, this is a shortcoming of the NIC hardware in the near term. Next generation NICs will support better tunnel offloads, and in a couple of years, we’ll start to see them show up in LOM.

In the meantime, STT should work on any standard NIC with TSO today.

The Point

The point of this post is that at the edge, in software, tunneling overhead is comparable to raw forwarding, and under some conditions it is even beneficial. For virtualized workloads, the overhead of software forwarding is in the noise when compared to all of the other machinations performed by the hypervisor.

Technologies like passthrough are unlikely to have a significant impact on throughput, but they will save CPU cycles on the server. However, that savings comes at a fairly steep cost as we have explained before, and doesn’t play out in most deployment environments.


16 Comments on “The Overhead of Software Tunneling”

  1. sshafi says:

    Martin,

    Want to hear your thoughts about Cisco ONE initiative?

  2. Great data Martin. I had been waiting around on a VXLan patch for the past couple months, not even realizing STT was there Doh. Offloading to the NIC is truly an elegant solution, especially as you pointed out upcoming LOM. Interesting it seems the numbers reflect the offload and lower than bridging. I still need to wrap my brain around better performance than Linux bridging e.g.”since we know how many packets are outstanding”. This is good stuff to take back to the lab!

    • Heya Brent, and thanks. The short answer to the performance is that since the receive side knows how many packets are outstanding, it doesn’t have to wait around for some timeout period when batching packets to send to the guest. Rather, once the last packet arrives, the coalesced frame can be immediately be sent up to the guest. We’re going to put together another post that focuses just on STT performance as it is independently an interesting (and not always intuitive :) topic.

    • I would not say it is so elegant solution to create a fake TCP…
      It is certainly efficient to overcome performance issues thanks to the NICs but what about those STT frames which will face firewall and IPS systems? I can promise that they will be dropped directly unless your security guys change their equipment rules…
      I would rather relying on advanced packet processing companies bursting GRE performance with software only along with minimal CPU usage like… us :-)

  3. Thomas M. says:

    “The MTU for the VM and the physical NICs are 1500bytes and the packet payload size is 32k. The test shows results using no tunneling (OVS bridge), GRE, and STT.”

    With a GRE or STT encap, I see how you can fragment a 32k payload.
    But can you expand what it means to have a packet payload of 32k pushed on a bridge ?

  4. Jerry says:

    I also wanna do such a test, and I know OVS has provided a VxLAN patch, but I cannot find any STT patched yet, u know where is it?

  5. Sherry Wei says:

    when vSwitch is used, are packets copied from guest memory to hypervisor memory and then copy back to another guest memory? I would imagine that is a lot of overhead comparing with, say SR-IOV where packets are DMAed into guest memory directly from NIC. It is more interesting to compare performance results between software vSwitch performance with inter VM using SR-IOV. It should be obvious that tunneling does not hardly add any overhead, as the bulk of the work is in the software copying packets between memory. The above table of performance comparison appear to be all done with hypervisor involved, doing the packet movement of getting packets into the hypervisor, then either send it back to another VM directly or send to external switch and back. The results are expected, but not really interesting. With SR-IOV, it seems the debate of using software switch is really only about the trade offs between software flexibility vs. performance. If the flexibility is a enumeration of finite number of different schemes, hardware solution may still be better. Would love to hear counter arguments and corrections.

    • It isn’t quite as simple as “SR-IOV delivers higher performance than software virtualized IO”. A more accurate statement would be “SR-IOV can deliver lower latency IO for some workloads, at the cost of reduced feature set”. For almost all applications the additional latency incurred by using software based io virtualization is acceptable if even noticeable. This is obvious if you look at the range of applications that today are deployed successfully in such a manner. Most enterprise datacenters are moving to a “VM first” policy, meaning that unless you can prove your app cannot be effectively virtualized you don’t get bare metal assigned to it. Today, that means software IO virtualization because there are so few deployed SR-IOV configurations. For the applications that cannot tolerate the additional latency (e.g. high frequency trading), there are a number of other issues that need to be addressed to effectively virtualize them. Over time this will happen but it will always be the outlier applications that can justify the expense and reduced feature set of direct access to hardware, whether it is running fully on bare metal or just using SR-IOV to expose the IO device to the guest.

      • Sherry Wei says:

        Thanks for the reply. I think in addition to latency impact, SR-IOV can produce higher throughput. In the test results shown above, single dedicated CPU for vSwitch produces about 10Gbps throughput with CPU utilization pretty much maxed out (75% – 85%). So if there is a 40G NIC card or multiple 10G NIC card, SR-IOV would not have any problem producing line rate IO throughput, but vswitch would cap out at 10Gbps unless more CPUs are allocated in which case it takes away guest VM performance. I think comparing SR-IOV performance with software vswitch is not even interesting as results would be expected. The question is once a packet is SR-IOV DMAed into the NIC or TOR switch, can the hardware switch have sufficent features to do what vswitch can do? (perphas this is your comment about reduced feature set?) If vswitch features keep evolving, hardware could have a hard time, but if all it does is encapsulation xyz, hardware can be made flexible enough to permulate all the schemes. Isn’t that what SDN come handy here for? I agree SR-IOV is something that has not been deployed much as it is so new (Didn’t vmware just released its support on ESXi?) and it probably requires VM device driver upgrade. I also agree that the old PCI pass-through is not a good solution as a VM monopolize a NIC and maybe some hypervisor features are lost.

        • This definitely deserves a longer discussion and perhaps a full post.

          Ultimately, the x86 is the source of all traffic. From the socket API on down, there is a tremendous amount of network processing that already happens on the x86 at the edge as data traverses the edge network stack. Doing switching on the edge adds relatively minor overhead which is in the noise if you look at other overhead in the hypervisor. In the common case, virtual switching adds a few additional hash lookups, and these can be parrallelized across cores inline with the applications.

          Using SR-IOV provides two benefits, (a) it removes the context switch between the guest and the hypervisor (essentially devirtualizing the networking stack by getting the hypervisor out of the way at significant cost of features and flexbility) (b) it moves the switching functionality to hardware.

          (a) is where the real gains are realized, but this can arguably done with much more modest support in the hardware without full abdication of switching intelligence to the NIC, thus preserving the flexibility of x86 based virtualization. (b) only really matters if there is forwarding functions that benefit from specialized hardware. However, in the wild, there are very few (and none we’ve run into) where the cost equation of using specialized hardware pencils out.

          We’d like to do a focused post on this, and would love your participation. If you’re interested, drop me (Martin) an e-mail and we can put something together.

          • Sherry Wei says:

            OK, thanks. I would be interested in understanding where you are going with this.
            One thing I should add regarding PCI pass-through feature. It does have a good use case: it allows any PCI-e device that is not emulated by hypervisor to be supported in a VM transparently. The hardware platforms with such devices are not uncommon outside of standard x86 based server architecture. This feature allows these platforms to support virtualization, sometimes this is the only sensibly way . We don’t like to see vmware kill it :0

          • Hi Martin,

            I would love showing you our benchmarks where OVS performance are pushed up with the insertion of an additional packet processing framework in (transparent) replacement to the current OVS datapath.
            I am convinced that OVS can serve much more than a simple L2 (virtual) switching and ultimately, x86-based systems running OVS will be the edges for network overlays to handle multitenant architecture requiring to terminate tunnels, to apply QoS/ACL per service context, etc.
            Also with the lack of wildcard available in the openflow specifications, many packets/flows will still have to be analysed on the userspace, adding performance bottlenecks. With the packet processing framework add-on, you benefit from this environment to handle those packets/flows while managing them at high performance.
            Do you have an email per your advice above to discuss further?

            Thanks!

  6. [...] why wouldn’t you just use VXLAN? One clear reason is efficiency, as we’ve discussed here. Since tunneling between hypervisors is required for network virtualization, it’s essential [...]

  7. Hey! I have been trying to replicate this in my setup. Obviously, I won’t be able to have the same numbers since my switch and network interface cards are different. Could you please tell me if you had run parallel instances of netperf or just a single netperf session? So far, I have tried to do this by using a single netperf session, my performance has been around 3.5 Gbps (OVS-Bridge) and 2.3 Gbps (GRE). I don’t think I am utilizing the complete CPU here. I have a ten core machine. I am using OVS-1.7.1.

  8. Steve Zhang says:

    Thanks for sharing the OVS performance as I’ve been looking around for awhile. Besides the large packets (1500 bytes) perf, is there any testing done on small packets? Two reasons for this, firstly I think certain applications may generate a good portion of small packets in overall traffic and secondly even in HW switch design, the capability on handling small packets imposes greater challenge than large sized ones. On soft switch side, I would imagine a higher packet rate will incur higher cost on context switch, table lookup etc. So it would be interesting to see some measurements.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 411 other followers