Accelerating Open vSwitch to “Ludicrous Speed”
Posted: November 13, 2014
[This post was written by OVS core contributors Justin Pettit, Ben Pfaff, and Ethan Jackson.]
The overhead associated with vSwitches has been a hotly debated topic in the networking community. In this blog post, we show how recent changes to OVS have elevated its performance to be on par with the native Linux bridge. Furthermore, CPU utilization of OVS in realistic scenarios can be up to 8x below that of the Linux bridge. This is the first of a two-part series. In the next post, we take a peek at the design and performance of the forthcoming port to DPDK, which bypasses the kernel entirely to gain impressive performance.
Open vSwitch is the most popular network back-end for OpenStack deployments and widely accepted as the de facto standard OpenFlow implementation. Open vSwitch development initially had a narrow focus — supporting novel features necessary for advanced applications such as network virtualization. However, as we gained experience with production deployments, it became clear these initial goals were not sufficient. For Open vSwitch to be successful, it not only must be highly programmable and general, it must also be blazingly fast. For the past several years, our development efforts have focused on precisely this tension — building a software switch that does not compromise on either generality or speed.
To understand how we achieved this feat, one must first understand the OVS architecture. The forwarding plane consists of two parts: a “slow-path” userspace daemon called ovs-vswitchd and a “fast-path” kernel module. Most of the complexity of OVS is in ovs-vswitchd; all of the forwarding decisions and network protocol processing are handled there. The kernel module’s only responsibilities are tunnel termination and caching traffic handling.
When a packet is received by the kernel module, its cache of flows is consulted. If a relevant entry is found, then the associated actions (e.g., modify headers or forward the packet) are executed on the packet. If there is no entry, the packet is passed to ovs-vswitchd to decide the packet’s fate. ovs-vswitchd then executes the OpenFlow pipeline on the packet to compute actions, passes it back to the fast path for forwarding, and installs a flow cache entry so similar packets will not need to take these expensive steps.
Until OVS 1.11, this fast-path cache contained exact-match “microflows”. Each cache entry specified every field of the packet header, and was therefore limited to matching packets with this exact header. While this approach works well for most common traffic patterns, unusual applications, such as port scans or some peer-to-peer rendezvous servers, would have very low cache hit rates. In this case, many packets would need to traverse the slow path, severely limiting performance.
OVS 1.11 introduced megaflows, enabling the single biggest performance improvement to date. Instead of a simple exact-match cache, the kernel cache now supports arbitrary bitwise wildcarding. Therefore, it is now possible to specify only those fields that actually affect forwarding.. For example, if OVS is configured simply to be a learning switch, then only the ingress port and L2 fields are relevant and all other fields can be wildcarded. In previous releases, a port scan would have required a separate cache entry for, e.g., each half of a TCP connection, even though the L3 and L4 fields were not important.
The introduction of megaflows allowed OVS to drastically reduce the number of packets that traversed the slow path. This represents a major improvement, but ovs-vswitchd still had a number of responsibilities, which became the new bottleneck. These include activities like managing the datapath flow table, running switch protocols (LACP, BFD, STP, etc), and other general accounting and management tasks.
While the kernel datapath has always been multi-threaded, ovs-vswitchd was a single-threaded process until OVS 2.0. This architecture was pleasantly simple, but it suffered from several drawbacks. Most obviously, it could use at most one CPU core. This was sufficient for hypervisors, but we began to see Open vSwitch used more frequently as a network appliance, in which it is important to fully use machine resources.
Less obviously, it becomes quite difficult to support ovs-vswitchd’s real-time requirements in a single-threaded architecture. Fast-path misses from the kernel must be processed by ovs-vswitchd as promptly as possible. In the old single-threaded architecture, miss handling often blocked behind the single thread’s other tasks. Large OVSDB changes, OpenFlow table changes, disk writes for logs, and other routine tasks could delay handling misses and degrade forwarding performance. In a multi-threaded architecture, soft real-time tasks can be separated into their own threads, protected from delay by unrelated maintenance tasks.
Since the introduction of multithreading, we’ve continued to fine-tune the number of threads and their responsibilities. In addition to order of magnitudes improvements in miss handling performance, this architecture shift has allowed us to increase the size of the kernel cache from 1000 flows in early versions of OVS to roughly 200,000 in the most recent version.
The introduction of megaflows and support for a larger cache enabled by multithreading reduced the need for packets to be processed in userspace. This is important because a packet lookup in userspace can be quite expensive. For example, a network virtualization application might define an OpenFlow pipeline with dozens of tables that hold hundreds of thousands of rules. The final optimization we will discuss is improvements in the classifier in ovs-vswitchd, which is responsible for determining which OpenFlow rules apply when processing a packet.
Data structures to allow a flow table to be quickly searched are an active area of research. OVS uses a tuple space search classifier, which consists of one hash table (tuple) for each kind of match actually in use. For example, if some flows match on the source IP, that’s represented as one tuple, and if others match on source IP and destination TCP port, that’s a second tuple. Searching a tuple space search classifier requires searching each tuple, then taking the highest priority match. In the last few releases, we have introduced a number of novel improvements to the basic tuple search algorithm:
- Priority Sorting – Our simplest optimization is to sort the tuples in order by the maximum priority of any flow in the tuple. Then a search that finds a matching flow with priority P can terminate as soon as it arrives at a tuple whose maximum priority is P or less.
- Staged Lookup – In many network applications, a policy may only apply to a subset of the headers. For example, a firewall policy that requires looking at TCP ports may only apply to a couple of VMs’ interfaces. With the basic tuple search algorithm, if any rule looks at the TCP ports, then any generated megaflow would look all the way up through the L4 headers. With staged lookup, we scope lookups from metadata (e.g., ingress port) up to layers further up the stack on an as-needed basis.
- Prefix Tracking – When processing L3 traffic, a longest prefix match is required for routing. The tuple-space algorithm works poorly in this case, since it degenerates into a tuple that matches as many bits as the longest match. This means that if one rule matches on 10.x.x.x and another on 192.168.0.x, the 10.x.x.x rule will also require matching 24 bits instead of 8, which requires keeping more megaflows in the kernel. With prefix tracking, we consult a trie that allows us to only look at tuples with the high order bits sufficient to differentiate between rules.
These classifier improvements have been shown with practical rule sets to reduce the number of megaflows needed in the kernel from over a million to only dozens.
The preceding changes improve many aspects of performance. For this post, we’ll just evaluate the performance gains in flow setup, which was the area of greatest concern. To measure setup performance, we used netperf’s TCP_CRR test, which measures the number of short-lived transactions per second (tps) that can be established between two hosts. We compared OVS to the Linux bridge, a fixed-function Ethernet switch implemented entirely inside the Linux kernel.
In the simplest configuration, the two switches achieved identical throughput (18.8 Gbps) and similar TCP_CRR connection rates (696,000 tps for OVS, 688,000 for the Linux bridge), although OVS used more CPU (161% vs. 48%). However, when we added one flow to OVS to drop STP BPDU packets and a similar ebtable rule to the Linux bridge, OVS performance and CPU usage remained constant whereas the Linux bridge connection rate dropped to 512,000 tps and its CPU usage increased over 26-fold to 1,279%. This is because the built-in kernel functions have per-packet overhead, whereas OVS’s overhead is generally fixed per-megaflow. We expect that enabling other features, such as routing and a firewall, would similarly add CPU load.
|Linux Bridge||Pass BPDUs||688,000||48%|
|Open vSwitch 1.10||12,000||100%|
|Open vSwitch 2.1||Megaflows Off||23,000||754%|
While these performance numbers are useful for benchmarking, they are synthetic. At the OpenStack Summit last week in Paris, Rackspace engineers described the performance gains they have seen over the past few releases in their production deployment. They begin in the “Dark Ages” (versions prior to 1.11) and proceed to “Ludicrous Speed” (versions since 2.1).
Now that OVS’s performance is similar to those of a fixed-function switch while maintaining the flexibility demanded by new networking applications, we’re looking forward to broadening our focus. While we continue to make performance improvements, the next few releases will begin adding new features such as stateful services and support for new platforms such as DPDK and Hyper-V.
If you want to hear more about this live or talk to us in person, we will all be at the Open vSwitch Fall 2014 Conference next week.