Of Mice and Elephants
[This post has been written by Martin Casado and Justin Pettit with hugely useful input from Bruce Davie, Teemu Koponen, Brad Hedlund, Scott Lowe, and T. Sridhar]
This post introduces the topic of network optimization via large flow (elephant) detection and handling. We decompose the problem into three parts, (i) why large (elephant) flows are an important consideration, (ii) smart things we can do with them in the network, and (iii) detecting elephant flows and signaling their presence. For (i), we explain the basis of elephant and mice and why this matters for traffic optimization. For (ii) we present a number of approaches for handling the elephant flows in the physical fabric, several of which we’re working on with hardware partners. These include using separate queues for elephants and mice (small flows), using a dedicated network for elephants such as an optical fast path, doing intelligent routing for elephants within the physical network, and turning elephants into mice at the edge. For (iii), we show that elephant detection can be done relatively easily in the vSwitch. In fact, Open vSwitch has supported per-flow tracking for years. We describe how it’s easy to identify elephant flows at the vSwitch and in turn provide proper signaling to the physical network using standard mechanisms. We also show that it’s quite possible to handle elephants using industry standard hardware based on chips that exist today.
Finally, we argue that it is important that this interface remain standard and decoupled from the physical network because the accuracy of elephant detection can be greatly improved through edge semantics such as application awareness and a priori knowledge of the workloads being used.
The Problem with Elephants in a Field of Mice
Conventional wisdom (somewhat substantiated by research) is that the majority of flows within the datacenter are short (mice), yet the majority of packets belong to a few long-lived flows (elephants). Mice are often associated with bursty, latency-sensitive apps whereas elephants tend to be large transfers in which throughput is far more important than latency.
Here’s why this is important. Long-lived TCP flows tend to fill network buffers end-to-end, and this introduces non-trivial queuing delay to anything that shares these buffers. In a network of elephants and mice, this means that the more latency-sensitive mice are being affected. A second-order problem is that mice are generally very bursty, so adaptive routing techniques aren’t effective with them. Therefore routing in data centers often uses stateless, hash-based multipathing such as Equal-cost multi-path routing (ECMP). Even for very bursty traffic, it has been shown that this approach is within a factor of two of optimal, independent of the traffic matrix. However, using the same approach for very few elephants can cause suboptimal network usage, like hashing several elephants on to the same link when another link is free. This is a direct consequence of the law of small numbers and the size of the elephants.
Treating Elephants Differently than Mice
Most proposals for dealing with this problem involve identifying the elephants, and handling them differently than the mice. Here are a few approaches that are either used today, or have been proposed:
- Throw mice and elephants into different queues. This doesn’t solve the problem of hashing long-lived flows to the same link, but it does alleviate the queuing impact on mice by the elephants. Fortunately, this can be done easily on standard hardware today with DSCP bits.
- Use different routing approaches for mice and elephants. Even though mice are too bursty to do something smart with, elephants are by definition longer lived and are likely far less bursty. Therefore, the physical fabric could adaptively route the elephants while still using standard hash-based multipathing for the mice.
- Turn elephants into mice. The basic idea here is to split an elephant up into a bunch of mice (for example, by using more than one ephemeral source port for the flow) and letting end-to-end mechanisms deal with possible re-ordering. This approach has the nice property that the fabric remains simple and uses a single queuing and routing mechanism for all traffic. Also, SACK in modern TCP stacks handles reordering much better than traditional stacks. One way to implement this in an overlay network is to modify the ephemeral port of the outer header to create the necessary entropy needed by the multipathing hardware.
- Send elephants along a separate physical network. This is an extreme case of 2. One method of implementing this is to have two spines in a leaf/spine architecture, and having the top-of-rack direct the flow to the appropriate spine. Often an optical switch is proposed for the spine. One method for doing this is to do a policy-based routing decision using a DSCP value that by convention denotes “elephant”.
At this point it should be clear that handling elephants requires detection of elephants. It should also be clear that we’ve danced around the question of what exactly characterizes an elephant. Working backwards from the problem of introducing queuing delays on smaller, latency-sensitive flows, it’s fair to say that an elephant has high throughput for a sustained period.
Often elephants can be determined a priori without actually trying to infer them from network effects. In a number of the networks we work with, the elephants are either related to cloning, backup, or VM migrations, all of which can be inferred from the edge or are known to the operators. vSphere, for example, knows that a flow belongs to a migration. And in Google’s published work on using OpenFlow, they had identified the flows on which they use the TE engine beforehand (reference here).
Dynamic detection is a bit trickier. Doing it from within the network is hard due to the difficulty of flow tracking in high-density switching ASICs. A number of sampling methods have been proposed, such as sampling the buffers or using sFlow. However the accuracy of such approaches hasn’t been clear due to the sampling limitations at high speeds.
On the other hand, for virtualized environments (which is a primary concern of ours given that the authors work at VMware), it is relatively simple to do flow tracking within the vSwitch. Open vSwitch, for example, has supported per-flow granularity for the past several releases now with each flow record containing the bytes and packets sent. Given a specified threshold, it is trivial for the vSwitch to mark certain flows as elephants.
The More Vantage Points, the Better
It’s important to remember that there is no reason to limit elephant detection to a single approach. If you know that a flow is large a priori, great. If you can detect elephants in the network by sampling buffers, great. If you can use the vSwitch to do per-packet flow tracking without requiring any sampling heuristics, great. In the end, if multiple methods identify it as an elephant, it’s still an elephant.
For this reason we feel that it is very important that the identification of elephants should be decoupled from the physical hardware and signaled over a standard interface. The user, the policy engine, the application, the hypervisor, a third party network monitoring system, and the physical network should all be able identify elephants.
Fortunately, this can easily be done relatively simply using standard interfaces. For example, to affect per-packet handling of elephants, marking the DSCP bits is sufficient, and the physical infrastructure can be configured to respond appropriately.
Another approach we’re exploring takes a more global view. The idea is for each vSwitch to expose its elephants along with throughput metrics and duration. With that information, an SDN controller for the virtual edge can identify the heaviest hitters network wide, and then signal them to the physical network for special handling. Currently, we’re looking at exposing this information within an OVSDB column.
Are Elephants Obfuscated by the Overlay?
No. For modern overlays, flow-level information, and QoS marking are all available in the outer header and are directly visible to the underlay physical fabric. Elephant identification can exploit this characteristic.
This is a very exciting area for us. We believe there is a lot of room to bring to bear edge understanding of workloads, and the ability for software at the edge to do sophisticated trending analysis to the problem of elephant detection and handling. It’s early days yet, but our initial forays both with customers and hardware partners, has been very encouraging.
More to come.