Of Mice and Elephants

[This post has been written by Martin Casado and Justin Pettit with hugely useful input from Bruce Davie, Teemu Koponen, Brad Hedlund, Scott Lowe, and T. Sridhar]

Overview

This post introduces the topic of network optimization via large flow (elephant) detection and handling.  We decompose the problem into three parts, (i) why large (elephant) flows are an important consideration, (ii) smart things we can do with them in the network, and (iii) detecting elephant flows and signaling their presence.  For (i), we explain the basis of elephant and mice and why this matters for traffic optimization. For (ii) we present a number of approaches for handling the elephant flows in the physical fabric, several of which we’re working on with hardware partners.  These include using separate queues for elephants and mice (small flows), using a dedicated network for elephants such as an optical fast path, doing intelligent routing for elephants within the physical network, and turning elephants into mice at the edge. For (iii), we show that elephant detection can be done relatively easily in the vSwitch.  In fact, Open vSwitch has supported per-flow tracking for years. We describe how it’s easy to identify elephant flows at the vSwitch and in turn provide proper signaling to the physical network using standard mechanisms.  We also show that it’s quite possible to handle elephants using industry standard hardware based on chips that exist today.

Finally, we argue that it is important that this interface remain standard and decoupled from the physical network because the accuracy of elephant detection can be greatly improved through edge semantics such as application awareness and a priori knowledge of the workloads being used.

The Problem with Elephants in a Field of Mice

Conventional wisdom (somewhat substantiated by research) is that the majority of flows within the datacenter are short (mice), yet the majority of packets belong to a few long-lived flows (elephants).  Mice are often associated with bursty, latency-sensitive apps whereas elephants tend to be large transfers in which throughput is far more important than latency.

Here’s why this is important.  Long-lived TCP flows tend to fill network buffers end-to-end, and this introduces non-trivial queuing delay to anything that shares these buffers.  In a network of elephants and mice, this means that the more latency-sensitive mice are being affected. A second-order problem is that mice are generally very bursty, so adaptive routing techniques aren’t effective with them.  Therefore routing in data centers often uses stateless, hash-based multipathing such as Equal-cost multi-path routing (ECMP).  Even for very bursty traffic, it has been shown that this approach is within a factor of two of optimal, independent of the traffic matrix.  However, using the same approach for very few elephants can cause suboptimal network usage, like hashing several elephants on to the same link when another link is free.  This is a direct consequence of the law of small numbers and the size of the elephants.

Treating Elephants Differently than Mice 

Most proposals for dealing with this problem involve identifying the elephants, and handling them differently than the mice.  Here are a few approaches that are either used today, or have been proposed:

  1.  Throw mice and elephants into different queues.  This doesn’t solve the problem of hashing long-lived flows to the same link, but it does alleviate the queuing impact on mice by the elephants.  Fortunately, this can be done easily on standard hardware today with DSCP bits.
  2. Use different routing approaches for mice and elephants.  Even though mice are too bursty to do something smart with, elephants are by definition longer lived and are likely far less bursty.  Therefore, the physical fabric could adaptively route the elephants while still using standard hash-based multipathing for the mice.
  3. Turn elephants into mice.  The basic idea here is to split an elephant up into a bunch of mice (for example, by using more than one ephemeral source port for the flow) and letting end-to-end mechanisms deal with possible re-ordering.  This approach has the nice property that the fabric remains simple and uses a single queuing and routing mechanism for all traffic.  Also, SACK in modern TCP stacks handles reordering much better than traditional stacks.  One way to implement this in an overlay network is to modify the ephemeral port of the outer header to create the necessary entropy needed by the multipathing hardware.
  4. Send elephants along a separate physical network.  This is an extreme case of 2.  One method of implementing this is to have two spines in a leaf/spine architecture, and having the top-of-rack direct the flow to the appropriate spine.  Often an optical switch is proposed for the spine.  One method for doing this is to do a policy-based routing decision using  a DSCP value that by convention denotes “elephant”.

Elephant Detection

 At this point it should be clear that handling elephants requires detection of elephants.  It should also be clear that we’ve danced around the question of what exactly characterizes an elephant.  Working backwards from the problem of introducing queuing delays on smaller, latency-sensitive flows, it’s fair to say that an elephant has high throughput for a sustained period.

Often elephants can be determined a priori without actually trying to infer them from network effects.  In a number of the networks we work with, the elephants are either related to cloning, backup, or VM migrations, all of which can be inferred from the edge or are known to the operators.  vSphere, for example, knows that a flow belongs to a migration.  And in Google’s published work on using OpenFlow, they had identified the flows on which they use the TE engine beforehand (reference here).

Dynamic detection is a bit trickier.  Doing it from within the network is hard due to the difficulty of flow tracking in high-density switching ASICs.  A number of sampling methods have been proposed, such as sampling the buffers or using sFlow. However the accuracy of such approaches hasn’t been clear due to the sampling limitations at high speeds.

On the other hand, for virtualized environments (which is a primary concern of ours given that the authors work at VMware), it is relatively simple to do flow tracking within the vSwitch.  Open vSwitch, for example, has supported per-flow granularity for the past several releases now with each flow record containing the bytes and packets sent.  Given a specified threshold, it is trivial for the vSwitch to mark certain flows as elephants.

The More Vantage Points, the Better

It’s important to remember that there is no reason to limit elephant detection to a single approach.  If you know that a flow is large a priori, great.  If you can detect elephants in the network by sampling buffers, great.  If you can use the vSwitch to do per-packet flow tracking without requiring any sampling heuristics, great.  In the end, if multiple methods identify it as an elephant, it’s still an elephant.

For this reason we feel that it is very important that the identification of elephants should be decoupled from the physical hardware and signaled over a standard interface.  The user, the policy engine, the application, the hypervisor, a third party network monitoring system, and the physical network should all be able identify elephants.

Fortunately, this can easily be done relatively simply using standard interfaces.  For example, to affect per-packet handling of elephants, marking the DSCP bits is sufficient, and the physical infrastructure can be configured to respond appropriately.

Another approach we’re exploring takes a more global view.  The idea is for each vSwitch to expose its elephants along with throughput metrics and duration.  With that information, an SDN controller for the virtual edge can identify the heaviest hitters network wide, and then signal them to the physical network for special handling.  Currently, we’re looking at exposing this information within an OVSDB column.

Are Elephants Obfuscated by the Overlay?

No.  For modern overlays, flow-level information, and QoS marking are all available in the outer header and are directly visible to the underlay physical fabric.  Elephant identification can exploit this characteristic.

Going Forward

This is a very exciting area for us.  We believe there is a lot of room to bring to bear edge understanding of workloads, and the ability for  software at the edge to do sophisticated trending analysis to the problem of elephant detection and handling.  It’s early days yet, but our initial forays both with customers and hardware partners, has been very encouraging.

More to come.


16 Comments on “Of Mice and Elephants”

  1. What are the views of the authors regarding the L3 handling of “elephant” transformation into “mice”? I mean, suppose that you just make the physical/L2 path aware of the need of distributing the load through the network fabric. Won’t a router (even an edge router) end up using EGP metrics/hot-potato routing and maybe send the flow back to the shortest path exit?

  2. What are the authors view in the L3 forwarding for the “elephant dispersion” case? Wouldn’t different edge routers receiving the partial flow end up sending it back to the shortest AS-PATH exit? Or you assume that the L3 is as “controllable” as in the B4 case?
    Disclaimer: I’m a PhD student interested in this area, so I’m asking this while looking for an interesting area to work on ;-)

  3. manfred says:

    Thanks for this well written article to highlight issues that elephants flows can cause in a field of mice and how the application/operator, vSwitch and physical networks can work together to identify the heaviest hitters to help optimize the overall system in a complimentary fashion.

  4. Silvester says:

    Very interesting topic. But, approaches like Kandoo might be cleaner for these kinds of problems as they can scale while still being decoupled from (virtual/physical) switches. What are your thoughts on that?

  5. Peter Phaal says:

    Great article. The nice thing about elephant flows is that they are extremely easy to detect using packet sampling. The following article describes how sFlow can be used to rapidly detect large flows on tens of thousands of links and at speeds all the way up to 100G.

    http://blog.sflow.com/2013/06/large-flow-detection.html

    Note: The article describes how Mininet and Open vSwitch can be used to experiment with large flow detection and control. It’s a simple test bed that anyone can set up in a virtual machine on their laptop.

  6. […] on RedHat Fedora 19. There are some new OVS tables included in the latest builds that include some neat concepts. OVS is also the generally accepted SDN reference data plane implementation in the industry. I tend […]

  7. […] Open vSwitch v2.0 introduces some really important features, at the top of the list is multi-threaded support in vswitchd. This will increase flow instantiation rates significantly into the upstream kernel module. A rough guess would be from less then 10k per/sec to 10x that w/ multithreaded support. I would imagine there will be a single threaded implementation for a single trace, serial hardware switch needs. This release also includes experimental support for OpenFlow v1.1, v1.2 and v1.3 along with some new OVSDB tables that have some cool potential such as IPFIX. […]

  8. […] elephants and mice post at Network Heresy has sparked some discussion across the “blogosphere” about how […]

  9. Lennie says:

    My solution would be:

    Add VLAN-aliassing support to the server/hypervisor/virtualswitch.

    Configure the switches in the network with multiple the same VLAN. Then configure each of these VLANs with a different Class Of Service. Pretty much every switch and vendor supports it.

    So what do you get then ? Then you can mark the outgoing packets on the server with one of the VLAN-tags and the switches will give them different priorities. You’ll only need a few VLANs, because the traffic inside these VLANs will only be overlay-, management- and backup traffic.

    To make it complete, when a server receives packets from an other server from any of these VLANs, because of aliassing all that VLAN-traffic will be treated the same way.

    PS What ever happend to STT ? STT could do offloading of the overlay to the .

  10. […] recent Network Heresy post “Of Mice and Elephants” discussed the impact long-lived flows (elephants) have on their short-lived peers (mice).  A […]

  11. Thanks for the great write up. Another way to handle the elephant flows is to flip them out of the packet-network onto a pure optical circuit switched fabric that has unlimited capacity, almost zero latency, and 5x to 10x lower cost at 40G. This has been described in papers like “Helios” and “C-Through”. Our company builds 3D MEMS optical circuit switches specifically for applications like these. I won’t spam you with details but please contact me or come visit our site if you would like to learn more.

  12. salaheddine says:

    Hey, thanks for the great work, i have seen a system of HP called MAHOUT, they add a shim layer in the networking stack of the VM, the detection is so easy, it occurred on the level of VM, the hypervisor will send information to the following switch. Using this approach with OpenFlow may be interesting since it can activate fast path on all the fabric..
    Well, i Have one question or a proposition, i just want to see if there is here someone who can give me a feedback.
    What i’d like to do as project is to use an NPU connected via PCI Express, the NPU will handle specific type of traffic, the main question is: we take openvswitch, is a rule like (in_port=1, output_port=2) has the some performance as a rule like(L2 classical switching) ?

  13. […] latter topic is something we’ve addressed in some other recent posts (here, here and here) — in this blog we’ll focus more on how we deal with physical devices […]

  14. […] latter topic is something we’ve addressed in some other recent posts (here, here and here) — in this blog we’ll focus more on how we deal with physical devices at the […]

  15. […] discussed in other blog posts and presentations, long-lived, high-bandwidth flows (elephants) can negatively affect short-lived […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 421 other followers