The Rise of Soft Switching Part 2.5: Responses and clarifications

[Note: This is an unplanned vignette for the soft switching series. We've received a lot of e-mail in response to our previous two posts. Alex submitted a great comment which outlines many of the popular issues that have come up. So rather than having the response buried in the comment forum, we're upgrading it to a full post.  So thanks to Alex for kicking this off. ]

Alex: I would say that you’ve made a lot of assumptions and oversimplified the generic problem. You are right that for some applications a softswitch could be a viable solution. However, there are a number of arguments against it:

It’s hard to write a somewhat concise blog post without some simplification. Oversimplified? Perhaps, and if so lets take the time to flush out the details.  We always welcome constructive discussion.  As a start, lets take a look at some of your points.

Alex: 1. When you need IPsec/MACsec/TCP/UDP offload, you still need an intelligent NIC. In many cases you need to load balance between multiple switching instances, so with the intelligent NIC it makes sense to place that function there, because it is coming “for free”. The same is true also for a basic switching.

Noone is arguing against intelligent NICs. However, this has nothing to do with soft switching. As mentioned in response to another comment, the vSwitch can (and does) carry the context for various stateless offloads from the vNIC through to the pNIC and in the vNIC<->vNIC case can either avoid doing them (ie VLAN tagging/stripping), or can do them in software optionally [checksum offload is one that can be dropped technically, but makes tcpdump output look scary so is usually done in software for "free" inline with the copy that happens anyway].

Also, as we’ve discussed at some length, doing inter-VM switching in hardware is far from “free”. It incurs overhead and it obviates the flexibility of switching in software.

Alex: 2. Switches today are not that dumb, they can handle very large tables, they can process L2-L4 headers and beyond (usually first 128B-256B of the packet are being parsed), some of them are even programmable to be able to add new protocols.

This is definitely worth a taking a closer look at. We’ve spent years working with high-end merchant silicon switching chips trying to emulate what others are able to do with soft switching. Note that we have very good relationship with the silicon vendors, so this isn’t just blind fumbling.

Lets looks at some recent hurdles we’ve run into in practice:

IP overwrite: Virtual networking environments sometime need address isolation similar to NAT. Very few commercially available switching chipsets support general IP overwrite. And even fewer (none?) support it at a useful position in the lookup pipeline for virtual networking. Clearly this is trivially doable (and often done) in software at the edge.

Note that IP overwrite is not just an edge function within a virtualized context since two co-located VMs in different logical contexts may wish to communicatin thereby neededing overwrite.

Additional lookup for virtual context: Often to preserve logical context, additional information needs to be stuffed into packets. To do so, you have one of two choices: (a) overload and existing field (ugly because you can no longer use that field within the logical context, and at some point you’ll run out of fields to overload) (b) create a new field. Lets look at (b):

For the mass produced switching chips that we’re familiar with, adding a new field means changing the protocol parser means spinning a new chip means an 18 month wait time.  Further, if you want this field to be part of a useful lookup (which often you do) then you’re probably limited to the number of bits you’re going to get.  We’ve quite litterally sat with ASIC designers arguing for 32 bits over 24 for such fields when we routinly use 64 or 128 bits in software.

L2 in L3 tunneling: Until very recently, L2 in L3 was not available in 10G and even now the support is extremely limited (although next generation chips will improve the situation — though again, this required over a year of bake time).

For the L2 in L3 tunneling that is available, rarely can you use the key for lookup, and in some implementations, decap only offers a small fraction of the full bisectional bandwidth of the switching chip making it impractcal for heavy use.

Again, this isn’t a theoretical musing on edge cases. All of these issues we’ve run into in practice. And there are many others we haven’t mentioned (tunnel and ECMP limits for example).

So yes, switching silicon is flexible and getting better all the time. But there are, and will continue to be, very real limitations when compared to switching in software.  Switching silicon will never be replaced.  However, it’s not clear that it is the right place to perform all packet lookup and manipulation.

A final point. You can certainly achieve all of these things if you cobble together enough hardware. However, we would wager that is a very clear looser on price performance, and you still don’t overcome the flexibility and upgrade cycle problems.

Alex: 3. You can use NPUs for switching, which can provide the required flexibility, performance and scalability.

NPUs don’t have the price/performance (nor high fanout capacity) of a standard switching chip (e.g. Broadcom). And they don’t have the flexibiliy, economies of scale, and tool chain support of x86.

There is probably a good economic reason that many of the most popular edge appliances today (wan opt, load balancing, etc.) are built on souped up x86 boxes, and not esoteric NPU platforms.

This would be an interesting point to follow up on. Our contention is that classic table-based switching is the best way to build a physical fabric. The price/performance for high-fan-out switching seems unbeatable. And that x86 is the best way to handle intelligent packet manipulation at the edge.

If you (Alex) are interested in working on a follow-on post. How about together we pencil out the price/performance of a (say) Broadcom/Intel solution vs. Broadcom + some NPU + Intel. I think it would be an interesting exercise.

Alex: 4. Even if softswitching can achieve the same performance level, you pay much heavier price from the power consumption point of view. And we all know well how important is the power today.

This is true of course in terms of byte/watt, but in terms of elasticity and utilization x86 remains attractive, since for the 90% of the day that the network is at 10% utilization all those x86 cores can do useful work.

It’s fair to point out that you may want to statically provision the core, meaning it will also remain underutilized. Given this, then you’re probably right that there is power differential. It would be interesting to compare this with having an additional NPU/TCAM/whatever. That’s something we don’t know the answer to off hand.  But would love to explore further.

Alex: 5. You’ve assumed in your calculations a large packet size, which is not always the case. Instead of providing the bit rate as a performance characteristic, much better metric is the packet rate. For instance, the 10Gbps capable switch can usually handle 15Mpps. I do not think you can claim the same for softswitch.

Also true, but we’re talking about edge switching. If we enumerate the workloads we’re familiar with that do anywhere near line rate in the number of VMs that fit on one host with small message sizes. Our experience is that they are either a) netperf b) hpc c) trading apps.

We can safely ignore a. b and c are legitimate exceptions, but have many other issues with virtualization and also fewer requirements. Passthrough with on NIC switching or hairpinning to the switch is probably a good choice for them. There’s a reason myrinet/quadrics/infiniband (a) existed and (b) are not heavily used outside of HPC and other specialized environments.

Alex: 6. If your application requires also traffic management capabilities, the softswitch overhead goes up very quickly with a number of queues, because dual token or leaky bucket implementation is expensive in means of CPU cycles. Of course, you can say that TM can be also offloaded to the NIC, but TM requires some level of flow classification, and with flow classification already performed in the NIC it makes sense to do the switching in the NIC as per the 1st argument.

Here we totally agree, and this is an awesome segue to our next post.

Alex: Just to clarify my comment: I am not saying that softswitch is not a good solution, I am saying that you should check a particular application and its further evolving requirements before deciding on the softswitch-based implementation.

This is is true, but borders on tautological. Clearly you can contrive a situation in which soft switching doesn’t work. However, we’re interested in refining the discourse beyond the obvious.

I think at least some of the authors are in a pretty good position to claim having surveyed a large number of applications, deployments, and customer requirements for vswitches (edge switching within virtualized environments). In our experience, there are very few environments outside of HPC and trading in which soft switching isn’t a great fit.

Alex: And a final comment here is about the definition of a softswitch. Remember that above-mentioned intelligent NIC could be based on the latest generation of multi-core processors (Cavium, NetLogic, Tilera, etc.) with hybrid NPU-CPU capabilities and a lot of offload, but the switching itself is still performed by software either in Linux or bare metal environment. So from the system point of view the switching is performed by the NIC, but it is still a softswitch…

No argument here :)

[Thanks again to Alex for the comment, it does flush out more of the discussion. Clearly the position is more nuanced than catagorical, and clearly the discourse is ongoing. Perhaps you'd be interested in fleshing out some of the price/performance/power arguments for a future post?]


2 Comments on “The Rise of Soft Switching Part 2.5: Responses and clarifications”

  1. Alex Bachmutsky says:

    Let me give you some responses:
    1. We already came to conclusion that the intelligent NIC is there. My argument is that adding switching performed by the same NIC could be a lower cost compared to the entire processing that you add in the softswitch.
    2. Believe me that I am well-aware of high-end switches, NPU and multicore chips available on the market, I wrote a book about them. For your points:
    a) NAT and NAT-PT can be done by some high-end switch devices, practically all NPUs and all multicore CPU/NPU hybrids; remember that all three can be used as an engine running on the intelligent NIC.
    b) extra lookup. First of all extra virtual context could be signaled by MAC-in-MAC, Q-in-Q, tunneling, etc. All of those could be processed in majority of high-end switches already today. You could also use some proprietary EtherType, but I would agree that it is ugly, it is used sometimes for internal connections. If for any reason you need to add non-standard virtual instance identifier, some switches from at least 2 vendors as I’ve already mentioned in my original post have a programmable microcode-based engines, and you can just add that extra field without the need to re-spin the switch. Similarly to the previous comment, all NPUs and all multicore CPU-NPU hybrids will be able to do that without any problem, because they are by definition highly programmable.
    c) The same comment as above is true for L2 in L3 tunnel (or any other tunnel for that purpose).
    d) “However, we would wager that is a very clear looser on price performance, and you still don’t overcome the flexibility and upgrade cycle problems.” With this one I absolutely disagree. Price/performance for switching applications will be always better in switch/NPU/multicore, just because they were designed for that purpose. Only when your processing requires switching based on DPI, you can try to argue your case, but even in this case some multicore CPUs have today integrated HW RegEx engines that can make very efficient many of such applications. Software based approach competes with HW when there is a need to terminate complex L4-7 protocols. Even here some solutions can fully terminate (not only offload checksum calculation) IP and TCP.
    3. Regarding NPU price/performance efficiency for switching application, I disagree with you. In my previous company we’ve done a lot of measurements, and in data plane processing (and switching belongs to data plane) the answer is very clear and against your arguments. The issue is more complex for control plane (such as routing protocols, for instance), but even there the new generation of hybrid NPU/CPU multicore devices competes well. CPUs like x86 or SPARC win, however, for many management plane ans service plane tasks. Also, remember that today there are NPUs capable to process 100Gbps traffic with any packet size, so the efficiency depends on the level of the traffic aggregation you plan for the target product based on either NPU or softswitch. Unfortunately, I cannot publish anything, because all such measurements are under NDA. This is also the reason why I don’t mention company names. I need to think how to do that. I could probably request from a number of vendors approval of publication, but it will be not easy. The problem is that everything is OK for a single vendor. But then the second vendor numbers could be better, making a negative point for the first vendor. And so on. There are no independent comparisons unfortunately. You can find some DMIPS or similar scores, but they are totally useless even to compare different CPUs, definitely not practical to compare CPU and NPU. This is one of reasons why I could not include performance comparisons in my last book.
    4. If you can reuse your GP-CPU for other tasks when switching is underloaded, you are right. The problem is that it is rarely the case today. Your argument becomes more relevant if and when we move to the cloud-based switching. However, today most clouds are good for processing, but not very good for high-speed networking, it will be very expensive to perform softswitching on Amazon cloud right now, but it might change in the future. At that time your elasticity argument will be very relevant.
    5. First of all, it depends what edge you are talking about. For instance, mobile edge has much smaller packet size on average. At the same time, average packet size on Internet today is still around 400B probably, still much lower than your numbers. If you build a device for tightly controlled enterprise or consumer, you can use that 400B as your target number. However, if you want to sell the device to telco market, you will have a hard time to argue with operators about the average packet size, they just require 64B performance even if in the field they don’t really need it. From their point of view, it gives them extra cushion just in case.

    One thing should be very clear – general purpose software based development (I don’t include here the microcode) is much simpler when there is a need to develop a significant code. Switching code I would not call “significant”, it can be still driven by the microcode. More complex applications benefit more from GP-CPU. Obviously, nobody will even imagine (at least today) to write 100K lines of code on NPU, few thousands lines of code is probably a limit there (instructions there are more complex, so few thousands NPU lines is probably something like x10K C-code instructions).

    Very interesting case for the discussion is the multicore NPU/CPU hybrid devices, because they run Linux, programmed in C/C++ and can be definitely competitors for softswitching based on x86. If it is a generic discussion without any comparison, we can discuss those in more details.

    Thanks,
    Alex

  2. At this point, we’re going in circles. Many of your comments simply don’t make sense in the context of the discussion. For example.

    “Your argument becomes more relevant if and when we move to the cloud-based switching”.

    The vast majority of all virtualized workloads over the last decade have used soft switching. And that includes every large hosting environment (or cloud) we’ve worked with. Generally they use the Linux bridge, Open vSwitch or the VMWare vswitch. Perhaps by “soft switching at the virtual edge”, you think we’re referring to something other than doing first-hop switching within the hypervisor? Because that is the single focus of this entire serious of posts.

    More fundamentally, we have conflicting experience, intuition, and analysis.

    At this point, lets take the discussion offline, flesh out the details, and report together the results. Really, there are two points being made. For the purposes of virtual edge networking of canonical VM workloads:

    1. Are existing high-fanout switching silicon sufficiently flexible to offload the full virtual switching decision?
    2. Can an NPU or more traditional table-based switching chip be used cost effectively in the NIC to offload the virtual edge switching decision?

    We shouldn’t have to guess, or resort to gross generalities in trying to answer these, we have all the data we need. As I mentioned, soft switching has been used for over a decade in nearly all virtualized deployments. So we know the lookup requirements and the traffic demands. All we have to do is determine which hardware can support it, how much it costs, and do a side by side comparison against the classic soft switch design. Easy peasy.

    Updates to follow as we make headway …


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 411 other followers