[This post was co-authored by Bruce Davie and Ken Duda]
Almost a year ago, we wrote a first post about our efforts to build virtual networks that span both virtual and physical resources. As we’ve moved beyond the first proofs of concept to customer trials for our combined solution, this post serves to provide an update on where we see the interaction between virtual and physical worlds heading.
Our overall approach to connecting physical and virtual resources can be viewed in two main categories:
- terminating the overlay on physical devices, such as top-of-rack switches, routers, appliances, etc.
- managing interactions between the overlay and the physical devices that provide the underlay.
We first started working to design a control plane to terminate network virtualization overlays on physical devices in 2012. We started by looking at the information model, defining what information needed to be exchanged between a physical device and a network virtualization controller such as NSX. To bound the problem space, we focused on a specific use case: mapping the ports and VLANs of a physical switch into virtual layer 2 networks implemented as VXLAN-based overlays. (See our posts on the issues around encapsulation choice here and here). At the same time, we knew there were a lot more use cases to be addressed, so we picked a completely extensible protocol to carry the necessary information: OVSDB. This was important: we knew that over time we’d have to support a lot more use cases than just L2 bridging between physical and virtual worlds. After all, one of the tenets of network virtualization is that a virtual network should faithfully reproduce all of the networking stack, from L2-L7, just as server virtualization faithfully reproduces a complete computing environment (as described in more detail here.)
So, the first thing we added to the solution space once we got L2 working was L3. By the time we published the VTEP schema as part of Open vSwitch in late 2013, distributed logical routing was included. (We use the term VTEP – VXLAN tunnel end-point – as a general term for the devices that terminate the overlay.) Let’s take a look at how logical routing works.
Distributed logical routing is an example of a more general capability in network virtualization, the distribution of services. Brad Hedlund wrote some time ago about the value of distributing services among hypervisors. The same basic arguments apply when there are VTEPs in the picture — you want to distribute functions, like logical routing, so that packets always follow the shortest path without hair-pinning, and so that the capacity to perform that function scales out as you add more devices, be they hypervisors or physical switches.
So, suppose a VM (VM1) is placed in logical subnet A, and a physical server (PS1) that is in subnet B is located behind a ToR switch acting as a VTEP (see picture). Say we want to create a logical network that interconnects these two subnets. The logical topology is created by API requests to the network virtualization controller, which in turn programs the vswitches and the ToR to instantiate the desired topology. Part of this process entails mapping physical ports to the logical topology via API requests. Everything the ToR needs to know to participate in the logical topology is provided to it via OVSDB.
Suppose VM1 needs to send a packet to PS1. The VM will send an ARP request towards its default gateway, which is implemented in a distributed manner. (We assume the VM learned its default gateway via some prior step; for example, DHCP may be used.) The ARP request will be intercepted by the local vswitch on the hypervisor. Acting as the logical router, the vswitch will reply to the ARP, so that the VM can now send the packet towards the router. All of this happens without any packet leaving the hypervisor.
The logical router now needs to ARP for the destination — PS1 (assuming an ARP cache miss for the first packet). It will broadcast the ARP request by sending it over a VXLAN tunnel to the VTEP (and potentially to other VTEPs as well, if there are more VTEPs that are involved in logical subnet B). When the ARP packet reaches the ToR, it is sent out on one or more physical interfaces — the set of interfaces that were previously mapped to this logical subnet via API requests. The ARP will reach PS1, which replies; the ToR forwards the reply over a VXLAN tunnel to the vswitch that issued the request, and now it’s able to forward the data traffic to the ToR which decapsulates the packet and delivers it to PS1.
For traffic flowing the other way, the role of logical router would be played by the physical VTEP rather than the vswitch. This is the nature of distributed routing — there is no single box performing all the work for a single logical router, but rather a collection of devices. And in this case, the work is distributed among both hardware VTEPs and vswitches.
We’ve glossed over a couple of details here, but one detail that’s worth noting is that, for traffic heading in the physical-to-virtual direction, the hardware device needs to perform an L3 lookup followed by VXLAN encapsulation. There has been some uncertainty regarding the capabilities of various switching chips to perform this operation (see this post, for example, which tries to determine the capabilities of Trident 2 based on switch vendor information). We’ve actually connected VMware’s NSX controller to ToR switches using at least four different classes of switching silicon (two merchant vendors, two custom ASIC-based designs). This includes both Arista’s 7150 series and 7050X switches. All of these are capable of performing the necessary L3+VXLAN operations. We’ll let the switch vendors speak for themselves regarding product specifics, but we’re essentially viewing this as a non-issue.
OK, that’s L3. What next? Overall, our approach has been to try to provide the same capabilities to virtual ports and physical ports, as much as that is possible. Of course, there is some inherent conflict here: hardware-based end-points tend to excel at throughput and density, while software-based end-points give us the greatest flexibility to deliver new features. Also, given our rich partner ecosystem with many hardware players, it’s not always going to be feasible to expose the unique features of a specific hardware product through the northbound API of NSX. But we certainly see value in being able to do more on physical ports: for example, we should be able to create access control lists on physical ports under API control. Similarly, we’d like to be able to control QoS policy at the physical ingress, e.g. remarking DSCP bits or trusting the value received and copying to the outer VXLAN header. More stateful services, such as firewalling or load-balancing, may not make sense in a ToR-class device but could be implemented in specific appliances suited for those tasks, and could still be integrated into virtualized networks using the same principles that we’ve applied to L2 and L3 functions.
In summary, we see the physical edges of virtual networks as a critical part of the overall network virtualization story. And of course we consider it important that we have a range of vendors whose devices can be integrated into the virtual overlay. It’s been great to see the ecosystem develop around this ability to tie the physical and the virtual together, and we see a lot of opportunity to build on the foundation we’ve established.
Network virtualization, as others have noted, is now well past the hype stage and in serious production deployments. One factor that has facilitated the adoption of network virtualization is the ease with which it can be incrementally deployed. In a typical data center, the necessary infrastructure is already in place. Servers are interconnected by a physical network that already meets the basic requirements for network virtualization: providing IP connectivity between the physical servers. And the servers are themselves virtualized, providing the ideal insertion point for network virtualization: the vswitch (virtual switch). Because the vswitch is the first hop in the data path for every packet that enters or leaves a VM, it’s the natural place to implement the data plane for network virtualization. This is the approach taken by VMware (and by Nicira before we were part of VMware) to enable network virtualization, and it forms the basis for our current deployments.
In typical data centers, however, not every machine is virtualized. “Bare metal” servers — that is, unvirtualized, or physical machines — are a fact of life in most real data centers. Sometimes they are present because they run software that is not easily virtualized, or because of performance concerns (e.g. highly latency-sensitive applications), or just because there are users of the data center who haven’t felt the need to virtualize. How do we accommodate these bare-metal workloads in virtualized networks if there is no vswitch sitting inside the machine?
Our solution to this issue was to develop gateway capabilities that allow physical devices to be connected to virtual networks. One class of gateway that we’ve been using for a while is a software appliance. It runs on standard x86 hardware and contains an instance of Open vSwitch. Under the control of the NSX controller, it maps physical ports, and VLANs on those ports, to logical networks, so that any physical device can participate in a given logical network, communicating with the VMs that are also connected to that logical network. This is illustrated below.
As an aside, these gateways also address another common use case: traffic that enters and leaves the data center, or “north-south” traffic. The basic functionality is similar enough: the gateway maps traffic from a physical port (in this case, a port connected to a WAN router rather than a server) to logical networks, and vice versa.
Software gateways are a great solution for moderate amounts of physical-to-virtual traffic, but there are inevitably some scenarios where the volume of traffic is too high for a single x86-based appliance, or even a handful of them. Say you had a rack (or more) full of bare-metal database servers and you wanted to connect them to logical networks containing VMs running application and web tiers for multi-tier applications. Ideally you’d like a high-density and high-throughput device that could bridge the traffic to and from the physical servers into the logical networks. This is where hardware gateways enter the picture.
Leveraging VXLAN-capable Switches
Fortunately, there is an emerging class of hardware switch that is readily adaptable to this gateway use case. Switches from several vendors are now becoming available with the ability to terminate VXLAN tunnels. (We’ll call these switches VTEPs — VXLAN Tunnel End Points or, more precisely, hardware VTEPs.) VXLAN tunnel termination addresses the data plane aspects of mapping traffic from the physical world to the virtual. However, there is also a need for a control plane mechanism by which the NSX controller can tell the VTEP everything it needs to know to connect its physical ports to virtual networks. Broadly speaking, this means:
- providing the VTEP with information about the VXLAN tunnels that instantiate a particular logical network (such as the Virtual Network Identifier and destination IP addresses of the tunnels);
- providing mappings between the MAC addresses of VMs and specific VXLAN tunnels (so the VTEP knows how to forward packets to a given VM);
- instructing the VTEP as to which physical ports should be connected to which logical networks.
In return, the VTEP needs to tell the NSX controller what it knows about the physical world — specifically, the physical MAC addresses of devices that it can reach on its physical ports.
There may be other information to be exchanged between the controller and the VTEP to offer more capabilities, but this covers the basics. This information exchange can be viewed as the synchronization of two copies of a database, one of which resides in the controller and one of which is in the VTEP. The NSX controller already implements a database access protocol, OVSDB, for the purposes of configuring and monitoring Open vSwitch instances. We decided to leverage this existing protocol for control of third party VTEPs as well. We designed a new database schema to convey the information outlined above; the OVSDB protocol and the database code are unchanged. That choice has proven very helpful to our hardware partners, as they have been able to leverage the open source implementation of the OVSDB server and client libraries.
The upshot of this work is that we can now build virtual networks that connect relatively large numbers of physical ports to virtual ports, using essentially the same operational model for any type of port, virtual or physical. The NSX controller exposes a northbound API by which physical ports can be attached to logical switches. Virtual ports of VMs are attached in much the same way to build logical networks that span the physical and virtual worlds. The figure below illustrates the approach.
It’s worth noting that this approach has no need for IP multicast in the physical network, and limits the use of flooding within the overlay network. This contrasts with some early VXLAN implementations (and the original VXLAN Internet Draft, which didn’t entirely decouple the data plane from the control plane). The reason we are able to avoid flooding in many cases is that the NSX controller knows the location of all the VMs that it has attached to logical networks — this information is provided by the vswitches to the controller. And the controller shares its knowledge with the hardware VTEPs via OVSDB. Hence, any traffic destined for a VM can be placed on the right VXLAN tunnel from the outset.
In the virtual-to-physical direction, it’s only necessary to flood a packet if there is more than one hardware VTEP. (If there is only one hardware VTEP, we can assume that any unknown destination must be a physical device attached to the VTEP, since we know where all the VMs are). In this case, we use the NSX Service Node to replicate the packet to all the hardware VTEPs (but not to any of the vswitches). Furthermore, once a given hardware VTEP learns about a physical MAC address on one of its ports, it writes that information to the database, so there will be no need to flood subsequent packets. The net result is that the amount of flooded traffic is quite limited.
For a more detailed discussion of the role of VXLAN in the broader landscape of network virtualization, take a look at our earlier post on the topic.
Hardware or Software?
Of course, there are trade-offs between hardware and software gateways. In the NSX software gateway, there is quite a rich set of features that can be enabled (port security, port mirroring, QoS features, etc.) and the set will continue to grow over time. Similarly, the feature set that the hardware VTEPs support, and which NSX can control, will evolve over time. One of the challenges here is that hardware cycles are relatively long. On top of that, we’d like to provide a consistent model across many different hardware platforms, but there are inevitably differences across the various VTEP implementations. For example, we’re currently working with a number of different top-of-rack switch vendors. Some partners use their own custom ASICs for forwarding, while others use merchant silicon from the leading switch silicon vendors. Hence, there is quite a bit of diversity in the features that the hardware can support.
Our expectation is that software gateways will provide the greatest flexibility, feature-richness, and speed of evolution. Hardware VTEPs will be the preferred choice for deployments requiring greater total throughput and density.
A final note: one of the important benefits of network virtualization is that it decouples networking functions from the physical networking hardware. At first glance, it might seem that the use of hardware VTEPs is a step backwards, since we’re depending on specific hardware capabilities to interconnect the physical and virtual worlds. But by developing an approach that works across a wide variety of hardware devices and vendors, we’ve actually managed to achieve a good level of decoupling between the functionality that NSX provides and the hardware that underpins it. And as long as there are significant numbers of physical workloads that need to be managed in the data center, it will be attractive to have a range of options for connecting those workloads to virtual networks.
Two weeks ago I gave a short presentation at the Open Networking Summit. With only 15 minutes allocated per speaker, I wasn’t sure I’d be able to make much of an impact. However, there has been a lot of reaction to the talk – much of it positive – so I’m posting the slides here and including them below. A video of the presentation is also available in the ONS video archive (free registration required).
[This post was written by JR Rivers, Bruce Davie, and Martin Casado]
One of the important characteristics of network virtualization is the decoupling of network services from the underlying physical network. That decoupling is fundamental to the definition of network virtualization: it’s the delivery of network services independent of a physical network that makes those services virtual. Furthermore, many of the benefits of virtualization – such as the ability to move network services along with the workloads that need those services, without touching hardware – follow directly from this decoupling.
In spite of all the benefits that flow from decoupling virtual networks from the underlying physical network, we occasionally hear the concern that something has been lost by not having more direct interaction with the physical network. Indeed, we’ve come across a common intuition that applications would somehow be better off if they could directly control what the physical network is doing. The goal of this post is to explain why we disagree with this view.
It’s worth noting that this idea of getting networks to do something special for certain applications is hardly a novel idea. Consider the history of Voice over IP as an example. It wasn’t that long ago when using Ethernet for phone calls was a research project. Advances in the capacity of both the end points as well as the underlying physical network changed all of that and today VOIP is broadly utilized by consumers and enterprises around the world. Let’s break down the architecture that enabled VOIP.
A call starts with end-points (VOIP phones and computers) interacting with a controller that provisions the connection between them. In this case, provisioning involves authenticating end-points, finding other end-points, and ringing the other end. This process creates a logical connection between the end-points that overlays the physical network(s) that connect them. From there, communication occurs directly between the end-points. The breakthroughs that allowed Voice Over IP were a) ubiquitous end-points with the capacity to encode voice and communicate via IP and b) physical networks with enough capacity to connect the end-points while still carrying their normal workload.
Now, does VOIP need anything special from the network itself? Back in the 1990s, many people believed that to enable VOIP it would be necessary to signal into the network to request bandwidth for each call. Both ATM signalling and RSVP (the Resource Reservation Protocol) were proposed to address this problem. But by the time VOIP really started to gain traction, network bandwidth was becoming so abundant that these explicit communication methods between the endpoints and the network proved un-necessary. Some simple marking of VOIP packets to ensure that they didn’t encounter long queues on bottleneck links was all that was needed in the QoS department. Intelligent behavior at the end-points (such as adaptive bit-rate codecs) made the solution even more robust. Today, of course, you can make a VOIP call between continents without any knowledge of the underlying network.
These same principles have been applied to more interactive use cases such as web-based video conferencing, interactive gaming, tweeting, you name it. The majority of the ways that people interact electronically are based on two fundamental premises: a logical connection between two or more end-points and a high capacity IP network fabric.
Returning to the context of network virtualization, IP fabrics allow system architects to build highly scalable physical networks; the summarization properties of IP and its routing protocols allow the connection of thousands of endpoints without imposing the knowledge of each one on the core of the network. This both reduces the complexity (and cost) of the networking elements, and improves their ability to heal in the event that something goes wrong. IP networks readily support large sets of equal cost paths between end-points, allowing administrators to simultaneously add capacity and redundancy. Path selection can be based on a variety of techniques such as statistical selection (hashing of headers), Valiant Load Balancing, and automated identification of “elephant” flows.
Is anything lost if applications don’t interact directly with the network forwarding elements? In theory, perhaps, an application might be able to get a path that is more well-suited to its precise bandwidth needs if it could talk to the network. In practice, a well-provisioned IP network with rich multipath capabilities is robust, effective, and simple. Indeed, it’s been proven that multipath load-balancing can get very close to optimal utilization, even when the traffic matrix is unknown (which is the normal case). So it’s hard to argue that the additional complexity of providing explicit communication mechanisms for applications to signal their needs to the physical network are worth the cost. In fact, we’ll argue in a future post that trying to carefully engineer traffic is counter-productive in data centers because the traffic patterns are so unpredictable. Combine this with the benefits of decoupling the network services from the physical fabric, and it’s clear that a virtualization overlay on top of a well-provisioned IP network is a great fit for the modern data center.
[This post was written by Bruce Davie and Martin Casado.]
With the growth of interest in network virtualization, there has been a tendency to focus on the encapuslations that are required to tunnel packets across the physical infrastructure, sometimes neglecting the fact that tunneling is just one (small) part of an overall architecture for network virtualization. Since this post is going to do just that – talk about tunnel encapsulations – we want to reiterate the point that a complete network virtualization solution is about much more than a tunnel encapsulation. It entails (at least) a control plane, a management plane, and a set of new abstractions for networking, all of which aim to change the operational model of networks from the traditional, physical model. We’ve written about these aspects of network virtualization before (e.g., here).
In this post, however, we do want to talk about tunneling encapsulations, for reasons that will probably be readily apparent. There is more than one viable encapsulation in the marketplace now, and that will be the case for some time to come. Does it make any difference which one is used? In our opinion, it does, but it’s not a simple beauty contest in which one encaps will be declared the winner. We will explore some of the tradeoffs in this post.
There are three main encapsulation formats that have been proposed for network virtualization: VXLAN, NVGRE, and STT. We’ll focus on VXLAN and STT here. Not only are they the two that VMware supports (now that Nicira is part of VMware) but they also represent two quite distinct points in the design space, each of which has its merits.
One of the salient advantages of VXLAN is that it’s gained traction with a solid number of vendors in a relatively short period. There were demonstrations of several vendors’ implementations at the recent VMworld event. It fills an important market need, by providing a straightforward way to encapsulate Layer 2 payloads such that the logical semantics of a LAN can be provided among virtual machines without concern for the limitations of physical layer 2 networks. For example, a VXLAN can provide logical L2 semantics among machines spread across a large data center network, without requiring the physical network to provide arbitrarily large L2 segments.
At the risk of stating the obvious, the fact that VXLAN has been implemented by multiple vendors makes it an ideal choice for multi-vendor deployments. But we should be clear what “multi-vendor” means in this case. Network virtualization entails tunneling packets through the data center routers and switches, and those devices only forward based on the outer header of the tunnel – a plain old IP (or MAC header). So the entities that need to terminate tunnels for network virtualization are the ones that we are concerned about here.
In many virtualized data center deployments, most of the traffic flows from VM to VM (“east-west” traffic) in which case the tunnels are terminated in vswitches. It is very rare for those vswitches to be from different vendors, so in this case, one might not be so concerned about multi-vendor support for the tunnel encaps. Other issues, such as efficiency and ability to evolve quickly might be more important, as we’ll discuss below.
Of course, there are plenty of cases where traffic doesn’t just flow east-west. It might need to go out of the data center to the Internet (or some other WAN), i.e. “north-south”. It might also need to be sent to some sort of appliance such as a load balancer, firewall, intrusion detection system, etc. And there are also plenty of cases where a tunnel does need to be terminated on a switch or router, such as to connect non-virtualized workloads to the virtualized network. In all of these cases, we’re clearly likely to run into multi-vendor situations for tunnel termination. Hence the need for a common, stable, and straightfoward approach to tunneling among all those devices.
Now, getting back to server-server traffic, why wouldn’t you just use VXLAN? One clear reason is efficiency, as we’ve discussed here. Since tunneling between hypervisors is required for network virtualization, it’s essential that tunneling not impose too high an overhead in terms of CPU load and network throughput. STT was designed with those goals in mind and performs very well on those dimensions using today’s commodity NICs. Given the general lack of multi-vendor issues when tunneling between hypervisors, STT’s significant performance advantage makes it a better fit in this scenario.
The performance advantage of STT may be viewed as somewhat temporary - it’s a result of STT’s ability to leverage TCP segmentation offload (TSO) in today’s NICs. Given the rise in importance of tunneling, and the momentum behind VXLAN, it reasonable to expect that a new generation of NICs will emerge that allow other tunnel encapsulations to be used without disabling TSO. When that happens, performance differences between STT and VXLAN should (mostly) disappear, given appropriate software to leverage the new NICs.
Another factor that comes into play when tunneling traffic from server to server is that we may want to change the semantics of the encapsualution from time to time as new features and capabilities make their way into the network virtualization platform. Indeed, one of overall advantages of network virtualization is the ease with which the capabilities of the network can be upgraded over time, since they are all implemented in software that is completely independent of the underlying hardware. To make the most of this potential for new feature deployment, it’s helpful to have a tunnel encaps with fields that can be modified as required by new capabilities. An encaps that typically operates between the vswitches of a single vendor (like STT) can meet this goal, while an encaps designed to facilitate multi-vendor scenarios (like VXLAN) needs to have the meaning of every header field pretty well nailed down.
So, where does that leave us? In essence, with two good solutions for tunneling, each of which meets a subset of the total needs of the market, but which can be used side-by-side with no ill effect. Consequently, we believe that VXLAN will continue to be a good solution for the multi-vendor environments that often occur in data center deployments, while STT will, for at least a couple of years, be the best approach for hypervisor-to-hypervisor tunnels. A complete network virtualization solution will need to use both encapsulations. There’s nothing wrong with that – building tunnels of the correct encapsulation type can be handled by the controller, without the need for user involvement. And, of course, we need to remember that a complete solution is about much more than just the bits on the wire.
[This post was written with input from Martin Casado, Ben Pfaff, Justin Pettit and Ben Basler.]
The Open vSwitch (OVS) project is obviously dear to our hearts at Nicira (and now VMware). OVS is a fairly standard open source project, with many dozens of people from companies around the world contributing patches and reviewing them. However, there is more to openness than just open source software; open protocols (with freely accessible specs) are also important. Of course, Open vSwitch is well known as an implementation of the OpenFlow protocol, for which the specs are freely available. But there is another protocol, arguably just as important as OpenFlow, which is used to manage Open vSwitch instances. That protocol is known as the Open vSwitch Data Base management protocol or OVSDB protocol. While the specification of that protocol can be found within the Open vSwitch sources, it’s a bit of an effort to figure out exactly how it works. With that in mind, as well as a desire to be as open as possible, we decided to publish a spec for the OVSDB protocol in a more familiar and accessible format – an Internet draft.
To be clear, anyone can publish an Internet draft, and that does not make something into a standard. There’s a possibility that this Internet draft may be suitable for publication as an informational RFC. That wouldn’t make it a standard either, but it would at least provide an archival publication mechanism for a protocol that is quite widely used. Whether that happens or not depends on the “Independent Stream Editor”, part of the rather complex organization that handles RFC publication. (See http://www.rfc-editor.org/RFCeditor.html.)
So, what is this OVSDB protocol? Obviously, you could just go and read our new Internet draft, but here is the quick summary. While OpenFlow establishes flow state in a switch, there’s a lot more to Open vSwitch – indeed there is more to networking – than just setting up flow (or forwarding) table entries. In Open vSwitch, you can create many virtual switch instances, attach interfaces to those switches, set QOS policies on interfaces, and so on. None of these configuration tasks can be done with OpenFlow, so you need a management/configuration protocol to do them.
The OVSDB protocol has been part of the Open vSwitch implementation for many years. It is essentially a general purpose protocol for interacting with a database, and in Open vSwitch the database in question is a set of tables representing switch configuration data. (Some readers may be familiar with of-config – the OpenFlow config protocol – a more recent effort; we believe that protocol could actually be implemented on top of OVSDB.)
To step back for a moment, networking folks often think of any network device as having a control plane and a data plane. Sadly, the management plane has been all too often neglected. OVSDB is a protocol that was created to address that important but neglected aspect of networking. We think that making networks dramatically easier to manage is in fact one of the major benefits of network virtualization. That’s a topic we’ve discussed elsewhere; for now, I’ll just urge readers of this blog to go take a look at our current approach to managing and configuring Open vSwitch instances.