Posted: January 15, 2012
[This post was written with Andrew Lambeth]
Our last post “Networking Doesn’t Need a VMware” made the point that drawing a simple analogy between server and network virtualization can steer the technical discourse on network virtualization in the wrong direction. The sentiment comes from the many partner, analyst, and media meetings we’ve been involved in that persistently focus on relatively uninteresting areas of the network virtualization space, specifically, details of encapsulation formats and lookup pipelines.
In this series of writeups, we take a deeper look and discuss some areas in which network
virtualization would do well to emulate server virtualization. This is a fairly broad topic so we’ll break it up across a couple of posts.
In this part, we’ll focus on address space virtualization.
Quick heads up that the length of this post got a little bit out of hand. For those who don’t have the time/patience/inclination/attention span, the synopsis is as follows:
One of the key strengths of a hypervisor lies in its insertion of a completely new address space below the guest OS’s view of what it believes to be the physical address space. And while there are several possible ways to interpose on network address space to achieve some form of virtualization, encapsulation provides the closest analog to the hierarchical memory virtualization used in compute. It does so by taking advantage of the hierarchy inherent in the physical topology, and allowing both the virtual and physical address spaces to support complete forwarding and addressing models. However, like memory virtualization’s page table, encapsulation requires maintenance of the address mappings (rules to tunnel mappings). The interface for doing so should be open, and a good candidate for that interface is OpenFlow.
Now, onto the detailed argument …
Virtual Memory in Compute
Virtual memory has been a core component of compute virtualization for decades. The basic concept is very simple, multiple (generally sparse) virtual address spaces are multiplexed to a single compact physical address using a page table that contains the between the two.
An OS virtualizes the address space for a process by populating a table with a single level of translations from virtual addresses (VA) to physical addresses (PA). All hypervisors support the ability for a guest OS to continue working in this mode by adding the notion of a third address space called machine addresses (MA) for the true physical addresses.
Since x86 hardware initially supported only a single level of mappings the hypervisor implemented a complete MMU in software to capture and maintain the guest’s VA to PA mappings, and then created a second set of mappings from guest PA to actual hardware MA. What was actually programmed into hardware was VA to MA mappings, in order to not incur overhead during the actual memory references by the guest, but the full heirarchy was maintained so at any time the hypervisor components could easily take an address from any of the three address spaces and map it back to any of the other address spaces.
Having a multi-level hierarchical mapping was so powerful that eventually CPU vendors added support for multi-level page tables in the hardware MMU (called Nested Page Tables on AMD and Extended Page Tables on Intel). Arguably this was the biggest architectural change to CPUs over the last decade.
Benefits of address virtualization in compute
Although this is pretty basic stuff, it is worth enumerating the benefits it provides to see how these can be applied to the networking world.
- It allows the multiplexing of multiple large, sparse address spaces onto a smaller, compact physical address space.
- It supports mobility of a process within a physical address space. This can be used to more efficiently allocated processes to memory, or take advantage of new memory as it is added.
- VMs don’t have to coordinate with other VMs to select their address space. Or more generally, there are no constraints of the virtual address space that can be allocated to a guest VM.
- A virtual memory subsystem provides a basic unit of isolation. A VM cannot mistakingly (or maliciously) address another VM unless another part of the system is busted or compromised.
Address Virtualization for Networking
The point of this post is to explore whether memory virtualization in compute can provide guidance on network virtualization. It would be nice, for example, to retain the same benefits that made memory virtualization so successful.
We’ll start with some common approaches to network virtualization and see how they compare.
Tagging : The basic idea behind tagging is to mark packets at the edge of the network with some bits (the tag) that contains the virtual context. This tag generally encodes a unique identifier for the virtual network and perhaps the virtual ingress port. As the packet traverses the network, the tag is used to segment the forwarding tables to only apply to rules associated with that tag.
A tag does not provide “virtualization” in the same way virtual memory of compute does. Rather it provides segmentation (which was also a phase compute memory went through decades ago). That is, it doesn’t introduce a new address space, but rather segments an existing address space. As a result, the same addresses are used for both physical and virtual purposes. The implications of this can be quite limiting. For example:
- Since the addresses are used to address things in the virtual world *and* the physical world, they have to be exposed to the physical forwarding tables. Therefore the nice property of aggregation that comes with hierarchical address mapping cannot be exploited. Using VLANs as an example, every VM MAC has to be exposed to the hardware putting a lot of pressure on physical L2 tables. If virtual addresses where mapped to a smaller subset of physical addresses instead, the requirement for large L2 tables would go away.
- Due to layering, tags generally only segment a single address space (e.g. L2). Again, because there is no new address space introduced, this means that all of the virtual contexts must all have identical addressing models (e.g. L2) *and* the virtual addresses space must be the same as the physical address space. This unnecessarily couples the virtual and physical worlds. Imagine a case in which the model used to address the virtual domain would not be suitable for physical forwarding. The classic example of this is L2. VMs are given Ethernet NICs and may want to talk using L2 only. However, L2 is generally not a great way of building large fabrics. Introducing an additional address space can provide the desired service model at the virtual realm, and the appropriate forwarding model in the physical realm.
- Another shortcoming of tagging is that you cannot take advantage of mobility through address remapping. In the virtual address realm, you can arbitrarily map a virtual address to a physical address. If the process or VM moves, the mapping just needs to be updated. A tag however does not provide address mobility. The reason that VEPA, VNTAG, etc. support mobility within an L2 domain is by virtue of learned soft state in L2 (which exists whether or not a tag is in use). For layers that don’t support address mobility (say vanilla IP), using a tag won’t somehow enable it.It’s worth pointing out that this is a classic problem within virtualized datacenters today. L2 supports mobility, and L3 often does not. So while both can be segmented using tags, often only the L2 portion allows mobility confining VMs to a given subnet. Most of the approaches to VM mobility across subnets introduce another layer of addressing, whether this is done with LISP, tunnels, etc. However, as soon as encapsulation is introduced, it begs the question of why use tags at all.
There are a number of other differences between tagging and address-level virtualization (like requiring all switches en route to understand the tag), but hopefully you’ve gotten the point. To be more like memory virtualization, a new addressing layer needs to be present, and segmenting the physical address space doesn’t provide the same properties.
Address Mapping : Another method of network virtualization is address mapping. Unlike tagging, a new address space is introduced and mapped onto the physical address space by one or more devices in the network. The most common example of address mapping is NAT, although it doesn’t necessarily have to be limited to L3.
To avoid changing the framing format of the packet, address mapping operates by updating the address in place, meaning that the same field is used for the updated address. However, multiplexing multiple larger address spaces onto a smaller physical address space (for example, a bunch of private IP subnets onto a smaller physical IP subnet) is a lossy operation and therefore requires additional bits to map back from the smaller space to the larger space.
Here in lies much of the complexity (and commensurate shortcoming) of address mapping. Generally, the additional bits are added to a smaller field in the packet, and the full mapping information is stored as state on the device performing translation. This is what NAT does, the original 5 tuple is stashed on the device, and the ephemeral transport port is used to launder a key that points back to that 5 tuple on the return path.
When compared to virtual memory, the model doesn’t hold up particularly well. Within a server, the entire page table is always accessible, so addresses can be mapped back and forth with some additional information to aid in the demultiplexing (generally the pid). With network address translation, the “page table” is created on demand, and stored in a single device. Therefore, in order to support component failover of asymmetric paths, the per-flow state has to be replicated to all other devices that could be on the alternate path. Also, because state is set up during flow initiation from the virtual address, inbound flows cannot be forwarded unless the destination public IP address is effectively “pinned” to a virtual address. As a result, virtual to virtual communication in which the end points are behind different devices have to resort to OOB techniques like NAT punching in order to communicated.
This doesn’t mean that addressing mapping isn’t hugely useful in practice. Clearly for mapping from a virtual address space to a physical address space this is the correct approach. However, if communicating from a virtual address to a virtual address through a physical address space, it it is far more limited than its compute analog.
Tunneling : By tunneling we mean that the payload of a packet is another packet (headers and all).
Like address mapping (and virtual memory) tunneling introduces another address space. However, there are two differences. First, the virtual address space doesn’t have to look anything like the physical. For example, it is possible to have the virtual address space be IPv6 and the physical address space be IPv4. Second, the full address mapping is stored in the packet so that there is no need to create per-flow state within the network to manage the mappings.
Tunneling (or perhaps more broadly encapsulation) is probably the most popular method of doing full network virtualization today. TRILL uses L2 in L2, LISP uses encapsulation, VXLAN (which is LISP as far as I can tell) uses L2 in L3, VCDNI uses L2 in L2, NVGRE uses L2 in L3, etc.
The general approach maps to virtual memory pretty well. The outer header can be likened to a physical address. The inner header can be likened to a virtual address. And often a shim header is included just after the outer header that contains demultiplexing information (the equivalent of a PID). The “page table” consists of on-datapath table table entries which map packets to tunnels.
Take for example an overlay mesh (like NVGRE). The L2 table at the edge that points to the tunnels is effectively mapping from virtual addresses to physical addresses. And there is no reason to limit this lookup to L2, it could provide L2 and L3 in the virtual domain. If a VM moves, this “page table” is updated as it would be in a server of a process is moved.
Because the packets are encapsulated, switches in the “physical domain” only have to deal with the outer header greatly reducing the number of addresses they have to deal with. Switches which contain the “page table” however, have to do lookups in the virtual world (e.g. map from packet to tunnel), map to the physical world (throw a tunnel on the packet) and then forward the packet in the physical world (figure out which port to send the tunneled traffic out on). Note that these are effectively the same steps an MMU takes within a server.
Further, like memory virtualization, tunneling provides nice isolation properties. For example, a VM in the virtual address space cannot address the physical network unless the virtual network is somehow bridged into it.
Also, like hierarchical memory, it is possible for the logical networks to have totally different forwarding stacks. One could be L2-only, the other ipv4, and another ipv6, and all of these could differ from the physical substrate. A common setup has IPv6 run in the virtual domain (for end-to-end addressing), and IPv4 for the physical fabric (where a large address space isn’t needed).
Fortunately, we already have a a multi-level hierarchical topology in the network (in particularly the datacenter). This allows for either the addition of a new layer at the edge which provides a virtual address space without changing any other components, or just changing the components at the outer edge of the hierarchy.
So what are the shortcomings of this approach? The most immediate are performance issues with tunneling, additional overhead overhead in the packet, and the need to maintain the distributed “page table”.
Tunneling performance is no longer the problem it use to be. Even from software in the server, clever tunnel implementations are able to take advantage of LRO and the like and can achieve 10G performance without much CPU. Hardware tunneling on most switching chipsets performs at line rate as well. Header overhead marginally increases transmission delay, and will reduce total throughput if a link is saturated.
Keeping the “page tables” up to date on the other hand is still something the industry is grappling with. Most standards punt on describing how this is done, or even the interface to use to write to the “page table”. Some piggyback on L2 learning (like NVGRE) to construct some of the state but still rely on an out of band mechanism for other state (in the case of NVGRE the VIF to tenantID mapping).
As we’ve said many times in the past, this interface should be standardized if for no other reason than to provide modularity in the architecture. Not doing that would be like limiting a hardware MMU on a CPU to only work with a single Operating System.
Fortunately, Microsoft realizes this need and has opened the interface to their vswitch on Windows Server 8, and XenServer, KVM and other Linux-based hypervisors support Open vSwitch.
For hardware platforms it would also be nice to expose the ability to manage this state. Of course, we would argue that OpenFlow is a good candidate if for no other reason than it has been already used for this purpose successfully.
Wrapping Up ..
If the analogy of memory virtualization in compute holds then encapsulation is the correct way to go about network virtualization. It introduces a new address space that is unrestricted in forwarding model, it takes advantage of hierarchy inherent in the physical topology allowing for address aggregation further up in the hierarchy, and it provides the basic virtualization properties of isolation and transparent mobility.
In the next post of this series, we’ll explore how server virtualization suggests that the way that encapsulation is implemented today isn’t quite right, and some things we may want to do to fix it.
Posted: January 9, 2012
[This post was written with Andrew Lambeth. Andrew has been virtualizing networking for long enough to have coined the term “vswitch”, and led the vDS distributed switching project at VMware. ]
Or at least, it doesn’t need to solve the problem in the same way.
It’s commonly said that “networking needs a VMWare”. Hell, there have been occasions in which we’ve said something very similar. However, while the analogy has an obvious appeal (virtual, flexible, thin layer of indirection in software, commoditize, commoditize, commoditize!), a closer look suggests that it draws from a very superficial understanding of the technology, and in the limit, it doesn’t make much sense.
It’s no surprise that many are drawn to this line of thought. It probably stems from the realization that virtualizing the network rather than managing the physical components is the right direction for networks to evolve. On this point, it appears there is broad agreement. In order to bring networking up to the operational model of compute (and perhaps disrupt the existing supply chain a bit) virtualization is needed.
Beyond this gross comparison, however, the analogy breaks down. The reality is that the technical requirements for server virtualization and network virtualization are very, very different.
Server Virtualization vs. Network Virtualization
With server virtualization, virtualizing CPU, memory and device I/O is incredibly complex, and the events that need to be handled with translation or emulation happen at CPU cycle timescale. So the virtualization logic must be both highly sophisticated and highly performant on the “datapath” (the datapath for compute virtualization being the instruction stream and I/O events).
On the other hand, the datapath operations for network virtualization are almost trivially simple. All they involve is mapping one address/context space to another address/context space. This effectively reduces to an additional header on the packet (or tag), and one or two more lookups on the datapath. Somewhat revealing of this simplicity, there are multiple reasonable solutions that address the datapath component, NVGRE, and VXLAN being two recently publicized proposals.
If the datapath is so simple, it’s reasonable to ask why network virtualization isn’t already a solved problem.
The answer, is that there is a critical difference between network virtualization and server virtualization and that difference is where the bulk of complexity for network virtualization resides.
What is that difference?
Virtualized servers are effectively self contained in that they are only very loosely coupled to one another (there are a few exceptions to this rule, but even then, the groupings with direct relationships are small). As a result, the virtualization logic doesn’t need to deal with the complexity of state sharing between many entities.
A virtualized network solution, on the other hand, has to deal with all ports on the network, most of which can be assumed to have a direct relationship (the ability to communicate via some service model). Therefore, the virtual networking logic not only has to deal with N instances of N state (assuming every port wants to talk to every other port), but it has to ensure that state is consistent (or at least safely inconsistent) along all of the elements on the path of a packet. Inconsistent state can result in packet loss (not a huge deal) or much worse, delivery of the packet to the wrong location.
It’s important to remember that networking traditionally has only had to deal with eventual consistency. That is “after state change, the network will take some time to converge, and until that time, all bets are off”. Eventual consistency is fine for basic forwarding provided that loops are prevented using a TTL, or perhaps the algorithm ensures loop freedom while it is converging. However, eventual consistency doesn’t work so well with virtualization. During failure, for example, it would suck if packets from tenant A managed to leak over to tenant B’s network. It would also suck if ACLs configured in tenant A’s were not enforced correctly during convergence.
So simply, the difference between server virtualization, and network virtualization is that network virtualization is all about scale (dealing with the complexity of many interconnected entities which is generally a N2 problem), and it is all about distributed state consistency. Or more concretely, it is a distributed state management problem rather than a low level exercise in dealing with the complexities of various hardware devices.
Of course, depending on the layer of networking being virtualized, the amount of state that has to be managed varies.
All network virtualization solutions have to handle basic address mapping. That is, provide a virtual address space (generally addresses of the packets within the tunnel) and the physical address space (the external tunnel header), and a mapping between the two (virtual address X is at physical address Y). Any of the many tunnel overlays solutions, whether ad hoc, proprietary, or standardized provide this basic mapping service.
To then virtualize L2, requires almost no additional state management. The L2 forwarding tables are dynamically populated from passing traffic. And the size of a single broadcast domain has fairly limited scale, supporting hundreds or low thousands of active MACs. So the only additional state that has to be managed is the association of a port (virtual or physical) to a broadcast domain which is what virtual networking standards like NVGRE and VXLAN provide.
As an aside, it’s a shame that standards like NVGRE and VXLAN choose to dictate the wire format (important for hardware compatibility) and the method for managing the context mapping between address domains (multicast), but not the control interface to manage the rest of the state. Specifying the wire format is fine. However, requiring a specific mechanism (and a shaky one at that) for managing the virtual to physical address mappings severely limits the solution space. And not specifying the control interface for managing the rest of the state effectively guarantees that implementations will be vertically integrated and proprietary.
For L3, there is a lot more state to deal with, and the number of end points to which this state applies can be very large. There are a number of datacenters today who have, or plan to have, millions of VMs. Because of this, any control plane that hopes to offer a virtualized L3 solution needs to manage potentially millions of entries at hundreds of thousands of end points (assuming the first hop network logic is within the vswitch). Clearly, scale is a primary consideration.
As another aside, in our experience, there is a lot of confusion on what exactly L3 virtualization is. While a full discussion will have to wait for a future post, it is worth pointing out that running a router as a VM is *not* network virtualization, it is x86 virtualization. Network virtualzation involves mapping between network address contexts in a manner that does not effect the total available bandwidth of the physical fabric. Running a networking stack in a virtual machine, while it does provide the benefits of x86 virtualization, limits the cross-sectional bandwidth of the emulated network to the throughput of a virtual machine. Ouch.
For L4 and above, the amount of state that has to be shared and the rate that it changes increases again by orders of magnitude. Take, for example, WAN optimization. A virtualized WAN optimization solution should be enforced throughout the network (for example, each vswitch running a piece of it) yet this would incur a tremendous amount of control overhead to create a shared content cache.
So while server virtualization lives and dies by the ability to deal with the complexity of virtualizing complex hardware interfaces of many devices at speed, network virtualization’s primary technical challenge is scale. Any solution that doesn’t deal with this up front will probably run into a wall at L2, or with some luck, basic L3.
This is all interesting … but why do I care?
Full virtualization of the network address space and service model is still a relatively new area. However, rather than tackle the problem of network virtualization directly, it appears that a fair amount of energy in industry is being poured into point solutions. This reminds us of the situation 10 years ago when many people were trying to solve server sprawl problems with application containers, and the standard claim was that virtualization didn’t offer additional benefits to justify the overhead and complexity of fully virtualizing the platform. Had that mindset prevailed, today we’d have solutions doing minimal server consolidation for a small handful of applications on only one or possibly two OSs, instead of a set of solutions that solve this and many many more problems for any application and most any OS. That mindset, for example could never have produced vMotion, which was unimaginable at the outset of server virtualization.
At the same time, those who are advocating for network virtualization tend to draw technical comparisons with server virtualization. And while clearly there is a similarity at the macro level, this comparison belies the radically different technical challenges of the two problems. And it belies the radically different approaches needed to solve the two problems. Network virtualization is not the same as server virtualization any more than server virtualization is the same as storage virtualization. Saying “the network needs a VMware” in 2012 is a little like saying “the x86 needs an EMC” in 2002.
Perhaps the confusion is harmless, but it does seem to effect how the solution space is viewed, and that may be drawing the conversation away from what really is important, scale (lots of it) and distributed state consistency. Worrying about the datapath , is worrying about a trivial component of an otherwise enormously challenging problem.