[This post was co-authored by Bruce Davie and Ken Duda]
Almost a year ago, we wrote a first post about our efforts to build virtual networks that span both virtual and physical resources. As we’ve moved beyond the first proofs of concept to customer trials for our combined solution, this post serves to provide an update on where we see the interaction between virtual and physical worlds heading.
Our overall approach to connecting physical and virtual resources can be viewed in two main categories:
- terminating the overlay on physical devices, such as top-of-rack switches, routers, appliances, etc.
- managing interactions between the overlay and the physical devices that provide the underlay.
We first started working to design a control plane to terminate network virtualization overlays on physical devices in 2012. We started by looking at the information model, defining what information needed to be exchanged between a physical device and a network virtualization controller such as NSX. To bound the problem space, we focused on a specific use case: mapping the ports and VLANs of a physical switch into virtual layer 2 networks implemented as VXLAN-based overlays. (See our posts on the issues around encapsulation choice here and here). At the same time, we knew there were a lot more use cases to be addressed, so we picked a completely extensible protocol to carry the necessary information: OVSDB. This was important: we knew that over time we’d have to support a lot more use cases than just L2 bridging between physical and virtual worlds. After all, one of the tenets of network virtualization is that a virtual network should faithfully reproduce all of the networking stack, from L2-L7, just as server virtualization faithfully reproduces a complete computing environment (as described in more detail here.)
So, the first thing we added to the solution space once we got L2 working was L3. By the time we published the VTEP schema as part of Open vSwitch in late 2013, distributed logical routing was included. (We use the term VTEP – VXLAN tunnel end-point – as a general term for the devices that terminate the overlay.) Let’s take a look at how logical routing works.
Distributed logical routing is an example of a more general capability in network virtualization, the distribution of services. Brad Hedlund wrote some time ago about the value of distributing services among hypervisors. The same basic arguments apply when there are VTEPs in the picture — you want to distribute functions, like logical routing, so that packets always follow the shortest path without hair-pinning, and so that the capacity to perform that function scales out as you add more devices, be they hypervisors or physical switches.
So, suppose a VM (VM1) is placed in logical subnet A, and a physical server (PS1) that is in subnet B is located behind a ToR switch acting as a VTEP (see picture). Say we want to create a logical network that interconnects these two subnets. The logical topology is created by API requests to the network virtualization controller, which in turn programs the vswitches and the ToR to instantiate the desired topology. Part of this process entails mapping physical ports to the logical topology via API requests. Everything the ToR needs to know to participate in the logical topology is provided to it via OVSDB.
Suppose VM1 needs to send a packet to PS1. The VM will send an ARP request towards its default gateway, which is implemented in a distributed manner. (We assume the VM learned its default gateway via some prior step; for example, DHCP may be used.) The ARP request will be intercepted by the local vswitch on the hypervisor. Acting as the logical router, the vswitch will reply to the ARP, so that the VM can now send the packet towards the router. All of this happens without any packet leaving the hypervisor.
The logical router now needs to ARP for the destination — PS1 (assuming an ARP cache miss for the first packet). It will broadcast the ARP request by sending it over a VXLAN tunnel to the VTEP (and potentially to other VTEPs as well, if there are more VTEPs that are involved in logical subnet B). When the ARP packet reaches the ToR, it is sent out on one or more physical interfaces — the set of interfaces that were previously mapped to this logical subnet via API requests. The ARP will reach PS1, which replies; the ToR forwards the reply over a VXLAN tunnel to the vswitch that issued the request, and now it’s able to forward the data traffic to the ToR which decapsulates the packet and delivers it to PS1.
For traffic flowing the other way, the role of logical router would be played by the physical VTEP rather than the vswitch. This is the nature of distributed routing — there is no single box performing all the work for a single logical router, but rather a collection of devices. And in this case, the work is distributed among both hardware VTEPs and vswitches.
We’ve glossed over a couple of details here, but one detail that’s worth noting is that, for traffic heading in the physical-to-virtual direction, the hardware device needs to perform an L3 lookup followed by VXLAN encapsulation. There has been some uncertainty regarding the capabilities of various switching chips to perform this operation (see this post, for example, which tries to determine the capabilities of Trident 2 based on switch vendor information). We’ve actually connected VMware’s NSX controller to ToR switches using at least four different classes of switching silicon (two merchant vendors, two custom ASIC-based designs). This includes both Arista’s 7150 series and 7050X switches. All of these are capable of performing the necessary L3+VXLAN operations. We’ll let the switch vendors speak for themselves regarding product specifics, but we’re essentially viewing this as a non-issue.
OK, that’s L3. What next? Overall, our approach has been to try to provide the same capabilities to virtual ports and physical ports, as much as that is possible. Of course, there is some inherent conflict here: hardware-based end-points tend to excel at throughput and density, while software-based end-points give us the greatest flexibility to deliver new features. Also, given our rich partner ecosystem with many hardware players, it’s not always going to be feasible to expose the unique features of a specific hardware product through the northbound API of NSX. But we certainly see value in being able to do more on physical ports: for example, we should be able to create access control lists on physical ports under API control. Similarly, we’d like to be able to control QoS policy at the physical ingress, e.g. remarking DSCP bits or trusting the value received and copying to the outer VXLAN header. More stateful services, such as firewalling or load-balancing, may not make sense in a ToR-class device but could be implemented in specific appliances suited for those tasks, and could still be integrated into virtualized networks using the same principles that we’ve applied to L2 and L3 functions.
In summary, we see the physical edges of virtual networks as a critical part of the overall network virtualization story. And of course we consider it important that we have a range of vendors whose devices can be integrated into the virtual overlay. It’s been great to see the ecosystem develop around this ability to tie the physical and the virtual together, and we see a lot of opportunity to build on the foundation we’ve established.
(This post was written by Tim Hinrichs and Scott Lowe with contributions from Martin Casado, Peter Balland, Pierre Ettori, and Dennis Moreau.)
In the first part of this series we described the policy problem: ensuring that the data center obeys the real-world rules and regulations that are pertinent to that data center. In this post, we look at the range of possible solutions by identifying some the key features that are important for any solution to the policy problem. Those key features correspond to the following four questions, which we use to structure our discussion.
- What are the policy sources a policy system must accommodate?
- How do those sources express the desired policy to the system?
- How does the policy system interact with data center services?
- What can the policy system do once it has the policy?
Let’s take a look at each of these questions one at a time.
Policy Sources: The origins of policy
Let’s start by digging deeper into an idea we touched on in the first post when describing the challenge of policy compliance: the sources of policy. While we sometimes talk about there being a single policy for a data center, the reality is that there are really many different policies that govern a data center. Each of these policies may have a different source or origin. Here are some examples of different policy sources:
- Application developers may write a separate policy for each application describing what that application expects from the infrastructure (such as high availability, elasticity/auto-scaling, connectivity, or a specific deployment location).
- The cloud operator may have a policy that describes how applications relate to one another. This policy might specify that applications from different business units must be deployed on different production networks, for example.
- Security and compliance teams might have policies that dictate specific firewall rules for web servers, encryption for PCI-compliant workloads, or OS patching guidelines.
- Different policies may be focused on different functionality within the data center. There might be a deployment policy, a billing policy, a security policy, a backup policy, and a decommissioning policy.
- Policies may be written at different levels of abstraction, e.g. applications versus virtual machines (VMs), networks versus routers, storage versus disks.
- Different policies might exist for different policy operations (monitoring, enforcing, and auditing are three examples that we will discuss later in this post).
The idea of multiple sources of policy naturally leads us to the presence of multiple policies. This is an interesting idea, because these multiple policies can interact with each other in many different ways. A policy describing where an application is to be deployed might also implicitly describe where VMs are to be deployed. A cloud operator’s policy might require an application to be deployed on network A or B, and an application policy requiring high availability might mean it must be deployed on network B or C; taken together, this means the application can only be deployed on network B. An auditing policy that requires knowing provenance for data when applied to an application that supports a high transaction rate might require solid state storage to meet performance requirements.
Taking this a step further, it may be unclear how these policies should interact. If the backup policy says to have 3 copies of data, and an auditing policy requires keeping track of where the data originated, do we need 3 copies of that provenance information? Conflicts are another example. If the application’s policy implies networks A or B, and the cloud operator’s policy implies networks C or D, then there is no way to deploy that application so that both policies are satisfied simultaneously.
There are a couple key takeaways from this discussion. First, a policy system must deal with multiple policy sources. Second, a policy system must deal with the presence of multiple policies, and how those policies can or should interact with one another.
Expressing Policy: Policy languages
Any discussion of policy systems has to deal with the subject of policy languages. An intuitive, easy-to-use syntax is critically important for eventual adoption, but here we focus on more semantic issues: how domain-specific is the language? How expressive is the language? What general-purpose policy features belong to the language?
A language is a domain specific language (DSL) if it includes primitives useful for policy in one domain but not another. For example, a policy language for networking might include primitives for the source and destination IP addresses of a network packet. A DSL for compute might include primitives for the amount of memory or disk space on a server. An application-oriented DSL might include elasticity primitives so that how different parts of the application grow and shrink can be the subject of policy.
Different parts of a policy language can be domain-specific:
- Namespace: The objects over which we declare policy can be domain-specific. For example, a networking DSL might define policy about packets, ports, switches, and routers.
- Condition: Policy languages typically have if-then constructs, and the “if” part of those constructs can include domain-specific tests, such as the source and destination IP addresses on a network packet.
- Consequent: The “then” component of if-then constructs can also be domain-specific. For networking, this might include allowing/dropping a packet, sending a packet through a firewall and then a load balancer, or ensuring quality of service (QoS).
Independent of domain-specific constructs, a language has a fundamental limitation on its expressiveness (its “raw expressiveness”). Language A is more expressive than language B if every policy for B can be translated into a policy for A but not vice versa. For example, if language A supports the logical connectives AND/OR/NOT, and language B is the same except it only supports AND/OR, then A is more expressive than B. However, it can be the case that language A supports AND/OR/NOT, language B supports AND/NOT, and yet the two languages are equally expressive (because OR can be simulated with AND/NOT).
It may seem that more expressiveness is necessarily better because a more expressive language makes it easier for users to write policies. Unfortunately, the more expressive a language, the harder it is to implement. By “harder to implement” we don’t mean it’s harder to get a 30% speed improvement through careful engineering; rather, we mean that it is provably impossible to make the implementation of sufficiently expressive languages run in less than exponential time. In short, every policy language chooses how to balance the policy writers’ need for expressiveness and the policy system’s need for implementability.
On top of domain-specificity and raw expressiveness, different policy languages support different features. For example, can we say that some policy statements are “hard” (can never be violated) while other statements are “soft” (can be violated if the only way to not violate is to violate a hard constraint). More generally, can we assign priorities to policy statements? Is there native support for exceptions to policy rules (maybe a cloud owner wants to manually make an exception for a violation so that auditing reflects why that violation was less severe than it may have seemed). Does the language have policy modules and enable people to describe how to combine those modules to produce a new policy? While such features might not impact domain-specificity or raw expressiveness, they have a large impact on how easy the policy language—and therefore the system using that language—is to use.
The key takeaway here is that the policy language has a significant impact on the policy system, so the choice of policy language is a critical one.
Policy Interaction: Integrating with data center services
A policy system by itself is useless; to have value, the policy system must interact and integrate with other data center or cloud services. By “data center service” or “cloud service” we mean basically anything that has an API, e.g. OpenStack components like Neutron, routers, servers, processes, files, databases, antivirus, intrusion detection, inventory management. Read-only API calls enable a policy system to see what is happening; read/write API calls enable a policy system to control what is happening.
Since a policy system’s ability to do something useful with a policy (like prevent violations, correct violations, or monitor for violations) is directly related to what the service can see and do in the data center, it’s crucial to understand how well a policy system works with the services in a data center. If two policy systems are equivalent except that one works with a broader range of data center services than the other, the one with a broader selection of data center services has the ability to see and do more; thus, it’s better able to see and do things to help the data center obey policy.
One type of data center service is especially noteworthy: the policy-aware service. Such services understand policy natively. They may have an API that includes “insert policy statement” and “delete policy statement”. Such services are especially useful in that they can potentially help the data center obey certain kinds of sub-policies. Distributing the work makes a policy system more robust, more reliable, and better performing.
The key point to remember here is that a policy system’s “power” (knowledge of and control over what’s happening in a data center or cloud environment) is driven by the nature of its interaction with the services running in that data center.
Policy Action: Taking action based on policy
Having looked at three key aspects of a policy system—supporting multiple sources of policies and multiple policies, using a policy language that balances expressiveness with implementability, and providing the appropriate depth and breadth of integration with necessary data center services—we now come to a discussion of what the policy system does (or can do) once it knows what policy (or group of policies) is pertinent to the data center. It’s compelling to think about the utility of policy in terms of the future, the present, and the past. We want the data center to obey the policy at all points in time, and there are different mechanisms for doing that.
- Auditing: We cannot change the past but we can record how the data center behaved, what the policy was, and therefore what the violations were.
- Monitoring: The present is equally impossible to change (by the time we act, that moment in time will have become the past), but we can identify violations, help people understand them, and gather information about how to reduce violations in the future.
- Enforcement: We can change the future behavior of the data center by taking action. Enforcement can attempt to prevent violations before they occur (“proactive enforcement”) or correct violations after the fact (“reactive enforcement”). This is the most challenging of the three because it requires choosing and executing actions that affect the natural state of the data center.
The potential for any policy system to carry out these three functions depends crucially on two things: the policy itself (a function of how well the system supports multiple policies as well as the system’s choice of policy language) and the controls the policy system has over the data center (driven by the policy system’s interaction with and integration into the surrounding data center services). The combination of these two things impose hard limits on how well any policy system is able to audit, monitor, and enforce policy.
While we would rather prevent violations than correct them, it’s sometimes impossible to do so. For example, we cannot prevent violations in a policy that requires server operating system (OS) instances to always have the latest patches. Why? As soon as Microsoft, Apple, or Red Hat releases a new security patch, the policy is immediately violated. The point of this kind of policy is that the policy system recognizes the violation and applies the patch to the vulnerable systems (to correct the violation). The key takeaway from this example is that preventing violations requires the policy system is on the critical path for any action that might violate policy. Violations can only be prevented if such enforcement points are available.
Similarly, it’s not always possible to correct violations. Consider a policy that says the load on a particular web server should never exceed 10,000 requests per second. If the requests to that server become high enough (even with load balancing), there may be no way reduce the load once it reaches 10,001 requests per second. The data center cannot control what web sites people in the real world access through their browsers. In this case, the key takeaway is that correcting violations requires there be actions available to the policy system to counteract the causes of those violations.
Even policy monitoring has limitations. A policy dictating application deployment to particular data centers based on the users of that application assumes readily available information about where applications are deployed and the users of those applications. A web application that does not expose information about its users ensures even monitoring this policy is impossible. The key takeaway here is that monitoring a policy requires that the appropriate information is available to the policy system. Further, if we cannot monitor a policy, we also cannot audit that policy.
In short, every policy system has limitations. These limitations might be on what the policy system knows about the data center, or these limitations might be on what control it has over the data center. These limitations influence whether any given policy can be audited, monitored, and enforced. Moreover, these limitations can change as the data center changes. As new services (hardware or software) are installed or existing services are upgraded in the data center, new functionality becomes available, and a policy system may have additional power (fewer limitations). When old services are removed, the policy system may have less power (more limitations).
These limitations give us ceilings on how successful any policy system might be in terms of auditing, monitoring, and enforcing policy. It is therefore useful to compare policy system designs in terms of how close to those ceilings they can get. Of course, the true test is in terms of actual implementation, not design, and a comparison of systems in terms of what policies they can audit, monitor, and enforce at scale is incredibly valuable. However, we must be careful not to condemn systems for not solving unsolvable problems.
This blog post has focused on laying out the range of possible solutions to the policy problem in the data center. In summary, here are some key points:
Policies come from many different sources and interact in many different ways. Ideally the data center would obey all those policies simultaneously, but in practice we expect the policies to conflict. A solution to the policy problem must address the issues surrounding multiple policies.
A policy language can be categorized in terms of its domain-specificity, its raw expressiveness, and the features it supports. Every solution must balance these three to meet the need for the policy writer to express policy and the need of the policy system to audit, monitor, and enforce policy.
Every solution must interact with the ecosystem of data center services within the data center. The richer the ecosystem a solution can leverage, the more successful it can be.
Once a policy system has a policy, it can audit, monitor, and enforce that policy. A solution to the policy problem is more or less successful at these functions depending on the policy and the data center.
In the next blog post, we will look at proposed policy systems like the OpenStack Group-Based Policy and the Congress project, and explain how they fit into this solution space.
As we’ve discussed previously, the vSwitch is a great position to detect elephant, or heavy-hitter flows because it has proximity to the guest OS and can use that position to gather additional context. This context may include the TSO send buffer, or even the guest TCP send buffer. Once an elephant is detected, it can be signaled to the underlay using standard interfaces such as DSCP. The following slide deck provides and overview of a working version of this, showing how such a setup can be used to both dynamically detect elephants and isolate mice from queuing delays they cause. We’ll write about this in more detail in a later post, but for now check out the slides (and in particular the graphs showing the latency of mice with and without detection and handling).
(This post was written by Tim Hinrichs and Scott Lowe with contributions from Martin Casado, Mike Dvorkin, Peter Balland, Pierre Ettori, and Dennis Moreau.)
Fully automated IT provisioning and management is considered by many to be the ultimate nirvana— people log into a self-service portal, ask for resources (compute, networking, storage, and others), and within minutes those resources are up and running. No longer are the people who use resources waiting on the people who are responsible for allocating and maintaining them. And, according to the accepted definitions of cloud computing (for example, the NIST definition in SP800-145), self-service provisioning is a key tenet of cloud computing.
However, fully automated IT management is a double-edged sword. While having people on the critical path for IT management was time-consuming, it provided an opportunity to ensure that those resources were managed sensibly and in a way that was consistent with how the business said they ought to be managed. In other words, having people on the critical path enabled IT resources to be managed according to business policy. We cannot simply remove those people without also adding a way of ensuring that IT resources obey business policy—without introducing a way of ensuring that IT resources retain the same level of policy compliance.
While many tools today (e.g. application lifecycle-management, templates, blueprints) claim they bring the data center into compliance with a given policy, there are often hidden assumptions underlying those claims. Before we can hope to understand how effective those tools are, we need to understand the problem of policy and policy compliance, which is the focus of this post. Future posts will begin laying out the space of policy compliance solutions and evaluating how existing policy efforts fit into that space.
The Challenge of Policy Compliance
Policy compliance is challenging for several reasons. A business policy is often a combination of policies from many different sources. It’s something in the heads of IT operators (“This link is flakey, so route network traffic around it”). It’s written in English as organizational mandates (“We never put applications from different business units on the same production network”). It’s something expressed in contracts between organizations (“Our preferred partners are always given solid-state storage”). It’s something found in governmental legislation (“Data from Singapore’s citizens can never leave the geographic borders of Singapore”).
Despite its complexity, policy compliance is already being addressed today. People take high-level laws, rules, and regulations, and translate them into checklists that describe how the IT infrastructure must be architected, how the software applications running on that infrastructure must be written, and how software applications must be deployed on that infrastructure. Another group of people translate those checklists into configuration parameters for individual components of the infrastructure (e.g. servers, networks, storage) and into functional or non-functional software requirements for the applications deployed on that infrastructure. Network policy is implemented as a myriad of switch, router, and firewall configurations. Server policy is implemented by configuration management systems like Puppet/Chef/CFEngine, keeping systems up to date with the latest security patches, and ensuring that known vulnerabilities are mitigated with other mechanisms such as firewalls. Application developers write special-purpose code to address policy.
Hopefully it is clear that policy compliance is a hard, hard problem. Some of the difficulties are unavoidable. People will always need to deal with the ambiguity and contradictions of laws, rules, and regulations. People will always need to translate those laws into the language of infrastructure and applications. People will always have the job of auditing to ensure a business is in compliance with the law.
Automating Policy Compliance
In this post, we focus on the aspects of the policy compliance problem that we believe are amenable to automation. We use the term IT policy from this point forward to refer to the high-level laws, rules, and regulations after they have been translated into the language of infrastructure and applications. This aspect of policy compliance is important because we believe we can automate much of it, and because there are numerous problems with how it is addressed today. For example:
An IT policy is often written using fairly high-level concepts, e.g. the information applications manipulate, how applications are allowed to communicate, what their performance must be, and how they must be secured. The onus for translating these concepts into infrastructure configuration and application code is left to infrastructure operators and application developers. That translation process is error-prone.
The results are often brittle. Moving a server or application from one network to another could cause a policy violation. A new application with different infrastructure requirements could force a network operator to remember why the configuration was the way it was, and to change it in a way that satisfies the IT policy for both the original and the new applications simultaneously.
When an auditor comes along to assess policy compliance, she must analyze the plethora of configurations, application code, architecture choices, and deployment options in the current system in order to understand what the IT policy that is being enforced actually is. Because this is so difficult, auditors typically use checklists that give an approximate understanding of compliance.
Whenever the IT policy changes (e.g. because new legislation is passed), people must identify the infrastructure components and applications causing (what are suddenly) policy violations and rework them.
Key Components of Policy Compliance Automation
We believe that the same tools and technologies that helped automate IT management bring the promise of better, easier, and faster IT policy compliance through policy-aware systems. A policy-aware system has the potential to detect policy violations in the cloud and reconfigure automatically. A policy-aware system has the potential to identify situations where no matter what we do, there is no way to make a proposed change without violating policy. A policy-aware system has the potential to be told about policy updates and react accordingly. A policy-aware system has the potential to maintain enough state about the cloud to identify existing and potential policy violations and either prevent them from occurring or take corrective action.
To enable policy-aware systems, we need at least two things.
We must communicate the IT policy to the system in a language it can understand. English won’t work; it is ambiguous and thus would require the computer to do the jobs of lawyers. Router and firewall configurations won’t work because they are too low level—the true intent of the policy is lost in the details. We need a language that can unambiguously describe both operational and infrastructure policies using concepts similar to those we would use if writing the policy in English.
We need a policy engine that understands that language, can act to bring IT resources into compliance, can interoperate with the ecosystem of software services in a cloud or across multiple clouds, and can help administrators and auditors understand where policy violations are and how to correct them.
In a future post, we’ll examine these key components in a bit more detail and discuss potential solutions for filling these roles. At that time, we’ll also discuss how recent developments like Group Policy for OpenDaylight and OpenStack Neutron, and OpenStack Congress fit into this landscape.
One More Thing: Openness
There is one more thing that is important for automated policy compliance: openness. As the “highest” layer in a fully-automated IT environment, policy represents the final layer of potential control or “lock-in.” As such, we believe it is critically important that an automated policy compliance solution not fall under the control of any one vendor. An automated policy compliance solution that offers cloud interoperability needs to be developed out in the open, with open governance, open collaboration, and open code access. By leveraging highly successful open source methodologies and approaches, this becomes possible.
Wrapping It Up
As we wrap this up and prepare for the next installation in this series, here are some key points to remember:
- IT policy compliance in cloud environments is critical. IT resource management cannot be fully and safely automated without policy compliance.
- Manually enforcing IT policy isn’t sustainable in today’s fast-moving and highly fluid IT environments. An automated IT policy compliance solution is needed.
- An automated IT policy compliance solution needs the right policy language—one that can express unambiguous concepts in a way that is consumable by both humans and software alike—as well as a policy engine that can understandthis policy language and interact with an ecosystem of cloud services to enforce policy compliance.
- Any policy compliance solution needs to be developed using an open source model, with open governance, open collaboration, and open code access.
In the next part of this series on automated policy compliance in cloud environments, we’ll take a deeper dive into looking at policy languages and policy engines that meet the needs described here.
[This post was written by Dinesh Dutt with help from Martin Casado. Dinesh is Chief Scientist at Cumulus Networks. Before that, he was a Cisco Fellow, working on various data center technologies from ASICs to protocols to RFCs. He’s a primary co-author on the TRILL RFC and the VxLAN draft at the IETF. Sudeep Goswami, Shrijeet Mukherjee, Teemu Koponen, Dmitri Kalintsev, and T. Sridhar provided useful feedback along the way.]
In light of the seismic shifts introduced by server and network virtualization, many questions pertaining to the role of end hosts and the networking subsystem have come to the fore. Of the many questions raised by network virtualization, a prominent one is this: what function does the physical network provide in network virtualization? This post considers this question through the lens of the end-to-end argument.
Networking and Modern Data Center Applications
There are a few primary lessons learnt from the large scale data centers run by companies such as Amazon, Google, Facebook and Microsoft. The first such lesson is that a physical network built on L3 with equal-cost multipathing (ECMP) is a good fit for the modern data center. These networks provide predictable latency, scale well, converge quickly when nodes or links change, and provide a very fine-grained failure domain.
Second, historically, throwing bandwidth at the problem leads to far simpler networking compared to using complex features to overcome bandwidth limitations. The cost of building such high capacity networks has dropped dramatically in the last few years. By making networks follow the KISS principle, the networks are more robust and can be built out of simple building blocks.
Finally, there is value in moving functions from the network to the edge where there are better semantics, a richer compute model, and lower performance demands. This is evidenced by the applications that are driving data center. Over time, they have subsumed many of the functions that prior generation applications relied on the network for. For example, Hadoop has its own discovery mechanism instead of assuming that all peers are on the same broadcast medium (L2 network). Failure, security and other such characteristics are often built into the application, the compute harness, or the PaaS layer.
There is no debate about the role of networking for such applications. Yes, networks can attempt to do better load spreading or such, but vendors don’t design Hadoop protocol into networking equipment and debate about the performance benefits of doing so.
The story is much different when discussing virtual datacenters (for brevity, we’ll refer to these virtualized datacenters as “clouds” while acknowledging it is a misuse of the term) that host traditional workloads. Here there is active debate as to where functionality should lie.
Network virtualization is a key cloud-enabling technology. Network virtualization does to networking what server virtualization did to servers. It takes the fundamental pieces that constitute networking – addresses and connectivity(including policies that determine connectivity) – and virtualizes them such that many virtual networks can be multiplexed onto a single physical network.
Unlike software layers within modern data center applications that provide similar functionality (although with different abstractions) there is an ongoing discussion on where the virtual network should be implemented. In what follows, we view this discussion in light of the end-to-end argument.
Network Virtualization and the End-to-End Principle
The end-to-end principle (http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf) is a fundamental principle defining the design and functioning of the largest network of them all, the Internet. Over the years since its first inception, in 1984, the principle has been revisited and revised, many times, by the authors themselves and by others. But a fundamental idea it postulated remains as relevant today as when it was first formulated.
With regard to the question of where to place a function, in an application or in the communication subsystem, this is what the original paper says (this oft-quoted section comes at the end of a discussion where the application being discussed is reliable file transfer and the function is reliability): “The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the end points of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.)” [Emphasis is by the authors of this post].
Consider the application of this statement to virtual networking. One of the primary pieces of information required in network virtualization is the virtual network ID (or VNI). Let us consider who can provide the information correctly and completely.
In the current world of server virtualization, the network is completely unaware of when a VM is enabled or disabled and therefore joins or leaves (or creates or destroys) a virtual network. Furthermore, since the VM itself is unaware of the fact that it is running virtualized and that the NIC it sees is really a virtual NIC, there is no information in the packet itself that can help a networking device such as a first hop router or switch identify the virtual network solely on the basis of an incoming frame. The hypervisor on the virtualized server is the only one that is aware of this detail and so it is the only one that can correctly implement this function of associating a packet to a virtual network.
Some solutions to network virtualization concede this point that the hypervisor has to be involved in the decision making of which virtual network a packet belongs to. But they’d like to consider a solution in which the hypervisor signals to the first hop switch the desire for a new virtual network and the first hop switch returns back a tag such as a VLAN for the hypervisor to tag the frame with. The first hop switch then uses this tag to be act as the edge of the network virtualization overlay. Let us consider what this entails to the overall system design. As a co-inventor of VxLAN while at Cisco, I’ve grappled with these consequences during the design of such approaches.
The robustness of a system is determined partly by how many touch points are involved when a function has to be performed. In the case of the network virtualization overlay, the only touch points involved are the ones that are directly involved in the communication: the sending and receiving hypervisors. The state about the bringup and teardown of a virtual network and the connectivity matrix of that virtual network do not involve the physical network. Since fewer touchpoints are involved in the functioning of a virtual network, it is easier to troubleshoot and diagnose problems with the virtual network (decomposing it as discussed in an earlier blog post).
Another data point for this argument comes from James Hamilton’s famous cry of “the network is in my way”. His frustration arose partly from the then in-vogue model of virtualizing networks. VLANs which were the primary construct in virtualizing a network involved touching multiple physical nodes to enable a new VLAN to come up. Since a new VLAN coming up could destabilize the existing network by becoming the straw that broke the camel’s back, a cumbersome, manual and lengthy process was required to add a new VLAN. This constrained the agility with which virtual networks could be spun up and spun down. Furthermore, to scale the solution even mildly, required the reinvention of how the primary L2 control protocol, spanning tree, worked (think MSTP).
Besides the technical merits of the end-to-end principle, another of its key consequences is the effect on innovation. It has been eloquently argued many times that services such as Netflix and VoIP are possible largely because the Internet design has the end-to-end principle as a fundamental rubric. Similarly, by looking at network virtualization as an application best implemented by end stations instead of a tight integration with the communication subsystem, it becomes clear that user choice and innovation become possible with this loose coupling. For example, you can choose between various network virtualization solutions when you separate the physical network from the virtual network overlay. And you can evolve the functions at software time scales. Also, you can use the same physical infrastructure for PaaS and IaaS applications instead of designing different networks for each kind. Lack of choice and control has been a driving factor in the revolution underway in networking today. So, this consequence of the end-to-end principle is not an academic point.
The final issue is performance. The end-to-end principal clearly allows functions to be pushed into the network as a performance optimization. This topic deserves a post in itself (we’ve addressed it in pieces before, but not in its entirety), so we’ll just tee up the basic arguments. Of course, if the virtual network overlay provides sufficient performance, there is nothing additional to do. If not, then the question remains of where to put the functionality to improve performance.
Clearly some functions should be in the physical network, such as packet replication, and enforcing QoS priorities. However, in general, we would argue that it is better to extend the end-host programming model (additional hardware instructions, more efficient memory models, etc.) where all end host applications can take advantage of it, than push a subset of the functions into the network and require an additional mechanism for state synchronization to relay edge semantics. But again, we’ll address these issues in a more focused post later.
At an architectural level, a virtual network overlay is not much different than functionality that already exists in a big data application. In both cases, functionality that applications have traditionally relied on the network for – discovery, handling failures, visibility and security – have been subsumed into the application.
Ivan Pepelnjak said it first, and said it best when comparing network virtualization to Skype. Network virtualization is better compared to the network layer that has organically evolved within applications than to traditional networking functionality found in switches and routers.
If the word “network” was not used in “virtual network overlay” or if it’s origins hadn’t been so intermixed with the discontentment with VLANs, we wonder if the debate would exist at all.
[This post has been written by Martin Casado and Justin Pettit with hugely useful input from Bruce Davie, Teemu Koponen, Brad Hedlund, Scott Lowe, and T. Sridhar]
This post introduces the topic of network optimization via large flow (elephant) detection and handling. We decompose the problem into three parts, (i) why large (elephant) flows are an important consideration, (ii) smart things we can do with them in the network, and (iii) detecting elephant flows and signaling their presence. For (i), we explain the basis of elephant and mice and why this matters for traffic optimization. For (ii) we present a number of approaches for handling the elephant flows in the physical fabric, several of which we’re working on with hardware partners. These include using separate queues for elephants and mice (small flows), using a dedicated network for elephants such as an optical fast path, doing intelligent routing for elephants within the physical network, and turning elephants into mice at the edge. For (iii), we show that elephant detection can be done relatively easily in the vSwitch. In fact, Open vSwitch has supported per-flow tracking for years. We describe how it’s easy to identify elephant flows at the vSwitch and in turn provide proper signaling to the physical network using standard mechanisms. We also show that it’s quite possible to handle elephants using industry standard hardware based on chips that exist today.
Finally, we argue that it is important that this interface remain standard and decoupled from the physical network because the accuracy of elephant detection can be greatly improved through edge semantics such as application awareness and a priori knowledge of the workloads being used.
The Problem with Elephants in a Field of Mice
Conventional wisdom (somewhat substantiated by research) is that the majority of flows within the datacenter are short (mice), yet the majority of packets belong to a few long-lived flows (elephants). Mice are often associated with bursty, latency-sensitive apps whereas elephants tend to be large transfers in which throughput is far more important than latency.
Here’s why this is important. Long-lived TCP flows tend to fill network buffers end-to-end, and this introduces non-trivial queuing delay to anything that shares these buffers. In a network of elephants and mice, this means that the more latency-sensitive mice are being affected. A second-order problem is that mice are generally very bursty, so adaptive routing techniques aren’t effective with them. Therefore routing in data centers often uses stateless, hash-based multipathing such as Equal-cost multi-path routing (ECMP). Even for very bursty traffic, it has been shown that this approach is within a factor of two of optimal, independent of the traffic matrix. However, using the same approach for very few elephants can cause suboptimal network usage, like hashing several elephants on to the same link when another link is free. This is a direct consequence of the law of small numbers and the size of the elephants.
Treating Elephants Differently than Mice
Most proposals for dealing with this problem involve identifying the elephants, and handling them differently than the mice. Here are a few approaches that are either used today, or have been proposed:
- Throw mice and elephants into different queues. This doesn’t solve the problem of hashing long-lived flows to the same link, but it does alleviate the queuing impact on mice by the elephants. Fortunately, this can be done easily on standard hardware today with DSCP bits.
- Use different routing approaches for mice and elephants. Even though mice are too bursty to do something smart with, elephants are by definition longer lived and are likely far less bursty. Therefore, the physical fabric could adaptively route the elephants while still using standard hash-based multipathing for the mice.
- Turn elephants into mice. The basic idea here is to split an elephant up into a bunch of mice (for example, by using more than one ephemeral source port for the flow) and letting end-to-end mechanisms deal with possible re-ordering. This approach has the nice property that the fabric remains simple and uses a single queuing and routing mechanism for all traffic. Also, SACK in modern TCP stacks handles reordering much better than traditional stacks. One way to implement this in an overlay network is to modify the ephemeral port of the outer header to create the necessary entropy needed by the multipathing hardware.
- Send elephants along a separate physical network. This is an extreme case of 2. One method of implementing this is to have two spines in a leaf/spine architecture, and having the top-of-rack direct the flow to the appropriate spine. Often an optical switch is proposed for the spine. One method for doing this is to do a policy-based routing decision using a DSCP value that by convention denotes “elephant”.
At this point it should be clear that handling elephants requires detection of elephants. It should also be clear that we’ve danced around the question of what exactly characterizes an elephant. Working backwards from the problem of introducing queuing delays on smaller, latency-sensitive flows, it’s fair to say that an elephant has high throughput for a sustained period.
Often elephants can be determined a priori without actually trying to infer them from network effects. In a number of the networks we work with, the elephants are either related to cloning, backup, or VM migrations, all of which can be inferred from the edge or are known to the operators. vSphere, for example, knows that a flow belongs to a migration. And in Google’s published work on using OpenFlow, they had identified the flows on which they use the TE engine beforehand (reference here).
Dynamic detection is a bit trickier. Doing it from within the network is hard due to the difficulty of flow tracking in high-density switching ASICs. A number of sampling methods have been proposed, such as sampling the buffers or using sFlow. However the accuracy of such approaches hasn’t been clear due to the sampling limitations at high speeds.
On the other hand, for virtualized environments (which is a primary concern of ours given that the authors work at VMware), it is relatively simple to do flow tracking within the vSwitch. Open vSwitch, for example, has supported per-flow granularity for the past several releases now with each flow record containing the bytes and packets sent. Given a specified threshold, it is trivial for the vSwitch to mark certain flows as elephants.
The More Vantage Points, the Better
It’s important to remember that there is no reason to limit elephant detection to a single approach. If you know that a flow is large a priori, great. If you can detect elephants in the network by sampling buffers, great. If you can use the vSwitch to do per-packet flow tracking without requiring any sampling heuristics, great. In the end, if multiple methods identify it as an elephant, it’s still an elephant.
For this reason we feel that it is very important that the identification of elephants should be decoupled from the physical hardware and signaled over a standard interface. The user, the policy engine, the application, the hypervisor, a third party network monitoring system, and the physical network should all be able identify elephants.
Fortunately, this can easily be done relatively simply using standard interfaces. For example, to affect per-packet handling of elephants, marking the DSCP bits is sufficient, and the physical infrastructure can be configured to respond appropriately.
Another approach we’re exploring takes a more global view. The idea is for each vSwitch to expose its elephants along with throughput metrics and duration. With that information, an SDN controller for the virtual edge can identify the heaviest hitters network wide, and then signal them to the physical network for special handling. Currently, we’re looking at exposing this information within an OVSDB column.
Are Elephants Obfuscated by the Overlay?
No. For modern overlays, flow-level information, and QoS marking are all available in the outer header and are directly visible to the underlay physical fabric. Elephant identification can exploit this characteristic.
This is a very exciting area for us. We believe there is a lot of room to bring to bear edge understanding of workloads, and the ability for software at the edge to do sophisticated trending analysis to the problem of elephant detection and handling. It’s early days yet, but our initial forays both with customers and hardware partners, has been very encouraging.
More to come.
Network virtualization, as others have noted, is now well past the hype stage and in serious production deployments. One factor that has facilitated the adoption of network virtualization is the ease with which it can be incrementally deployed. In a typical data center, the necessary infrastructure is already in place. Servers are interconnected by a physical network that already meets the basic requirements for network virtualization: providing IP connectivity between the physical servers. And the servers are themselves virtualized, providing the ideal insertion point for network virtualization: the vswitch (virtual switch). Because the vswitch is the first hop in the data path for every packet that enters or leaves a VM, it’s the natural place to implement the data plane for network virtualization. This is the approach taken by VMware (and by Nicira before we were part of VMware) to enable network virtualization, and it forms the basis for our current deployments.
In typical data centers, however, not every machine is virtualized. “Bare metal” servers — that is, unvirtualized, or physical machines — are a fact of life in most real data centers. Sometimes they are present because they run software that is not easily virtualized, or because of performance concerns (e.g. highly latency-sensitive applications), or just because there are users of the data center who haven’t felt the need to virtualize. How do we accommodate these bare-metal workloads in virtualized networks if there is no vswitch sitting inside the machine?
Our solution to this issue was to develop gateway capabilities that allow physical devices to be connected to virtual networks. One class of gateway that we’ve been using for a while is a software appliance. It runs on standard x86 hardware and contains an instance of Open vSwitch. Under the control of the NSX controller, it maps physical ports, and VLANs on those ports, to logical networks, so that any physical device can participate in a given logical network, communicating with the VMs that are also connected to that logical network. This is illustrated below.
As an aside, these gateways also address another common use case: traffic that enters and leaves the data center, or “north-south” traffic. The basic functionality is similar enough: the gateway maps traffic from a physical port (in this case, a port connected to a WAN router rather than a server) to logical networks, and vice versa.
Software gateways are a great solution for moderate amounts of physical-to-virtual traffic, but there are inevitably some scenarios where the volume of traffic is too high for a single x86-based appliance, or even a handful of them. Say you had a rack (or more) full of bare-metal database servers and you wanted to connect them to logical networks containing VMs running application and web tiers for multi-tier applications. Ideally you’d like a high-density and high-throughput device that could bridge the traffic to and from the physical servers into the logical networks. This is where hardware gateways enter the picture.
Leveraging VXLAN-capable Switches
Fortunately, there is an emerging class of hardware switch that is readily adaptable to this gateway use case. Switches from several vendors are now becoming available with the ability to terminate VXLAN tunnels. (We’ll call these switches VTEPs — VXLAN Tunnel End Points or, more precisely, hardware VTEPs.) VXLAN tunnel termination addresses the data plane aspects of mapping traffic from the physical world to the virtual. However, there is also a need for a control plane mechanism by which the NSX controller can tell the VTEP everything it needs to know to connect its physical ports to virtual networks. Broadly speaking, this means:
- providing the VTEP with information about the VXLAN tunnels that instantiate a particular logical network (such as the Virtual Network Identifier and destination IP addresses of the tunnels);
- providing mappings between the MAC addresses of VMs and specific VXLAN tunnels (so the VTEP knows how to forward packets to a given VM);
- instructing the VTEP as to which physical ports should be connected to which logical networks.
In return, the VTEP needs to tell the NSX controller what it knows about the physical world — specifically, the physical MAC addresses of devices that it can reach on its physical ports.
There may be other information to be exchanged between the controller and the VTEP to offer more capabilities, but this covers the basics. This information exchange can be viewed as the synchronization of two copies of a database, one of which resides in the controller and one of which is in the VTEP. The NSX controller already implements a database access protocol, OVSDB, for the purposes of configuring and monitoring Open vSwitch instances. We decided to leverage this existing protocol for control of third party VTEPs as well. We designed a new database schema to convey the information outlined above; the OVSDB protocol and the database code are unchanged. That choice has proven very helpful to our hardware partners, as they have been able to leverage the open source implementation of the OVSDB server and client libraries.
The upshot of this work is that we can now build virtual networks that connect relatively large numbers of physical ports to virtual ports, using essentially the same operational model for any type of port, virtual or physical. The NSX controller exposes a northbound API by which physical ports can be attached to logical switches. Virtual ports of VMs are attached in much the same way to build logical networks that span the physical and virtual worlds. The figure below illustrates the approach.
It’s worth noting that this approach has no need for IP multicast in the physical network, and limits the use of flooding within the overlay network. This contrasts with some early VXLAN implementations (and the original VXLAN Internet Draft, which didn’t entirely decouple the data plane from the control plane). The reason we are able to avoid flooding in many cases is that the NSX controller knows the location of all the VMs that it has attached to logical networks — this information is provided by the vswitches to the controller. And the controller shares its knowledge with the hardware VTEPs via OVSDB. Hence, any traffic destined for a VM can be placed on the right VXLAN tunnel from the outset.
In the virtual-to-physical direction, it’s only necessary to flood a packet if there is more than one hardware VTEP. (If there is only one hardware VTEP, we can assume that any unknown destination must be a physical device attached to the VTEP, since we know where all the VMs are). In this case, we use the NSX Service Node to replicate the packet to all the hardware VTEPs (but not to any of the vswitches). Furthermore, once a given hardware VTEP learns about a physical MAC address on one of its ports, it writes that information to the database, so there will be no need to flood subsequent packets. The net result is that the amount of flooded traffic is quite limited.
For a more detailed discussion of the role of VXLAN in the broader landscape of network virtualization, take a look at our earlier post on the topic.
Hardware or Software?
Of course, there are trade-offs between hardware and software gateways. In the NSX software gateway, there is quite a rich set of features that can be enabled (port security, port mirroring, QoS features, etc.) and the set will continue to grow over time. Similarly, the feature set that the hardware VTEPs support, and which NSX can control, will evolve over time. One of the challenges here is that hardware cycles are relatively long. On top of that, we’d like to provide a consistent model across many different hardware platforms, but there are inevitably differences across the various VTEP implementations. For example, we’re currently working with a number of different top-of-rack switch vendors. Some partners use their own custom ASICs for forwarding, while others use merchant silicon from the leading switch silicon vendors. Hence, there is quite a bit of diversity in the features that the hardware can support.
Our expectation is that software gateways will provide the greatest flexibility, feature-richness, and speed of evolution. Hardware VTEPs will be the preferred choice for deployments requiring greater total throughput and density.
A final note: one of the important benefits of network virtualization is that it decouples networking functions from the physical networking hardware. At first glance, it might seem that the use of hardware VTEPs is a step backwards, since we’re depending on specific hardware capabilities to interconnect the physical and virtual worlds. But by developing an approach that works across a wide variety of hardware devices and vendors, we’ve actually managed to achieve a good level of decoupling between the functionality that NSX provides and the hardware that underpins it. And as long as there are significant numbers of physical workloads that need to be managed in the data center, it will be attractive to have a range of options for connecting those workloads to virtual networks.