By Tim Hinrichs and Scott Lowe with contributions from Alex Yip, Dmitri Kalintsev, and Peter Balland
(Note: this post is also cross-published at RuleYourCloud.com, a new site focused on policy.)
In the first few parts of this series, we discussed the policy problem, we outlined dimensions of the solution space, and we gave a brief overview of the existing OpenStack policy efforts. In this post we do a deep dive into one of the (not yet incubated) OpenStack policy efforts: Congress.
Remember that to solve the policy problem, people take ideas in their head about how the data center ought to behave (“policy”) and codify them in a language the computer system can understand. That is, the policy problem is really a programming languages problem. Not surprisingly Congress is, at its core, a policy language plus an implementation of that language.
Congress is a standard cloud service; you install it on a server, give it some inputs, and interact with it via a RESTful API. Congress has two kinds of inputs:
- The other cloud services you’d like it to manage (for example, a compute manager like OpenStack Nova and a network manager like OpenStack Neutron)
- A policy that describes how those services ought to behave
For example, we might decide to use Congress to manage OpenStack’s Nova (compute), Neutron (networking), Swift (object storage), Cinder (block storage), and Heat (applications). We might write a geo-location policy:
“Every application collecting personal information about its users from Japan must have all of its compute, networking, and storage elements deployed at a data center that resides within the geographic borders of Japan.”
A cloud service gives Congress the ability to see and change the data center’s behavior. The more services hooked up to Congress, the more powerful Congress becomes. Congress was designed to manage any collection of cloud services, regardless of their origin or locality (private or public). It does not matter if the service is provided by OpenStack, CloudStack, AWS, GCE, Cisco, HP, IBM, Red Hat, VMware, etc. It does not matter if the service manages compute, networking, storage, applications, databases, anti-virus, inventory, people, or groups. Congress is vendor and domain agnostic.
Congress provides a unified abstraction for all services, which insulates the policy writer from understanding differing data formats, authentication protocols, APIs, and the like. Congress does NOT require any special code to be running on the services it manages; rather, it includes a light-weight adapter for each service that implements the unified interface using the service’s native API.
From the policy writer’s point of view, each service is simply a collection of tables. A table is a collection of rows; each row is a collection of columns; each row-column entry stores simple data like strings and integers. When Congress wants to see what is happening in the data center, it reads from those tables. When Congress wants to change what is happening in the data center, it writes to those tables.
For example, the Nova compute service is represented as several tables like the servers table below.
+---------+----------+--------+-----------+----------+-----------+ | id | host_id | status | tenant_id | image_id | flavor_id | +---------+----------+--------+-----------+----------+-----------+ | <UUID1> | <UUID2> | ACTIVE | alice | <UUID3> | <UUID4> | | <UUID5> | <UUID6> | ACTIVE | bob | <UUID7> | <UUID8> | | <UUID9> | <UUID10> | DOWN | bo | <UUID7> | <UUID8> | +---------+----------+--------+-----------+----------+-----------+
At the time of writing, there are adapters (which we call “datasource drivers”) for each of the following services, all but one of which are OpenStack.
- Plexxi controller
Each adapter is a few hundred lines of code that (i) makes API calls into the service to get information about that service’s behavior; and (ii) translates that information into tables. Just recently we added a domain-specific language (DSL) that automates the translation of that information into tables, given a description of the structure of the information.
For more information about connecting cloud services, see the Congress cloud services documentation.
A policy describes how a collection of cloud services ought to behave. Every organization’s policy is unique because every organization has different services in its data center. Every organization has different business advantages they are trying to gain via the cloud. Every organization has different regulations that govern it. Every organization is full of people with different ideas about the right way to run the cloud.
Congress aims to provide a single policy language that every organization can use to express their high- and low-level policies. Instead of providing a long list of micro-policies that the user can mix-and-match, Congress provides a general purpose policy language for expressing policy: the well-known declarative language Datalog.
Datalog is domain-agnostic. It is just as easy to write policy about compute as it is to write policy about networking. It is just as easy to write policy about how compute, networking, storage, group membership, and applications interact with each other. Moreover, Datalog enables policy writers to define abstractions to bridge the gap between low-level infrastructure policy and high-level business policy.
Suppose our policy says that all servers should on average have a CPU utilization of at least 20% over a 2 day span. In Datalog we would write the a policy that leverages Nova for compute, Ceilometer for CPU utilization data, and some built-in tables that treat strings as if they were dates.
First, we declare the conditions under which there is a policy violation. We do that by writing a rule that says a VM is an error (policy violation) if the conditions shown below are true.
error(vm, email_address) :- nova:servers(id=myid, tenant_id=owner), // myid is a server owned by owner two_days_previous(start_date, end_date), // start_date is 2 days before end_date; end_date is today ceilometer:statistics(id=myid, start=start_date, end=end_date, meter="cpu-util", avg=value), arithmetic:less_than(value, 20), // value < 20% keystone:user(id=owner, email=email_address)
We also define a helper table that computes the start and end dates for 2 days before today.
two_days_previous(start_date, end_date) :- datetime:now(end_date), datetime:minus(end_date, "2 days", start_date)
Helper tables like
two_days_previous are useful because they allow the policy writer to create higher-level concepts that may not exist natively in the cloud services. For example, we can create a helper table that tells us which servers are connected to the Internet—something that requires information from several different places in OpenStack. Or the compute, networking, and storage admins could create the higher-level concept “is-secure” and enable a higher-level manager to write a policy that describes when resources ought to be secured.
For more information about writing policy, see the Congress policy documentation.
Once we have connected services to Congress and written policy over those services, we’ve given Congress the inputs it needs carry out its core capabilities, which the user is free to mix and match.
Monitoring: Congress watches how the other cloud services are behaving, compares that to policy, and flags mismatches (policy violations).
Enforcement: Congress acts as a policy authority. A service can propose a change to Congress, and Congress will tell the service whether the change complies with policy or not, thus preventing policy violations before they happen. Congress can also correct some violations after the fact.
Auditing: Congress gives users the ability to record the history of policy, policy violations, and remediations.
Delegation: Congress can offload the burden of monitoring/enforcing/auditing to other policy-aware systems.
When it comes to enforcement, a common question is why Congress would support both proactive and reactive enforcement. The implied question being, “Isn’t proactive always preferred?” The answer is that proactive is not always possible. Consider the simple policy “ensure all operating systems have the latest security patch.” As soon as Microsoft/Apple/RedHat releases a new security patch, the policy is immediately violated; the whole purpose of writing the policy is to enable Congress to identify the violation and take action to correct it.
The tip of master includes monitoring and a mechanism for proactive enforcement. In the Kilo release of OpenStack we plan to have a form of reactive enforcement available as well.
In this post, we’ve talked about some of the key takeaways regarding Congress:
- Congress was designed to solve the policy problem and work with any cloud service and any policy.
- It is currently capable of monitoring and proactive enforcement. Reactive enforcement and delegation are currently underway.
- Congress is not yet incubated in OpenStack, but has contributions from half a dozen organizations and nearly two dozen people.
[This post was written by OVS core contributors Justin Pettit, Ben Pfaff, and Ethan Jackson.]
The overhead associated with vSwitches has been a hotly debated topic in the networking community. In this blog post, we show how recent changes to OVS have elevated its performance to be on par with the native Linux bridge. Furthermore, CPU utilization of OVS in realistic scenarios can be up to 8x below that of the Linux bridge. This is the first of a two-part series. In the next post, we take a peek at the design and performance of the forthcoming port to DPDK, which bypasses the kernel entirely to gain impressive performance.
Open vSwitch is the most popular network back-end for OpenStack deployments and widely accepted as the de facto standard OpenFlow implementation. Open vSwitch development initially had a narrow focus — supporting novel features necessary for advanced applications such as network virtualization. However, as we gained experience with production deployments, it became clear these initial goals were not sufficient. For Open vSwitch to be successful, it not only must be highly programmable and general, it must also be blazingly fast. For the past several years, our development efforts have focused on precisely this tension — building a software switch that does not compromise on either generality or speed.
To understand how we achieved this feat, one must first understand the OVS architecture. The forwarding plane consists of two parts: a “slow-path” userspace daemon called ovs-vswitchd and a “fast-path” kernel module. Most of the complexity of OVS is in ovs-vswitchd; all of the forwarding decisions and network protocol processing are handled there. The kernel module’s only responsibilities are tunnel termination and caching traffic handling.
When a packet is received by the kernel module, its cache of flows is consulted. If a relevant entry is found, then the associated actions (e.g., modify headers or forward the packet) are executed on the packet. If there is no entry, the packet is passed to ovs-vswitchd to decide the packet’s fate. ovs-vswitchd then executes the OpenFlow pipeline on the packet to compute actions, passes it back to the fast path for forwarding, and installs a flow cache entry so similar packets will not need to take these expensive steps.
Until OVS 1.11, this fast-path cache contained exact-match “microflows”. Each cache entry specified every field of the packet header, and was therefore limited to matching packets with this exact header. While this approach works well for most common traffic patterns, unusual applications, such as port scans or some peer-to-peer rendezvous servers, would have very low cache hit rates. In this case, many packets would need to traverse the slow path, severely limiting performance.
OVS 1.11 introduced megaflows, enabling the single biggest performance improvement to date. Instead of a simple exact-match cache, the kernel cache now supports arbitrary bitwise wildcarding. Therefore, it is now possible to specify only those fields that actually affect forwarding.. For example, if OVS is configured simply to be a learning switch, then only the ingress port and L2 fields are relevant and all other fields can be wildcarded. In previous releases, a port scan would have required a separate cache entry for, e.g., each half of a TCP connection, even though the L3 and L4 fields were not important.
The introduction of megaflows allowed OVS to drastically reduce the number of packets that traversed the slow path. This represents a major improvement, but ovs-vswitchd still had a number of responsibilities, which became the new bottleneck. These include activities like managing the datapath flow table, running switch protocols (LACP, BFD, STP, etc), and other general accounting and management tasks.
While the kernel datapath has always been multi-threaded, ovs-vswitchd was a single-threaded process until OVS 2.0. This architecture was pleasantly simple, but it suffered from several drawbacks. Most obviously, it could use at most one CPU core. This was sufficient for hypervisors, but we began to see Open vSwitch used more frequently as a network appliance, in which it is important to fully use machine resources.
Less obviously, it becomes quite difficult to support ovs-vswitchd’s real-time requirements in a single-threaded architecture. Fast-path misses from the kernel must be processed by ovs-vswitchd as promptly as possible. In the old single-threaded architecture, miss handling often blocked behind the single thread’s other tasks. Large OVSDB changes, OpenFlow table changes, disk writes for logs, and other routine tasks could delay handling misses and degrade forwarding performance. In a multi-threaded architecture, soft real-time tasks can be separated into their own threads, protected from delay by unrelated maintenance tasks.
Since the introduction of multithreading, we’ve continued to fine-tune the number of threads and their responsibilities. In addition to order of magnitudes improvements in miss handling performance, this architecture shift has allowed us to increase the size of the kernel cache from 1000 flows in early versions of OVS to roughly 200,000 in the most recent version.
The introduction of megaflows and support for a larger cache enabled by multithreading reduced the need for packets to be processed in userspace. This is important because a packet lookup in userspace can be quite expensive. For example, a network virtualization application might define an OpenFlow pipeline with dozens of tables that hold hundreds of thousands of rules. The final optimization we will discuss is improvements in the classifier in ovs-vswitchd, which is responsible for determining which OpenFlow rules apply when processing a packet.
Data structures to allow a flow table to be quickly searched are an active area of research. OVS uses a tuple space search classifier, which consists of one hash table (tuple) for each kind of match actually in use. For example, if some flows match on the source IP, that’s represented as one tuple, and if others match on source IP and destination TCP port, that’s a second tuple. Searching a tuple space search classifier requires searching each tuple, then taking the highest priority match. In the last few releases, we have introduced a number of novel improvements to the basic tuple search algorithm:
- Priority Sorting – Our simplest optimization is to sort the tuples in order by the maximum priority of any flow in the tuple. Then a search that finds a matching flow with priority P can terminate as soon as it arrives at a tuple whose maximum priority is P or less.
- Staged Lookup – In many network applications, a policy may only apply to a subset of the headers. For example, a firewall policy that requires looking at TCP ports may only apply to a couple of VMs’ interfaces. With the basic tuple search algorithm, if any rule looks at the TCP ports, then any generated megaflow would look all the way up through the L4 headers. With staged lookup, we scope lookups from metadata (e.g., ingress port) up to layers further up the stack on an as-needed basis.
- Prefix Tracking – When processing L3 traffic, a longest prefix match is required for routing. The tuple-space algorithm works poorly in this case, since it degenerates into a tuple that matches as many bits as the longest match. This means that if one rule matches on 10.x.x.x and another on 192.168.0.x, the 10.x.x.x rule will also require matching 24 bits instead of 8, which requires keeping more megaflows in the kernel. With prefix tracking, we consult a trie that allows us to only look at tuples with the high order bits sufficient to differentiate between rules.
These classifier improvements have been shown with practical rule sets to reduce the number of megaflows needed in the kernel from over a million to only dozens.
The preceding changes improve many aspects of performance. For this post, we’ll just evaluate the performance gains in flow setup, which was the area of greatest concern. To measure setup performance, we used netperf’s TCP_CRR test, which measures the number of short-lived transactions per second (tps) that can be established between two hosts. We compared OVS to the Linux bridge, a fixed-function Ethernet switch implemented entirely inside the Linux kernel.
In the simplest configuration, the two switches achieved identical throughput (18.8 Gbps) and similar TCP_CRR connection rates (696,000 tps for OVS, 688,000 for the Linux bridge), although OVS used more CPU (161% vs. 48%). However, when we added one flow to OVS to drop STP BPDU packets and a similar ebtable rule to the Linux bridge, OVS performance and CPU usage remained constant whereas the Linux bridge connection rate dropped to 512,000 tps and its CPU usage increased over 26-fold to 1,279%. This is because the built-in kernel functions have per-packet overhead, whereas OVS’s overhead is generally fixed per-megaflow. We expect that enabling other features, such as routing and a firewall, would similarly add CPU load.
|Linux Bridge||Pass BPDUs||688,000||48%|
|Open vSwitch 1.10||12,000||100%|
|Open vSwitch 2.1||Megaflows Off||23,000||754%|
While these performance numbers are useful for benchmarking, they are synthetic. At the OpenStack Summit last week in Paris, Rackspace engineers described the performance gains they have seen over the past few releases in their production deployment. They begin in the “Dark Ages” (versions prior to 1.11) and proceed to “Ludicrous Speed” (versions since 2.1).
Now that OVS’s performance is similar to those of a fixed-function switch while maintaining the flexibility demanded by new networking applications, we’re looking forward to broadening our focus. While we continue to make performance improvements, the next few releases will begin adding new features such as stateful services and support for new platforms such as DPDK and Hyper-V.
If you want to hear more about this live or talk to us in person, we will all be at the Open vSwitch Fall 2014 Conference next week.
(This post was written by Tim Hinrichs and Scott Lowe, with contributions from Pierre Ettori, Aaron Rosen, and Peter Balland.)
In the first two parts of this blog series we discussed the problem of policy in the data center and the features that differentiate solutions to that problem. In this post, we give a high-level overview of several policy efforts within OpenStack.
Remember that a policy is a description of how (some part of) the data center ought to behave, a service is any component in the data center that has an API, and a policy system is designed to manage some combination of past, present, and future policy violations (auditing, monitoring, and enforcement, respectively).
The overview of OpenStack policy efforts talks about the features we identified in part 2 of this blog series. To recap, those features are:
- Policy language: how expressive is the language, is the language restricted to certain domains, what features (e.g. exceptions) does it support?
- Policy sources: what are the sources of policy, how do different sources of policy interact, how are conflicts dealt with?
- Services: which other data center services can be leveraged and how?
- Actions: what does the system do once it is given a policy: monitor (identify violations), enforce (prevent or correct violations), audit (analyze past violations)?
The one thing you’ll notice is that there are many different policy efforts within OpenStack. Perhaps surprisingly there is actually little redundancy because each effort addresses a different part of the overall policy problem: enabling users to describe their desires in a way that an OpenStack cloud can act on them. Additionally, as we will point out again later in the post, domain independent and domain specific policy efforts are highly complementary.
We begin with Congress, our own policy effort within OpenStack. Congress is a system purpose-built for managing policy in the data center. A Congress policy describes the desired behavior of the data center by dictating how all the services running in that data center are supposed to behave both individually and in tandem. In the current release Congress accepts a single policy for the entire data center, the idea being that the cloud administrators are jointly responsible for writing and maintaining that policy.
A Congress policy is domain independent and can describe the behavior of any collection of data center services. The cloud administrator can write a policy about networking, a policy about compute, or a policy that about networking, compute, storage, antivirus, organizational charts, inventory management systems, ActiveDirectory, and so on.
The recent alpha release of Congress supports monitoring violations in policy: comparing how the data center is actually behaving to how policy says the data center ought to behave and flagging mismatches. In the future, Congress will also support enforcement by having Congress itself execute API calls to change the behavior of the data center and/or pushing policy to other policy-aware services better positioned to enforce policy.
Neutron Group-Based Policy (GBP)
Neutron Group-Based Policy (GBP), which is similar to the policy effort in OpenDaylight, utilizes policy to manage networking. A policy describes how the network packets in the data center are supposed to behave. Each policy (“contract” in GBP terminology) describes which actions (such as allow, drop, reroute, or apply QoS) should be applied to which network packets based on packet header properties like port and protocol. Entities on the network (called “endpoints”) are grouped and each group is assigned one or more policies. Groups are maintained outside the policy language by people or automated systems.
In GBP, policies can come from any number of people or agents. Conflicts can arise within a single policy or across several policies and are eliminated by a mechanism built into GBP (which is out of scope for this blog post).
The goal of GBP is to enforce policy directly. (Both monitoring and auditing are challenging in the networking domain because there are so many packets moving so quickly throughout the data center.) To do enforcement, GBP compiles policies down into existing Neutron primitives and creates logical networks, switches, and routers. When new policy statements are inserted, GBP does an incremental compilation: changing the Neutron primitives in such a way as to implement the new policy while minimally disrupting existing primitives.
Swift Storage Policy
Swift is OpenStack’s object storage service. As of version 2.0, released July 2014, Swift supports storage policies. Each storage policy is attached to a virtual storage system, which is where Swift stores objects. Each policy assigns values to a number of built-in features of a storage system. At the time of writing, each policy dictates how many partitions the storage system has, how many replicas of each object it should maintain, and the minimum amount of time before a partition can be moved to a different physical location since the last time it was moved.
A user can create any number of virtual storage systems—and so can write any number of policies—but there are no conflicts between policies. If we put an object into a container with 2 replicas and the same object into another container with 3 replicas, it just means we are storing that object in two different virtual storage systems, which all told means we have 5 replicas.
Policy is enforced directly by Swift. Every time an object is written, Swift ensures the right number of replicas are created. Swift ensures not to move a partition before policy allows that partition to be moved.
The Smart Scheduler/SolverScheduler effort aims to provides an interface for using different constraint solvers to solve optimization problems for other projects, Nova in particular. One specific use case is for Network Functions Virtualization (see here and here) For example, Nova might ask where to place a new virtual machine to minimize the average number of VMs on each server. This effort utilizes domain-independent solvers (such as linear programming/arithmetic solvers) but applies them to solve domain-specific problems. The intention is to focus on enforcement.
Nova Policy-Based Scheduling Module
The Nova policy-based scheduling module aims to schedule Nova resources per client, per cluster of resources, and per context (e.g. overload, time, etc.). A proof of concept was presented at the Demo Theater at OpenStack Juno Summit.
Gantt aims to provide scheduling as a service for other OpenStack components (see here and here). Previously, it was a subgroup of Nova and focused on scheduling virtual machines based on resource utilization. It includes plugin framework for making arbitrary metrics available to the scheduler.
Heat Convergence engine
The Heat Convergence engine represents a shift toward a model for Heat where applications are deployed and managed by comparing the current state of the application to the desired state of the application and taking action to reduce the differences. Each desired state amounts to a policy describing a single application. Those policies do not interact, logically, and can draw upon any service in the data center. Heat policies are concerned mainly with corrective enforcement, though monitoring is also useful (“how far along is my application’s deployment?”).
The key takeaway is that OpenStack has a growing ecosystem of policy-aware services. Most of them are domain-specific, meaning they are systems tailored to enforcing a particular kind of policy for a particular service, but a few are domain-independent, meaning that they will work for any kind of policy.
As we mentioned earlier, domain-independent and domain-specific policy systems are highly complementary. The strength of a domain-specific policy system is enforcing policies within its domain, but its weakness is that policies outside the domain are not expressible in the language. The strength of a domain-independent policy system is expressing policies for any and every domain, but its weakness is that monitoring/enforcing/auditing those policies can be challenging.
For policy to live up to its expectations, we need a rich ecosystem of policy-aware services that interoperate with one another. Networking policies should be handled by Neutron; compute policies should be handled by Nova; storage policies should be handled by Swift and Cinder; application policies should be handled by Heat; cross-cutting policies should be handled by a combination of Congress, Gantt, and SolverScheduler. We believe it will be incredibly valuable to give users a single touch point to understand how all the policies throughout the data center interact and interoperate—to provide a dashboard where users ask questions about the current state of the data center, investigate the impact of proposed changes, enact and automate enforcement decisions, and audit the data center’s policy from the past.
To help coordinate the interaction and development of policy-aware services and policy-related efforts in OpenStack, the OpenStack Mid-Cycle Policy Summit intends to bring representatives from many different policy-minded companies and projects together. The aim of the summit is to discuss the current state of policy within OpenStack and begin discussing the roadmap for how policy will evolve in the future. The summit will start with some presentations by (and about) the various policy-related efforts and their approach to policy; it will wrap up with a workshop focused on how the different efforts might interoperate both today and in the future. Following this summit, which takes place September 18–19, 2014, we’ll post another blog entry describing the experience and lessons learned.
[This post was authored by T. Sridhar and Jesse Gross.]
Earlier this year, VMware, Microsoft, Red Hat and Intel published an IETF draft on Generic Network Virtualization Encapsulation (Geneve). This draft (first published on Valentine’s Day no less) includes authors from the each of the first generation encapsulation protocols — VXLAN, NVGRE, and STT. However, beyond the obvious appeal of unification across hypervisor platforms, the salient feature of Geneve is that it was designed from the ground up to be flexible. Nobody wants an endless cycle of new encapsulation formats as network virtualization designs and controllers mature, certainly not the vendors that have to support the ever growing list of acronyms and RFCs.
Of course press releases, standards bodies and predictions about the future mean little without actual implementations, which is why it is important to consider the “ecosystem” from the beginning of the process. This includes software and silicon implementations in both commercial and open source varieties. This always takes time but since Geneve was designed to accommodate a wide variety of use cases it has seen a relatively quick uptake. Unsurprisingly, the first implementations that landed were open source software — including switches such as Open vSwitch and networking troubleshooting tools like Wireshark. Today the first hardware implementation has arrived, in the form of the 40 Gbps Intel XL710 NIC, previously known as Fortville.
Demo of Geneve hardware acceleration at Intel Developer Forum.
Why is hardware support important? Performance. Everyone likes flexibility, of course, but most of the time that comes with a cost. In the case of a NIC, hardware acceleration enables us to have our cake and eat it too by offloading expensive operations while retaining software control in the CPU. These NICs add encapsulation awareness for classic operations like checksum and TCP segmentation offload to bring Geneve tunnels to performance parity with traditional traffic. For good measure, they also add in support for a few additional Geneve-specific features as well.
Of course, this is just the beginning — it is still only six months after publication of the Geneve specification and much more is still to come. Expect to see further announcements coming soon for both NIC and switch silicon and of course new software to take advantage of the advanced capabilities. Until then, a discussion session as well as a live demo will be at Intel Developer Forum this week to provide a first glimpse of Geneve in action.
[This post was co-authored by Bruce Davie and Ken Duda]
Almost a year ago, we wrote a first post about our efforts to build virtual networks that span both virtual and physical resources. As we’ve moved beyond the first proofs of concept to customer trials for our combined solution, this post serves to provide an update on where we see the interaction between virtual and physical worlds heading.
Our overall approach to connecting physical and virtual resources can be viewed in two main categories:
- terminating the overlay on physical devices, such as top-of-rack switches, routers, appliances, etc.
- managing interactions between the overlay and the physical devices that provide the underlay.
We first started working to design a control plane to terminate network virtualization overlays on physical devices in 2012. We started by looking at the information model, defining what information needed to be exchanged between a physical device and a network virtualization controller such as NSX. To bound the problem space, we focused on a specific use case: mapping the ports and VLANs of a physical switch into virtual layer 2 networks implemented as VXLAN-based overlays. (See our posts on the issues around encapsulation choice here and here). At the same time, we knew there were a lot more use cases to be addressed, so we picked a completely extensible protocol to carry the necessary information: OVSDB. This was important: we knew that over time we’d have to support a lot more use cases than just L2 bridging between physical and virtual worlds. After all, one of the tenets of network virtualization is that a virtual network should faithfully reproduce all of the networking stack, from L2-L7, just as server virtualization faithfully reproduces a complete computing environment (as described in more detail here.)
So, the first thing we added to the solution space once we got L2 working was L3. By the time we published the VTEP schema as part of Open vSwitch in late 2013, distributed logical routing was included. (We use the term VTEP – VXLAN tunnel end-point – as a general term for the devices that terminate the overlay.) Let’s take a look at how logical routing works.
Distributed logical routing is an example of a more general capability in network virtualization, the distribution of services. Brad Hedlund wrote some time ago about the value of distributing services among hypervisors. The same basic arguments apply when there are VTEPs in the picture — you want to distribute functions, like logical routing, so that packets always follow the shortest path without hair-pinning, and so that the capacity to perform that function scales out as you add more devices, be they hypervisors or physical switches.
So, suppose a VM (VM1) is placed in logical subnet A, and a physical server (PS1) that is in subnet B is located behind a ToR switch acting as a VTEP (see picture). Say we want to create a logical network that interconnects these two subnets. The logical topology is created by API requests to the network virtualization controller, which in turn programs the vswitches and the ToR to instantiate the desired topology. Part of this process entails mapping physical ports to the logical topology via API requests. Everything the ToR needs to know to participate in the logical topology is provided to it via OVSDB.
Suppose VM1 needs to send a packet to PS1. The VM will send an ARP request towards its default gateway, which is implemented in a distributed manner. (We assume the VM learned its default gateway via some prior step; for example, DHCP may be used.) The ARP request will be intercepted by the local vswitch on the hypervisor. Acting as the logical router, the vswitch will reply to the ARP, so that the VM can now send the packet towards the router. All of this happens without any packet leaving the hypervisor.
The logical router now needs to ARP for the destination — PS1 (assuming an ARP cache miss for the first packet). It will broadcast the ARP request by sending it over a VXLAN tunnel to the VTEP (and potentially to other VTEPs as well, if there are more VTEPs that are involved in logical subnet B). When the ARP packet reaches the ToR, it is sent out on one or more physical interfaces — the set of interfaces that were previously mapped to this logical subnet via API requests. The ARP will reach PS1, which replies; the ToR forwards the reply over a VXLAN tunnel to the vswitch that issued the request, and now it’s able to forward the data traffic to the ToR which decapsulates the packet and delivers it to PS1.
For traffic flowing the other way, the role of logical router would be played by the physical VTEP rather than the vswitch. This is the nature of distributed routing — there is no single box performing all the work for a single logical router, but rather a collection of devices. And in this case, the work is distributed among both hardware VTEPs and vswitches.
We’ve glossed over a couple of details here, but one detail that’s worth noting is that, for traffic heading in the physical-to-virtual direction, the hardware device needs to perform an L3 lookup followed by VXLAN encapsulation. There has been some uncertainty regarding the capabilities of various switching chips to perform this operation (see this post, for example, which tries to determine the capabilities of Trident 2 based on switch vendor information). We’ve actually connected VMware’s NSX controller to ToR switches using at least four different classes of switching silicon (two merchant vendors, two custom ASIC-based designs). This includes both Arista’s 7150 series and 7050X switches. All of these are capable of performing the necessary L3+VXLAN operations. We’ll let the switch vendors speak for themselves regarding product specifics, but we’re essentially viewing this as a non-issue.
OK, that’s L3. What next? Overall, our approach has been to try to provide the same capabilities to virtual ports and physical ports, as much as that is possible. Of course, there is some inherent conflict here: hardware-based end-points tend to excel at throughput and density, while software-based end-points give us the greatest flexibility to deliver new features. Also, given our rich partner ecosystem with many hardware players, it’s not always going to be feasible to expose the unique features of a specific hardware product through the northbound API of NSX. But we certainly see value in being able to do more on physical ports: for example, we should be able to create access control lists on physical ports under API control. Similarly, we’d like to be able to control QoS policy at the physical ingress, e.g. remarking DSCP bits or trusting the value received and copying to the outer VXLAN header. More stateful services, such as firewalling or load-balancing, may not make sense in a ToR-class device but could be implemented in specific appliances suited for those tasks, and could still be integrated into virtualized networks using the same principles that we’ve applied to L2 and L3 functions.
In summary, we see the physical edges of virtual networks as a critical part of the overall network virtualization story. And of course we consider it important that we have a range of vendors whose devices can be integrated into the virtual overlay. It’s been great to see the ecosystem develop around this ability to tie the physical and the virtual together, and we see a lot of opportunity to build on the foundation we’ve established.
(This post was written by Tim Hinrichs and Scott Lowe with contributions from Martin Casado, Peter Balland, Pierre Ettori, and Dennis Moreau.)
In the first part of this series we described the policy problem: ensuring that the data center obeys the real-world rules and regulations that are pertinent to that data center. In this post, we look at the range of possible solutions by identifying some the key features that are important for any solution to the policy problem. Those key features correspond to the following four questions, which we use to structure our discussion.
- What are the policy sources a policy system must accommodate?
- How do those sources express the desired policy to the system?
- How does the policy system interact with data center services?
- What can the policy system do once it has the policy?
Let’s take a look at each of these questions one at a time.
Policy Sources: The origins of policy
Let’s start by digging deeper into an idea we touched on in the first post when describing the challenge of policy compliance: the sources of policy. While we sometimes talk about there being a single policy for a data center, the reality is that there are really many different policies that govern a data center. Each of these policies may have a different source or origin. Here are some examples of different policy sources:
- Application developers may write a separate policy for each application describing what that application expects from the infrastructure (such as high availability, elasticity/auto-scaling, connectivity, or a specific deployment location).
- The cloud operator may have a policy that describes how applications relate to one another. This policy might specify that applications from different business units must be deployed on different production networks, for example.
- Security and compliance teams might have policies that dictate specific firewall rules for web servers, encryption for PCI-compliant workloads, or OS patching guidelines.
- Different policies may be focused on different functionality within the data center. There might be a deployment policy, a billing policy, a security policy, a backup policy, and a decommissioning policy.
- Policies may be written at different levels of abstraction, e.g. applications versus virtual machines (VMs), networks versus routers, storage versus disks.
- Different policies might exist for different policy operations (monitoring, enforcing, and auditing are three examples that we will discuss later in this post).
The idea of multiple sources of policy naturally leads us to the presence of multiple policies. This is an interesting idea, because these multiple policies can interact with each other in many different ways. A policy describing where an application is to be deployed might also implicitly describe where VMs are to be deployed. A cloud operator’s policy might require an application to be deployed on network A or B, and an application policy requiring high availability might mean it must be deployed on network B or C; taken together, this means the application can only be deployed on network B. An auditing policy that requires knowing provenance for data when applied to an application that supports a high transaction rate might require solid state storage to meet performance requirements.
Taking this a step further, it may be unclear how these policies should interact. If the backup policy says to have 3 copies of data, and an auditing policy requires keeping track of where the data originated, do we need 3 copies of that provenance information? Conflicts are another example. If the application’s policy implies networks A or B, and the cloud operator’s policy implies networks C or D, then there is no way to deploy that application so that both policies are satisfied simultaneously.
There are a couple key takeaways from this discussion. First, a policy system must deal with multiple policy sources. Second, a policy system must deal with the presence of multiple policies, and how those policies can or should interact with one another.
Expressing Policy: Policy languages
Any discussion of policy systems has to deal with the subject of policy languages. An intuitive, easy-to-use syntax is critically important for eventual adoption, but here we focus on more semantic issues: how domain-specific is the language? How expressive is the language? What general-purpose policy features belong to the language?
A language is a domain specific language (DSL) if it includes primitives useful for policy in one domain but not another. For example, a policy language for networking might include primitives for the source and destination IP addresses of a network packet. A DSL for compute might include primitives for the amount of memory or disk space on a server. An application-oriented DSL might include elasticity primitives so that how different parts of the application grow and shrink can be the subject of policy.
Different parts of a policy language can be domain-specific:
- Namespace: The objects over which we declare policy can be domain-specific. For example, a networking DSL might define policy about packets, ports, switches, and routers.
- Condition: Policy languages typically have if-then constructs, and the “if” part of those constructs can include domain-specific tests, such as the source and destination IP addresses on a network packet.
- Consequent: The “then” component of if-then constructs can also be domain-specific. For networking, this might include allowing/dropping a packet, sending a packet through a firewall and then a load balancer, or ensuring quality of service (QoS).
Independent of domain-specific constructs, a language has a fundamental limitation on its expressiveness (its “raw expressiveness”). Language A is more expressive than language B if every policy for B can be translated into a policy for A but not vice versa. For example, if language A supports the logical connectives AND/OR/NOT, and language B is the same except it only supports AND/OR, then A is more expressive than B. However, it can be the case that language A supports AND/OR/NOT, language B supports AND/NOT, and yet the two languages are equally expressive (because OR can be simulated with AND/NOT).
It may seem that more expressiveness is necessarily better because a more expressive language makes it easier for users to write policies. Unfortunately, the more expressive a language, the harder it is to implement. By “harder to implement” we don’t mean it’s harder to get a 30% speed improvement through careful engineering; rather, we mean that it is provably impossible to make the implementation of sufficiently expressive languages run in less than exponential time. In short, every policy language chooses how to balance the policy writers’ need for expressiveness and the policy system’s need for implementability.
On top of domain-specificity and raw expressiveness, different policy languages support different features. For example, can we say that some policy statements are “hard” (can never be violated) while other statements are “soft” (can be violated if the only way to not violate is to violate a hard constraint). More generally, can we assign priorities to policy statements? Is there native support for exceptions to policy rules (maybe a cloud owner wants to manually make an exception for a violation so that auditing reflects why that violation was less severe than it may have seemed). Does the language have policy modules and enable people to describe how to combine those modules to produce a new policy? While such features might not impact domain-specificity or raw expressiveness, they have a large impact on how easy the policy language—and therefore the system using that language—is to use.
The key takeaway here is that the policy language has a significant impact on the policy system, so the choice of policy language is a critical one.
Policy Interaction: Integrating with data center services
A policy system by itself is useless; to have value, the policy system must interact and integrate with other data center or cloud services. By “data center service” or “cloud service” we mean basically anything that has an API, e.g. OpenStack components like Neutron, routers, servers, processes, files, databases, antivirus, intrusion detection, inventory management. Read-only API calls enable a policy system to see what is happening; read/write API calls enable a policy system to control what is happening.
Since a policy system’s ability to do something useful with a policy (like prevent violations, correct violations, or monitor for violations) is directly related to what the service can see and do in the data center, it’s crucial to understand how well a policy system works with the services in a data center. If two policy systems are equivalent except that one works with a broader range of data center services than the other, the one with a broader selection of data center services has the ability to see and do more; thus, it’s better able to see and do things to help the data center obey policy.
One type of data center service is especially noteworthy: the policy-aware service. Such services understand policy natively. They may have an API that includes “insert policy statement” and “delete policy statement”. Such services are especially useful in that they can potentially help the data center obey certain kinds of sub-policies. Distributing the work makes a policy system more robust, more reliable, and better performing.
The key point to remember here is that a policy system’s “power” (knowledge of and control over what’s happening in a data center or cloud environment) is driven by the nature of its interaction with the services running in that data center.
Policy Action: Taking action based on policy
Having looked at three key aspects of a policy system—supporting multiple sources of policies and multiple policies, using a policy language that balances expressiveness with implementability, and providing the appropriate depth and breadth of integration with necessary data center services—we now come to a discussion of what the policy system does (or can do) once it knows what policy (or group of policies) is pertinent to the data center. It’s compelling to think about the utility of policy in terms of the future, the present, and the past. We want the data center to obey the policy at all points in time, and there are different mechanisms for doing that.
- Auditing: We cannot change the past but we can record how the data center behaved, what the policy was, and therefore what the violations were.
- Monitoring: The present is equally impossible to change (by the time we act, that moment in time will have become the past), but we can identify violations, help people understand them, and gather information about how to reduce violations in the future.
- Enforcement: We can change the future behavior of the data center by taking action. Enforcement can attempt to prevent violations before they occur (“proactive enforcement”) or correct violations after the fact (“reactive enforcement”). This is the most challenging of the three because it requires choosing and executing actions that affect the natural state of the data center.
The potential for any policy system to carry out these three functions depends crucially on two things: the policy itself (a function of how well the system supports multiple policies as well as the system’s choice of policy language) and the controls the policy system has over the data center (driven by the policy system’s interaction with and integration into the surrounding data center services). The combination of these two things impose hard limits on how well any policy system is able to audit, monitor, and enforce policy.
While we would rather prevent violations than correct them, it’s sometimes impossible to do so. For example, we cannot prevent violations in a policy that requires server operating system (OS) instances to always have the latest patches. Why? As soon as Microsoft, Apple, or Red Hat releases a new security patch, the policy is immediately violated. The point of this kind of policy is that the policy system recognizes the violation and applies the patch to the vulnerable systems (to correct the violation). The key takeaway from this example is that preventing violations requires the policy system is on the critical path for any action that might violate policy. Violations can only be prevented if such enforcement points are available.
Similarly, it’s not always possible to correct violations. Consider a policy that says the load on a particular web server should never exceed 10,000 requests per second. If the requests to that server become high enough (even with load balancing), there may be no way reduce the load once it reaches 10,001 requests per second. The data center cannot control what web sites people in the real world access through their browsers. In this case, the key takeaway is that correcting violations requires there be actions available to the policy system to counteract the causes of those violations.
Even policy monitoring has limitations. A policy dictating application deployment to particular data centers based on the users of that application assumes readily available information about where applications are deployed and the users of those applications. A web application that does not expose information about its users ensures even monitoring this policy is impossible. The key takeaway here is that monitoring a policy requires that the appropriate information is available to the policy system. Further, if we cannot monitor a policy, we also cannot audit that policy.
In short, every policy system has limitations. These limitations might be on what the policy system knows about the data center, or these limitations might be on what control it has over the data center. These limitations influence whether any given policy can be audited, monitored, and enforced. Moreover, these limitations can change as the data center changes. As new services (hardware or software) are installed or existing services are upgraded in the data center, new functionality becomes available, and a policy system may have additional power (fewer limitations). When old services are removed, the policy system may have less power (more limitations).
These limitations give us ceilings on how successful any policy system might be in terms of auditing, monitoring, and enforcing policy. It is therefore useful to compare policy system designs in terms of how close to those ceilings they can get. Of course, the true test is in terms of actual implementation, not design, and a comparison of systems in terms of what policies they can audit, monitor, and enforce at scale is incredibly valuable. However, we must be careful not to condemn systems for not solving unsolvable problems.
This blog post has focused on laying out the range of possible solutions to the policy problem in the data center. In summary, here are some key points:
Policies come from many different sources and interact in many different ways. Ideally the data center would obey all those policies simultaneously, but in practice we expect the policies to conflict. A solution to the policy problem must address the issues surrounding multiple policies.
A policy language can be categorized in terms of its domain-specificity, its raw expressiveness, and the features it supports. Every solution must balance these three to meet the need for the policy writer to express policy and the need of the policy system to audit, monitor, and enforce policy.
Every solution must interact with the ecosystem of data center services within the data center. The richer the ecosystem a solution can leverage, the more successful it can be.
Once a policy system has a policy, it can audit, monitor, and enforce that policy. A solution to the policy problem is more or less successful at these functions depending on the policy and the data center.
In the next blog post, we will look at proposed policy systems like the OpenStack Group-Based Policy and the Congress project, and explain how they fit into this solution space.
As we’ve discussed previously, the vSwitch is a great position to detect elephant, or heavy-hitter flows because it has proximity to the guest OS and can use that position to gather additional context. This context may include the TSO send buffer, or even the guest TCP send buffer. Once an elephant is detected, it can be signaled to the underlay using standard interfaces such as DSCP. The following slide deck provides and overview of a working version of this, showing how such a setup can be used to both dynamically detect elephants and isolate mice from queuing delays they cause. We’ll write about this in more detail in a later post, but for now check out the slides (and in particular the graphs showing the latency of mice with and without detection and handling).