Scaling Congress

(This post was written by Tim Hinrichs, Shawn Hargan, and Alex Yip.)

Policy is a topic that we’ve touched on before here at Network Heresy. In fact, policy was the focus of a series of blog posts: first describing the policy problem and why policy is so important, then describing the range of potential solutions, followed by a comparison of policy efforts within OpenStack, and finally culminating in a detailed description of Congress: a project aimed at providing “policy as a service” to OpenStack clouds. (Check out the OpenStack wiki page on Congress for more details on the Congress project itself.)

Like other OpenStack projects, Congress is moving very quickly. Recently, one of the lead developers of Congress summarized some of the performance improvements that have been realized in recent builds of Congress. These performance improvements include things like much faster query performance, massive reductions in data import speeds, and significant reductions in memory overhead.

If you’re interested in the full details on the performance improvements that the Congress team is seeing, go read the full post on scaling the performance of Congress over at (You can also subscribe to the RSS feed to get automatic updates when new posts are published.)

OVN, Bringing Native Virtual Networking to OVS

By Justin Pettit, Ben Pfaff, Chris Wright, and Madhu Venugopal

Today we are excited to announce Open Virtual Network (OVN), a new project that brings virtual networking to the OVS user community. OVN complements the existing capabilities of OVS to add native support for virtual network abstractions, such as virtual L2 and L3 overlays and security groups. Just like OVS, our design goal is to have a production quality implementation that can operate at significant scale.

Why are we doing this? The primary goal in developing Open vSwitch has always been to provide a production-ready low-level networking component for hypervisors that could support a diverse range of network environments.  As one example of the success of this approach, Open vSwitch is the most popular choice of virtual switch in OpenStack deployments. To make OVS more effective in these environments, we believe the logical next step is to augment the low-level switching capabilities with a lightweight control plane that provides native support for common virtual networking abstractions.

To achieve these goals, OVN’s design is narrowly focused on providing L2/L3 virtual networking. This distinguishes OVN from general-purpose SDN controllers or platforms.

OVN is a new project from the Open vSwitch team to support virtual network abstraction. OVN will put users in control over cloud network resources, by allowing users to connect groups of VMs or containers into private L2 and L3 networks, quickly, programmatically, and without the need to provision VLANs or other physical network resources.  OVN will include logical switches and routers, security groups, and L2/L3/L4 ACLs, implemented on top of a tunnel-based (VXLAN, NVGRE, Geneve, STT, IPsec) overlay network.

OVN aims to be sufficiently robust and scalable to support large production deployments. OVN will support the same virtual machine environments as Open vSwitch, including KVM, Xen, and the emerging port to Hyper-V.  Container systems such as Docker are growing in importance but pose new challenges in scale and responsiveness, so we will work with the container community to ensure quality native support.  For physical-logical network integration, OVN will implement software gateways, as well as support hardware gateways from vendors that implement the “vtep” schema that ships with OVS.

Northbound, we will work with the OpenStack community to integrate OVN via a new plugin.  The OVN architecture will simplify the current Open vSwitch integration within Neutron by providing a virtual networking abstraction.  OVN will provide Neutron with improved dataplane performance through shortcut, distributed logical L3 processing and in-kernel based security groups, without running special OpenStack agents on hypervisors. Lastly, it will provide a scale-out and highly available gateway solution responsible for bridging from logical into physical space.

The Open vSwitch team will build and maintain OVN under the same open source license terms as Open vSwitch, and is open to contributions from all.  The outline of OVN’s design is guided by our experience developing Open vSwitch, OpenStack, and Nicira/VMware’s networking solutions.  We will evolve the design and implementation in the Open vSwitch mailing lists, using the same open process used for Open vSwitch.

OVN will not require a special build of OVS or OVN-specific changes to ovs-vswitchd or ovsdb-server.  OVN components will be part of the Open vSwitch source and binary distributions.  OVN will integrate with existing Open vSwitch components, using established protocols such as OpenFlow and OVSDB, using an OVN agent that connects to ovs-vswitchd and ovsdb-server.  (The VTEP emulator already in OVS’s “vtep” directory uses a similar architecture.)

OVN is not a generic platform or SDN controller on which applications are built.  Rather, OVN will be purpose built to provide virtual networking.  Because OVN will use the same interfaces as any other controller, OVS will remain as flexible and unspecialized as it is today.  It will still provide the same primitives that it always has and continue to be the best software switch for any controller.

The design and implementation will occur on the ovs-dev mailing list.  In fact, a high-level description of the architecture was posted this morning.  If you’d like to join the effort, please check out the mailing list.

Happy switching!

On Policy in the Data Center: Congress

By Tim Hinrichs and Scott Lowe with contributions from Alex Yip, Dmitri Kalintsev, and Peter Balland

(Note: this post is also cross-published at, a new site focused on policy.)

In the first few parts of this series, we discussed the policy problem, we outlined dimensions of the solution space, and we gave a brief overview of the existing OpenStack policy efforts. In this post we do a deep dive into one of the (not yet incubated) OpenStack policy efforts: Congress.


Remember that to solve the policy problem, people take ideas in their head about how the data center ought to behave (“policy”) and codify them in a language the computer system can understand. That is, the policy problem is really a programming languages problem. Not surprisingly Congress is, at its core, a policy language plus an implementation of that language.

Congress is a standard cloud service; you install it on a server, give it some inputs, and interact with it via a RESTful API. Congress has two kinds of inputs:

  • The other cloud services you’d like it to manage (for example, a compute manager like OpenStack Nova and a network manager like OpenStack Neutron)
  • A policy that describes how those services ought to behave

For example, we might decide to use Congress to manage OpenStack’s Nova (compute), Neutron (networking), Swift (object storage), Cinder (block storage), and Heat (applications). We might write a geo-location policy:

“Every application collecting personal information about its users from Japan must have all of its compute, networking, and storage elements deployed at a data center that resides within the geographic borders of Japan.”

Any Service

A cloud service gives Congress the ability to see and change the data center’s behavior. The more services hooked up to Congress, the more powerful Congress becomes. Congress was designed to manage any collection of cloud services, regardless of their origin or locality (private or public). It does not matter if the service is provided by OpenStack, CloudStack, AWS, GCE, Cisco, HP, IBM, Red Hat, VMware, etc. It does not matter if the service manages compute, networking, storage, applications, databases, anti-virus, inventory, people, or groups. Congress is vendor and domain agnostic.

Congress provides a unified abstraction for all services, which insulates the policy writer from understanding differing data formats, authentication protocols, APIs, and the like. Congress does NOT require any special code to be running on the services it manages; rather, it includes a light-weight adapter for each service that implements the unified interface using the service’s native API.

From the policy writer’s point of view, each service is simply a collection of tables. A table is a collection of rows; each row is a collection of columns; each row-column entry stores simple data like strings and integers. When Congress wants to see what is happening in the data center, it reads from those tables. When Congress wants to change what is happening in the data center, it writes to those tables.

For example, the Nova compute service is represented as several tables like the servers table below.

| id      | host_id  | status | tenant_id | image_id | flavor_id |  
| <UUID1> | <UUID2>  | ACTIVE | alice     | <UUID3>  | <UUID4>   |  
| <UUID5> | <UUID6>  | ACTIVE | bob       | <UUID7>  | <UUID8>   |  
| <UUID9> | <UUID10> | DOWN   | bo        | <UUID7>  | <UUID8>   |  

At the time of writing, there are adapters (which we call “datasource drivers”) for each of the following services, all but one of which are OpenStack.

  • Nova
  • Neutron
  • Cinder
  • Swift
  • Keystone
  • Ceilometer
  • Glance
  • Plexxi controller

Each adapter is a few hundred lines of code that (i) makes API calls into the service to get information about that service’s behavior; and (ii) translates that information into tables. Just recently we added a domain-specific language (DSL) that automates the translation of that information into tables, given a description of the structure of the information.

For more information about connecting cloud services, see the Congress cloud services documentation.

Any Policy

A policy describes how a collection of cloud services ought to behave. Every organization’s policy is unique because every organization has different services in its data center. Every organization has different business advantages they are trying to gain via the cloud. Every organization has different regulations that govern it. Every organization is full of people with different ideas about the right way to run the cloud.

Congress aims to provide a single policy language that every organization can use to express their high- and low-level policies. Instead of providing a long list of micro-policies that the user can mix-and-match, Congress provides a general purpose policy language for expressing policy: the well-known declarative language Datalog.

Datalog is domain-agnostic. It is just as easy to write policy about compute as it is to write policy about networking. It is just as easy to write policy about how compute, networking, storage, group membership, and applications interact with each other. Moreover, Datalog enables policy writers to define abstractions to bridge the gap between low-level infrastructure policy and high-level business policy.

Suppose our policy says that all servers should on average have a CPU utilization of at least 20% over a 2 day span. In Datalog we would write the a policy that leverages Nova for compute, Ceilometer for CPU utilization data, and some built-in tables that treat strings as if they were dates.

First, we declare the conditions under which there is a policy violation. We do that by writing a rule that says a VM is an error (policy violation) if the conditions shown below are true.

error(vm, email_address) :-
    nova:servers(id=myid, tenant_id=owner), // myid is a server owned by owner
    two_days_previous(start_date, end_date), // start_date is 2 days before end_date; end_date is today
    ceilometer:statistics(id=myid, start=start_date, end=end_date, meter="cpu-util", avg=value),
    arithmetic:less_than(value, 20), // value < 20%
    keystone:user(id=owner, email=email_address)

We also define a helper table that computes the start and end dates for 2 days before today.

two_days_previous(start_date, end_date) :-
    datetime:minus(end_date, "2 days", start_date)

Helper tables like two_days_previous are useful because they allow the policy writer to create higher-level concepts that may not exist natively in the cloud services. For example, we can create a helper table that tells us which servers are connected to the Internet—something that requires information from several different places in OpenStack. Or the compute, networking, and storage admins could create the higher-level concept “is-secure” and enable a higher-level manager to write a policy that describes when resources ought to be secured.

For more information about writing policy, see the Congress policy documentation.


Once we have connected services to Congress and written policy over those services, we’ve given Congress the inputs it needs carry out its core capabilities, which the user is free to mix and match.

  • Monitoring: Congress watches how the other cloud services are behaving, compares that to policy, and flags mismatches (policy violations).

  • Enforcement: Congress acts as a policy authority. A service can propose a change to Congress, and Congress will tell the service whether the change complies with policy or not, thus preventing policy violations before they happen. Congress can also correct some violations after the fact.

  • Auditing: Congress gives users the ability to record the history of policy, policy violations, and remediations.

  • Delegation: Congress can offload the burden of monitoring/enforcing/auditing to other policy-aware systems.

When it comes to enforcement, a common question is why Congress would support both proactive and reactive enforcement. The implied question being, “Isn’t proactive always preferred?” The answer is that proactive is not always possible. Consider the simple policy “ensure all operating systems have the latest security patch.” As soon as Microsoft/Apple/RedHat releases a new security patch, the policy is immediately violated; the whole purpose of writing the policy is to enable Congress to identify the violation and take action to correct it.

The tip of master includes monitoring and a mechanism for proactive enforcement. In the Kilo release of OpenStack we plan to have a form of reactive enforcement available as well.


In this post, we’ve talked about some of the key takeaways regarding Congress:

  • Congress was designed to solve the policy problem and work with any cloud service and any policy.
  • It is currently capable of monitoring and proactive enforcement. Reactive enforcement and delegation are currently underway.
  • Congress is not yet incubated in OpenStack, but has contributions from half a dozen organizations and nearly two dozen people.

Please feel free to join our weekly IRC meeting, check out the wiki, and download and install the code.

Accelerating Open vSwitch to “Ludicrous Speed”

[This post was written by OVS core contributors Justin Pettit, Ben Pfaff, and Ethan Jackson.]

The overhead associated with vSwitches has been a hotly debated topic in the networking community. In this blog post, we show how recent changes to OVS have elevated its performance to be on par with the native Linux bridge. Furthermore, CPU utilization of OVS in realistic scenarios can be up to 8x below that of the Linux bridge.  This is the first of a two-part series.  In the next post, we take a peek at the design and performance of the forthcoming port to DPDK, which bypasses the kernel entirely to gain impressive performance.

Open vSwitch is the most popular network back-end for OpenStack deployments and widely accepted as the de facto standard OpenFlow implementation.  Open vSwitch development initially had a narrow focus — supporting novel features necessary for advanced applications such as network virtualization.  However, as we gained experience with production deployments, it became clear these initial goals were not sufficient.  For Open vSwitch to be successful, it not only must be highly programmable and general, it must also be blazingly fast.  For the past several years, our development efforts have focused on precisely this tension — building a software switch that does not compromise on either generality or speed.

To understand how we achieved this feat, one must first understand the OVS architecture.  The forwarding plane consists of two parts: a “slow-path” userspace daemon called ovs-vswitchd and a “fast-path” kernel module.  Most of the complexity of OVS is in ovs-vswitchd; all of the forwarding decisions and network protocol processing are handled there.  The kernel module’s only responsibilities are tunnel termination and caching traffic handling.


When a packet is received by the kernel module, its cache of flows is consulted.  If a relevant entry is found, then the associated actions (e.g., modify headers or forward the packet) are executed on the packet.  If there is no entry, the packet is passed to ovs-vswitchd to decide the packet’s fate. ovs-vswitchd then executes the OpenFlow pipeline on the packet to compute actions, passes it back to the fast path for forwarding, and installs a flow cache entry so similar packets will not need to take these expensive steps.


The life of a new flow through the OVS forwarding elements. The first packet misses the kernel module’s flow cache and is sent to userspace for further processing. Subsequent packets are processed entirely in the kernel.

Until OVS 1.11, this fast-path cache contained exact-match “microflows”.  Each cache entry specified every field of the packet header, and was therefore limited to matching packets with this exact header.  While this approach works well for most common traffic patterns,  unusual applications, such as port scans or some peer-to-peer rendezvous servers, would have very low cache hit rates.  In this case, many packets would need to traverse the slow path, severely limiting performance.

OVS 1.11 introduced megaflows, enabling the single biggest performance improvement to date.  Instead of a simple exact-match cache, the kernel cache now supports arbitrary bitwise wildcarding.  Therefore, it is now possible to specify only those fields that actually affect forwarding..  For example, if OVS is configured simply to be a learning switch, then only the ingress port and L2 fields are relevant and all other fields can be wildcarded.  In previous releases, a port scan would have required a separate cache entry for, e.g., each half of a TCP connection, even though the L3 and L4 fields were not important.


The introduction of megaflows allowed OVS to drastically reduce the number of packets that traversed the slow path.   This represents a major improvement, but ovs-vswitchd still had a number of responsibilities, which became the new bottleneck.  These include activities like managing the datapath flow table, running switch protocols (LACP, BFD, STP, etc), and other general accounting and management tasks.

While the kernel datapath has always been multi-threaded, ovs-vswitchd was a single-threaded process until OVS 2.0.  This architecture was pleasantly simple, but it suffered from several drawbacks.  Most obviously, it could use at most one CPU core.  This was sufficient for hypervisors, but we began to see Open vSwitch used more frequently as a network appliance, in which it is important to fully use machine resources.

Less obviously, it becomes quite difficult to support ovs-vswitchd’s real-time requirements in a single-threaded architecture.  Fast-path misses from the kernel must be processed by ovs-vswitchd as promptly as possible.  In the old single-threaded architecture, miss handling often blocked behind the single thread’s other tasks.  Large OVSDB changes,  OpenFlow table changes, disk writes for logs, and other routine tasks could delay handling misses and degrade forwarding performance.  In a multi-threaded architecture, soft real-time tasks can be separated into their own threads, protected from delay by unrelated maintenance tasks.

Since the introduction of multithreading, we’ve continued to fine-tune the number of threads and their responsibilities.  In addition to order of magnitudes improvements in miss handling performance, this architecture shift has allowed us to increase the size of the kernel cache from 1000 flows in early versions of OVS to roughly 200,000 in the most recent version.

Classifier Improvements

The introduction of megaflows and support for a larger cache enabled by multithreading reduced the need for packets to be processed in userspace.  This is important because a packet lookup in userspace can be quite expensive.  For example, a network virtualization application might define an OpenFlow pipeline with dozens of tables that hold hundreds of thousands of rules.  The final optimization we will discuss is improvements in the classifier in ovs-vswitchd, which is responsible for determining which OpenFlow rules apply when processing a packet.

Data structures to allow a flow table to be quickly searched are an active area of research.  OVS uses a tuple space search classifier, which consists of one hash table (tuple) for each kind of match actually in use.  For example, if some flows match on the source IP, that’s represented as one tuple, and if others match on source IP and destination TCP port, that’s a second tuple.  Searching a tuple space search classifier requires searching each tuple, then taking the highest priority match.  In the last few releases, we have introduced a number of novel improvements to the basic tuple search algorithm:

  • Priority Sorting – Our simplest optimization is to sort the tuples in order by the maximum priority of any flow in the tuple.  Then a search that finds a matching flow with priority P can terminate as soon as it arrives at a tuple whose maximum priority is P or less.
  • Staged Lookup – In many network applications, a policy may only apply to a subset of the headers.  For example, a firewall policy that requires looking at TCP ports may only apply to a couple of VMs’ interfaces.  With the basic tuple search algorithm, if any rule looks at the TCP ports, then any generated megaflow would look all the way up through the L4 headers.  With staged lookup, we scope lookups from metadata (e.g., ingress port) up to layers further up the stack on an as-needed basis.
  • Prefix Tracking – When processing L3 traffic, a longest prefix match is required for routing.  The tuple-space algorithm works poorly in this case, since it degenerates into a tuple that matches as many bits as the longest match.  This means that if one rule matches on 10.x.x.x and another on 192.168.0.x, the 10.x.x.x rule will also require matching 24 bits instead of 8, which requires keeping more megaflows in the kernel.  With prefix tracking, we consult a trie that allows us to only look at tuples with the high order bits sufficient to differentiate between rules.

These classifier improvements have been shown with practical rule sets to reduce the number of megaflows needed in the kernel from over a million to only dozens.


The preceding changes improve many aspects of performance.  For this post, we’ll just evaluate the performance gains in flow setup, which was the area of greatest concern.  To measure setup performance, we used netperf’s TCP_CRR test, which measures the number of short-lived transactions per second (tps) that can be established between two hosts.  We compared OVS to the Linux bridge, a fixed-function Ethernet switch implemented entirely inside the Linux kernel.

In the simplest configuration, the two switches achieved identical throughput (18.8 Gbps) and similar TCP_CRR connection rates (696,000 tps for OVS, 688,000 for the Linux bridge), although OVS used more CPU (161% vs. 48%). However, when we added one flow to OVS to drop STP BPDU packets and a similar ebtable rule to the Linux bridge, OVS performance and CPU usage remained constant whereas the Linux bridge connection rate dropped to 512,000 tps and its CPU usage increased over 26-fold to 1,279%. This is because the built-in kernel functions have per-packet overhead, whereas OVS’s overhead is generally fixed per-megaflow. We expect that enabling other features, such as routing and a firewall, would similarly add CPU load.

tps CPU
Linux Bridge Pass BPDUs 688,000 48%
Drop BPDUs 512,000 1,279%
Open vSwitch 1.10 12,000 100%
Open vSwitch 2.1 Megaflows Off 23,000 754%
Megaflows On 696,000 161%

While these performance numbers are useful for benchmarking, they are synthetic.  At the OpenStack Summit last week in Paris, Rackspace engineers described the performance gains they have seen over the past few releases in their production deployment.  They begin in the “Dark Ages” (versions prior to 1.11) and proceed to “Ludicrous Speed” (versions since 2.1).


Andy Hill and Joel Preas from Rackspace discuss OVS performance improvements at the OpenStack Paris Summit.


Now that OVS’s performance is similar to those of a fixed-function switch while maintaining the flexibility demanded by new networking applications, we’re looking forward to broadening our focus.  While we continue to make performance improvements, the next few releases will begin adding new features such as stateful services and support for new platforms such as DPDK and Hyper-V.

If you want to hear more about this live or talk to us in person, we will all be at the Open vSwitch Fall 2014 Conference next week.


On Policy in the Data Center: Comparing OpenStack policy efforts

(This post was written by Tim Hinrichs and Scott Lowe, with contributions from Pierre Ettori, Aaron Rosen, and Peter Balland.)

In the first two parts of this blog series we discussed the problem of policy in the data center and the features that differentiate solutions to that problem. In this post, we give a high-level overview of several policy efforts within OpenStack.

Remember that a policy is a description of how (some part of) the data center ought to behave, a service is any component in the data center that has an API, and a policy system is designed to manage some combination of past, present, and future policy violations (auditing, monitoring, and enforcement, respectively).

The overview of OpenStack policy efforts talks about the features we identified in part 2 of this blog series. To recap, those features are:

  • Policy language: how expressive is the language, is the language restricted to certain domains, what features (e.g. exceptions) does it support?
  • Policy sources: what are the sources of policy, how do different sources of policy interact, how are conflicts dealt with?
  • Services: which other data center services can be leveraged and how?
  • Actions: what does the system do once it is given a policy: monitor (identify violations), enforce (prevent or correct violations), audit (analyze past violations)?

The one thing you’ll notice is that there are many different policy efforts within OpenStack. Perhaps surprisingly there is actually little redundancy because each effort addresses a different part of the overall policy problem: enabling users to describe their desires in a way that an OpenStack cloud can act on them. Additionally, as we will point out again later in the post, domain independent and domain specific policy efforts are highly complementary.


We begin with Congress, our own policy effort within OpenStack. Congress is a system purpose-built for managing policy in the data center. A Congress policy describes the desired behavior of the data center by dictating how all the services running in that data center are supposed to behave both individually and in tandem. In the current release Congress accepts a single policy for the entire data center, the idea being that the cloud administrators are jointly responsible for writing and maintaining that policy.

A Congress policy is domain independent and can describe the behavior of any collection of data center services. The cloud administrator can write a policy about networking, a policy about compute, or a policy that about networking, compute, storage, antivirus, organizational charts, inventory management systems, ActiveDirectory, and so on.

The recent alpha release of Congress supports monitoring violations in policy: comparing how the data center is actually behaving to how policy says the data center ought to behave and flagging mismatches. In the future, Congress will also support enforcement by having Congress itself execute API calls to change the behavior of the data center and/or pushing policy to other policy-aware services better positioned to enforce policy.

Neutron Group-Based Policy (GBP)

Neutron Group-Based Policy (GBP), which is similar to the policy effort in OpenDaylight, utilizes policy to manage networking. A policy describes how the network packets in the data center are supposed to behave. Each policy (“contract” in GBP terminology) describes which actions (such as allow, drop, reroute, or apply QoS) should be applied to which network packets based on packet header properties like port and protocol. Entities on the network (called “endpoints”) are grouped and each group is assigned one or more policies. Groups are maintained outside the policy language by people or automated systems.

In GBP, policies can come from any number of people or agents. Conflicts can arise within a single policy or across several policies and are eliminated by a mechanism built into GBP (which is out of scope for this blog post).

The goal of GBP is to enforce policy directly. (Both monitoring and auditing are challenging in the networking domain because there are so many packets moving so quickly throughout the data center.) To do enforcement, GBP compiles policies down into existing Neutron primitives and creates logical networks, switches, and routers. When new policy statements are inserted, GBP does an incremental compilation: changing the Neutron primitives in such a way as to implement the new policy while minimally disrupting existing primitives.

Swift Storage Policy

Swift is OpenStack’s object storage service. As of version 2.0, released July 2014, Swift supports storage policies. Each storage policy is attached to a virtual storage system, which is where Swift stores objects. Each policy assigns values to a number of built-in features of a storage system. At the time of writing, each policy dictates how many partitions the storage system has, how many replicas of each object it should maintain, and the minimum amount of time before a partition can be moved to a different physical location since the last time it was moved.

A user can create any number of virtual storage systems—and so can write any number of policies—but there are no conflicts between policies. If we put an object into a container with 2 replicas and the same object into another container with 3 replicas, it just means we are storing that object in two different virtual storage systems, which all told means we have 5 replicas.

Policy is enforced directly by Swift. Every time an object is written, Swift ensures the right number of replicas are created. Swift ensures not to move a partition before policy allows that partition to be moved.

Smart Scheduler/SolverScheduler

The Smart Scheduler/SolverScheduler effort aims to provides an interface for using different constraint solvers to solve optimization problems for other projects, Nova in particular. One specific use case is for Network Functions Virtualization (see here and here) For example, Nova might ask where to place a new virtual machine to minimize the average number of VMs on each server. This effort utilizes domain-independent solvers (such as linear programming/arithmetic solvers) but applies them to solve domain-specific problems. The intention is to focus on enforcement.

Nova Policy-Based Scheduling Module

The Nova policy-based scheduling module aims to schedule Nova resources per client, per cluster of resources, and per context (e.g. overload, time, etc.). A proof of concept was presented at the Demo Theater at OpenStack Juno Summit.


Gantt aims to provide scheduling as a service for other OpenStack components (see here and here). Previously, it was a subgroup of Nova and focused on scheduling virtual machines based on resource utilization. It includes plugin framework for making arbitrary metrics available to the scheduler.

Heat Convergence engine

The Heat Convergence engine represents a shift toward a model for Heat where applications are deployed and managed by comparing the current state of the application to the desired state of the application and taking action to reduce the differences. Each desired state amounts to a policy describing a single application. Those policies do not interact, logically, and can draw upon any service in the data center. Heat policies are concerned mainly with corrective enforcement, though monitoring is also useful (“how far along is my application’s deployment?”).


The key takeaway is that OpenStack has a growing ecosystem of policy-aware services. Most of them are domain-specific, meaning they are systems tailored to enforcing a particular kind of policy for a particular service, but a few are domain-independent, meaning that they will work for any kind of policy.

As we mentioned earlier, domain-independent and domain-specific policy systems are highly complementary. The strength of a domain-specific policy system is enforcing policies within its domain, but its weakness is that policies outside the domain are not expressible in the language. The strength of a domain-independent policy system is expressing policies for any and every domain, but its weakness is that monitoring/enforcing/auditing those policies can be challenging.

For policy to live up to its expectations, we need a rich ecosystem of policy-aware services that interoperate with one another. Networking policies should be handled by Neutron; compute policies should be handled by Nova; storage policies should be handled by Swift and Cinder; application policies should be handled by Heat; cross-cutting policies should be handled by a combination of Congress, Gantt, and SolverScheduler. We believe it will be incredibly valuable to give users a single touch point to understand how all the policies throughout the data center interact and interoperate—to provide a dashboard where users ask questions about the current state of the data center, investigate the impact of proposed changes, enact and automate enforcement decisions, and audit the data center’s policy from the past.

Next Steps

To help coordinate the interaction and development of policy-aware services and policy-related efforts in OpenStack, the OpenStack Mid-Cycle Policy Summit intends to bring representatives from many different policy-minded companies and projects together. The aim of the summit is to discuss the current state of policy within OpenStack and begin discussing the roadmap for how policy will evolve in the future. The summit will start with some presentations by (and about) the various policy-related efforts and their approach to policy; it will wrap up with a workshop focused on how the different efforts might interoperate both today and in the future. Following this summit, which takes place September 18–19, 2014, we’ll post another blog entry describing the experience and lessons learned.

Geneve – Ecosystem Support Has Arrived

[This post was authored by T. Sridhar and Jesse Gross.]

Earlier this year, VMware, Microsoft, Red Hat and Intel published an IETF draft on Generic Network Virtualization Encapsulation (Geneve). This draft (first published on Valentine’s Day no less) includes authors from the each of the first generation encapsulation protocols — VXLAN, NVGRE, and STT. However, beyond the obvious appeal of unification across hypervisor platforms, the salient feature of Geneve is that it was designed from the ground up to be flexible. Nobody wants an endless cycle of new encapsulation formats as network virtualization designs and controllers mature, certainly not the vendors that have to support the ever growing list of acronyms and RFCs.

Of course press releases, standards bodies and predictions about the future mean little without actual implementations, which is why it is important to consider the “ecosystem” from the beginning of the process. This includes software and silicon implementations in both commercial and open source varieties. This always takes time but since Geneve was designed to accommodate a wide variety of use cases it has seen a relatively quick uptake. Unsurprisingly, the first implementations that landed were open source software — including switches such as Open vSwitch and networking troubleshooting tools like Wireshark. Today the first hardware implementation has arrived, in the form of the 40 Gbps Intel XL710 NIC, previously known as Fortville.Geneve-idf

Demo of Geneve hardware acceleration at Intel Developer Forum.

Why is hardware support important? Performance. Everyone likes flexibility, of course, but most of the time that comes with a cost. In the case of a NIC, hardware acceleration enables us to have our cake and eat it too by offloading expensive operations while retaining software control in the CPU. These NICs add encapsulation awareness for classic operations like checksum and TCP segmentation offload to bring Geneve tunnels to performance parity with traditional traffic. For good measure, they also add in support for a few additional Geneve-specific features as well.

Of course, this is just the beginning — it is still only six months after publication of the Geneve specification and much more is still to come. Expect to see further announcements coming soon for both NIC and switch silicon and of course new software to take advantage of the advanced capabilities. Until then, a discussion session as well as a live demo will be at Intel Developer Forum this week to provide a first glimpse of Geneve in action.

Physical Networks in the Virtualized Networking World

[This post was co-authored by Bruce Davie and Ken Duda]

Almost a year ago, we wrote a first post about our efforts to build virtual networks that span both virtual and physical resources. As we’ve moved beyond the first proofs of concept to customer trials for our combined solution, this post serves to provide an update on where we see the interaction between virtual and physical worlds heading.

Our overall approach to connecting physical and virtual resources can be viewed in two main categories:

  • terminating the overlay on physical devices, such as top-of-rack switches, routers, appliances, etc.
  • managing interactions between the overlay and the physical devices that provide the underlay.

The latter topic is something we’ve addressed in some other recent posts (herehere and here) — in this blog we’ll focus more on how we deal with physical devices at the edge of the overlay.

We first started working to design a control plane to terminate network virtualization overlays on physical devices in 2012. We started by looking at the information model, defining what information needed to be exchanged between a physical device and a network virtualization controller such as NSX. To bound the problem space, we focused on a specific use case: mapping the ports and VLANs of a physical switch into virtual layer 2 networks implemented as VXLAN-based overlays. (See our posts on the issues around encapsulation choice here and here). At the same time, we knew there were a lot more use cases to be addressed, so we picked a completely extensible protocol to carry the necessary information: OVSDB. This was important: we knew that over time we’d have to support a lot more use cases than just L2 bridging between physical and virtual worlds. After all, one of the tenets of network virtualization is that a virtual network should faithfully reproduce all of the networking stack, from L2-L7, just as server virtualization faithfully reproduces a complete computing environment (as described in more detail here.)

So, the first thing we added to the solution space once we got L2 working was L3. By the time we published the VTEP schema as part of Open vSwitch in late 2013, distributed logical routing was included. (We use the term VTEP – VXLAN tunnel end-point – as a general term for the devices that terminate the overlay.) Let’s take a look at how logical routing works.


Distributed logical routing is an example of a more general capability in network virtualization, the distribution of services. Brad Hedlund wrote some time ago about the value of distributing services among hypervisors. The same basic arguments apply when there are VTEPs in the picture — you want to distribute functions, like logical routing, so that packets always follow the shortest path without hair-pinning, and so that the capacity to perform that function scales out as you add more devices, be they hypervisors or physical switches.

So, suppose a VM (VM1) is placed in logical subnet A, and a physical server (PS1) that is in subnet B is located behind a ToR switch acting as a VTEP (see picture). Say we want to create a logical network that interconnects these two subnets. The logical topology is created by API requests to the network virtualization controller, which in turn programs the vswitches and the ToR to instantiate the desired topology. Part of this process entails mapping physical ports to the logical topology via API requests. Everything the ToR needs to know to participate in the logical topology is provided to it via OVSDB.

Suppose VM1 needs to send a packet to PS1. The VM will send an ARP request towards its default gateway, which is implemented in a distributed manner. (We assume the VM learned its default gateway via some prior step; for example, DHCP may be used.) The ARP request will be intercepted by the local vswitch on the hypervisor. Acting as the logical router, the vswitch will reply to the ARP, so that the VM can now send the packet towards the router. All of this happens without any packet leaving the hypervisor.

The logical router now needs to ARP for the destination — PS1 (assuming an ARP cache miss for the first packet). It will broadcast the ARP request by sending it over a VXLAN tunnel to the VTEP (and potentially to other VTEPs as well, if there are more VTEPs that are involved in logical subnet B). When the ARP packet reaches the ToR, it is sent out on one or more physical interfaces — the set of interfaces that were previously mapped to this logical subnet via API requests. The ARP will reach PS1, which replies; the ToR forwards the reply over a VXLAN tunnel to the vswitch that issued the request, and now it’s able to forward the data traffic to the ToR which decapsulates the packet and delivers it to PS1.

For traffic flowing the other way, the role of logical router would be played by the physical VTEP rather than the vswitch. This is the nature of distributed routing — there is no single box performing all the work for a single logical router, but rather a collection of devices. And in this case, the work is distributed among both hardware VTEPs and vswitches.

We’ve glossed over a couple of details here, but one detail that’s worth noting is that, for traffic heading in the physical-to-virtual direction, the hardware device needs to perform an L3 lookup followed by VXLAN encapsulation. There has been some uncertainty regarding the capabilities of various switching chips to perform this operation (see this post, for example, which tries to determine the capabilities of Trident 2 based on switch vendor information). We’ve actually connected VMware’s NSX controller to ToR switches using at least four different classes of switching silicon (two merchant vendors, two custom ASIC-based designs). This includes both Arista’s 7150 series and 7050X switches. All of these are capable of performing the necessary L3+VXLAN operations. We’ll let the switch vendors speak for themselves regarding product specifics, but we’re essentially viewing this as a non-issue.

OK, that’s L3. What next? Overall, our approach has been to try to provide the same capabilities to virtual ports and physical ports, as much as that is possible. Of course, there is some inherent conflict here: hardware-based end-points tend to excel at throughput and density, while software-based end-points give us the greatest flexibility to deliver new features. Also, given our rich partner ecosystem with many hardware players, it’s not always going to be feasible to expose the unique features of a specific hardware product through the northbound API of NSX. But we certainly see value in being able to do more on physical ports: for example, we should be able to create access control lists on physical ports under API control. Similarly, we’d like to be able to control QoS policy at the physical ingress, e.g. remarking DSCP bits or trusting the value received and copying to the outer VXLAN header. More stateful services, such as firewalling or load-balancing, may not make sense in a ToR-class device but could be implemented in specific appliances suited for those tasks, and could still be integrated into virtualized networks using the same principles that we’ve applied to L2 and L3 functions.

In summary, we see the physical edges of virtual networks as a critical part of the overall network virtualization story. And of course we consider it important that we have a range of vendors whose devices can be integrated into the virtual overlay. It’s been great to see the ecosystem develop around this ability to tie the physical and the virtual together, and we see a lot of opportunity to build on the foundation we’ve established.


Get every new post delivered to your Inbox.

Join 485 other followers