This post was originally published on this site

Demystification of TC

At Criteo we operate containers with Mesos. Containers are sharing the resources of their host. Applications are competing to access these resources. One of the great benefits of containers is to allow tackling the resource contention issue.

The network resources are somewhat special compared to other ones such as CPU and memory for several reasons. First of all, it is the most sensitive to contention just because of the (relative) lack of speed of the network compared to CPU and RAM. A network bound application will notice any added latency to its flow, which may be visible at human scale. It is easy for an application to impact the network performance of a co-hosted one. This is the noisy neighbor issue.

Then, network optimization issues are much more complex than, for example, CPU sharing because the overall performance of the system depends on external, unpredictable, and rapidly changing factors.

Finally, the way you share your network resources is political, and as such there is no one-size-fit-all policy addressing all use cases.

Since we did not find any off-the-shelf solution matching our use-case to isolate network usage of Mesos container, we decided to develop our own solution based on Linux standard tool TC (standing for Traffic Control). Sadly, the available information on the Internet on the subject is very limited, scattered, and often outdated. We will try to address this issue here.

To understand TC and how to integrate it with Mesos, we first need to understand a few principles of traffic shaping and define some terms.
Then, we will overview TC, its object model, and demonstrate how we can isolate containers with a specific policy for each one.

Finally, you’ll find some dirty details about the solution inner workings and how to address common operational issues.

Elements of Traffic Control

Overview

Traffic Control (TC) is a set of algorithms used to optimize network flows for throughput, latency, or jitter. With this algorithms you specify some policy by network flows. A policy can be something like “I want to allocate at least 1Gbps of bandwidth to HTTP traffic”, “cap BitTorrent traffic to a maximum of 100Mbps”, or maybe “DNS traffic as the priority over everything else”.

In our case we want to describe a policy per Mesos container, setting a maximum amount of bandwidth available on a container basis. Each container having its own user supplied bandwidth resource.

Let’s review the main concepts behind TC as they directly describe the intent of each kind of algorithms, and how they work together. But please keep in mind that in TC these concepts are modeled in a slightly different way, so don’t try to map them directly to TC objects!

Different types of algorithms are applied in following sequence:

When a packet is sent, following the egress path, it will go through three phases in the following order: (1) Classification, (2) Scheduling, and (3) Shaping. A packet does not have to go through all this steps, we could use only a scheduler for example.

On the incoming path, also called ingress, we can only use a Policer on Linux.

Let’s take a closer look at each of these algorithms, first focusing on the outgoing / egress path first.

Classification

When an application sends a packet, the first thing to do is to find out which policy needs to be applied to this packet.

It works by identifying network flows based on the metadata and/or the headers of the packet, and accordingly dispatching it to the relevant policy:

There is a variety of approaches to implement the classification, such as leveraging pure TC capabilities, conntrack, eBPF, etc.

In our case, we we want to track the container which is sending the packet, which is trivial as it is available in the metadata of the packet.

The complexity arise when packets are coming back, then we need to leverage the conntrack system. We’ll dig into this when we’ll go into the actual implementation.

Scheduling

Scheduling algorithms are responsible for managing queues of packets, which essentially does two things (1) decide which packet to send next, and (2) decide what to do when the queues are full.

Let’s take the simplest possible scheduling algorithm: a FIFO queue.

(1) Which packet to send next? It just sends packets in their arrival order.

(2) What to do when packet queues are full? It gets a bit more interesting! You have basically two choices, either drop the newly arrived packet which needs to be enqueued this a “tail drop” policy, either drop older packets in the queue: “head drop”.

Tail drop will heavily penalize TCP connections by forcefully triggering mechanisms such as RTO. Head dropping will be smoother by usually allowing fast retransmits to happen.

http://linux-ip.net/articles/Traffic-Control-HOWTO/classless-qdiscs.html

Generally speaking this subject is also known as Active Queue Management (AQM) in the literature.

Smarter algorithms includes SFQ, RED family, CoDEL, etc.

Shaping

Our packets have been through classification and scheduling, we have the set of queues nicely sorted per policy, containers, etc.

We are now able to release them at a disciplined pace to enforce policies such as rate limits. This is the concern of shaper algorithms.

This could have the consequence to smooth out spiky data (depending on the actual algorithm and parameters used).

https://wiki.mikrotik.com/wiki/Manual:HTB-Token_Bucket_Algorithm

Shaping algorithms are typically variation of the well-known token bucket algorithm. The algorithm works by defining a bucket in which you are putting token at a fixed frequency. To be able to send data you need to take the equivalent amount in tokens from the bucket. So the frequency at which you refill the bucket defines your target output data rate. Its size (maximum amount of tokens you can put in your bucket) defines the burst rate.

Bursting refers to the maximum amount of data you can send at once (before the next guy in line), respecting your rate limits.

Typical algorithms: DDR, HFSC, HTB (which will discuss in greater details after), or even more dandy: CAKE.

Policing

Now, on the ingress path (incoming data) we have access to a single type of algorithm: Policers.

Policers have similar capabilities than shapers: they can enforce a rate limit on the incoming flow of packets. But the key difference with a shaper is that it can only take one corrective action: dropping the received packet. This kind of algorithm will basically enforce bandwidth and burst limits, by performing some accounting, and dropping non compliant packets. 
More sophisticated algorithms allow to define tolerated peak values defining a warning band between the nominal rate we want to enforce, and the peak rates.

Technically, the Linux policer is a three color policer: arriving packets are classified as either green if they fully comply to requested guaranties, yellow if they are between the minimum guaranties and tolerated peak rates, or red if non compliant at all. Red packets are dropped before yellow one in case of contention. Advanced usage of TC allows different handling of the packets in this three different buckets.

But in any case with Policing we are incapable of delaying packets as we don’t have access to any queue. As a consequence, unlike shaping functions we cannot select packets to be dropped in a smart way. We are essentially dropping randomly packets when our policy is violated. This will eventually introduce a bad behavior of TCP by triggering retransmission timeouts, and a consequence sporadic latency issues.

We’ll see later how to get around this limitation.


TC Object Model

Overview

The TC objects have been thought as a way to model your policies as a decision graph.
Filters and Actions essentially allows you to handle the classification problem, while the definition of the actual policies is done with QDiscs and Classes objects.

QDisc and Classes

QDisc stands for queuing discipline, it is the high level object for shaping and scheduling algorithms (both activities are referred collectively as scheduling in Linux kernel terminology).

For example, the simplest qdisc would be a FIFO, as a matter of fact the default qdisc in Linux is pfifo_fast.

Every network device (physical or virtual!) have two entry points, one for each data direction:

  • one for outgoing packets (egress path),
  • and one for incoming packets (ingress path).

On these entry points, we must have a qdisc attached to it, the kernel is always providing default ones if not explicitly setup.

On the egress entry point, a pfifo_fast qdisc is setup by default.

The ingress qdisc is a bit special, as it forbids to attach any qdisc but allows to attach filters to it. We cannot shape ingress traffic, but with filter we can police it.

Here is the default setup of your machine if you don’t change anything explicitly:

There are two types of qdiscs: classful and classless.

You can think of classless qdiscs simply as queues, sometimes with fancy algorithms ruling their basic operations such as enqueue, dequeue, etc. All the packets in the classless qdiscs are managed on the same basis, without any requirement (and capability) for classification. The pfifo_fast is the perfect example of a classless qdiscs. While it’s doing fancy operations such as honoring traffic priority, we cannot define policies for different network flows.

Classful qdics allows to have a different policy for each Class of traffic.

For example, HTB is a traffic shaping classful qdisc. Using it, we could define three classes: one for HTTP traffic, one for BitTorrent traffic, and a third catch all one, assigning them respectively rates of 500Mbps, 200Mbps, and 300Mbps.

Classful qdisc also have the capability to define policies as a hierarchy. In the previous example we could further refine the HTTP policy, and split its 500Mbps into sub-classes, allocating 200Mbps to Youtube traffic, and 300MBps to all other HTTP traffic.

This is possible because a Class is a wrapper of a qdisc, which can be classless or classful.

To summarize the relationship between qdisc and class objects:

  • Classless qdisc are not allowed to own any Class or Filter
  • Classful qdisc contains one or several Class(es) and Filter(s)
  • Classes wraps qdiscs (of any type)

Here is a schematic view of the relationship between qdiscs and classes:

The whole point of classes is to add an identifier (the classid) for qdiscs so they can be targeted by Filters. This is effectively allowing to classify traffic flows.

There is an important constraint with qdisc hierarchies: the leaf(s) have to be classless qdiscs. This is enforced by the kernel which is terminating any newly added class by a pfifo_fast qdics.

Here is the example of a complex, but more realistic setup:

A policy tree like above could help us to describe a something like:

  • Let’s reserve 10 Gbps of data of the systems (classid=1)
  • and 1Gbps for Mesos containers (you probably do want to use these numbers…) — classid=2
  • … then among the Mesos containers, we further divide them into 3 priority classes
  • all leafs here are using the default pfifo_fast scheduling (but we could use another one such as fq_codel)

Notice that from the network card perspective, it is only talking to the root qdisc, which is actually polling packets from its children with it’s own algorithms, and this recursively until the leaf nodes are polled.

Handles and Classids

Handles and Classids have the same underlying type: a 32bits unsigned integer. The 32bits are split in two 16bits parts <major>:<minor>.

By convention, QDiscs have a handle, with a minor always set to 0 (or simply omitted, in which case we use the notation <major>:). The major number is arbitrary (can be chosen by the end user or randomly assigned by system).

And Classes have a classid, you already knew it. There is a constraint: the major number of the classid must match the handle of the parent qdisc. The minor number of the classid is the class instance number.

There are several special and reserved handle numbers, the most important one is FFFF: being the root qdisc attached to the device.

Practically speaking…

Let’s start to build some isolation for our Mesos containers!

We are going to use:

  • for traffic shaping: HTB a classful qdisc
  • for fairness (within the container flows):FQ_CODEL, a classless qdisc

HTB will also gives us fairness among the containers.

Let’s focus on the egress (outgoing) path for now:

Filters & Actions

Filters and actions subsystem of TC is extremely powerful and flexible. It can be viewed as a DSL to program packet processing.

Detailing its super powers is outside the scope of this article, if you want to dig the subject I’d recommend following starting points:

That being said, let’s have a look at what we need from filters to assign packets to relevant queuing discipline.

Filters are attached to classful qdisc (it is not possible to attach filters to classless qdiscs). They remotely look a bit like iptables rules, where you have a part matching packets and another part describing what you want to do with it, it’s called action.

The most common action allows to push it to qdisc class:

The priority defines in which order your filters are evaluated, the lower the priority the higher in the chain.

Our main goal is to apply a different policy per container. So we need to identify to which container a packet belongs to. For outgoing packets it is straightforward, the kernel is giving us the information. But for incoming packets we choose to leverage netfilter’s connection tracking system to identify to which container they are it going to.

Tracking connections means that we can tie back all packets together to their applicative conversation. With CONNMARK, packets belonging to a same conversation will be tagged with the same identifier we chose. This is called “marking” packets.

As a mark we are going to set a number uniquely identifying the associated container.

This identifier is technically the net_cls cgroup classid. In our specific case of Mesos, it is allocated for us by the net_cls isolator.

Once we have this marking in place, tc filters can use this mark to apply the policy corresponding to the container.

Here is the setup we want:

Except that it would not work! Remember, we cannot shape on ingress!

To make it possible, we need to use an IFB device.

Flow redirection to IFB

IFB is a virtual network device with egress and ingress path inverted allowing to shape incoming traffic by attaching any type of qdisc to it.

First, we need to create the device by loading a kernel module.

Then, we redirect ingress traffic our physical NIC with TC incoming traffic to the newly created IFB device. While we do so, we’ll also call CONNMARK capabilities to restore the connection mark.

At this point packets going out of the IFB device can be shaped before reaching the application in the end.

Indeed TC filters have the capabilities to “steal” packets from an interface and to push them to another one.

So the ingress data path now looks like the following:

Now, let’s see how we can set this up!

CONNMARK setup

Nothing super fancy here, just using documented boiler plate:

That’s it, we are done! Now, we are able to shape our traffic going in and out with a different rate per Mesos container.

Going further down the rabbit hole

Now that we have a working system, let’s get a bit more familiar with it to see how we can maintain and troubleshoot it.

Where am I?

Let’s have a look to where all this TC and netfilter things are happening:

Note that this is a simplified 10000 feet view, especially for netfilter which has been grossly over simplified here if you want a zoom of the subject please refer to this map on wikipedia.

As you can see at the lowest of the stack, besides the driver code, you have the hooks used by tools such as tcpdump. Then comes the queuing disciplines, and above that all netfilter logic mixed into the rest of the IP stack.

Queuing disciplines are very low in the stack, which also further explains why handling incoming traffic is difficult since we have little context on the packets.

Let’s review the practical implications of this layering

Packet capture

The traffic you capture with tools such as tcpdump is very very low in the stack, so:

  • for incoming packets, you see packets which may later on be dropped or redirected by TC or IPTables.
  • for outgoing packets, you don’t see packets sent by the application but dropped by TC or IPTables
  • various optimization are happening in the driver: GRO, GSO, TSO, etc, so you don’t see real packets, which makes network analysis complex, especially combined with TC shaping

Network traffic you may capture via NFLOG facility suffers from the same problems of regular captures, but in mirror:

  • for incoming packets, you don’t see packets dropped by TC
  • for outgoing packets, you see packets sent by the application but that will never be sent on the wire because of TC

Network analysis could be done by capture data with both regular taps and NFLOG facility and by comparing the capture… You can try to look at tools diffing packet capture files, or export them as text and use your favorite diff tool!

NFLOG example:

There is yet another way to capture traffic, from within TC using the mirred action:

Troubleshooting TC

Ok, you have put in place your qdiscs, classes, filters and … it does not work! what now?

Well, that’s a tough one, here is two tools that can help you.

To debug your filters there’s an action called ‘simple’ which is as the name indicates it to log messages and exposes statics. It can be very helpful to check if your matching expression are actually matched, if the rules are checked, etc.

Another approach is to rely on counters from TC and potentially from iptables:

To setup iptables based counters:

Conntrack for fun and packet loss

Our setup is using conntrack, this is not neutral and needs to be managed in larger systems.

Conntrack is basically managing a connection table, which is bounded in number of entries. If the table gets full on new connections your packets are going to be dropped. You can see if it is happening in dmesg.

Tuning conntrack is subject by itself, but you have to especially pay attention to:

  • Memory taken by each entry, this is varying with your architecture, on my server it’s 376 bytes big
  • UDP traffic which can really blow up your table size
  • different timeout values (see table below)

source: http://people.netfilter.org/jimit/Netfilter_Framework.odp

Conntrack table size can be changed dynamically via sysctl:

Conclusion

We have seen how we can isolate network usage containers from each other on a Mesos. To get there we understood network traffic notions such as packet scheduling, traffic shaping, packet marking, connection tracking, flow classification, and integrate them altogether with Mesos.

The proposed solution has many advantages over existing systems, e.g. the port mapping isolator, such as providing granular policies per container combined with being very agnostic to the underlying network configuration. Also, the setup is a lot simpler as it relies only on cgroups and no namespace.

It’s not without limitation, for example network traffic on the loopback is not taken into account, UDP network traffic is globally not very well handled.

We have demonstrated everything with tc shell commands, a real world Mesos isolator would probably use libnl to implement it.

As usual with network related stuff, the hardest part when you put in place such kind of system will be to help your users, explain the system to them, and troubleshoot it.

Hopefully this article will help you with these matters.

If you want to know more on how we actually implemented network isolation with our Mesos deployment, you can check out our talk at MesosCon18:

https://mesoscon18.sched.com/event/HYTq/moving-toward-network-isolated-containers-in-mesos

https://www.slideshare.net/FredericBoismenu/mesos-nnetwork-isolation-at-criteo

And stay tuned for second post diving deeper into how we integrated into Mesos and let our users defined their network bandwidth requirements!


Thanks to Clement Michaud and Pierre Cheynier!