nschoe's labs Home Articles Haskell A.I. WebRTC About Contact RSS feed logo

Docker: Taming the Beast - Part I


Docker is one of those things that, one day, started popping in forums, in IRC channels discussions and on Hacker News and making the buzz, but I never took time to really know what that was really about.
Of course I don’t live in a cave: I had heard of it, but whenever I read something that talked about Docker, ideas like “software containers”, “sandboxes” and “virtual machines” were lighting, but that was about it.
And then one day I wanted to learn Docker, and I immediately faced a problem: everybody “knew” about Docker, everybody talked about Docker, but nobody was actually explaining Docker anymore. It was as if it was so obvious, that the only topics were about Docker implementations, people raging that Docker did not bring anything new, that containers had been around for decades already, etc.
Furthermore, the Docker landing page itself was not much more verbose:

An open platform for distributed applications for developers and sysadmins.

This is very abstract, and for somebody who “missed the train”, it was not that easy to find some useful explanations. The only thing I knew, was that Docker dealt with “containers”, which, in my mind, was assimilated to a sandbox (in particular, I come from the Haskell world, so the closest analogy I had was the cabal sandboxes).
But that sounded like magic to me: make a software in a completely enclosed “environment” (whatever that means), without affecting the computer.
The idea seemed very attractive to me, even more because I recently switched to nixOS, a Linux distribution built around the nix package manager where this concept of isolation is paramount.

In another word, my goal in this first post is to spare you the trouble of fishing for information about Docker to understand it as well as sharing the love about Docker, of course! 5 days a week I’m hanging on the #docker IRC chan (I’m nschoe by the way, if you want to come and say hello :-)) and too many times I see newcomers asking questions that clearly indicates they are missing some very important base notions: if you’re in that case, I hope to help you in this series of articles. Let’s fix that together!

I just want to clarify that Docker (and more generally software containers) is a complex topic, that goes back all the way down to the kernel. I’m not a Docker developers and I’m not an expert on the field. I just like to understand, and I work better if I really get what’s going under the hood.
I’ll do my best to transmit what I understand of all of this, but if you spot any mistakes, please don’t hesitate to to let me know, I’ll fix it.
Likewise, if you’re a newcomer and think I missed a point or left a gray area, tell me so I can fill in the gap!

In this first post, we won’t play with Docker directly, we won’t issue any docker commands, but we will understand the basics, the fundamentals: what’s going on under the hood. But don’t worry, we will start playing with Docker right from part 2!
Trust me: Docker is not an easy thing, and you will benefit from reading some theory about it.

Now let’s go!

What is Docker?

Docker in a Sentence

It’s always better to have an idea, however vague it may be, when you want to learn something. So let me give you some intuition about Docker, right now, so that we have the rest of the post to actually explain it.
But before I give you a sentence (which would not make much more sense that the previous quote), let me give you some facts, in no particular order:

  • Docker is free and open source (which is awesome), if you missed it before, the website is here
  • Once you have an idea of what Docker is, its documentation is fairly well written and understandable, it’s available here
  • “Docker” can mean many things, if you check the documentation’s home page, you will see
    • Docker Engine
    • Docker Swarm
    • Docker Compose
    • Docker Hub
    • Docker Trusted Registry
    • Docker Cloud
    • Docker Machine
    • Docker Toolbox
      That’s a lot! Don’t worry, we’ll deal with each of them in our articles.
      What you are interested in, and what “Docker” should mean to you right now is “Docker Engine”. Unless stated otherwise (or plainly obvious!), from now on, when we talk about “Docker”, we mean “the Docker Engine”.
  • Docker is a software, meaning it’s not a library, it’s not a web app. It’s a command-line software.
  • Docker allows you to create containers (the divine word!); as unsatisfactory as it sounds, at the moment, think about containers as software sandboxes: an isolated (we will see how) environment (i.e. set of tools, files, environment variables, etc.) in which your software lives.
    This will be the only time when I will tell you this, but for now (and only for now), you can think of containers as a sort of virtual machine. But this analogy is dangerous because people tend to stick to it, and Docker is not a virtual machine, at all. So keep this analogy fresh, and don’t store it in your long-term memory.

Ok, so now for the sentence, even if that doesn’t clarify much, we can say that

Docker is a software that allows developers to create isolated and possibly inter-dependant environments for controled software execution.

That’s it for now, let’s keep reading to make sense of all of this.

Docker Use Cases

Docker is a complex tool, and it can be useful in several scenarii. Let’s quickly examine some of them (I don’t pretend to list them all) and see if you can find some similarities with your situation.

Easy Portability

This is easily one of the most common use cases I can find for Docker.
Docker allows you to setup (we’ll even say build) your environment as you wish, and this makes it very portable. To save testing and development time, people can focus on a specific environment, and distribute their binaries with the Docker image (we’ll see what this is a bit later, for the moment, think of an image as the “environment” we keep talking about).
This way, the target customer needs only to install Docker and the image will run smoothly (whether the customer runs the same Linux distribution than you, Windows or OS X).
Keep in mind that Docker runs on the Linux kernel, so it’s a simple matter of installing Docker if your target runs Linux, but it’s currently a bit more complex under Windows and OS X (namely, the customer installs Docker inside a virtual machine running Linux).

Anyway, that’s a common use case for Docker.

Testing Your Software Under Several Environments

Another popular Docker use case is testing.
This is very popular because Docker images can be very easily instantiated and disposed of. When developing a software, it’s a very delicate task to ensure that it will work on your customer’s system configuration. Does he use the same library versions? Does he have the same tools than you? Will your Web App still work with PHP4 and PHP5? Does your Haskell project compile with older GHC version? Etc.

Docker is very useful here because you can create an environment with PHP 4 or GHC 6 installed and test your installation inside it without messing with your system.

I know.
I anticipate some advanced users that may be thinking right now that I missed an important point about Docker. The use case I just described is somehow controversial: some people think that Docker is not the best tool for this.
There are other solutions for this, such as the nix package manager on which the nixOS Linux distribution is based off.
I know. It just happens that there are a lot of people using Docker for this, and despite being a huge fan of nix and nixOS, I still think it’s worth mentioning.

Continuous Integration and Continuous Delivery

I know that C.I. and C.D. are not the same thing, but they can be grouped together for this section’s sake. If you plan on setting up a C.I. development environment, Docker can be very easily grouped with Jenkins (see here for a quick access).
You can setup git hooks that will start your Jenkin pipeline inside a Docker container and push your code if the tests pass.

Save Money on Infrastructure Without Compromising Security

If you’re in the Saas (Software As A Service) business, you have several clients, running the same service you offer. And if you’re not too bad, you care (a lot) about security and in particular, about making sure your clients instances are well separated, and as independent as possible. For instance (this is just an example), you might spawn several servers of your custom app, and if one crashes or gets compromised, you’d like to reduce the risk of your other clients being infected.

Full Separation of Services

Typically, your service is based on several softwares: you may have an nginx running as an HTTP server, to serve your Web App files, you may have a WebSockets server, you may have a postgreSQL database storing your clients’ information. All of these are separate entities that may crash or need to be updated separately.
Docker allows your to containerize each of these services, making them isolated and independent, while controlling precisely the way they communicate (understand: chose and control which ports and/or UNIX sockets they talk with one another).

Scaling Up

Docker, through the use of Docker Swarm (this is for another post) can quite easily help you setup a High Availability server/database to prevent a node that crashed to ruin your entire setup.
Or you can quickly and easily setup a load balancer between several nginx instances to relieve your over-crowded server and/or smoothly take down a server while replacing it with another, possibly updated one.
Or you can setup a computing cluster and use Docker (Swarm) to easily spread your computing power.

As you see, Docker has many, many possible use cases, and surely I am still missing some very interesting applications. Anyway, before we dive deeper into Docker, it’s important to keep our head cool and not mistake Docker for what it’s not.

What Docker is Not

Now that we have seen some cool Docker features, let’s give some warning words and see what Docker is not.

  • Docker is not a virtual machine.
    This one may be the most difficult concept to grasp for newcomers, Docker-is-not-a-virtual-machine. It doesn’t do virtualization, it doesn’t do emulation.
    The problem is that every article that talks about Docker also talk about virtual machines, and make analogies. In this article, I will try to avoid talking about virtual machines as much as I can, and make you forget about that parallel.
    A bit later in the article, we’ll see some differences between the two.
  • Docker is not magic.
    Too often I see newcomers come on the #docker IRC chan asking about quick instructions to achieve X, Y and Z. Docker makes it relatively easy to use very complex notions such as overlay file systems and control groups, but that doesn’t mean it’s trivial.
    One particular point on which I insist is that Docker does not replace system comprehension. If you find something difficult to understand, do not think that Docker is the solution. More often than not, Docker will actually add a layer of complexity and it might not be clear whether the problem comes from the application or from the fact that it’s dockerized.
    Two days don’t pass without someone coming on the #docker IRC chan and ask why their postgreSQL data was lost when they recreated the container, or why the Docker image they are building takes half their disk space.
    In a word, you need to understand the underlying concept before using Docker. I have spent hours in the Docker documentation, and I have have doubts every time I use Docker for something new: am I really good enough in XYZ to replace the common method by Docker containers?

If it was not concrete enough: don’t try to containerize a postgreSQL database if you’re not familiar with postgresql (can you do common administration tasks with psql? Do you know how to create, list, alter and delete tables? do you know how to save and backup your postgreSQL table?).
Do not try to setup a containerized front-end nginx, if you don’t know the basics of nginx configuration. I’ve paid this particular price myself and I was lucky enough to find someone on #nginx to help me with that.
Don’t try to isolate your compilation chain if you can’t write basic Makefiles.

Well, you get the idea :-)

How does it Work?

While this part is technically not needed to start using Docker, I will talk about some fundamental concepts about Docker and how it works, so it’s really worth it.
I will talk about what’s under the hood of Docker and how it makes the magic happen.

How does it Not Work?

In order to get rid of false ideas and our intuition (which, in this case, is most likely playing us), we’ll talk about how Docker does not work.

As I said before, Docker is not a virtual machine; let’s see how one roughly works, then.
A virtual machine, as its name implies is like a real machine, only it’s “virtual”, as in “inside another machine”. It behaves very much like a real, fully-fledged machine (computer).

When using a virtual machine, you generally create a virtual hard disk, which is a big file on the host’s filesystem, you allocate some of the host’s RAM and video memory to the virtual machine, by making it accessible through special, low-level routines.
On the virtual hard drive, you install a complete operating system (Windows, Linux, etc.) from scratch. It means that from the point of view of the installer inside the virtual machine, it’s like it’s really writing to a disk (only it’s a file on the host, but it -the installer- doesn’t know it). But that doesn’t change much: you still partition it, you still create filesystems on the partitions (FAT, ext3, ext4, etc.).
Then you typically write the MBR in the first few sectors of this file (or now you write your UEFI loader in the correct partition), and when you “start” your virtual machine, it read the host’s file as its hard drive, reading the partitions, the OS bootsector, etc.

Since your virtual machine is “only” reading the hard drive’s data and executing it’s instructions, it can basically run “anything”, and in particular, it doesn’t matter what OS you install and run. This is the big strength of virtual machines. And with CPUs’ virtualization features, you can even run different processor architectures.

Let’s come back quickly to the virtual hard drive: as I said before, the virtual machine uses a (big) file on the host’s filesystem. When the OS inside the virtual machine writes data to disk (e.g. creates a file), it calls low-level, kernel routines (drivers) to write to the file. In a real machine, these kernel routines will call the hard drive driver and actually writes the data to the physical disk.
In a virtual machine, this process is sort of “hijacked”: when the OS calls the low-level routines to write to disk, the virtual machine software will “catch” these calls, write the data to the host’s (big) file serving as the virtual hard drive and send back the appropriate answer to the virtualized OS.

All of this is very very simplified and I’m sure some specialists are hating me right now, but I think it’s enough to understand the concept. And even most importantly, to understand the differences with Docker containers.

So everything we just saw is roughly how a virtual machine work and it’s really not the case of Docker. This is important, because you have to “forget” this, or at least, remember that this is different.

The Kernel and User lands

Now we are getting serious; Docker relies on several features to make the magic happen. And most of them have to do with the Kernel.
To summarize quickly, the kernel is the “core” of Linux (and any OS, for that matter, but Docker runs on Linux, so now I’ll mostly talk about Linux. If you are in OS X or Windows, don’t be sad and keep reading, all of this still applies. We’ll see how Docker works for these OSes).
By “core” I mean that very low-level routines are implemented in the kernel: drivers to communicate with external peripherals, process schedulers, filesystems, etc. Everything that makes the OS work, is in the kernel.
As you see, this is an important, crucial and heavy part of an OS.

In a running OS, there are generally two lands or spaces: the kernel space (a.k.a. “kernel land”) and the user space (a.k.a. “user land”). You can see the spaces as two levels -or rings- of privileges.
There are a small number of programs running in privileged mode (i.e. in kernel space), these are called “system calls”: stat, splice, read, close, exit, mmap as well as device drivers. These programs have a complete, unrestricted access to everything: this is why there are only a very few allowed.

All other programs -the one you use, web browsers, terminal emulators, mail clients, etc- are all running in the user land, with unprivileged and restricted access. These are the “user programs”, this is why this is called the “user land”.

Part of the magic behind Docker relies on this notion of user lands. To put it simply, each Docker container creates its own user land, that is separated from the host’s and the other containers’. A bit as if you booted your computer a second time on the same kernel.
We’ll explore how it is achieved in the following sections.


So what are those namespaces? We keep hearing about them, but what are they?
To answer this question, we need to have a little understanding of how the Linux kernel works. Especially about processes.

Processes are instances of a program. When you run a program, it created a process. Now there are some programs that creates several processes, but you can think of it as if the program itself launched other small programs. Note that I am not talking about threads, threads are another beast. A single process can spawn several threads to make use of multi-threading, to parallelize operations. Here, I am talking about processes, programs.

So, at any given time, there are a lot of processes running on your computer, you can have an idea: try running ps -A | wc -l. It will return a number, this is close to the number of processes running at that time. Right now, I have 149 processes running.

It’s important to have an understanding of how these processes interact with each other.

Let’s play a little bit. Launch your favorite terminal, it should present you with a shell. For the vast majority of people, this will be bash; I’m using zsh myself, but the principle will be exactly the same.
Now that you are in your shell, run touch test to create a file named test. Then run tail -f test. What this does is launch the program tail in “follow” mode, which means that it keeps watching the content of test (currently empty) for new output. We don’t really care about that, all that we care about is that tail won’t terminate: it will keep running.

Now run another terminal, we will try to see what’s happening. As you probably already know, ps is what we can use to see the running processed. We will format a bit its output so that it is more readable. Run ps -Ao ppid,pid,user,fname. This launches ps to print a snapshot of the current running processed, and format the output, to display, in the order: the “parent PID”, the PID, the user who executed the process and then the process name.
It should return a pretty long list, but toward the end, you should see something like this:

1067  7379 nschoe   zsh
7379  8308 nschoe   tail
2189  8862 nschoe   ps

Remember that the left-most column is the parent PID, and the second is the PID. Here we see something interesting: the zsh process has PID 7379 and the tail process has Parent PID 7379. The numbers will be different that mine, but you will still have those two numbers equal.
This is a very basic, and very important notion of processes: a process can have child processes and processes (the children) can have parent processes. This is pretty easy to understand: when we launched tail from our bash/zsh shell, it created a new, child process.
That’s one important concept. Let’s see immediately a new one, go back to your first terminal, the one from which you ran tail and hit CTRL + C. It should stop tail.
Now, launch this command: tail -f test &. The & sign that we appended means that the command we launched, tail, runs in the background: indeed, you can see that your terminal is now available, even though tail is still running.

Let’s check that: ps -Ao ppid,pid,user,fname:

1067 12242 nschoe   zsh
12242 13267 nschoe   tail
2189 13346 nschoe   ps

Now from that terminal, hit CTRL + D. It is possible that it answers with somehting along the line of

zsh: you have running jobs

In which case, hit CTRL + D again. It should quit the terminal. Now that it is exited, let’s run our ps command again (in a new terminal): ps -Ao ppid,pid,user,fname:

2189 15131 nschoe   ps

There will be plenty of output of course, but toward the end, where we usually saw tail and bash/zsh we don’t see them. Let’s grep the result to be sure: ps -Ao ppid,pid,user,fname | grep tail you should see nothing.

This a second, very important concept of processes: when you kill the parent (which we did by killing bash/shell), the child process is generally killed too. How does it work and how can this be possible?
To create the child process, the parent process generally called fork, which creates the child process (this is a basic summary, that will be enough to understand Docker). fork returned the PID of the child process.
When we kill the parent process, by sending it a SIGNAL, the parent can (and should) forward that SIGNAL to its child(ren): this is how the parent process can kill its child process.

When a child process dies (either because its parent forwarded it a SIGNAL or because it received itself a SIGNAL), what really happens is that the child’s return code is set to a special code, called EXIT_ZOMBIE (this is the real, actual, official name!). At that point, the process still technically exists (it takes a slot in the maximum number of processes, etc), and a signal called SIGCHLD (for SIGnal CHiLD) is sent to the parent. This signal basically tells the parent process that its child just died, and that it should do something about it. The parent then must reap the dead process child. Then, and only then, does the child process cease to exist.

But what if the parent never gets a chance to reap the child process? Well, we will emulate this behavior: open a terminal, and run this command: nohup tail -f test&. As before, the & will make tail run in the background. the nohup directive here prevents the parent (bash/shell) from forwarding the SIGNAL to its child.

Let’s check: ps -Ao ppid,pid,user,fname:

1067 16127 nschoe   zsh
16127 16950 nschoe   tail
2189 16959 nschoe   ps

We’re getting used to it: we can see zsh has PID 16127 and tail had PID 16950 and has PID 16127 for parent, which is zsh. Classic.

Now hit CTRL + D. Your terminal will most likely complain with something like zsh: you have jobs running. Ignore that and hit CTRL + D again, it should work this time, and your terminal should exit, as before.

Now, let’s see what happened to tail: ps -Ao ppid,pid,user,fname:

1 16950 nschoe   tail
2189 17003 nschoe   ps

First, we see that zsh doesn’t appear anymore, which is normal, because we killed it. But interestingly, we can see that tail still exists! It was not killed. We know it’s the same tail process, because it has the same PID (even though PID numbers can be re-used, in this case, this really is the same!). Even more interestingly, we can see that now tail’s parent PID is 1. And this time, you should have 1 too.

This is another key concept of how Linux processes work: there really is one process to control them all. In Linux, there always is a top-most, parent-most process, called “the init process”. It used to really be called “init”, but it’s very likely yours is called “systemd” now. You can see it with ps -o ppid,pid,user,fname -p 1 (be sure to remove the A). It should return something like:

0     1 root     systemd

The first three columns should be identical: we displayed the process with PID 1. This process is launched by root and has no parent (hence 0 as parent PID). What’s susceptible to change, is the name of the process. Most likely you should have systemd, but it’s still possible that you have init. Anyway, the very first process in a Linux system, is always the init process. And there can be only one.

This process is the first one launched when the system boots, and the last one killed when the system shuts down. This is his first role.
His second role is precisely what we’ve just seen: his role to become the parent process of children who do not have a parent anymore. This just happened to tail: we killed bash/zsh (and prevented the forwarding of the SIGNAL) and so tail became orphaned (this is also the correct, official term!). Then, systemd/init adopted it, and it became its parent.

You can always “go back” to the first process, the init process: run ps -Ao ppid,pid,user,fname again, pick a PID, whichever you want, for me, 5318 6194 nschoe site so PID 6194. It has parent PID 5318. Now I’ll display information about the parent process: ps -o ppid,pid,user,fname -p 5318 (replace with your parent PID).
It show 1067 5318 nschoe zsh. So the parent was zsh. This zsh has parent PID 1067, let’s print its information: ps -o ppid,pid,user,fname -p 1067 which returns: 1 1067 nschoe .urxvtd-.
And here we are! The parent process is urxvtd (this is my terminal emulator, yours might be gnome-terminal or konsole for instance), and this time, the parent PID is 1, init.

Every processes have systemd/init as a distant parent, may it be the direct parent, the grandparent, the great-grandparent, etc.

What Does it Have to do With Namespaces?

You thought I had forgotten?
We haven’t been avoiding namespaces, actually we have been laying the bricks to understanding them. Keep in mind everything we have seen about child and parent processes, and the init process as it will be useful in a minute.

Now, there is something you and I have been doing for some time now and which will be crucial to understanding Docker containers and isolation. We have launched several terminals and several programs (like tail for instance). Then we have run ps which allowed us to observe (I should say “spy on”, really) other processes. And with kill we have, well… killed other processes.

And believe it or not, this is the key to understand all that: we have made processes interact with each other. Which is fabulous because it allowed us to do everything but it is also a disaster, because it means that if we have some process that we want to have running, others could kill it, or inspect it. And this is the opposite of isolation!

Well, all of this is possible, because all of these processes run in the same namespace. To put it simply, we can consider that a namespace is an init tree. Here, we have one init process, which is the (more or less distant) parent of every other processes running: it defines one namespace. One key concept of Docker and process containerization in general is to create separate namespaces.

A typical namespace, like you have right now on your computer looks like this:

1 init
    |-- 6728 zsh
    |   |
    |   |-- 6729 tail
    |-- 7839 firefox
    |   |
    |   |-- 7840 firefox (tab 1)
    |   |-- 7841 firefox (tab 2)
    |   |-- (...)

The top-most process has PID 1 and is what we call the init process (most likely called systemd on your machine). Then this init process has direct children, here we can see two: zsh with PID 6728 and firefox with PID 7839.
both zsh and firefox have children of their own, as you can see. The figure above forms a tree.

Now what happens with containerization and Docker? Well, if you want to run isolated processes, the first thing you need is for these processes not to be able to do what we have been doing up until now, i.e. spy on other processes and interact with them. You need to completely isolate them.
The way this is done is by creating a second init process tree, i.e. a second namespace.
Let’s say we want to containerize nginx, a web server. Nginx is started from a shell, bash for instance. We’d like bash and nginx to be isolated from the rest of the system, so we have to “make them believe” they are in their own namespace. So based on what we’ve seen so far, they need their own PID 1 init process. In this case, bash can be the PID 1 init process, and it will be nginx’s parent process.

But of course, we actually have only one machine (the host) and one operating system (our Linux distribution), because we are not running a virtual machine, so whatever program we launch (that includes bash and nginx), they will be child processes of the “real” PID 1 init process, the one running on our system, i.e. systemd. Here is how the processes tree will look like:

1 init
    |-- 6728 zsh
    |   |
    |   |-- 6729 tail
    |-- 7839 firefox
    |   |
    |   |-- 7840 firefox (tab 1)
    |   |-- 7841 firefox (tab 2)
    |   |-- (...)
|   |                    isolated process tree  |
|   |                                           |
|   |-- 8937 (1) bash                           |
|   |   |                                       |
|   |   |-- 8938 (4539) nginx                   |
|   |       |                                   |
|   |       |-- 8939 (4540) nginx (worker)      |
|   |                                           |

You recognize the first items of the tree: we have our machine PID 1 process, init. It started zsh and firefox as it previously did, and them have started child processed themselves.
Now the new part: we have our isolated process tree or -namespace which I have artistically ASCII-art-decorated. In this isolated tree, bash has PID 1 (the number enclosed in parentheses). This bash started another process, nginx, which has PID 4539. Nginx is usually comprised of a core process which read its configuration and creates children as needed to handle requests, here it created a child -called a “worker”- whose PID is 4540.

When we are more experienced in Docker, we’ll come back to this and actually see it for ourselves, but now, believe me when I say that if we “logged in” this isolated environment and ran ps, we would only see this.

But the truth is, all these bash and nginx (everything that is part of the isolated process tree) actually runs on the host Linux system, right? They are “classical” processes, and they must have PIDs. This is the number I wrote before the parentheses. This extremely important and useful feature which allows a process to have several PIDs has been introduced in version 2.6.24 of the Linux Kernel, in 2008!

So this is what we are talking about when we mention namespaces: nested sets of process trees in which processes can’t “get out”. From inside the isolated process tree, you cannot observe processes outside of it by running ps and you definitely can’t kill them with kill. This is the first step of program isolation which Docker uses.

Why “the first step”? Why isn’t it enough? Well, there are still plenty of ways these isolated processes can interact with the host system: we haven’t protected the filesystem, so they can read/write to the host’s files, they can run very expensive computing operations and take all CPU and RAM, etc. As for now, we have isolated the processes from seeing and directly interacting with each other, but Docker goes even further.

Let’s take a little break and enjoy the fact that we finally can put something concrete on the notions of “namespace” and “isolation”.

The Control Groups

Control Groups (a.k.a. “cgroups”) are another feature of the Linux kernel that Docker uses for providing isolation. They solve the problem that we have just introduced about computer resources.
Basically, cgroups is a tool to allocate and monitor the resources that a group of processes (for instance, our isolated namespace) can use.

If you are running several isolated process trees, you’d like to control their resource usage: for instance, the first group of processes may be running a critical stack of softwares, so you might want to allocate 75% of the CPU and 80% of the RAM; while your second process tree might be expendable, so limit it to 10% of these resources.

All of this is possible thanks to cgroups.

We won’t go into much details for cgroups because they are less essential to understand Docker than namespaces. Usually when using Docker you create containers and care about how to make them communicate, etc. Only when you begin running serious stacks of containers do you care about controlling the resources.
But it was still important to talk about cgroups, because when we’ll use the docker stats command, this will rely on cgroups.

It’s all About Layers

Layers are the other “magic” component of Docker and solve the other problem we talked about: processes can still read and write on the filesystem, somehow breaking the isolation.

Talking about layers with Docker without talking about images and containers would be a challenge, and a pretty useless one in my opinion.
However, I’d like to avoid talking too much about Docker images and containers in this post, because the second article will be about them.
So try to focus on the meaning and not on the details for this part. I hate articles that say “trust me” and obscure things, but in this case, I don’t really have a choice, otherwise this article will grow in size.
I promise I’ll talk in details about images and containers in the next article, and I’ll even clarify this notion of layers, but for the moment, focus on the notion behind it.

If we want to present a truly isolated environment to our processes, we must hide the host filesystem from it and expose a clean, predictable one.

What does this mean exactly?

It means that for our containerized process, the filesystem should look like a fresh install: there must be no /home/nschoe, there must not be files under /tmp that are used by the host’s applications, etc. Conversely, it means that when the containerized process writes a file in /etc/something.conf, it should not appear in the host’s filesystem.
The containerized process should even be able to run rm -rf /* without impacting the host’s filesystem at all.
How is this magic possible?

Docker makes use of union filesystems. What’s that?

A union filesystem is a filesystem that makes use of union mounts.

Wait wait wait! Put down that weapon, I’m just about to explain that!

You know what a filesystem is? Examples of filesystem are ext2,3,4, fat, ntfs, reseirFS, etc. Filesystems is a way to manage data on the hard drive. It handles the notion of files, directories, symbolic/hard links, etc. This might seem trivial, but it’s really not.
When you have a big file on your computer, like 10GB big, how do you store it? Do you find a slot on your hard drive big enough to fit your 10GB or do you break up the files in 10 1GB smaller parts? But then, how do you keep track of the locations of each of these parts? When your OS asks the size of your file, do you have it stored as meta-data or do you compute the size by iterating through the blocks each time? etc.

All of these are handled by the filesystem. If you’re running a classic Linux installation, you’re very very likely to have either ext3 or ext4 as a filesystem; unless you manually specified another, in which case you probably know what you’re doing.

Let’s get back to our topic. A union filesystem is not a filesystem in the same sense that the ones I cited previously are. It rather relies on one of those, and then implement union mounts. Union mounts are conceptually simple yet very useful: it takes two or more directories and present a unified view of them at a specified mount point.
Let’s take a simple example, suppose we have two directories dir1/ and dir2/, each containing files, as such:

dir1/           dir2/
|               |
|-- file1.txt   |-- file4.mp3
|-- file2.ods   |-- file5.txt
|-- file3.iso   |-- file2.ods
                |-- file6.jpg

Well, a union mount of dir1/ and dir2/ at mount point /path/to/mnt/ would give:

|-- file1.txt
|-- file2.ods
|-- file3.iso
|-- file4.mp3
|-- file5.txt
|-- file6.jpg

So you would transparently see the contents of both dir1/ and dir2/ at location /path/to/mnt/.
Union mount brings two good features to the table:

  • transparently making several directories content available to a single, unified mount point (directory)
  • shadowing files

This second notion probably answers the question at your lips right now: “what about file2.ods?!”
If you look closely at the example before, you can see that both dir1/ and dir2/ have a file named file2.ods. So what happens?
Well it depends on the union filesystem, but most of the time, there is a notion of precedence. In OverlayFS for instance, when unifying two directories, you have the “upper” and “lower” directories, and the “upper” directory takes precedence.
So when there are two files with the same name, the one in the “upper” takes precedence.

Without going into more details, this precedence thing solves one problem: “what do we do when two files are named the same?” but it raises another, more subtle problem. What happens if I want to delete file2.ods? Simply removing it won’t work: because i will remove it from the “upper” directory, but then, there won’t be a name conflict between the two directories and then file2.ods will still be visible, but this time, it will be the file2.ods from the “lower” directory. To solve this problem, union filesystems generally use a “without file” (the implementations can vary), but typically, what this means is that when you delete a file, rather than physically deleting it, it simply adds a third layer, which take precedence, and that “masks” the file to delete.

Anyway, this is a detail of implementation that we are not yet ready for.

Why Did You Talk About Layers in the First Place?

Because it’s very very important to Docker, and is arguably the core feature (well, not really, processes isolation is too). When you create a docker container, as I said before, it has to run in its own “place”, with its own set of processes (we’ve already covered that) and its own filesystem.

And union filesystem (along with union mounts) are how it’s done: when you create a new container, Docker makes use of union mounts to mount all essential files and directories (this is why/how you do get a classic, linux filesystem architecture inside your containers: /, /etc, /var, /usr, /home etc.) and by making extensive use of shadowing, it can effectively “delete” or “mask” everything that’s related to the host. This is why you do get a /etc directory in your container, but you don’t get your specific, host-related /etc/file.conf.
In the same sense, this is how it allows you to write files in your container and not “pollute” your host environment and other containers.

Actually, union filesystems are used extensively in another place in Docker: containers and images. But this is a topic of its own, which is very often confused so I’d like to take some proper time to explain it. In another post.


Depending on your personality and expectations, you might be frustrated after reading this first article because I did not use any docker command.
Too often I see people coming on the #docker IRC chan and ask questions which show that they are trying to use docker like they are trying to use a new text editor. Docker is not simple. The tools on which Docker relies are not trivial. Docker does a wonderful job of packaging complex, low-level tools into a simple, beautiful API (and it has very well-written documentation). But it still remains that the overall system is a beautifully complex one.
It is relatively easy to get a few containers running with Docker, because it makes such a beautiful job of abstracting the complexity away, but if you don’t take the time to analyze, document and understand what is going on under, you will quickly run into walls: you won’t understand the notion of images vs. containers (topic of next post!), you won’t be able to share your environment, you will have your disk space used by up unnecessary redundancy, etc.

On the contrary, you don’t need to be an expert in every of the details we saw: I am not myself. This is why I did no go into too much details. Feel free to document more on the topics you want, and if there are topics you’d like me to talk about more in-depth, .

I hope I was clear enough and that I shed some lights on some concepts that were obscure, if you still have gray areas, feel free to or poke me (nschoe) on IRC (#docker).
In the next article, I will relieve some of the frustration and we will begin playing with “real” docker commands. In the meantime, you don’t need to do anything specific, I’ll begin the next article with the installation instructions.

Part II is available to read here!

May 26, 2016