When a process running in a docker container exits, what happens to its child processes?
In my version of docker, at least, the processes running inside the container have different PID numbers than they do when viewed from the outside. In particular, the container’s root process is number 1 inside the container, and its parent is 0.
When a child’s parent (not the root) is killed, the child is reparented to the container’s root process. At least that’s what it looks like inside the container.
Although Docker appears to reparent orphans to the container’s root process, it can’t guarantee that the root process will reap dead children correctly.
A related problem is that docker stop
will send a SIGTERM
signal to
the container’s root process, but that process might not propagate the
signal to its own children. They’ll be orphaned, and probably get
reparented, but they won’t know they’re supposed to go away so they may
stick around.
These problems are described by the folks at Phusion in Docker and the PID 1 reaping problem. They offer a smarter replacement for init that solves the problems, along with some others like syslog and cron. (Personally, I lean toward making services be more “docker-native” – just log to stdout, so appliances like logspout can handle log aggregation for you.)
An alternative solution is tini, a minimal init that is intended to be used as the root process in a Docker container.
That’s the summary. If you want to experiment yourself, you might find the information below to be useful.
- Find out if Docker is using the Linux sub-reaper feature to allow the container’s root process to adopt orphaned processes.
- How does docker use cgroups? What can leak from this abstraction, and how does it handle these cases?
- Many of the reports of problems, such as Container cannot connect to upstart, are from the early days. Have these problems been better addressed in later Docker releases?
In normal Unix-land, outside of any container, what happens is that the
orphaned child processes are adopted by the init process. The init
process takes on the reponsibility of issuing a wait(2)
call (or
equivalent) when the child process exits (init will get a SIGCHLD signal
that tells it so). This action is known as “reaping” the dead child.
If no process ever reaps the dead child, it will stay around forever as
a “zombie” process.
Linux kernels after 3.4 offer a way that another process besides init can take on the responsibility of being notified about orphaned child processes, and waiting for them. See subreaper prctl operation for details.
When we start a container (docker run
) we specify a command that it
will use to create the top-level, or root, process in that container.
When that process exits, the container stops.
That root process may start others, which will see it as its parent.
To probe the filial behavior of processes, we use a simple C utility. It creates a child process, which creates a grandchild process, and they all print their pid and that of their parent. It’s easy to get confused over “child” and “parent” here, so we’ll use these names:
root | the top-level process that is running the utility |
child | the process started by the root |
grandchild | the process started by the child |
We’re interested in what happens to the grandchild process when the
child goes away. We use the rather ham-fisted technique of putting
calls to sleep(3)
in strategic places to force the grandchild to
sample its parent pid after the child has exited.
Let’s establish what our utility will see when we run it in the host environment. It will create a child and a grandchild, the child will exit, and then the grandchild will ask for its parent pid.
By the time the grandchild gets around to doing a getppid
, it will
already have been adopted by some reaper process.
./a.out
which | PID | parent |
Child | 26525 | 26524 |
Root | 26524 | 26523 |
Grandchild | 26526 | 2713 |
We see that the grandchild has been reparented to process 2713. That turns out to be a subreaper run by the window manager on my box.
If this were run on a server, it would likely be pid 1 (init) instead.
Inside the container, we see only our own processes, with PID numbers starting at 1.
Because we told Docker to start a.out
as the container’s top-level root process,
it appears as PID 1.
docker run --rm -v `pwd`:/src ubuntu /src/a.out
which | PID | parent |
Child | 5 | 1 |
Root | 1 | 0 |
Grandchild | 6 | 1 |
Outside the container, we can see those same processes, but they have different pids.
(I wonder what happened to processes 2, 3, and 4?)
tini minimal init
subreaper prctl operation enables “sub-init”. The new PR_SET_CHILD_SUBREAPER prctl() operation allows userspace service managers/supervisors mark itself as a sort of ‘sub-init’, able to stay as the parent for all orphaned processes created by the started services. All SIGCHLD signals will be delivered to the service manager.
Docker and the PID 1 reaping problem
Your docker image may be wrong