Detailed explanation of Docker's process and Cgroup concept

Process organization or relationship in container

Process No. 0: containerd-shim

Role:

containerd-shim is the parent process of the container, responsible for managing the life cycle of the container and receiving the executed instructions in the container.
It creates a container by calling runc and processes instructions inside the container.

Features:

Container dependencies:
- containerd-shim is the ancestor process of the container. If it hangs, the entire container will exit.
Process Management:
- If the container's process 1 ends, containerd-shim will reclaim all processes in the container namespace.
- If process 1 does not end but its child process ends, the child process will become a zombie process and needs to be recycled by process 1.
Signal processing:
- docker stop will send a SIGTERM (-15) signal to the container process 1.
  - If process 1 does not have signal forwarding capabilities, it will send a SIGKILL (-9) signal to all processes in the container.
  - If process 1 has signal forwarding capabilities, it forwards SIGTERM (-15) signals to all processes in the container.

Process 1: The first process in the container

Role:

Process No. 1 is the first process in the container, representing the life cycle of the container.
It is usually a user-specified application process.

Features:

life cycle:
- Process 1 ends and the container will end as well.
- Process 1 must be run in the foreground, otherwise the container will exit.
Functional defects (difference from operating system processes):
- Process 1 in the container is not necessarily the ancestor of all user processes.
- If process 1 becomes an orphan process, it will be adopted by the init process (process 0).
- Process No. 1 developed by users may lack the ability to recycle zombie processes and forward signals.

Ability to possess:

Recycling zombie process:: Call wait or waitpid regularly to recycle zombie child processes.
Signal forwarding: Forwards the received signal (such as SIGTERM) to the child process.

~# docker container inspect text | grep -i pid
pid:41404---------The corresponding one is in the container1No. Process
~# ps -elf | grep 41404
pid:41404 ppid:41382---------The corresponding one is in the container0No. Process

Three reactions after the process receives the signal

A signal is a notification sent by the operating system to a process, which is used to notify a process of some kind of event. Can be used for inter-process communication or to control process behavior.

Ignore:
- The process does not process any signal.
- Example: Ignore the SIGTERM signal and the process will not be terminated.
Catch:
- The process can register a custom handler to handle the captured signal.
- When the signal arrives, the handler execution is triggered.
- Example: trap 'echo "signal received"' SIGTERM
Default behavior (Default):
- Each signal has a default behavior, defined by the operating system.
- Example:
  - The default behavior of SIGTERM is to terminate the process.
  - The default behavior of SIGKILL is to force the process to be terminated.

Two privilege signals

SIGKILL (-9)

Function: Forced termination of the process.
Features:
- Cannot be ignored.
- Unable to be captured.
Usage scenario: When the process is unresponsive, the process is forced to be terminated.

SIGSTOP (-19):

Function: Pause the process's running
Features:
- Cannot be ignored.
- Unable to be captured.
Resume operation: Send SIGCONT (-18) signal.

SIGTERM (-15) signals can be ignored or captured by processes.

Execute kill commands in containers

kill -9 1

Cannot kill process 1 in the container.
Cause: Process No. 1 in the container is tagged with SIGNAL_UNKILLABLE.

kill -19 1

Process 1 in the container cannot be paused.
Cause: Process No. 1 in the container is tagged with SIGNAL_UNKILLABLE.

kill -15 1

It is possible to kill process 1 in the container.
If process 1 does not register the SIGTERM signal's processing function, it ignores the signal.
If process 1 registers a processing function for the SIGTERM signal, it executes the function.

Cgroup Introduction

Cgroup (Control Group) is a mechanism provided by the Linux kernel to limit, control and monitor the resource usage of process groups. It allows system administrators to granularly manage the resource usage of a set of processes, including CPU, memory, disk I/O, etc. Cgroup is the basis for implementing resource isolation and management in container technologies (such as Docker and Kubernetes).

Why use Cgroup? The main function of Cgroup is to restrict the use of host resources by containers or process groups, prevent a container or process from overoccupying resources, thereby affecting the normal operation of other containers or processes. Through Cgroup, it is possible to ensure that multiple containers or processes on the host can share system resources fairly and stably.

In Linux systems, you can view and manage Cgroups through the following commands

# View the current Cgroup control levelcat /proc/cgroups
# Create a new Cgroupmkdir /sys/fs/cgroup/cpu/my_cgroup
# Add a process to the Cgroupecho "PID" &gt; /sys/fs/cgroup/cpu/my_cgroup/tasks
# Set CPU limitsecho "2" &gt; /sys/fs/cgroup/cpu/my_cgroup/cpu.cfs_quota_us

CFS-related parameters in CPU Cgroup

CFS (Completely Fair Scheduler) is a scheduling algorithm in the Linux kernel, which is used to fairly allocate CPU time to each process. CFS-related parameters in Cgroup determine the CPU usage of the process group.

cpu.cfs_period_us: represents the time period of the CPU, in microseconds (μs). For example, setting to 100ms (100,000 μs) means that one cycle is 100 milliseconds.
cpu.cfs_quota_us: indicates the maximum CPU time that processes in the control group can use during this time period. For example, setting to 50ms (50,000 μs) means that the process can use up to 50ms of CPU time in a 100ms cycle. At this time, the CPU usage is 50ms / 100ms = 0.5, which is 50%.
: Used to control the allocation of CPU resources between multiple control groups at the same level. When the CPU resources on the host are insufficient, it will take effect, determining the proportion of CPU resource allocation between each control group. The larger the value, the more CPU resources are allocated.

Summarize:

The two values cpu.cfs_quota_us and cpu.cfs_period_us determine the maximum value that all processes in each control group can use CPU resources.
This value determines the relative proportion of available CPUs for the control group under the CPU Cgroup subsystem. However, this ratio will only work among the control groups when the CPU is full on the system.

Resource Management in Kubernetes

In Kubernetes, the resource requests and restrictions of Pods can be set through requests and limits.

requests: Indicates the minimum resource requirement of the Pod. Kubernetes will schedule pods based on requests to ensure that there are enough resources on the node. Requests corresponds to , in the Cgroup , indicating the initial CPU resource application amount. The actual usage can exceed requests, but not below it.
limits: Indicates the upper limit of the resource usage of the Pod. Kubernetes limits the resource usage of Pods through Cgroups to ensure that the value set by limits does not exceed. limits correspond to cpu.cfs_quota_us and cpu.cfs_period_us in the Cgroup, indicating the hard upper limit used by the CPU.

However, it should be noted that whether the upper limit set by limits can be reached depends on the actual resource status of the host. If the host resources are insufficient, the Pod may not reach the upper limit set by limits.

memory cgroup

Each container has a corresponding memory cgroup control group located in/sys/fs/cgroup/memory//docker - xxx, used to manage container memory.

Main parameters:

memory.limit_in_bytes: Sets the physical memory limit that all processes in the container can occupy, and the child group can only be set to the value of the parent group at most.
memory.oom_control: The default is 0, which means that the OOM mechanism is enabled; it can be set to 1 to turn off, and is implemented through echo 1 > memory.oom_control.
memory.usage_in_bytes: Read-only parameter, displays the total amount of physical memory occupied by all processes in the container.

Example: When the container starts, the rss is 100 M, the page cache is 899 M, and the total memory usage is 999 M; as the process runs and applies for more memory, the rss increases to 200 M, and the page cache decreases to 699 M, and the total memory usage is still 899 M, but the physical memory occupied by the actual process increases.

The available disks of the container are used for quota

By default, there is no limit on available disk space in the container.

File system in container = lowerdir + upperdir

Write operation

Write something in the container file system (no directory does not mount any external storage volume)
- Then the data written at this time are written to the upperdir layer, that is, the host machine written to.
- Therefore, if there is no restriction, it is likely to cause the host disk space to be full.

How to solve the problem? There are two ways:

Quota for available disks for containers
A special external storage volume should be mounted on the directory for writing to the container (recommended)

This is the end of this article about Docker's process and Cgroup concept. For more related Docker Cgroup concept content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!