fork: extend clone3() to support setting a PID
The main motivation to add set_tid to clone3() is CRIU. To restore a process with the same PID/TID CRIU currently uses /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to ns_last_pid and then (quickly) does a clone(). This works most of the time, but it is racy. It is also slow as it requires multiple syscalls. Extending clone3() to support *set_tid makes it possible restore a process using CRIU without accessing /proc/sys/kernel/ns_last_pid and race free (as long as the desired PID/TID is available). This clone3() extension places the same restrictions (CAP_SYS_ADMIN) on clone3() with *set_tid as they are currently in place for ns_last_pid. The original version of this change was using a single value for set_tid. At the 2019 LPC, after presenting set_tid, it was, however, decided to change set_tid to an array to enable setting the PID of a process in multiple PID namespaces at the same time. If a process is created in a PID namespace it is possible to influence the PID inside and outside of the PID namespace. Details also in the corresponding selftest. To create a process with the following PIDs: PID NS level Requested PID 0 (host) 31496 1 42 2 1 For that example the two newly introduced parameters to struct clone_args (set_tid and set_tid_size) would need to be: set_tid[0] = 1; set_tid[1] = 42; set_tid[2] = 31496; set_tid_size = 3; If only the PIDs of the two innermost nested PID namespaces should be defined it would look like this: set_tid[0] = 1; set_tid[1] = 42; set_tid_size = 2; The PID of the newly created process would then be the next available free PID in the PID namespace level 0 (host) and 42 in the PID namespace at level 1 and the PID of the process in the innermost PID namespace would be 1. The set_tid array is used to specify the PID of a process starting from the innermost nested PID namespaces up to set_tid_size PID namespaces. set_tid_size cannot be larger then the current PID namespace level. Signed-off-by: Adrian Reber <areber@redhat.com> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Acked-by: Andrei Vagin <avagin@gmail.com> Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This commit is contained in:

committed by
Christian Brauner

parent
17a810699c
commit
49cb2fc42c
@@ -39,24 +39,38 @@
|
||||
#ifndef __ASSEMBLY__
|
||||
/**
|
||||
* struct clone_args - arguments for the clone3 syscall
|
||||
* @flags: Flags for the new process as listed above.
|
||||
* All flags are valid except for CSIGNAL and
|
||||
* CLONE_DETACHED.
|
||||
* @pidfd: If CLONE_PIDFD is set, a pidfd will be
|
||||
* returned in this argument.
|
||||
* @child_tid: If CLONE_CHILD_SETTID is set, the TID of the
|
||||
* child process will be returned in the child's
|
||||
* memory.
|
||||
* @parent_tid: If CLONE_PARENT_SETTID is set, the TID of
|
||||
* the child process will be returned in the
|
||||
* parent's memory.
|
||||
* @exit_signal: The exit_signal the parent process will be
|
||||
* sent when the child exits.
|
||||
* @stack: Specify the location of the stack for the
|
||||
* child process.
|
||||
* @stack_size: The size of the stack for the child process.
|
||||
* @tls: If CLONE_SETTLS is set, the tls descriptor
|
||||
* is set to tls.
|
||||
* @flags: Flags for the new process as listed above.
|
||||
* All flags are valid except for CSIGNAL and
|
||||
* CLONE_DETACHED.
|
||||
* @pidfd: If CLONE_PIDFD is set, a pidfd will be
|
||||
* returned in this argument.
|
||||
* @child_tid: If CLONE_CHILD_SETTID is set, the TID of the
|
||||
* child process will be returned in the child's
|
||||
* memory.
|
||||
* @parent_tid: If CLONE_PARENT_SETTID is set, the TID of
|
||||
* the child process will be returned in the
|
||||
* parent's memory.
|
||||
* @exit_signal: The exit_signal the parent process will be
|
||||
* sent when the child exits.
|
||||
* @stack: Specify the location of the stack for the
|
||||
* child process.
|
||||
* @stack_size: The size of the stack for the child process.
|
||||
* @tls: If CLONE_SETTLS is set, the tls descriptor
|
||||
* is set to tls.
|
||||
* @set_tid: Pointer to an array of type *pid_t. The size
|
||||
* of the array is defined using @set_tid_size.
|
||||
* This array is used to select PIDs/TIDs for
|
||||
* newly created processes. The first element in
|
||||
* this defines the PID in the most nested PID
|
||||
* namespace. Each additional element in the array
|
||||
* defines the PID in the parent PID namespace of
|
||||
* the original PID namespace. If the array has
|
||||
* less entries than the number of currently
|
||||
* nested PID namespaces only the PIDs in the
|
||||
* corresponding namespaces are set.
|
||||
* @set_tid_size: This defines the size of the array referenced
|
||||
* in @set_tid. This cannot be larger than the
|
||||
* kernel's limit of nested PID namespaces.
|
||||
*
|
||||
* The structure is versioned by size and thus extensible.
|
||||
* New struct members must go at the end of the struct and
|
||||
@@ -71,10 +85,13 @@ struct clone_args {
|
||||
__aligned_u64 stack;
|
||||
__aligned_u64 stack_size;
|
||||
__aligned_u64 tls;
|
||||
__aligned_u64 set_tid;
|
||||
__aligned_u64 set_tid_size;
|
||||
};
|
||||
#endif
|
||||
|
||||
#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
|
||||
#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
|
||||
|
||||
/*
|
||||
* Scheduling policies
|
||||
|
Reference in New Issue
Block a user