[Devel,RHEL7,COMMIT] ms/pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()

Submitted by Konstantin Khorenko on May 16, 2017, 4:23 p.m.

Details

Message ID 201705161623.v4GGN4CB021508@finist_cl7.x64_64.work.ct
State New
Series "pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()"
Headers show

Commit Message

Konstantin Khorenko May 16, 2017, 4:23 p.m.
The commit is pushed to "branch-rh7-3.10.0-514.16.1.vz7.32.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-514.16.1.vz7.32.3
------>
commit 21c4940330d1b414cae840280b4c6bd8b00d4b89
Author: Kirill Tkhai <ktkhai@virtuozzo.com>
Date:   Tue May 16 20:23:04 2017 +0400

    ms/pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()
    
    This will go to mainstream:
    https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git/commit/?h=for-linus&id=3fd37226216620c1a468afa999739d5016fbc349
    
    Imagine we have a pid namespace and a task from its parent's pid_ns,
    which made setns() to the pid namespace. The task is doing fork(),
    while the pid namespace's child reaper is dying. We have the race
    between them:
    
    Task from parent pid_ns             Child reaper
    copy_process()                      ..
    alloc_pid()                       ..
    ..                                zap_pid_ns_processes()
    ..                                  disable_pid_allocation()
    ..                                  read_lock(&tasklist_lock)
    ..                                  iterate over pids in pid_ns
    ..                                    kill tasks linked to pids
    ..                                  read_unlock(&tasklist_lock)
    write_lock_irq(&tasklist_lock);   ..
    attach_pid(p, PIDTYPE_PID);       ..
    ..                                ..
    
    So, just created task p won't receive SIGKILL signal,
    and the pid namespace will be in contradictory state.
    Only manual kill will help there, but does the userspace
    care about this? I suppose, the most users just inject
    a task into a pid namespace and wait a SIGCHLD from it.
    
    The patch fixes the problem. It simply checks for
    (pid_ns->nr_hashed & PIDNS_HASH_ADDING) in copy_process().
    We do it under the tasklist_lock, and can't skip
    PIDNS_HASH_ADDING as noted by Oleg:
    
    "zap_pid_ns_processes() does disable_pid_allocation()
    and then takes tasklist_lock to kill the whole namespace.
    Given that copy_process() checks PIDNS_HASH_ADDING
    under write_lock(tasklist) they can't race;
    if copy_process() takes this lock first, the new child will
    be killed, otherwise copy_process() can't miss
    the change in ->nr_hashed."
    
    If allocation is disabled, we just return -ENOMEM
    like it's made for such cases in alloc_pid().
    
    v2: Do not move disable_pid_allocation(), do not
    introduce a new variable in copy_process() and simplify
    the patch as suggested by Oleg Nesterov.
    Account the problem with double irq enabling
    found by Eric W. Biederman.
    
    Fixes: c876ad768215 ("pidns: Stop pid allocation when init dies")
    Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
    
    CC: Andrew Morton <akpm@linux-foundation.org>
    CC: Ingo Molnar <mingo@kernel.org>
    CC: Peter Zijlstra <peterz@infradead.org>
    CC: Oleg Nesterov <oleg@redhat.com>
    CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
    CC: Michal Hocko <mhocko@suse.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: Andrei Vagin <avagin@openvz.org>
    CC: Cyrill Gorcunov <gorcunov@openvz.org>
    CC: Serge Hallyn <serge@hallyn.com>
    Cc: stable@vger.kernel.org
    Acked-by: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    
    Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 kernel/fork.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/kernel/fork.c b/kernel/fork.c
index 24e178f..0509a83 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1601,11 +1601,13 @@  static struct task_struct *copy_process(unsigned long clone_flags,
 	*/
 	recalc_sigpending();
 	if (signal_pending(current)) {
-		spin_unlock(&current->sighand->siglock);
-		write_unlock_irq(&tasklist_lock);
 		retval = -ERESTARTNOINTR;
 		goto bad_fork_cancel_cgroup;
 	}
+	if (unlikely(!(ns_of_pid(pid)->nr_hashed & PIDNS_HASH_ADDING))) {
+		retval = -ENOMEM;
+		goto bad_fork_cancel_cgroup;
+	}
 
 	if (likely(p->pid)) {
 		ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);
@@ -1655,6 +1657,8 @@  static struct task_struct *copy_process(unsigned long clone_flags,
 	return p;
 
 bad_fork_cancel_cgroup:
+	spin_unlock(&current->sighand->siglock);
+	write_unlock_irq(&tasklist_lock);
 	cgroup_cancel_fork(p, cgrp_ss_priv);
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)