restore: Serialize access to last_pid

Submitted by Cyrill Gorcunov on Nov. 15, 2019, 2:33 p.m.

Details

Message ID 20191115143334.16155-1-gorcunov@gmail.com
State New
Series "restore: Serialize access to last_pid"
Headers show

Commit Message

Cyrill Gorcunov Nov. 15, 2019, 2:33 p.m.
When we do clone threads in a later stage of
restore procedure it may race with helpers
which do call clone_noasan by self.

In particular when restoring mount points
we do use try_remount_writable which is called
via call_helper_process and produce own fork.

Thus to eliminate the race which happens that
later lets allocate clone_noasan context via
shared memory on restore and then once
processes are forked we pass @last_pid_mutex
to the clone_noasan engine and all subsequent
calls to clone_noasan will be serialized with
threads creation.

To be honest I have no clue right now how to
provide some kind of test for this stuff. Our
QA team reported the issue as

 | (07.398000) pie: 20768: Error (criu/pie/restorer.c:1823): Unable to create a thread 20770: 20775
 | (07.398013) pie: 20768: Error (criu/pie/restorer.c:1828): Reread last_pid 20775

where we wrote 20769 into last_pid inside pie thread
restore , it failed and re-reading the last_pid returned
20775 instead of expecting value, which means there is
other forks happen inbetween without taking last_pid
lock.

Also I'm in big doubt this architectural solution, don't
like it much, it was rather a fast fix which address
the problem. So any other ideas are welcome.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 criu/clone-noasan.c         | 63 ++++++++++++++++++++++++++++++++++++-
 criu/cr-restore.c           |  6 ++++
 criu/include/clone-noasan.h |  5 +++
 3 files changed, 73 insertions(+), 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/criu/clone-noasan.c b/criu/clone-noasan.c
index 5ca280eb8386..627d924d5346 100644
--- a/criu/clone-noasan.c
+++ b/criu/clone-noasan.c
@@ -1,8 +1,65 @@ 
 #include <sched.h>
+#include <sys/mman.h>
 #include "common/compiler.h"
+#include "clone-noasan.h"
 #include "log.h"
 #include "common/bug.h"
 
+#undef LOG_PREFIX
+#define LOG_PREFIX "clone_noasan: "
+
+static struct {
+	mutex_t		op_mutex;
+	mutex_t		*clone_mutex;
+} *context;
+
+int clone_noasan_init(void)
+{
+	context = mmap(NULL, sizeof(*context), PROT_READ | PROT_WRITE,
+		       MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+	if (context == MAP_FAILED) {
+		pr_perror("Can't allocate context");
+		return -1;
+	}
+
+	mutex_init(&context->op_mutex);
+	return 0;
+}
+
+void clone_noasan_fini(void)
+{
+	munmap(context, sizeof(*context));
+	context = NULL;
+}
+
+static inline void context_lock(void)
+{
+	if (context && context->clone_mutex)
+		mutex_lock(context->clone_mutex);
+}
+
+static inline void context_unlock(void)
+{
+	if (context && context->clone_mutex)
+		mutex_unlock(context->clone_mutex);
+}
+
+int clone_noasan_set_mutex(mutex_t *clone_mutex)
+{
+	if (!context) {
+		pr_err_once("Context is missing\n");
+		return -ENOENT;
+	}
+
+	mutex_lock(&context->op_mutex);
+	if (context->clone_mutex)
+		return -EBUSY;
+	context->clone_mutex = clone_mutex;
+	mutex_unlock(&context->op_mutex);
+
+	return 0;
+}
+
 /*
  * ASan doesn't play nicely with clone if we use current stack for
  * child task. ASan puts local variables on the fake stack
@@ -23,9 +80,13 @@  int clone_noasan(int (*fn)(void *), int flags, void *arg)
 {
 	void *stack_ptr = (void *)round_down((unsigned long)&stack_ptr - 1024, 16);
 	BUG_ON((flags & CLONE_VM) && !(flags & CLONE_VFORK));
+	int ret;
 	/*
 	 * Reserve some bytes for clone() internal needs
 	 * and use as stack the address above this area.
 	 */
-	return clone(fn, stack_ptr, flags, arg);
+	context_lock();
+	ret = clone(fn, stack_ptr, flags, arg);
+	context_unlock();
+	return ret;
 }
diff --git a/criu/cr-restore.c b/criu/cr-restore.c
index de0b2cb407dd..17bff7f3eb42 100644
--- a/criu/cr-restore.c
+++ b/criu/cr-restore.c
@@ -1814,6 +1814,7 @@  static int restore_task_with_children(void *_arg)
 		if (restore_wait_other_tasks())
 			goto err;
 		fini_restore_mntns();
+		clone_noasan_set_mutex(&task_entries->last_pid_mutex);
 		__restore_switch_stage(CR_STATE_RESTORE);
 	} else {
 		if (restore_finish_stage(task_entries, CR_STATE_FORKING) < 0)
@@ -2301,6 +2302,7 @@  static int restore_root_task(struct pstree_item *init)
 	if (!opts.restore_detach && !opts.exec_cmd)
 		wait(NULL);
 
+	clone_noasan_fini();
 	return 0;
 
 out_kill:
@@ -2331,6 +2333,7 @@  static int restore_root_task(struct pstree_item *init)
 	depopulate_roots_yard(mnt_ns_fd, true);
 	stop_usernsd();
 	__restore_switch_stage(CR_STATE_FAIL);
+	clone_noasan_fini();
 	pr_err("Restoring FAILED.\n");
 	return -1;
 }
@@ -2391,6 +2394,9 @@  int cr_restore_tasks(void)
 	if (cpu_init() < 0)
 		goto err;
 
+	if (clone_noasan_init() < 0)
+		goto err;
+
 	if (vdso_init_restore())
 		goto err;
 
diff --git a/criu/include/clone-noasan.h b/criu/include/clone-noasan.h
index 8ef75fa73675..90d94a939a7c 100644
--- a/criu/include/clone-noasan.h
+++ b/criu/include/clone-noasan.h
@@ -1,6 +1,11 @@ 
 #ifndef __CR_CLONE_NOASAN_H__
 #define __CR_CLONE_NOASAN_H__
 
+#include "common/lock.h"
+
 int clone_noasan(int (*fn)(void *), int flags, void *arg);
+int clone_noasan_init(void);
+void clone_noasan_fini(void);
+int clone_noasan_set_mutex(mutex_t *clone_mutex);
 
 #endif /* __CR_CLONE_NOASAN_H__ */

Comments

Dmitry Safonov Nov. 15, 2019, 11:52 p.m.
On Fri, 15 Nov 2019 at 14:35, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>  When we do clone threads in a later stage of
>  restore procedure it may race with helpers
>  which do call clone_noasan by self.
>
>  In particular when restoring mount points
>  we do use try_remount_writable which is called
>  via call_helper_process and produce own fork.

try_remount_writable() is called from prepare_remaps() which is called only
in root_prepare_shared(). Could you produce a graph for the race?

===========================================
            THREAD 1                                THREAD 2
------------------------------------     -------------------------------------
                   ?                                                 ?
   try_remount_writable()                  clone_noasan()

Hmm?

[..]
> To be honest I have no clue right now how to
> provide some kind of test for this stuff.

Fault-injection?

[..]
>  criu/clone-noasan.c         | 63 ++++++++++++++++++++++++++++++++++++-

Please, keep only one function there - the file is compiled without sanitizer
instrumentation, so it's better to avoid inflating the file with new code.

[..]
> @@ -23,9 +80,13 @@ int clone_noasan(int (*fn)(void *), int flags, void *arg)
>  {
>         void *stack_ptr = (void *)round_down((unsigned long)&stack_ptr - 1024, 16);
>         BUG_ON((flags & CLONE_VM) && !(flags & CLONE_VFORK));
> +       int ret;
>         /*
>          * Reserve some bytes for clone() internal needs
>          * and use as stack the address above this area.
>          */
> -       return clone(fn, stack_ptr, flags, arg);
> +       context_lock();
> +       ret = clone(fn, stack_ptr, flags, arg);
> +       context_unlock();
> +       return ret;
>  }

I don't understand - you need to make an atomic section between
write(last_pid_fd, ...) and clone(), not just during clone().

--
             Dmitry
Cyrill Gorcunov Nov. 16, 2019, 7:09 a.m.
On Fri, Nov 15, 2019 at 11:52:50PM +0000, Dmitry Safonov wrote:
> On Fri, 15 Nov 2019 at 14:35, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> >  When we do clone threads in a later stage of
> >  restore procedure it may race with helpers
> >  which do call clone_noasan by self.
> >
> >  In particular when restoring mount points
> >  we do use try_remount_writable which is called
> >  via call_helper_process and produce own fork.
> 
> try_remount_writable() is called from prepare_remaps() which is called only
> in root_prepare_shared(). Could you produce a graph for the race?

No. It is called from open_reg_by_id -> restore_one_task

> 
> ===========================================
>             THREAD 1                                THREAD 2
> ------------------------------------     -------------------------------------
>                    ?                                                 ?
>    try_remount_writable()                  clone_noasan()
> 
> Hmm?
> 
> [..]
> > To be honest I have no clue right now how to
> > provide some kind of test for this stuff.
> 
> Fault-injection?

Nope. One need some additional sleeping/notification functions to reach the
moment where both clone-thread and clone_noasan called.

> 
> [..]
> >  criu/clone-noasan.c         | 63 ++++++++++++++++++++++++++++++++++++-
> 
> Please, keep only one function there - the file is compiled without sanitizer
> instrumentation, so it's better to avoid inflating the file with new code.

Well, could move this code to noather place, surely.

> 
> [..]
> > @@ -23,9 +80,13 @@ int clone_noasan(int (*fn)(void *), int flags, void *arg)
> >  {
> >         void *stack_ptr = (void *)round_down((unsigned long)&stack_ptr - 1024, 16);
> >         BUG_ON((flags & CLONE_VM) && !(flags & CLONE_VFORK));
> > +       int ret;
> >         /*
> >          * Reserve some bytes for clone() internal needs
> >          * and use as stack the address above this area.
> >          */
> > -       return clone(fn, stack_ptr, flags, arg);
> > +       context_lock();
> > +       ret = clone(fn, stack_ptr, flags, arg);
> > +       context_unlock();
> > +       return ret;
> >  }
> 
> I don't understand - you need to make an atomic section between
> write(last_pid_fd, ...) and clone(), not just during clone().

The helpers clone-calls do not use last pid, insted we pass last_pid
mutex into noasan context and then we have this

	clone_noasan 			pie-code-thread
	  mutex_lock(last_pid) 		  mutex_lock(last_pid)
	  clone 			  RUN_CLONE
	  mutex_unlock(last_pid) 	  mutex_unlock(last_pid)
Andrei Vagin Nov. 16, 2019, 6:10 p.m.
On Fri, Nov 15, 2019 at 05:33:34PM +0300, Cyrill Gorcunov wrote:
> When we do clone threads in a later stage of
> restore procedure it may race with helpers
> which do call clone_noasan by self.
>
> In particular when restoring mount points
> we do use try_remount_writable which is called
> via call_helper_process and produce own fork.
>
> Thus to eliminate the race which happens that
> later lets allocate clone_noasan context via
> shared memory on restore and then once
> processes are forked we pass @last_pid_mutex
> to the clone_noasan engine and all subsequent
> calls to clone_noasan will be serialized with
> threads creation.
>
> To be honest I have no clue right now how to
> provide some kind of test for this stuff. Our
> QA team reported the issue as
>
>  | (07.398000) pie: 20768: Error (criu/pie/restorer.c:1823): Unable to create a thread 20770: 20775
>  | (07.398013) pie: 20768: Error (criu/pie/restorer.c:1828): Reread last_pid 20775
>
> where we wrote 20769 into last_pid inside pie thread
> restore , it failed and re-reading the last_pid returned
> 20775 instead of expecting value, which means there is
> other forks happen inbetween without taking last_pid
> lock.
>
> Also I'm in big doubt this architectural solution, don't
> like it much, it was rather a fast fix which address
> the problem. So any other ideas are welcome.
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
>  criu/clone-noasan.c         | 63 ++++++++++++++++++++++++++++++++++++-
>  criu/cr-restore.c           |  6 ++++
>  criu/include/clone-noasan.h |  5 +++
>  3 files changed, 73 insertions(+), 1 deletion(-)
>
> diff --git a/criu/clone-noasan.c b/criu/clone-noasan.c
> index 5ca280eb8386..627d924d5346 100644
> --- a/criu/clone-noasan.c
> +++ b/criu/clone-noasan.c
> @@ -1,8 +1,65 @@
>  #include <sched.h>
> +#include <sys/mman.h>
>  #include "common/compiler.h"
> +#include "clone-noasan.h"
>  #include "log.h"
>  #include "common/bug.h"
>
> +#undef LOG_PREFIX
> +#define LOG_PREFIX "clone_noasan: "
> +
> +static struct {
> +	mutex_t		op_mutex;
> +	mutex_t		*clone_mutex;
> +} *context;
> +
> +int clone_noasan_init(void)
> +{
> +	context = mmap(NULL, sizeof(*context), PROT_READ | PROT_WRITE,
> +		       MAP_ANONYMOUS | MAP_SHARED, -1, 0);
> +	if (context == MAP_FAILED) {
> +		pr_perror("Can't allocate context");
> +		return -1;
> +	}
> +
> +	mutex_init(&context->op_mutex);
> +	return 0;
> +}
> +
> +void clone_noasan_fini(void)
> +{
> +	munmap(context, sizeof(*context));
> +	context = NULL;
> +}
> +
> +static inline void context_lock(void)
> +{
> +	if (context && context->clone_mutex)
> +		mutex_lock(context->clone_mutex);
> +}
> +
> +static inline void context_unlock(void)
> +{

context_lock has to return whether a mutex has been locked or not:

locked = context_lock();
....
context_unlock(locked)l

otherwise you can call mutex_unlock for the unlocked mutex or unlock a
mutex which has been locked by someone else.

Actually, all this looks like overdesign...

I think we don't need context and another shared vma.

static struct *last_pid_mutex;

void clone_noasan_init(struct *last_pid_mu) {
	last_pid_mutex = last_pid_mu;
}


and then call clone_noasan_init before forking the root task. Hmm?
Cyrill Gorcunov Nov. 16, 2019, 6:17 p.m.
On Sat, Nov 16, 2019 at 10:10:14AM -0800, Andrey Vagin wrote:
> 
> context_lock has to return whether a mutex has been locked or not:
> 
> locked = context_lock();
> ....
> context_unlock(locked)l

No, it should not. Similar to spin_locks, this is void helpers
the only difference is that we keep the lock instacce hidden
in context and available for all processes.

> otherwise you can call mutex_unlock for the unlocked mutex or unlock a
> mutex which has been locked by someone else.

We don't have lockdep at all, which means two mutex_unlock
have the same effect. I see what you mean but lvalue doesn't
help anyhow.

> Actually, all this looks like overdesign...
> 
> I think we don't need context and another shared vma.
> 
> static struct *last_pid_mutex;
> 
> void clone_noasan_init(struct *last_pid_mu) {
> 	last_pid_mutex = last_pid_mu;
> }
> 
> and then call clone_noasan_init before forking the root task. Hmm?

You know, I would prefer to have it as a separate subsystem instead.
And actually would love to not use last_pid mutex at all but instead
call context_lock from pie code.

Still I won't insist, if you prefer last_pid_mutex with clone_noasan_init
I will make it so.
Andrei Vagin Nov. 17, 2019, 9:13 a.m.
On Fri, Nov 15, 2019 at 05:33:34PM +0300, Cyrill Gorcunov wrote:
> @@ -23,9 +80,13 @@ int clone_noasan(int (*fn)(void *), int flags, void *arg)
>  {
>  	void *stack_ptr = (void *)round_down((unsigned long)&stack_ptr - 1024, 16);
>  	BUG_ON((flags & CLONE_VM) && !(flags & CLONE_VFORK));
> +	int ret;
>  	/*
>  	 * Reserve some bytes for clone() internal needs
>  	 * and use as stack the address above this area.
>  	 */
> -	return clone(fn, stack_ptr, flags, arg);
> +	context_lock();
> +	ret = clone(fn, stack_ptr, flags, arg);

Here is one more problem. clone can fork a process with a pid which is
going to be used be one of restored threads.

> +	context_unlock();
> +	return ret;
>  }
Cyrill Gorcunov Nov. 17, 2019, 9:33 a.m.
On Sun, Nov 17, 2019 at 01:13:58AM -0800, Andrey Vagin wrote:
> > -	return clone(fn, stack_ptr, flags, arg);
> > +	context_lock();
> > +	ret = clone(fn, stack_ptr, flags, arg);
> 
> Here is one more problem. clone can fork a process with a pid which is
> going to be used be one of restored threads.

Well, this helpers will run in serialized mode and they are finished
before we yield new thread. So no, this won't be a problem. The main
problem is that all this looks like a broken archtecture design.

I think we need (well, I need) to think more, maybe a I figure out
some explicit way to handle this.
Andrei Vagin Nov. 18, 2019, 1:11 a.m.
On Sun, Nov 17, 2019 at 12:33:37PM +0300, Cyrill Gorcunov wrote:
> On Sun, Nov 17, 2019 at 01:13:58AM -0800, Andrey Vagin wrote:
> > > -	return clone(fn, stack_ptr, flags, arg);
> > > +	context_lock();
> > > +	ret = clone(fn, stack_ptr, flags, arg);
> > 
> > Here is one more problem. clone can fork a process with a pid which is
> > going to be used be one of restored threads.
> 
> Well, this helpers will run in serialized mode and they are finished
> before we yield new thread. So no, this won't be a problem. The main
> problem is that all this looks like a broken archtecture design.

I don't understand what "serialized mode" means.

You wrote: "When we do clone threads in a later stage of
restore procedure it may race with helpers
which do call clone_noasan by self.".

This means that clone_noasan can be called when we are restoring threads
and clocn_noasan can fork a process with a pid of one of restored
threads.

> 
> I think we need (well, I need) to think more, maybe a I figure out
> some explicit way to handle this.

+1 ;)
Cyrill Gorcunov Nov. 18, 2019, 9:43 a.m.
On Sun, Nov 17, 2019 at 05:11:21PM -0800, Andrey Vagin wrote:
> > 
> > I think we need (well, I need) to think more, maybe a I figure out
> > some explicit way to handle this.
> 
> +1 ;)

As discussed F2F we're still having race even with the patch.
Need to rework and resend.