[4/4] restore: Fix hang if root task is waiting on zombie

Submitted by Cyrill Gorcunov on Dec. 7, 2018, 11:57 a.m.

Details

Message ID 20181207115712.24130-5-gorcunov@gmail.com
State New
Series "restore: Fix potential hung on restore"
Headers show

Commit Message

Cyrill Gorcunov Dec. 7, 2018, 11:57 a.m.
When we're waiting for zombie we might be the last task
and the others already altered nr_in_progress futex. Thus
in this situation we should rather use nanosleep or similar
to relieve syscall pressure. Otherwise we can simply stuck
at restore procedure waiting for @nr_in_progress forever.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 criu/pie/restorer.c | 36 +++++++++++++++++++++++++++++++-----
 1 file changed, 31 insertions(+), 5 deletions(-)

Patch hide | download patch | download mbox

diff --git a/criu/pie/restorer.c b/criu/pie/restorer.c
index d3b459c6c8b1..b4c0eecd44cc 100644
--- a/criu/pie/restorer.c
+++ b/criu/pie/restorer.c
@@ -1162,6 +1162,13 @@  static int wait_helpers(struct task_restore_args *task_args)
 
 static int wait_zombies(struct task_restore_args *task_args)
 {
+	static const uint32_t step_ms = 100;
+	static const struct timespec req = {
+		.tv_nsec        = step_ms * 1000000,
+		.tv_sec         = 0,
+	};
+	struct timespec rem;
+	uint32_t nr_prints = 10;
 	int i;
 
 	for (i = 0; i < task_args->zombies_n; i++) {
@@ -1171,12 +1178,31 @@  static int wait_zombies(struct task_restore_args *task_args)
 
 		ret = sys_waitid(P_PID, task_args->zombies[i], NULL, WNOWAIT | WEXITED, NULL);
 		if (ret == -ECHILD) {
-			/* A process isn't reparented to this task yet.
-			 * Let's wait when someone complete this stage
-			 * and try again.
+			/*
+			 * A zombie is not reparented to us yet. So we need to
+			 * wait for it. If there some tasks left then we can
+			 * use @nr_in_progress to not calling waitid too often.
+			 * But in case if we're the root process, the @nr_in_progress
+			 * won't be altered and we use nanosleep with @step_ms
+			 * to relieve syscall pressure.
 			 */
-			futex_wait_while_eq(&task_entries_local->nr_in_progress,
-								nr_in_progress);
+			if (nr_in_progress > 1) {
+				futex_wait_while_eq(&task_entries_local->nr_in_progress,
+						    nr_in_progress);
+			} else {
+				if (nr_prints)
+					pr_debug("wait_zombies %d (nanosleep %ld %ld)\n",
+						 task_args->zombies[i], req.tv_sec, req.tv_nsec);
+				ret = sys_nanosleep((struct timespec *)&req, &rem);
+				if (ret == -EINTR) {
+					if (nr_prints)
+						pr_debug("\twait_zombies %d (nanosleep %ld %ld)\n",
+							 task_args->zombies[i], rem.tv_sec, rem.tv_nsec);
+					sys_nanosleep((struct timespec *)&rem, NULL);
+				}
+				if (nr_prints)
+					nr_prints--;
+			}
 			i--;
 			continue;
 		}

Comments

Andrei Vagin Dec. 11, 2018, 6:51 a.m.
On Fri, Dec 07, 2018 at 02:57:12PM +0300, Cyrill Gorcunov wrote:
> When we're waiting for zombie we might be the last task
> and the others already altered nr_in_progress futex. Thus
> in this situation we should rather use nanosleep or similar
> to relieve syscall pressure. Otherwise we can simply stuck
> at restore procedure waiting for @nr_in_progress forever.

I don't understand this description. Pls provide a sequence of actions
when we can stuck at restore.

Thanks,
Andrei

> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
>  criu/pie/restorer.c | 36 +++++++++++++++++++++++++++++++-----
>  1 file changed, 31 insertions(+), 5 deletions(-)
> 
> diff --git a/criu/pie/restorer.c b/criu/pie/restorer.c
> index d3b459c6c8b1..b4c0eecd44cc 100644
> --- a/criu/pie/restorer.c
> +++ b/criu/pie/restorer.c
> @@ -1162,6 +1162,13 @@ static int wait_helpers(struct task_restore_args *task_args)
>  
>  static int wait_zombies(struct task_restore_args *task_args)
>  {
> +	static const uint32_t step_ms = 100;
> +	static const struct timespec req = {
> +		.tv_nsec        = step_ms * 1000000,
> +		.tv_sec         = 0,
> +	};
> +	struct timespec rem;
> +	uint32_t nr_prints = 10;
>  	int i;
>  
>  	for (i = 0; i < task_args->zombies_n; i++) {
> @@ -1171,12 +1178,31 @@ static int wait_zombies(struct task_restore_args *task_args)
>  
>  		ret = sys_waitid(P_PID, task_args->zombies[i], NULL, WNOWAIT | WEXITED, NULL);
>  		if (ret == -ECHILD) {
> -			/* A process isn't reparented to this task yet.
> -			 * Let's wait when someone complete this stage
> -			 * and try again.
> +			/*
> +			 * A zombie is not reparented to us yet. So we need to
> +			 * wait for it. If there some tasks left then we can
> +			 * use @nr_in_progress to not calling waitid too often.
> +			 * But in case if we're the root process, the @nr_in_progress
> +			 * won't be altered and we use nanosleep with @step_ms
> +			 * to relieve syscall pressure.
>  			 */
> -			futex_wait_while_eq(&task_entries_local->nr_in_progress,
> -								nr_in_progress);
> +			if (nr_in_progress > 1) {
> +				futex_wait_while_eq(&task_entries_local->nr_in_progress,
> +						    nr_in_progress);
> +			} else {
> +				if (nr_prints)
> +					pr_debug("wait_zombies %d (nanosleep %ld %ld)\n",
> +						 task_args->zombies[i], req.tv_sec, req.tv_nsec);
> +				ret = sys_nanosleep((struct timespec *)&req, &rem);
> +				if (ret == -EINTR) {
> +					if (nr_prints)
> +						pr_debug("\twait_zombies %d (nanosleep %ld %ld)\n",
> +							 task_args->zombies[i], rem.tv_sec, rem.tv_nsec);
> +					sys_nanosleep((struct timespec *)&rem, NULL);
> +				}
> +				if (nr_prints)
> +					nr_prints--;
> +			}
>  			i--;
>  			continue;
>  		}
> -- 
> 2.17.2
>
Andrei Vagin Dec. 11, 2018, 7:07 a.m.
On Fri, Dec 07, 2018 at 02:57:12PM +0300, Cyrill Gorcunov wrote:
> When we're waiting for zombie we might be the last task
> and the others already altered nr_in_progress futex. Thus
> in this situation we should rather use nanosleep or similar
> to relieve syscall pressure. Otherwise we can simply stuck
> at restore procedure waiting for @nr_in_progress forever.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
>  criu/pie/restorer.c | 36 +++++++++++++++++++++++++++++++-----
>  1 file changed, 31 insertions(+), 5 deletions(-)
> 
> diff --git a/criu/pie/restorer.c b/criu/pie/restorer.c
> index d3b459c6c8b1..b4c0eecd44cc 100644
> --- a/criu/pie/restorer.c
> +++ b/criu/pie/restorer.c
> @@ -1162,6 +1162,13 @@ static int wait_helpers(struct task_restore_args *task_args)
>  
>  static int wait_zombies(struct task_restore_args *task_args)
>  {
> +	static const uint32_t step_ms = 100;
> +	static const struct timespec req = {
> +		.tv_nsec        = step_ms * 1000000,
> +		.tv_sec         = 0,
> +	};
> +	struct timespec rem;
> +	uint32_t nr_prints = 10;
>  	int i;
>  
>  	for (i = 0; i < task_args->zombies_n; i++) {
> @@ -1171,12 +1178,31 @@ static int wait_zombies(struct task_restore_args *task_args)
>  
>  		ret = sys_waitid(P_PID, task_args->zombies[i], NULL, WNOWAIT | WEXITED, NULL);
>  		if (ret == -ECHILD) {
> -			/* A process isn't reparented to this task yet.
> -			 * Let's wait when someone complete this stage
> -			 * and try again.
> +			/*
> +			 * A zombie is not reparented to us yet. So we need to
> +			 * wait for it. If there some tasks left then we can
> +			 * use @nr_in_progress to not calling waitid too often.
> +			 * But in case if we're the root process, the @nr_in_progress
> +			 * won't be altered and we use nanosleep with @step_ms
> +			 * to relieve syscall pressure.
>  			 */
> -			futex_wait_while_eq(&task_entries_local->nr_in_progress,
> -								nr_in_progress);
> +			if (nr_in_progress > 1) {
> +				futex_wait_while_eq(&task_entries_local->nr_in_progress,
> +						    nr_in_progress);
> +			} else {

> A zombie is not reparented to us yet.

This means that we are waiting when a process (helper) will die and its
children (zombies) will be reparanted to the init process.

If nr_in_progress is 1, this means that there is no processes which are
going to die, doesn't it? It the answer is yes, what are we waiting
here?

> +				if (nr_prints)
> +					pr_debug("wait_zombies %d (nanosleep %ld %ld)\n",
> +						 task_args->zombies[i], req.tv_sec, req.tv_nsec);
> +				ret = sys_nanosleep((struct timespec *)&req, &rem);
> +				if (ret == -EINTR) {
> +					if (nr_prints)
> +						pr_debug("\twait_zombies %d (nanosleep %ld %ld)\n",
> +							 task_args->zombies[i], rem.tv_sec, rem.tv_nsec);
> +					sys_nanosleep((struct timespec *)&rem, NULL);
> +				}
> +				if (nr_prints)
> +					nr_prints--;
> +			}
>  			i--;
>  			continue;
>  		}
> -- 
> 2.17.2
>
Cyrill Gorcunov Dec. 11, 2018, 9:36 a.m.
On Mon, Dec 10, 2018 at 11:07:09PM -0800, Andrey Vagin wrote:
> > A zombie is not reparented to us yet.
> 
> This means that we are waiting when a process (helper) will die and its
> children (zombies) will be reparanted to the init process.
> 
> If nr_in_progress is 1, this means that there is no processes which are
> going to die, doesn't it? It the answer is yes, what are we waiting
> here?

If nr_in_progress is 1 it means the zombie processes already decremented
nr_in_progress but may not be yet reparented to us, because we do decrement
nr_in_progress first and only then call kill(self) in zombie restore code.

	for (i = 0; i < task_args->zombies_n; i++) {
		int ret, nr_in_progress;

		nr_in_progress = futex_get(&task_entries_local->nr_in_progress);

		ret = sys_waitid(P_PID, task_args->zombies[i], NULL, WNOWAIT | WEXITED, NULL);
		if (ret == -ECHILD) {
			/* A process isn't reparented to this task yet.
			 * Let's wait when someone complete this stage
			 * and try again.
			 */
-->			futex_wait_while_eq(&task_entries_local->nr_in_progress,
								nr_in_progress);
			i--;
			continue;
		}

The nr_in_progress is fetched earlier and equal to 1. Then we call for waitid
and got -ECHILD which resulted that we're sitting in futex wait forever. That
is what I've been notified about when investigating a bug. Unfortunately I
lost the container failing and was unable to recreate a local test case.
I suspect there is a race window between signal delivery and nr_in_progress
updating.

Still since I've no local reproduction maybe it worth simply let this patch
flying around and if it triggers someday we will back to the question.