[PATCHv3,2/3] restore: Fix deadlock when helper's child dies

Submitted by Dmitry Safonov on July 19, 2017, 2:02 p.m.

Details

Message ID 20170719140254.32414-3-dsafonov@virtuozzo.com
State New
Series "Fix TASK_HELPER deadlock on futex"
Headers show

Commit Message

Dmitry Safonov July 19, 2017, 2:02 p.m.
Since commit ced9c529f687 ("restore: fix race with helpers' kids dying
too early"), we block SIGCHLD in helper tasks before CR_STATE_RESTORE.
This way we avoided default criu sighandler as it doesn't expect that
childs may die.

This is very racy as we wait on futex for another stage to be started,
but the next stage may start only when all the tasks complete previous
stage. If some children of helper dies, the helper may already have
blocked SIGCHLD and have started sleeping on the futex. Then the next
stage never comes and no one reads a pending SIGCHLD for helper.

A customer met this situation on the node, where the following
(non-related) problem has occured:
Unable to send a fin packet: libnet_write_raw_ipv6(): -1 bytes written (Network is unreachable)
Then child criu of the helper has exited with error-code and the
lockup has happened.

While we could fix it by aborting futex in the end of
restore_task_with_children() for each (non-root also) tasks,
that would be not completely correct:
1. All futex-waiting tasks will wake up after that and they
   may not expect that some tasks are on the previous stage,
   so they will spam into logs with unrelated errors and may
   also die painfully.
2. Child may die and miss aborting of the futex due to:
   o segfault
   o OOM killer
   o User-sended SIGKILL
   o Other error-path we forgot to cover with abort futex

To fix this deadlock in TASK_HELPER, as suggested-by Kirill,
let's check if there are children deaths expected - if there
isn't any, don't block SIGCHLD, otherwise wait() and check if
death was on expected stage of restore (not CR_STATE_RESTORE).

Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 criu/cr-restore.c | 75 +++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 53 insertions(+), 22 deletions(-)

Patch hide | download patch | download mbox

diff --git a/criu/cr-restore.c b/criu/cr-restore.c
index 9143cd310fed..21c5fbf3eacc 100644
--- a/criu/cr-restore.c
+++ b/criu/cr-restore.c
@@ -1207,6 +1207,58 @@  out:
 	return ret;
 }
 
+static bool child_death_expected(void)
+{
+	struct pstree_item *pi;
+
+	list_for_each_entry(pi, &current->children, sibling) {
+		switch (pi->pid->state) {
+		case TASK_DEAD:
+		case TASK_HELPER:
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static int restore_one_helper(void)
+{
+	sigset_t oldmask;
+	siginfo_t info;
+
+	if (prepare_fds(current))
+		return -1;
+
+	if (!child_death_expected()) {
+		restore_finish_stage(task_entries, CR_STATE_RESTORE);
+		return 0;
+	}
+
+	if (block_sigmask(&oldmask, SIGCHLD))
+		return -1;
+
+	/* Finish CR_STATE_RESTORE, but do not wait for the next stage. */
+	futex_dec_and_wake(&task_entries->nr_in_progress);
+
+	if (waitid(P_ALL, 0, &info, WEXITED | WNOWAIT)) {
+		pr_perror("Failed to wait\n");
+		return -1;
+	}
+
+	if (futex_get(&task_entries->start) == CR_STATE_RESTORE) {
+		pr_err("Child %d died too early\n", info.si_pid);
+		return -1;
+	}
+
+	if (wait_on_helpers_zombies()) {
+		pr_err("Failed to wait on helpers and zombies\n");
+		return -1;
+	}
+
+	return 0;
+}
+
 static int restore_one_task(int pid, CoreEntry *core)
 {
 	int i, ret;
@@ -1218,32 +1270,11 @@  static int restore_one_task(int pid, CoreEntry *core)
 	else if (current->pid->state == TASK_DEAD)
 		ret = restore_one_zombie(core);
 	else if (current->pid->state == TASK_HELPER) {
-		sigset_t blockmask, oldmask;
-
-		if (prepare_fds(current))
-			return -1;
-
-		sigemptyset(&blockmask);
-		sigaddset(&blockmask, SIGCHLD);
-
-		if (sigprocmask(SIG_BLOCK, &blockmask, &oldmask) == -1) {
-			pr_perror("Can not set mask of blocked signals");
-			return -1;
-		}
-
-		restore_finish_stage(task_entries, CR_STATE_RESTORE);
-
+		ret = restore_one_helper();
 		close_image_dir();
 		close_proc();
 		for (i = SERVICE_FD_MIN + 1; i < SERVICE_FD_MAX; i++)
 			close_service_fd(i);
-
-		if (wait_on_helpers_zombies()) {
-			pr_err("failed to wait on helpers and zombies\n");
-			ret = -1;
-		} else {
-			ret = 0;
-		}
 	} else {
 		pr_err("Unknown state in code %d\n", (int)core->tc->task_state);
 		ret = -1;

Comments

Andrei Vagin July 20, 2017, 7:39 a.m.
On Wed, Jul 19, 2017 at 05:02:53PM +0300, Dmitry Safonov wrote:
> Since commit ced9c529f687 ("restore: fix race with helpers' kids dying
> too early"), we block SIGCHLD in helper tasks before CR_STATE_RESTORE.
> This way we avoided default criu sighandler as it doesn't expect that
> childs may die.
> 
> This is very racy as we wait on futex for another stage to be started,
> but the next stage may start only when all the tasks complete previous
> stage. If some children of helper dies, the helper may already have
> blocked SIGCHLD and have started sleeping on the futex. Then the next
> stage never comes and no one reads a pending SIGCHLD for helper.
> 
> A customer met this situation on the node, where the following
> (non-related) problem has occured:
> Unable to send a fin packet: libnet_write_raw_ipv6(): -1 bytes written (Network is unreachable)
> Then child criu of the helper has exited with error-code and the
> lockup has happened.
> 
> While we could fix it by aborting futex in the end of
> restore_task_with_children() for each (non-root also) tasks,
> that would be not completely correct:
> 1. All futex-waiting tasks will wake up after that and they
>    may not expect that some tasks are on the previous stage,
>    so they will spam into logs with unrelated errors and may
>    also die painfully.
> 2. Child may die and miss aborting of the futex due to:
>    o segfault
>    o OOM killer
>    o User-sended SIGKILL
>    o Other error-path we forgot to cover with abort futex
> 
> To fix this deadlock in TASK_HELPER, as suggested-by Kirill,
> let's check if there are children deaths expected - if there
> isn't any, don't block SIGCHLD, otherwise wait() and check if
> death was on expected stage of restore (not CR_STATE_RESTORE).
> 
> Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
> ---
>  criu/cr-restore.c | 75 +++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 53 insertions(+), 22 deletions(-)
> 
> diff --git a/criu/cr-restore.c b/criu/cr-restore.c
> index 9143cd310fed..21c5fbf3eacc 100644
> --- a/criu/cr-restore.c
> +++ b/criu/cr-restore.c
> @@ -1207,6 +1207,58 @@ out:
>  	return ret;
>  }
>  
> +static bool child_death_expected(void)
> +{
> +	struct pstree_item *pi;
> +
> +	list_for_each_entry(pi, &current->children, sibling) {
> +		switch (pi->pid->state) {
> +		case TASK_DEAD:
> +		case TASK_HELPER:
> +			return true;
> +		}
> +	}
> +
> +	return false;
> +}
> +

Could you write a comment for this function?

> +static int restore_one_helper(void)
> +{
> +	sigset_t oldmask;
> +	siginfo_t info;
> +
> +	if (prepare_fds(current))
> +		return -1;
> +
> +	if (!child_death_expected()) {
> +		restore_finish_stage(task_entries, CR_STATE_RESTORE);
> +		return 0;
> +	}
> +
> +	if (block_sigmask(&oldmask, SIGCHLD))

block_sigmask(NULL, SIGCHLD) has to work too, oldmask isn't used
> +		return -1;
> +
> +	/* Finish CR_STATE_RESTORE, but do not wait for the next stage. */
> +	futex_dec_and_wake(&task_entries->nr_in_progress);
> +
> +	if (waitid(P_ALL, 0, &info, WEXITED | WNOWAIT)) {
> +		pr_perror("Failed to wait\n");
> +		return -1;
> +	}
> +
> +	if (futex_get(&task_entries->start) == CR_STATE_RESTORE) {
> +		pr_err("Child %d died too early\n", info.si_pid);
> +		return -1;
> +	}
> +
> +	if (wait_on_helpers_zombies()) {
> +		pr_err("Failed to wait on helpers and zombies\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
>  static int restore_one_task(int pid, CoreEntry *core)
>  {
>  	int i, ret;
> @@ -1218,32 +1270,11 @@ static int restore_one_task(int pid, CoreEntry *core)
>  	else if (current->pid->state == TASK_DEAD)
>  		ret = restore_one_zombie(core);
>  	else if (current->pid->state == TASK_HELPER) {
> -		sigset_t blockmask, oldmask;
> -
> -		if (prepare_fds(current))
> -			return -1;
> -
> -		sigemptyset(&blockmask);
> -		sigaddset(&blockmask, SIGCHLD);
> -
> -		if (sigprocmask(SIG_BLOCK, &blockmask, &oldmask) == -1) {
> -			pr_perror("Can not set mask of blocked signals");
> -			return -1;
> -		}
> -
> -		restore_finish_stage(task_entries, CR_STATE_RESTORE);
> -
> +		ret = restore_one_helper();
>  		close_image_dir();
>  		close_proc();
>  		for (i = SERVICE_FD_MIN + 1; i < SERVICE_FD_MAX; i++)
>  			close_service_fd(i);
> -
> -		if (wait_on_helpers_zombies()) {
> -			pr_err("failed to wait on helpers and zombies\n");
> -			ret = -1;
> -		} else {
> -			ret = 0;
> -		}
>  	} else {
>  		pr_err("Unknown state in code %d\n", (int)core->tc->task_state);
>  		ret = -1;
> -- 
> 2.13.1
> 
> _______________________________________________
> CRIU mailing list
> CRIU@openvz.org
> https://lists.openvz.org/mailman/listinfo/criu
Dmitry Safonov July 20, 2017, 11:21 a.m.
On 07/20/2017 10:39 AM, Andrei Vagin wrote:
> On Wed, Jul 19, 2017 at 05:02:53PM +0300, Dmitry Safonov wrote:
>> Since commit ced9c529f687 ("restore: fix race with helpers' kids dying
>> too early"), we block SIGCHLD in helper tasks before CR_STATE_RESTORE.
>> This way we avoided default criu sighandler as it doesn't expect that
>> childs may die.
>>
>> This is very racy as we wait on futex for another stage to be started,
>> but the next stage may start only when all the tasks complete previous
>> stage. If some children of helper dies, the helper may already have
>> blocked SIGCHLD and have started sleeping on the futex. Then the next
>> stage never comes and no one reads a pending SIGCHLD for helper.
>>
>> A customer met this situation on the node, where the following
>> (non-related) problem has occured:
>> Unable to send a fin packet: libnet_write_raw_ipv6(): -1 bytes written (Network is unreachable)
>> Then child criu of the helper has exited with error-code and the
>> lockup has happened.
>>
>> While we could fix it by aborting futex in the end of
>> restore_task_with_children() for each (non-root also) tasks,
>> that would be not completely correct:
>> 1. All futex-waiting tasks will wake up after that and they
>>     may not expect that some tasks are on the previous stage,
>>     so they will spam into logs with unrelated errors and may
>>     also die painfully.
>> 2. Child may die and miss aborting of the futex due to:
>>     o segfault
>>     o OOM killer
>>     o User-sended SIGKILL
>>     o Other error-path we forgot to cover with abort futex
>>
>> To fix this deadlock in TASK_HELPER, as suggested-by Kirill,
>> let's check if there are children deaths expected - if there
>> isn't any, don't block SIGCHLD, otherwise wait() and check if
>> death was on expected stage of restore (not CR_STATE_RESTORE).
>>
>> Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
>> ---
>>   criu/cr-restore.c | 75 +++++++++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 53 insertions(+), 22 deletions(-)
>>
>> diff --git a/criu/cr-restore.c b/criu/cr-restore.c
>> index 9143cd310fed..21c5fbf3eacc 100644
>> --- a/criu/cr-restore.c
>> +++ b/criu/cr-restore.c
>> @@ -1207,6 +1207,58 @@ out:
>>   	return ret;
>>   }
>>   
>> +static bool child_death_expected(void)
>> +{
>> +	struct pstree_item *pi;
>> +
>> +	list_for_each_entry(pi, &current->children, sibling) {
>> +		switch (pi->pid->state) {
>> +		case TASK_DEAD:
>> +		case TASK_HELPER:
>> +			return true;
>> +		}
>> +	}
>> +
>> +	return false;
>> +}
>> +
> 
> Could you write a comment for this function?

Ok

> 
>> +static int restore_one_helper(void)
>> +{
>> +	sigset_t oldmask;
>> +	siginfo_t info;
>> +
>> +	if (prepare_fds(current))
>> +		return -1;
>> +
>> +	if (!child_death_expected()) {
>> +		restore_finish_stage(task_entries, CR_STATE_RESTORE);
>> +		return 0;
>> +	}
>> +
>> +	if (block_sigmask(&oldmask, SIGCHLD))
> 
> block_sigmask(NULL, SIGCHLD) has to work too, oldmask isn't used

Yep, it was over-caution, as I was not sure if on all archs the
parameter of sigprocmask can be NULL.
But man says it can be, so will do.

>> +		return -1;
>> +
>> +	/* Finish CR_STATE_RESTORE, but do not wait for the next stage. */
>> +	futex_dec_and_wake(&task_entries->nr_in_progress);
>> +
>> +	if (waitid(P_ALL, 0, &info, WEXITED | WNOWAIT)) {
>> +		pr_perror("Failed to wait\n");
>> +		return -1;
>> +	}
>> +
>> +	if (futex_get(&task_entries->start) == CR_STATE_RESTORE) {
>> +		pr_err("Child %d died too early\n", info.si_pid);
>> +		return -1;
>> +	}
>> +
>> +	if (wait_on_helpers_zombies()) {
>> +		pr_err("Failed to wait on helpers and zombies\n");
>> +		return -1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   static int restore_one_task(int pid, CoreEntry *core)
>>   {
>>   	int i, ret;
>> @@ -1218,32 +1270,11 @@ static int restore_one_task(int pid, CoreEntry *core)
>>   	else if (current->pid->state == TASK_DEAD)
>>   		ret = restore_one_zombie(core);
>>   	else if (current->pid->state == TASK_HELPER) {
>> -		sigset_t blockmask, oldmask;
>> -
>> -		if (prepare_fds(current))
>> -			return -1;
>> -
>> -		sigemptyset(&blockmask);
>> -		sigaddset(&blockmask, SIGCHLD);
>> -
>> -		if (sigprocmask(SIG_BLOCK, &blockmask, &oldmask) == -1) {
>> -			pr_perror("Can not set mask of blocked signals");
>> -			return -1;
>> -		}
>> -
>> -		restore_finish_stage(task_entries, CR_STATE_RESTORE);
>> -
>> +		ret = restore_one_helper();
>>   		close_image_dir();
>>   		close_proc();
>>   		for (i = SERVICE_FD_MIN + 1; i < SERVICE_FD_MAX; i++)
>>   			close_service_fd(i);
>> -
>> -		if (wait_on_helpers_zombies()) {
>> -			pr_err("failed to wait on helpers and zombies\n");
>> -			ret = -1;
>> -		} else {
>> -			ret = 0;
>> -		}
>>   	} else {
>>   		pr_err("Unknown state in code %d\n", (int)core->tc->task_state);
>>   		ret = -1;
>> -- 
>> 2.13.1
>>
>> _______________________________________________
>> CRIU mailing list
>> CRIU@openvz.org
>> https://lists.openvz.org/mailman/listinfo/criu