__synccall: deadlock and reliance on racy /proc/self/task

Submitted by Szabolcs Nagy on Feb. 9, 2019, 9:40 p.m.

Details

Message ID 20190209214045.GO21289@port70.net
State New
Series "__synccall: deadlock and reliance on racy /proc/self/task"
Headers show

Commit Message

Szabolcs Nagy Feb. 9, 2019, 9:40 p.m.
* Alexey Izbyshev <izbyshev@ispras.ru> [2019-02-09 21:33:32 +0300]:
> On 2019-02-09 19:21, Szabolcs Nagy wrote:
> > * Rich Felker <dalias@libc.org> [2019-02-08 13:33:57 -0500]:
> > > On Fri, Feb 08, 2019 at 09:14:48PM +0300, Alexey Izbyshev wrote:
> > > > On 2/7/19 9:36 PM, Rich Felker wrote:
> > > > >Does it work if we force two iterations of the readdir loop with no
> > > > >tasks missed, rather than just one, to catch the case of missed
> > > > >concurrent additions? I'm not sure. But all this makes me really
> > > > >uncomfortable with the current approach.
> > > >
> > > > I've tested with 0, 1, 2 and 3 retries of the main loop if miss_cnt
> > > > == 0. The test eventually failed in all cases, with 0 retries
> > > > requiring only a handful of iterations, 1 -- on the order of 100, 2
> > > > -- on the order of 10000 and 3 -- on the order of 100000.
> > > 
> > > Do you have a theory on the mechanism of failure here? I'm guessing
> > > it's something like this: there's a thread that goes unseen in the
> > > first round, and during the second round, it creates a new thread and
> > > exits itself. The exit gets seen (again, it doesn't show up in the
> > > dirents) but the new thread it created still doesn't. Is that right?
> > > 
> > > In any case, it looks like the whole mechanism we're using is
> > > unreliable, so something needs to be done. My leaning is to go with
> > > the global thread list and atomicity of list-unlock with exit.
> > 
> > yes that sounds possible, i added some instrumentation to musl
> > and the trace shows situations like that before the deadlock,
> > exiting threads can even cause old (previously seen) entries to
> > disappear from the dir.
> > 
> Thanks for the thorough instrumentation! Your traces confirm both my theory
> about the deadlock and unreliability of /proc/self/task.
> 
> I'd also done a very light instrumentation just before I got your email, but
> it took me a while to understand the output I got (see below).

the attached patch fixes the issue on my machine.
i don't know if this is just luck.

the assumption is that if /proc/self/task is read twice such that
all tids in it seem to be active and caught, then all the active
threads of the process are caught (no new threads that are already
started but not visible there yet)

> Now, about the strange output I mentioned. Consider one of the above
> fragments:
> --iter: 4
> exit 15977
> retry 0
> tid 15977
> tid 15978
> exit 15978
> retry 1
> tid 15978
> tgkill: ESRCH
> mismatch: tid 15979: 0 != 23517
> 
> Note that "tid 15978" is printed two times. Recall that it's printed only if
> we haven't seen it in the chain. But how it's possible that we haven't seen
> it *two* times? Since we waited on the futex the first time and we got the
> lock, the signal handler must have unlocked it. There is even a comment
> before futex() call:
> 
> /* Obtaining the lock means the thread responded. ESRCH
>  * means the target thread exited, which is okay too. */
> 
> If it the signal handler reached futex unlock code, it must have updated the
> chain, and we must see the tid in the chain on the next retry and don't
> print it.
> 
> Apparently, there is another reason of futex(FUTEX_LOCK_PI) success: the
> owner is exiting concurrently (which is also indicated by the subsequent
> failure of tgkill with ESRCH). So obtaining the lock doesn't necessarily
> mean that the owner responded: it may also mean that the owner is (about to
> be?) dead.

so tgkill succeeds but the target exits before handling the signal.
i'd expect ESRCH then not success from the futex.
interesting.

anyway i had to retry until there are no exiting threads in dir to
reliably fix the deadlock.

Patch hide | download patch | download mbox

From ed101ece64b645865779293eb48109cad03e9c35 Mon Sep 17 00:00:00 2001
From: Szabolcs Nagy <nsz@port70.net>
Date: Sat, 9 Feb 2019 21:13:35 +0000
Subject: [PATCH] more robust synccall

---
 src/thread/synccall.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/src/thread/synccall.c b/src/thread/synccall.c
index cc66bd24..7f275114 100644
--- a/src/thread/synccall.c
+++ b/src/thread/synccall.c
@@ -102,6 +102,7 @@  void __synccall(void (*func)(void *), void *ctx)
 
 	/* Loop scanning the kernel-provided thread list until it shows no
 	 * threads that have not already replied to the signal. */
+	int all_threads_caught = 0;
 	for (;;) {
 		int miss_cnt = 0;
 		while ((de = readdir(&dir))) {
@@ -120,6 +121,7 @@  void __synccall(void (*func)(void *), void *ctx)
 			for (cp = head; cp && cp->tid != tid; cp=cp->next);
 			if (cp) continue;
 
+			miss_cnt++;
 			r = -__syscall(SYS_tgkill, pid, tid, SIGSYNCCALL);
 
 			/* Target thread exit is a success condition. */
@@ -142,10 +144,16 @@  void __synccall(void (*func)(void *), void *ctx)
 			/* Obtaining the lock means the thread responded. ESRCH
 			 * means the target thread exited, which is okay too. */
 			if (!r || r == ESRCH) continue;
-
-			miss_cnt++;
 		}
-		if (!miss_cnt) break;
+		if (miss_cnt)
+			all_threads_caught = 0;
+		else
+			all_threads_caught++;
+		/* when all visible threads are stopped there may be newly
+		 * created threads that are not in dir yet, so only assume
+		 * we are done when we see no running threads twice. */
+		if (all_threads_caught > 1)
+			break;
 		rewinddir(&dir);
 	}
 	close(dir.fd);
-- 
2.19.1


Comments

Alexey Izbyshev Feb. 9, 2019, 10:29 p.m.
On 2019-02-10 00:40, Szabolcs Nagy wrote:
> the attached patch fixes the issue on my machine.
> i don't know if this is just luck.
> 
> the assumption is that if /proc/self/task is read twice such that
> all tids in it seem to be active and caught, then all the active
> threads of the process are caught (no new threads that are already
> started but not visible there yet)
> 
> anyway i had to retry until there are no exiting threads in dir to
> reliably fix the deadlock.

Unfortunately, on 4.15.x kernel, I've got both the deadlock (~23000 
iterations) and the mismatch (after I removed kill() loop; ~19000 
iterations).

On 4.4.x, it took ~30 mln. iterations to get the mismatch (on 
deadlock-free version):

--iter: 30198000
--iter: 30199000
mismatch: tid 539: 1000 != 0

Alexey
Rich Felker Feb. 10, 2019, 12:52 a.m.
On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> * Alexey Izbyshev <izbyshev@ispras.ru> [2019-02-09 21:33:32 +0300]:
> > On 2019-02-09 19:21, Szabolcs Nagy wrote:
> > > * Rich Felker <dalias@libc.org> [2019-02-08 13:33:57 -0500]:
> > > > On Fri, Feb 08, 2019 at 09:14:48PM +0300, Alexey Izbyshev wrote:
> > > > > On 2/7/19 9:36 PM, Rich Felker wrote:
> > > > > >Does it work if we force two iterations of the readdir loop with no
> > > > > >tasks missed, rather than just one, to catch the case of missed
> > > > > >concurrent additions? I'm not sure. But all this makes me really
> > > > > >uncomfortable with the current approach.
> > > > >
> > > > > I've tested with 0, 1, 2 and 3 retries of the main loop if miss_cnt
> > > > > == 0. The test eventually failed in all cases, with 0 retries
> > > > > requiring only a handful of iterations, 1 -- on the order of 100, 2
> > > > > -- on the order of 10000 and 3 -- on the order of 100000.
> > > > 
> > > > Do you have a theory on the mechanism of failure here? I'm guessing
> > > > it's something like this: there's a thread that goes unseen in the
> > > > first round, and during the second round, it creates a new thread and
> > > > exits itself. The exit gets seen (again, it doesn't show up in the
> > > > dirents) but the new thread it created still doesn't. Is that right?
> > > > 
> > > > In any case, it looks like the whole mechanism we're using is
> > > > unreliable, so something needs to be done. My leaning is to go with
> > > > the global thread list and atomicity of list-unlock with exit.
> > > 
> > > yes that sounds possible, i added some instrumentation to musl
> > > and the trace shows situations like that before the deadlock,
> > > exiting threads can even cause old (previously seen) entries to
> > > disappear from the dir.
> > > 
> > Thanks for the thorough instrumentation! Your traces confirm both my theory
> > about the deadlock and unreliability of /proc/self/task.
> > 
> > I'd also done a very light instrumentation just before I got your email, but
> > it took me a while to understand the output I got (see below).
> 
> the attached patch fixes the issue on my machine.
> i don't know if this is just luck.
> 
> the assumption is that if /proc/self/task is read twice such that
> all tids in it seem to be active and caught, then all the active
> threads of the process are caught (no new threads that are already
> started but not visible there yet)

I'm skeptical of whether this should work in principle. If the first
scan of /proc/self/task misses tid J, and during the next scan, tid J
creates tid K then exits, it seems like we could see the same set of
tids on both scans.

Maybe it's salvagable though. Since __block_new_threads is true, in
order for this to happen, tid J must have been between the
__block_new_threads check in pthread_create and the clone syscall at
the time __synccall started. The number of threads in such a state
seems to be bounded by some small constant (like 2) times
libc.threads_minus_1+1, computed at any point after
__block_new_threads is set to true, so sufficiently heavy presignaling
(heavier than we have now) might suffice to guarantee that all are
captured. 

Rich
Szabolcs Nagy Feb. 10, 2019, 1:16 a.m.
* Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
> On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> > the assumption is that if /proc/self/task is read twice such that
> > all tids in it seem to be active and caught, then all the active
> > threads of the process are caught (no new threads that are already
> > started but not visible there yet)
> 
> I'm skeptical of whether this should work in principle. If the first
> scan of /proc/self/task misses tid J, and during the next scan, tid J
> creates tid K then exits, it seems like we could see the same set of
> tids on both scans.
> 
> Maybe it's salvagable though. Since __block_new_threads is true, in
> order for this to happen, tid J must have been between the
> __block_new_threads check in pthread_create and the clone syscall at
> the time __synccall started. The number of threads in such a state
> seems to be bounded by some small constant (like 2) times
> libc.threads_minus_1+1, computed at any point after
> __block_new_threads is set to true, so sufficiently heavy presignaling
> (heavier than we have now) might suffice to guarantee that all are
> captured. 

heavier presignaling may catch more threads, but we don't
know how long should we wait until all signal handlers are
invoked (to ensure that all tasks are enqueued on the call
serializer chain before we start walking that list)
Rich Felker Feb. 10, 2019, 1:20 a.m.
On Sun, Feb 10, 2019 at 02:16:23AM +0100, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
> > On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> > > the assumption is that if /proc/self/task is read twice such that
> > > all tids in it seem to be active and caught, then all the active
> > > threads of the process are caught (no new threads that are already
> > > started but not visible there yet)
> > 
> > I'm skeptical of whether this should work in principle. If the first
> > scan of /proc/self/task misses tid J, and during the next scan, tid J
> > creates tid K then exits, it seems like we could see the same set of
> > tids on both scans.
> > 
> > Maybe it's salvagable though. Since __block_new_threads is true, in
> > order for this to happen, tid J must have been between the
> > __block_new_threads check in pthread_create and the clone syscall at
> > the time __synccall started. The number of threads in such a state
> > seems to be bounded by some small constant (like 2) times
> > libc.threads_minus_1+1, computed at any point after
> > __block_new_threads is set to true, so sufficiently heavy presignaling
> > (heavier than we have now) might suffice to guarantee that all are
> > captured. 
> 
> heavier presignaling may catch more threads, but we don't
> know how long should we wait until all signal handlers are
> invoked (to ensure that all tasks are enqueued on the call
> serializer chain before we start walking that list)

That's why reading /proc/self/task is still necessary. However, it
seems useful to be able to prove you've queued enough signals that at
least as many threads as could possibly exist are already in a state
where they cannot return from a syscall with signals unblocked without
entering the signal handler. In that case you would know there's no
more racing going on to create new threads, so reading /proc/self/task
is purely to get the list of threads you're waiting to enqueue
themselves on the chain, not to find new threads you need to signal.

Rich
Rich Felker Feb. 10, 2019, 4:01 a.m.
On Sat, Feb 09, 2019 at 08:20:32PM -0500, Rich Felker wrote:
> On Sun, Feb 10, 2019 at 02:16:23AM +0100, Szabolcs Nagy wrote:
> > * Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
> > > On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> > > > the assumption is that if /proc/self/task is read twice such that
> > > > all tids in it seem to be active and caught, then all the active
> > > > threads of the process are caught (no new threads that are already
> > > > started but not visible there yet)
> > > 
> > > I'm skeptical of whether this should work in principle. If the first
> > > scan of /proc/self/task misses tid J, and during the next scan, tid J
> > > creates tid K then exits, it seems like we could see the same set of
> > > tids on both scans.
> > > 
> > > Maybe it's salvagable though. Since __block_new_threads is true, in
> > > order for this to happen, tid J must have been between the
> > > __block_new_threads check in pthread_create and the clone syscall at
> > > the time __synccall started. The number of threads in such a state
> > > seems to be bounded by some small constant (like 2) times
> > > libc.threads_minus_1+1, computed at any point after
> > > __block_new_threads is set to true, so sufficiently heavy presignaling
> > > (heavier than we have now) might suffice to guarantee that all are
> > > captured. 
> > 
> > heavier presignaling may catch more threads, but we don't
> > know how long should we wait until all signal handlers are
> > invoked (to ensure that all tasks are enqueued on the call
> > serializer chain before we start walking that list)
> 
> That's why reading /proc/self/task is still necessary. However, it
> seems useful to be able to prove you've queued enough signals that at
> least as many threads as could possibly exist are already in a state
> where they cannot return from a syscall with signals unblocked without
> entering the signal handler. In that case you would know there's no
> more racing going on to create new threads, so reading /proc/self/task
> is purely to get the list of threads you're waiting to enqueue
> themselves on the chain, not to find new threads you need to signal.

One thing to note: SYS_kill is not required to queue an unlimited
number of signals, and might not report failure to do so. We should
probably be using SYS_rt_sigqueue, counting the number of signals
successfully queued, and continue sending them during the loop that
monitors progress building the chain until the necessary number have
been successfully sent, if we're going to rely on the above properties
to guarantee that we've caught every thread.

Rich
Alexey Izbyshev Feb. 10, 2019, 12:15 p.m.
On 2019-02-10 04:20, Rich Felker wrote:
> On Sun, Feb 10, 2019 at 02:16:23AM +0100, Szabolcs Nagy wrote:
>> * Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
>> > Maybe it's salvagable though. Since __block_new_threads is true, in
>> > order for this to happen, tid J must have been between the
>> > __block_new_threads check in pthread_create and the clone syscall at
>> > the time __synccall started. The number of threads in such a state
>> > seems to be bounded by some small constant (like 2) times
>> > libc.threads_minus_1+1, computed at any point after
>> > __block_new_threads is set to true, so sufficiently heavy presignaling
>> > (heavier than we have now) might suffice to guarantee that all are
>> > captured.
>> 
>> heavier presignaling may catch more threads, but we don't
>> know how long should we wait until all signal handlers are
>> invoked (to ensure that all tasks are enqueued on the call
>> serializer chain before we start walking that list)
> 
> That's why reading /proc/self/task is still necessary. However, it
> seems useful to be able to prove you've queued enough signals that at
> least as many threads as could possibly exist are already in a state
> where they cannot return from a syscall with signals unblocked without
> entering the signal handler. In that case you would know there's no
> more racing going on to create new threads, so reading /proc/self/task
> is purely to get the list of threads you're waiting to enqueue
> themselves on the chain, not to find new threads you need to signal.

Similar to Szabolcs, I fail to see how heavier presignaling would help. 
Even if we're sure that we'll *eventually* catch all threads (including 
their future children) that were between __block_new_threads check in 
pthread_create and the clone syscall at the time we set 
__block_new_threads to 1, we still have no means to know whether we 
reached a stable state. In other words, we don't know when we should 
stop spinning in /proc/self/task loop because we may miss threads that 
are currently being created.

Also, note that __pthread_exit() blocks all signals and decrements 
libc.threads_minus_1 before exiting, so an arbitrary number of threads 
may be exiting while we're in /proc/self/task loop, and we know that 
concurrently exiting threads are related to misses.

Alexey
Szabolcs Nagy Feb. 10, 2019, 12:32 p.m.
* Rich Felker <dalias@libc.org> [2019-02-09 23:01:50 -0500]:
> On Sat, Feb 09, 2019 at 08:20:32PM -0500, Rich Felker wrote:
> > On Sun, Feb 10, 2019 at 02:16:23AM +0100, Szabolcs Nagy wrote:
> > > * Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
> > > > On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> > > > > the assumption is that if /proc/self/task is read twice such that
> > > > > all tids in it seem to be active and caught, then all the active
> > > > > threads of the process are caught (no new threads that are already
> > > > > started but not visible there yet)

it seems if the main thread exits, it is still listed in /proc/self/task
and has zombie status for the lifetime of the process so futex lock always
fails with ESRCH.

so my logic waiting for all exiting threads to exit does not work (at
least the main thread needs to be special cased).

> > > > 
> > > > I'm skeptical of whether this should work in principle. If the first
> > > > scan of /proc/self/task misses tid J, and during the next scan, tid J
> > > > creates tid K then exits, it seems like we could see the same set of
> > > > tids on both scans.
> > > > 
> > > > Maybe it's salvagable though. Since __block_new_threads is true, in
> > > > order for this to happen, tid J must have been between the
> > > > __block_new_threads check in pthread_create and the clone syscall at
> > > > the time __synccall started. The number of threads in such a state
> > > > seems to be bounded by some small constant (like 2) times
> > > > libc.threads_minus_1+1, computed at any point after
> > > > __block_new_threads is set to true, so sufficiently heavy presignaling
> > > > (heavier than we have now) might suffice to guarantee that all are
> > > > captured. 
> > > 
> > > heavier presignaling may catch more threads, but we don't
> > > know how long should we wait until all signal handlers are
> > > invoked (to ensure that all tasks are enqueued on the call
> > > serializer chain before we start walking that list)
> > 
> > That's why reading /proc/self/task is still necessary. However, it
> > seems useful to be able to prove you've queued enough signals that at
> > least as many threads as could possibly exist are already in a state
> > where they cannot return from a syscall with signals unblocked without
> > entering the signal handler. In that case you would know there's no
> > more racing going on to create new threads, so reading /proc/self/task
> > is purely to get the list of threads you're waiting to enqueue
> > themselves on the chain, not to find new threads you need to signal.
> 
> One thing to note: SYS_kill is not required to queue an unlimited
> number of signals, and might not report failure to do so. We should
> probably be using SYS_rt_sigqueue, counting the number of signals
> successfully queued, and continue sending them during the loop that
> monitors progress building the chain until the necessary number have
> been successfully sent, if we're going to rely on the above properties
> to guarantee that we've caught every thread.

yes, but even if we sent enough signals that cannot be dropped,
and see all tasks in /proc/self/task to be caught in the handler,
there might be tasks that haven't reached the handler yet and
not visible in /proc/self/task yet. if they add themselves to the
chain after we start processing it then they will wait forever.

as a ductape solution we could sleep a bit after all visible tasks
are stopped to give a chance to the not yet visible ones to run
(or to show up in /proc/self/task).

but ideally we would handle non-libc created threads too, so using
libc.threads_minus_1 and __block_new_threads is already suboptimal,
a mechanism like ptrace or SIGSTOP is needed that affects all tasks.
Rich Felker Feb. 10, 2019, 2:57 p.m.
On Sun, Feb 10, 2019 at 03:15:55PM +0300, Alexey Izbyshev wrote:
> On 2019-02-10 04:20, Rich Felker wrote:
> >On Sun, Feb 10, 2019 at 02:16:23AM +0100, Szabolcs Nagy wrote:
> >>* Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
> >>> Maybe it's salvagable though. Since __block_new_threads is true, in
> >>> order for this to happen, tid J must have been between the
> >>> __block_new_threads check in pthread_create and the clone syscall at
> >>> the time __synccall started. The number of threads in such a state
> >>> seems to be bounded by some small constant (like 2) times
> >>> libc.threads_minus_1+1, computed at any point after
> >>> __block_new_threads is set to true, so sufficiently heavy presignaling
> >>> (heavier than we have now) might suffice to guarantee that all are
> >>> captured.
> >>
> >>heavier presignaling may catch more threads, but we don't
> >>know how long should we wait until all signal handlers are
> >>invoked (to ensure that all tasks are enqueued on the call
> >>serializer chain before we start walking that list)
> >
> >That's why reading /proc/self/task is still necessary. However, it
> >seems useful to be able to prove you've queued enough signals that at
> >least as many threads as could possibly exist are already in a state
> >where they cannot return from a syscall with signals unblocked without
> >entering the signal handler. In that case you would know there's no
> >more racing going on to create new threads, so reading /proc/self/task
> >is purely to get the list of threads you're waiting to enqueue
> >themselves on the chain, not to find new threads you need to signal.
> 
> Similar to Szabolcs, I fail to see how heavier presignaling would
> help. Even if we're sure that we'll *eventually* catch all threads
> (including their future children) that were between
> __block_new_threads check in pthread_create and the clone syscall at
> the time we set __block_new_threads to 1, we still have no means to
> know whether we reached a stable state. In other words, we don't
> know when we should stop spinning in /proc/self/task loop because we
> may miss threads that are currently being created.

This seems correct.

> Also, note that __pthread_exit() blocks all signals and decrements
> libc.threads_minus_1 before exiting, so an arbitrary number of
> threads may be exiting while we're in /proc/self/task loop, and we
> know that concurrently exiting threads are related to misses.

This too -- there could in theory be unboundedly many threads that
have already decremented threads_minus_1 but haven't yet exited, and
this approach has no way to ensure that we wait for them to exit
before returning from __synccall.

I'm thinking that the problems here are unrecoverable and that we need
the thread list.

Rich
Rich Felker Feb. 10, 2019, 3:05 p.m.
On Sun, Feb 10, 2019 at 01:32:14PM +0100, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2019-02-09 23:01:50 -0500]:
> > On Sat, Feb 09, 2019 at 08:20:32PM -0500, Rich Felker wrote:
> > > On Sun, Feb 10, 2019 at 02:16:23AM +0100, Szabolcs Nagy wrote:
> > > > * Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
> > > > > On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> > > > > > the assumption is that if /proc/self/task is read twice such that
> > > > > > all tids in it seem to be active and caught, then all the active
> > > > > > threads of the process are caught (no new threads that are already
> > > > > > started but not visible there yet)
> 
> it seems if the main thread exits, it is still listed in /proc/self/task
> and has zombie status for the lifetime of the process so futex lock always
> fails with ESRCH.
> 
> so my logic waiting for all exiting threads to exit does not work (at
> least the main thread needs to be special cased).
> 
> > > > > 
> > > > > I'm skeptical of whether this should work in principle. If the first
> > > > > scan of /proc/self/task misses tid J, and during the next scan, tid J
> > > > > creates tid K then exits, it seems like we could see the same set of
> > > > > tids on both scans.
> > > > > 
> > > > > Maybe it's salvagable though. Since __block_new_threads is true, in
> > > > > order for this to happen, tid J must have been between the
> > > > > __block_new_threads check in pthread_create and the clone syscall at
> > > > > the time __synccall started. The number of threads in such a state
> > > > > seems to be bounded by some small constant (like 2) times
> > > > > libc.threads_minus_1+1, computed at any point after
> > > > > __block_new_threads is set to true, so sufficiently heavy presignaling
> > > > > (heavier than we have now) might suffice to guarantee that all are
> > > > > captured. 
> > > > 
> > > > heavier presignaling may catch more threads, but we don't
> > > > know how long should we wait until all signal handlers are
> > > > invoked (to ensure that all tasks are enqueued on the call
> > > > serializer chain before we start walking that list)
> > > 
> > > That's why reading /proc/self/task is still necessary. However, it
> > > seems useful to be able to prove you've queued enough signals that at
> > > least as many threads as could possibly exist are already in a state
> > > where they cannot return from a syscall with signals unblocked without
> > > entering the signal handler. In that case you would know there's no
> > > more racing going on to create new threads, so reading /proc/self/task
> > > is purely to get the list of threads you're waiting to enqueue
> > > themselves on the chain, not to find new threads you need to signal.
> > 
> > One thing to note: SYS_kill is not required to queue an unlimited
> > number of signals, and might not report failure to do so. We should
> > probably be using SYS_rt_sigqueue, counting the number of signals
> > successfully queued, and continue sending them during the loop that
> > monitors progress building the chain until the necessary number have
> > been successfully sent, if we're going to rely on the above properties
> > to guarantee that we've caught every thread.
> 
> yes, but even if we sent enough signals that cannot be dropped,
> and see all tasks in /proc/self/task to be caught in the handler,
> there might be tasks that haven't reached the handler yet and
> not visible in /proc/self/task yet. if they add themselves to the
> chain after we start processing it then they will wait forever.
> 
> as a ductape solution we could sleep a bit after all visible tasks
> are stopped to give a chance to the not yet visible ones to run
> (or to show up in /proc/self/task).

This is not going to help on a box that's swapping to hell where one
of the threads takes 30 seconds to run again (which is a real
possibility!)

> but ideally we would handle non-libc created threads too, so using
> libc.threads_minus_1 and __block_new_threads is already suboptimal,

Non-libc-created threads just can't be supported; they break in all
sorts of ways and have to just be considered totally undefined. The
synccall signal handler couldn't even perform any of the operations it
does, since libc functions all (by contract, if not in practice) rely
on having a valid thread pointer. We bend this rule slightly (and very
carefully) in posix_spawn to make syscalls with a context shared with
the thread in the parent process, but allowing it to be broken in
arbitrary ways by application code is just not practical.

> a mechanism like ptrace or SIGSTOP is needed that affects all tasks.

Yes, that would work, but is incompatible with running in an
already-traced task as far as I know.

Rich