[patch] test/leaderkill fix

Wed Jul 11 08:40:48 UTC 2007

The behavior I see from that test case is different than what you describe.
I have never seen a wrong exit status get to a real parent's waitpid, and
I'm not entirely sure how that could happen.  What I do see is that after
PTRACE_KILL, the leader is a zombie but the other thread is still alive,
and so the wait4 in detach called on the leader in handle_group_exit
blocks.  In the -p test scenario, this just makes strace block so you can't
interrupt it.  This failure mode makes sense to me.

I'm not really sure any more why it was necessary to explicitly detach the
leader there.  I think some old kernel must not have behaved like a current
one would in this case.  Once the group exit is allowed to commence (by
detaching the thread causing the exit, i.e. TCP in handle_group_exit), then
all threads including the leader will die and report their deaths
appropriately.  But then, on current kernels the TCB_SUSPENDED logic and
the like is not required at all either.

What seems like it ought to be fine even on a very old kernel is to detach
the instigating thread first.  It will definitely die and report soon, so
its detach will not block for a long time.  Then detaching the leader will
work, at least in this test case.  The PTRACE_KILL should not be required,
because the synchronous waiting in detach() ensures that the instigating
thread finished and so has posted the death signal to all threads,
precluding the leader running any more user code on detach.

I've committed that change.  It fixes the test/leaderkill case.  I am still
concerned about other cases where there are more threads.  I think that the
synchronous wait in detach will bite again on the leader because the other
threads still exist.  They should be killed by the group exit, but they
will still stick around as zombies until we see them with wait because they
are ptraced.  I think that is enough to prevent the zombie leader from
being reported to wait.  So it would be good to investigate some more
cases.  If I'm right about that case, then I think the right solution is
simply to punt the detach call on the leader in handle_group_exit.  It
should be seen shortly along with all the other threads.  But I may be
overlooking something, some reason that detach was there other than ancient
kernels.

Thanks,
Roland