[PATCH] Collect processes in batches

Wed Jun 2 01:50:32 UTC 2010

This has a woeful lack of introduction or explanation for readers of the
posting, and of appropriate comments in the code about the whole plan here.

Firstly, I'll explain the background to this for the benefit of other
reviewers and mailing list readers.

The Linux kernel implementation of wait4 works by iterating over a list
of all children and tracees, checking each for being dead or stopped or
whatever so as it's interesting to report it.  In Linux kernels prior to
2.6.25, when this logic found a stopped child to report (i.e. when
yielding a WIFSTOPPED status), it would move that child to the end of
the list.  Hence, the next wait4 call would check each other child
first, before the one returned last time.

This reordering had a useful effect in some scenarios.  A simple-minded
tracer will use wait4 to see a tracee stop, then use ptrace to resume
that tracee, and then loop to call wait4 for the next tracee to stop.
This is exactly what strace does.  When tracing multiple threads, if the
first thread to be examined is busy-waiting, or just doing very quick
system calls with little computation between them, that thread might
already have stopped again for its next event by the time strace calls
wait4 again.  When that happens, without the reordering behavior, strace
will report and resume that some one thread over and over again, leaving
the other threads still stopped and never getting around to reporting
and resuming the.  This is called a "starvation" scenario.  But, because
of the old reordering behavior, reporting would always be round-robin
among stopped candidates, and hence so would strace's resuming.  So
before 2.6.25, the "starvation" scenario did not show up in practice.

Since version 2.6.25, Linux no longer does this reordering.  The list
stays in its original order throughout (roughly, the order of child
creation, followed by the order of tracee attach).  The kernel
maintainers decided that there was never any guarantee of any sort of
about the order in which available children or tracees are seen in wait4
calls--it's always been subject to timing, and userland always should
have been sophisticated enough to drain all the pending reports before
resuming anybody.  (In fact, I was the kernel maintainer who decided
that.  It definitely is the right thing for the kernel to do, I can say
with my kernel hat on.  With my strace hat on, I can fret that the
kernel's change has made it more likely that cases where strace's lack
of sophistication in this regard matters will arise in practice.  But
the kernel change has already happened and it won't go back.)

Now, to this patch.  There are magic global variables with no
explanation why--something like that always just obviously needs more
scrutiny.  There are also nits like wrong comment formatting that
everybody here should know well for this code base by now.  

It is a bit hard to follow the changes to the control flow being made,
in the absence of any comments about it all.  I think it is doing it
sort of inside out of how I would approach the problem.

Here is what I think is the way to do this that is fairly clean and the
least invasive to strace's existing code and control flow.

In trace(), replace the three places PTRACE_SYSCALL is used with a
common subroutine that doesn't call ptrace immediately.  Instead, it
stores the resumption signal and adds the tcb to the tail of a linked
list.  At the top of the loop, where we call wait4, check that list.  If
it's nonempty, then use WNOHANG.  When wait4 returns 0, then consume the
linked list by doing each PTRACE_SYSCALL in the order they were reported,
and short-circuit back to the top of the loop.  There the list will be
empty, so you'll call wait4 without WNOHANG.

For the !WIFSTOPPED (death) cases, you need to be sure not to get to
droptcb while the linked list still points to the tcb.  (You can get a
death report without having resuming it yet, for various sudden-death
cases.)  Probably the easiest thing is to use a new flag TCB_RESUMING to
keep track of being on the list (or you might just use the list pointer
field in a way where you can check it, i.e. tail element's pointer is
not to NULL so it looks like an unlisted tcb).  Check that in droptcb
and have it just clear ->pid but leave ->flags TCB_INUSE|TCB_RESUMING.
Then the resumption loop can notice these ->pid==0 tcb's and just clear
their ->flags (droptcb already having done the rest of the work).

Does that make sense?

Thanks,
Roland