[PATCH] Collect processes in batches

Wed Jun 9 10:46:40 UTC 2010

On Tue, 2010-06-01 at 18:50 -0700, Roland McGrath wrote:
> This has a woeful lack of introduction or explanation for readers of the
> posting, and of appropriate comments in the code about the whole plan here.
> 
> Firstly, I'll explain the background to this for the benefit of other
> reviewers and mailing list readers.
> 
> The Linux kernel implementation of wait4 works by iterating over a list
> of all children and tracees, checking each for being dead or stopped or
> whatever so as it's interesting to report it.  In Linux kernels prior to
> 2.6.25, when this logic found a stopped child to report (i.e. when
> yielding a WIFSTOPPED status), it would move that child to the end of
> the list.  Hence, the next wait4 call would check each other child
> first, before the one returned last time.
> 
> This reordering had a useful effect in some scenarios.  A simple-minded
> tracer will use wait4 to see a tracee stop, then use ptrace to resume
> that tracee, and then loop to call wait4 for the next tracee to stop.
> This is exactly what strace does.  When tracing multiple threads, if the
> first thread to be examined is busy-waiting, or just doing very quick
> system calls with little computation between them, that thread might
> already have stopped again for its next event by the time strace calls
> wait4 again.  When that happens, without the reordering behavior, strace
> will report and resume that some one thread over and over again, leaving
> the other threads still stopped and never getting around to reporting
> and resuming the.  This is called a "starvation" scenario.  But, because
> of the old reordering behavior, reporting would always be round-robin
> among stopped candidates, and hence so would strace's resuming.  So
> before 2.6.25, the "starvation" scenario did not show up in practice.
> 
> Since version 2.6.25, Linux no longer does this reordering.  The list
> stays in its original order throughout (roughly, the order of child
> creation, followed by the order of tracee attach).  The kernel
> maintainers decided that there was never any guarantee of any sort of
> about the order in which available children or tracees are seen in wait4
> calls--it's always been subject to timing, and userland always should
> have been sophisticated enough to drain all the pending reports before
> resuming anybody.  (In fact, I was the kernel maintainer who decided
> that.  It definitely is the right thing for the kernel to do, I can say
> with my kernel hat on.  With my strace hat on, I can fret that the
> kernel's change has made it more likely that cases where strace's lack
> of sophistication in this regard matters will arise in practice.  But
> the kernel change has already happened and it won't go back.)
> 
> 
> Now, to this patch.  There are magic global variables with no
> explanation why--something like that always just obviously needs more
> scrutiny.

These variables?

+#ifdef LINUX
+static int remembered_pid;
+static int remembered_status;
+#endif

The explanation is at the point where they are used:

+               /* If we waited and got a stopped task notification,
+                * subsequent wait may return the same pid again, for example,
+                * with SIGKILL notification. SIGKILL kills even stopped tasks.
+                * We must not add it to the list
+                * (one task can't be inserted twice in the list).
+                */
+               {
+                       struct tcb *f = found_tcps;
+                       while (f) {
+                               if (f == tcp) {
+                                       remembered_pid = pid;
+                                       remembered_status = status;
+                                       return found_tcps;
+                               }
+                               f = f->next_need_service;
+                       }
+               }

> It is a bit hard to follow the changes to the control flow being made,
> in the absence of any comments about it all.  I think it is doing it
> sort of inside out of how I would approach the problem.

The patch works like this: collect all waitable tracees, store them
in a linked list, then traverse it and act on each tracee.
Rinse, repeat.

> Here is what I think is the way to do this that is fairly clean and the
> least invasive to strace's existing code and control flow.
> 
> In trace(), replace the three places PTRACE_SYSCALL is used with a
> common subroutine that doesn't call ptrace immediately.  Instead, it
> stores the resumption signal and adds the tcb to the tail of a linked
> list.  At the top of the loop, where we call wait4, check that list.  If
> it's nonempty, then use WNOHANG.  When wait4 returns 0, then consume the
> linked list by doing each PTRACE_SYSCALL in the order they were reported,
> and short-circuit back to the top of the loop.  There the list will be
> empty, so you'll call wait4 without WNOHANG.

This might work too. You are proposing to collect a waitable tracee,
decide what to do with it, remember the action you want to do on it
and store the tracee in a linked list, go back and collect next waitable
tracee. When there is no more tracees, perform remembered actions.

Looks more complex because you need to store a bit of data (action you
want to perform) with each remembered tracee, but it might work.

> For the !WIFSTOPPED (death) cases, you need to be sure not to get to
> droptcb while the linked list still points to the tcb.

In my algorithm, I never ever try to delete a tcb which is in the middle
in the middle of the list of remembered tracees because I always process
the list from the head. Deletion at the head is trivial.

> (You can get a
> death report without having resuming it yet, for various sudden-death
> cases.)

Exactly! This is the case which I actually have seen in testing:
I see the same tracee reported by waitpid even though I already saw it,
remembered it, but did not do anything to it yet. And yet, it is
appearing again. For one, SIGKILL can do that.

I decided to deal with this problem simply by stashing out the pid
and wait status in remembered_pid and remembered_status variables,
process and flush the list I accumulated so far, then process
another "list" of tracees which consists of only one tracee: the one
whose status I saved in (remembered_pid, remembered_status) tuple.

Here is the relevant code which notices that we do have a "remembered"
tracee and constructs/returns an one-element list:

+static struct tcb *
+collect_stopped_tcbs(void)
 {
        int pid;
        int wait_errno;
        int status;
        struct tcb *tcp;
+       struct tcb *found_tcps;
 #ifdef LINUX
+       struct tcb **nextp;
        struct rusage ru;
+       int wnohang = 0;
 #ifdef __WALL
        static int wait4_options = __WALL;
 #endif
+
+       if (remembered_pid) {
+               pid = remembered_pid;
+               remembered_pid = 0;
+               if (debug)
+                       fprintf(stderr, " [remembered wait(%#x) = %u]\n",
+                                               remembered_status, pid);
+               tcp = pid2tcb(pid); /* can't be NULL */
+               tcp->wait_status = remembered_status;
+               tcp->next_need_service = NULL;
+               return tcp;
+       }

So, I guess, my explanation should just be pasted in into this code
as a comment in if(remembered_pid) body...

>   Probably the easiest thing is to use a new flag TCB_RESUMING to
> keep track of being on the list (or you might just use the list pointer
> field in a way where you can check it, i.e. tail element's pointer is
> not to NULL so it looks like an unlisted tcb).  Check that in droptcb
> and have it just clear ->pid but leave ->flags TCB_INUSE|TCB_RESUMING.
> Then the resumption loop can notice these ->pid==0 tcb's and just clear
> their ->flags (droptcb already having done the rest of the work).
> 
> Does that make sense?

In my code this is not necessary since it always processes tracees
at the head of the collected list, thus any deletion will also happen
at the head.

But it can be done your way too.

One thing I would want to see changed is a "next_need_service" member
name. For one, it's not even a correct English, should be
"next_needs_service" - but still it is ugly. My fault... Please think
of a better name...

-- 
vda