strace cleanup() problems

Fri Jan 22 04:56:04 UTC 2010

[Sorry, first version sent unfinished.  Ignore the earlier reply.]

> We found some problems with strace's cleanup() logic during the testing,
> see below. Please bear in mind I am not familiar with strace's code ;)

Unfortunately, nobody really is at this point.  strace had already been
around for a very long time before I started working on it.  I added
some of the code we're discussing now, but that was several years ago
now.  It was all done to work with the quirks of ptrace on the Linux
kernels of the day at that time.  This was all before I started fixing
ptrace in Linux.  As you well know, the Linux kernel's behavior in many
corner cases has changed a great deal since.

In the long history of strace, it has always gone to some pains to
continue working with a variety of old kernels.  Even acknowledging that
the behavior of a given old kernel is wrong and/or inconsistent, if
there was in practice a particular usage scenario with strace that
worked on that kernel before, we don't want to break it now.  This is
what's behind my habitual resistance to big changes to the ways strace
uses the kernel facilities.

Most of the sensitive strace code has not changed much in several years
and many kernel versions now.  The hairy parts, even ones that I
debugged in the past and parts I wrote, are dim memories at best.
Dmitry Levin knows strace in general as well as anyone does, but I don't
know that he has been in the weeds of the ptrace interactions much.
Andreas now has the primary burden of working on this code, but these
are his first occasions to get exposed to all the corners relevant here.

So we'll all just have to help each other through it.  Andreas and I
will be gleaning it all just by reading the current code and reasoning
through the wrinkles, the same as you.  I hope that you can hold up one
end of the discussion, representing what permutations the kernel can
present, and that Andreas can formulate what strace's code should do to
cope better.  I'll stick to trying to help elucidate how the code works
now and perhaps sometimes why it got that way.

> Afaics, it is not right to do cleanup() if clearbpt() fails. Yes, the
> "Pretty fatal" comment is correct, but we can have other processes to
> continue tracing.

I think this code dates from the original (now #ifndef LINUX) method of
setbpt/clearbpt.  That did actual text breakpoint insertion in the
parent before it executes the (fork) syscall.  This means that if the
breakpoint is not cleared in the child, then resuming the child will get
stuck on more unexpected SIGTRAPs, or will execute bad instructions (as
on x86, if a one-byte breakpoint instruction is written over a longer
instruction).  I think that sort of scenario is what drives the "Pretty
fatal" thought expressed in the comment there.  In contrast, the worst
outcome of the current (#ifdef LINUX) method is an argument register
left unrestored, and chances are that register was dead in the userland
data flow anyway.  (That's all tangential to your real point, and just
says that "Pretty fatal" might have made one kind of sense in the past
but, even on that kind of thinking alone, it is rather weaker today.)

The real point seems to be that this comment, and the behavior that goes
with it, were written without really having -f in mind.  If you have a
problem tracing one process, then that is reasonably an instant
fully-fatal error when you only have one process to trace.

> IOW, suppose "strace -c" traces the process P which fork's child C.

I take it you mean "strace -f -c".

> Now droptcb(C) can fail just because C is SIGKILLED and therefore
> PTRACE_GETREGS/PTRACE_SETREGS fails. strace shouldn't "panic" in
> this case imho, we should ignore ESRCH like ptrace_restart() does.

I see two questions here.  One is handling ESRCH from ptrace, in
particular.  The other is the general question of what to do when
something "very unexpected" happens.

First, on ESRCH.  There are two ways to interpret an ESRCH failure from
ptrace.  One is that the process/thread has died while sitting at the
proper ptrace stop where strace is examining it, so things are normal
and you should just indicate that the syscall was unfinished and handle
the exit report--i.e., just what you do when the race is a little slower
and ptrace completes before the thing gets killed.  The other is that
the thread is not actually stopped or attached or something, meaning
that strace has become horribly confused and perhaps should just panic.
The strace code we inherited certainly skews toward the latter.

So, one place we can start is trying to distinguish those two cases
concretely.  That is, given any ESRCH error (including ptrace_restart,
but not just there), we can ask, "Is this the normal case?"  In the
normal case, the PID of interest is now a zombie, or is gone entirely
(and in theory the PID might have been reused--though in practice that
would require a whole lot of racing or a system in a strange condition).

It's gone entirely in the case of exec by another thread in its process,
where the kernel eats the thread zombies regardless of ptrace.
Otherwise, you have a zombie waiting for you to see it with wait (i.e.,
back at the top trace() loop).  So an approach available on non-ancient
Linux kernels is to use:
	if (waitid(P_PID, tcp->pid, &info, WNOHANG|WNOWAIT) < 0 ?
	    errno == ESRCH : 
	    (info.si_pid == tcp->pid && !WIFSTOPPED(info.si_status)))
		// it's just dead, nothing fishy going on
		tcp->ptrace_errno = ESRCH;
	else
		// "very unexpected"
(For more ancient kernels, you don't get WNOWAIT, so you'd have to do a
real waitpid() that consumes the status and do bookkeeping to feed that
back into the main processing loop.)

Next, there is the general question of what we should be doing when we
declare the circumstances "very unexpected".  That is, all the places
where we call cleanup() and exit (once we've eliminated any paths to
that now that we don't think really qualify as "very unexpected").

Arguably, it now punts to panic too often even assuming the situation at
hand means that strace has become horribly confused about at least one
process.  Is it any worse for strace to stumble along tracing what it
can when it has one of these issues?  I don't know.

> Actually, I don't really understand what cleanup() buys us if it is
> called when something goes wrong. Can't strace just exit and rely on
> implicit detach in kernel? Except cleanup() should probably send
> CONT/TERM before exiting. Yes, we can leave the tracee in the "bad"
> state, but I do not see how cleanup() can help. For example, it does
> not try to clearbpt() if TCB_BPTSET.

detach() does call clearbpt().  So it only has that particular
shortcoming for the direct child of "strace /bin/command args".
Arguably it should do that same check when not calling detach(), that
seems right to me.  As I mentioned above, it's probably the practical
fact for some time that missing clearbpt() when it's "needed" didn't
really hurt in noticeable ways.

I think this code comes from a time when you very much could not rely on
the implicit detach by the kernel.  That is, people observed practical
problems with using strace and winding up with processes left stopped
and that sort of thing.  There may even have been situations with old
kernels where you could not fully recover after the fact, i.e. SIGCONT
could not restore all permutations of weirdness that could be left by
the kernel after ptrace was used.

Nowadays we have gone to quite some pains (mostly you have!) to ensure
that even abrupt implicit ptrace detaches don't leave threads stopped
when they should not be.  But even so, when we've just done
PTRACE_ATTACH to induce a SIGSTOP, they "should be" as far as the kernel
can tell.

> In any case, cleanup() itself have problems which can lead to the
> hang, please see below.

Right.  I think that probably if we clean up the "is this normal or very
unexpected" discernment, then we won't actually get into the situations
of "very unexpected, should panic" at all in the practical cases people
are seeing today.  After that, if we clean up the response to hitting a
"very unexpected" situation, then we won't be going to cleanup() even in
all of whatever scenarios remain.  But after that, there will still be
some ways to get to cleanup() and we'll want it to work robustly.
(Notably, ^C to an "strace -p PID" gets there without there being any
kind of unexpected error at all.)

> detach() assumes that if TCB_STARTUP is set, the tracee must have
> the pending SIGSTOP queued by attach and thus my_tgkill(SIGSTOP) is
> not needed.

Right.  If it's already pending, sending another one doesn't hurt.  But
if it's already dequeued the attach-time SIGSTOP and is heading to (or
in) the ptrace stop for that, then the second SIGSTOP would be left
pending after we detach.  Hence the current logic tries to determine if
it thinks there should be a SIGSTOP in flight or not.

> If cleanup() finds a TCB_ATTACHED sub-thread before TCB_STARTUP
> sub-thread X it sends SIGCONT to the thread group. This steals the
> pending SIGSTOP and wakes up the TCB_STARTUP tracee X. After that
> detach()->wait4(X) can hang "forever" until X exits or stops. For
> example, X can sleep in sys_pause() and strace hangs in wait4(X).

I see.

> Probably TCB_ATTACHED should be per-process, not per thread. 

That interacts with some hairy logic and I don't know off hand
what the ramifications of the change would be.

> Or perhaps cleanup() should be changed to detach() sub-threads first,
> and only then the main thread which has TCB_ATTACHED.

Yes, that sounds right.

> There is another reason why I think cleanup() should detach sub
> threads first. The main thread can be zombie, in this case
> my_tgkill(SIGSTOP) can't help and wait4() can't return unless
> we detach and reap all sub-threads.

Indeed, a good point.  In strace, a zombie initial thread has its
"struct tcb" kept alive with TCB_EXITING set (droptcb) until all its
other threads die too.  The entries for each other thread have
tcp->parent pointing to this.

Note that this zombie "struct tcb" will still be "live" in tcbtab, so
cleanup() will see ~TCB_ATTACHED and do the kill path for this case just
the same as when the initial thread is still alive.

So, one fairly minimal change would be to do:

	if (tcp->parent && !(tcp->parent->flags & TCB_ATTACHED))
		tcp->flags &= ~TCB_STARTUP;

Then we don't expect or worry about a pending SIGSTOP in detach().
Now we'd just need to make sure that the logic sending SIGCONT to
the process runs after the detach() on one of its threads, since
if detach() uses SIGSTOP that can wipe out a previous SIGCONT.

Of course on current kernels we know that we can just skip detach()
entirely for !(tcp->parent->flags & TCB_ATTACHED).  But I'm not at all
sure that this won't regress on some old kernels to where we can leave
some threads stopped in a way that SIGCONT and the SIGTERM delivery
doesn't resume and kill the whole process as they should.

> I can't understand why detach()->wait4() can't hang if TCB_STARTUP
> is set and the previous ptrace(PTRACE_DETACH) succeeded.

If PTRACE_DETACH succeeds, then it's not our ptracee any more, 
and so wait4 gets ECHILD.

> Finally, the exact scenario which we had during the testing.
> 
> To simplify, the tracee does:
> 
> 	fork();
> 
> 	// both parent and child run he code below
> 
> 	for (0 .. NUM_THREADS)
> 		pthread_create();
> 
> 	raise(SIGFPE);
> 
> it runs under "strace -f". We have 2 thread groups, G1 and G2.

Ok.

> G2's main thread does raise(SIGFPE), this kills (the kernel sends
> SIGKILL to all threads) a thread T in G2 and clearbpt(T) fails.
> strace calls cleanup().
> 
> cleanup() tries to detach() all traced theads, from G1 and G2.
> It detaches some of them, then it sees G1's main thread with
> TCB_ATTACHED set and sends SIGCONT to G1.
	       ^clear

Right?

> This wakes up a TCB_STARTUP thread X from G1, X resumes and sleeps
> in do_futex().

Arguably the intended behavior should be that when X resumes it sees the
SIGTERM sent to G1 before doing any user work (blocking or otherwise).
Of course, cleanup() should send SIGTERM before SIGCONT for that, and
it does SIGCONT first now.

> strace does detach(X) and hangs in wait4(X) (remember, detach()
> trusts TCB_STARTUP and doesn't send SIGSTOP).
> 
> Now, wait4(X) hangs forever because X waits for FUTEX_WAKE from
> another thread in this thread group, and that thread sleep in
> TASK_TRACED.

Understood.  I hope the other strace hackers will chime in to say
whether all this explanation makes sense to them too, or ask their own
questions about the details.

Thanks,
Roland