strace cleanup() problems

Fri Jan 22 03:39:32 UTC 2010

> We found some problems with strace's cleanup() logic during the testing,
> see below. Please bear in mind I am not familiar with strace's code ;)

Unfortunately, nobody really is at this point.  strace had already been
around for a very long time before I started working on it.  I added some
of the code we're discussing now, but that was several years ago now.  It
was all done to work with the quirks of ptrace on the Linux kernels of the
day at that time.  This was all before I started fixing ptrace in Linux.
As you well know, the Linux kernel's behavior in many corner cases has
changed a great deal since.

In the long history of strace, it has always gone to some pains to continue
working with a variety of old kernels.  Even acknowledging that the
behavior of a given old kernel is wrong and/or inconsistent, if there was
in practice a particular usage scenario with strace that worked on that
kernel before, we don't want to break it now.  This is what's behind my
habitual resistance to big changes to the ways strace uses the kernel
facilities.

Most of the sensitive strace code has not changed much in several years and
many kernel versions now.  The hairy parts, even ones that I debugged in
the past and parts I wrote, are dim memories at best.  Dmitry Levin knows
strace in general as well as anyone does, but I don't know that he has been
in the weeds of the ptrace interactions much.  Andreas now has the primary
burden of working on this code, but these are his first occasions to get
exposed to all the corners relevant here.

So we'll all just have to help each other through it.  Andreas and I will
be gleaning it all just by reading the current code and reasoning through
the wrinkles, the same as you.  I hope that you can hold up one end of the
discussion, representing what permutations the kernel can present, and that
Andreas can formulate what strace's code should do to cope better.  I'll
stick to trying to help elucidate how the code works now and perhaps
sometimes why it got that way.

> Afaics, it is not right to do cleanup() if clearbpt() fails. Yes, the
> "Pretty fatal" comment is correct, but we can have other processes to
> continue tracing.

I think this code dates from the original (now #ifndef LINUX) method of
setbpt/clearbpt.  That did actual text breakpoint insertion in the parent
before it executes the (fork) syscall.  This means that if the breakpoint
is not cleared in the child, then resuming the child will get stuck on more
unexpected SIGTRAPs, or will execute bad instructions (as on x86, if a
one-byte breakpoint instruction is written over a longer instruction).  I
think that sort of scenario is what drives the "Pretty fatal" thought
expressed in the comment there.  In contrast, the worst outcome of the
current (#ifdef LINUX) method is an argument register left unrestored, and
chances are that register was dead in the userland data flow anyway.
(That's all tangential to your real point, and just says that "Pretty
fatal" might have made one kind of sense in the past but, even on that kind
of thinking alone, it is rather weaker today.)

The real point seems to be that this comment, and the behavior that goes
with it, were written without really having -f in mind.  If you have a
problem tracing one process, then that is reasonably an instant fully-fatal
error when you only have one process to trace.

> IOW, suppose "strace -c" traces the process P which fork's child C.

I take it you mean "strace -f -c".

> Now droptcb(C) can fail just because C is SIGKILLED and therefore
> PTRACE_GETREGS/PTRACE_SETREGS fails. strace shouldn't "panic" in
> this case imho, we should ignore ESRCH like ptrace_restart() does.

I see two questions here.  One is handling ESRCH from ptrace, in
particular.  The other is the general question of what to do when
something "very unexpected" happens.

> Actually, I don't really understand what cleanup() buys us if it is
> called when something goes wrong. Can't strace just exit and rely on
> implicit detach in kernel? Except cleanup() should probably send
> CONT/TERM before exiting. Yes, we can leave the tracee in the "bad"
> state, but I do not see how cleanup() can help. For example, it does
> not try to clearbpt() if TCB_BPTSET.
> 
> In any case, cleanup() itself have problems which can lead to the
> hang, please see below.
> 
> ------------------------------------------------------------------------------
> detach() assumes that if TCB_STARTUP is set, the tracee must have
> the pending SIGSTOP queued by attach and thus my_tgkill(SIGSTOP) is
> not needed.
> 
> This is mostly correct, but if cleanup() is the caller we can't trust
> TCB_STARTUP.
> 
> 	cleanup:
> 
> 			if (tcp->flags & TCB_ATTACHED)
> 				detach(tcp, 0);
> 			else {
> 				kill(tcp->pid, SIGCONT);
> 				kill(tcp->pid, SIGTERM);
> 			}
> 
> 
> If cleanup() finds a TCB_ATTACHED sub-thread before TCB_STARTUP
> sub-thread X it sends SIGCONT to the thread group. This steals the
> pending SIGSTOP and wakes up the TCB_STARTUP tracee X. After that
> detach()->wait4(X) can hang "forever" until X exits or stops. For
> example, X can sleep in sys_pause() and strace hangs in wait4(X).
> 
> Probably TCB_ATTACHED should be per-process, not per thread. Or
> perhaps cleanup() should be changed to detach() sub-threads first,
> and only then the main thread which has TCB_ATTACHED.
> 
> ------------------------------------------------------------------------------
> There is another reason why I think cleanup() should detach sub
> threads first. The main thread can be zombie, in this case
> my_tgkill(SIGSTOP) can't help and wait4() can't return unless
> we detach and reap all sub-threads.
> 
> ------------------------------------------------------------------------------
> I can't understand why detach()->wait4() can't hang if TCB_STARTUP
> is set and the previous ptrace(PTRACE_DETACH) succeeded.
> 
> ------------------------------------------------------------------------------
> Finally, the exact scenario which we had during the testing.
> 
> To simplify, the tracee does:
> 
> 	fork();
> 
> 	// both parent and child run he code below
> 
> 	for (0 .. NUM_THREADS)
> 		pthread_create();
> 
> 	raise(SIGFPE);
> 
> it runs under "strace -f". We have 2 thread groups, G1 and G2.
> 
> G2's main thread does raise(SIGFPE), this kills (the kernel sends
> SIGKILL to all threads) a thread T in G2 and clearbpt(T) fails.
> strace calls cleanup().
> 
> cleanup() tries to detach() all traced theads, from G1 and G2.
> It detaches some of them, then it sees G1's main thread with
> TCB_ATTACHED set and sends SIGCONT to G1.
> 
> This wakes up a TCB_STARTUP thread X from G1, X resumes and sleeps
> in do_futex().
> 
> strace does detach(X) and hangs in wait4(X) (remember, detach()
> trusts TCB_STARTUP and doesn't send SIGSTOP).
> 
> Now, wait4(X) hangs forever because X waits for FUTEX_WAKE from
> another thread in this thread group, and that thread sleep in
> TASK_TRACED.
> 
> Oleg.
>