[RFC] gradually moving strace from ptrace to perf

Tue Sep 24 12:51:41 UTC 2013

The main performance issue with strace is that it is slow.
Straced process is ~30 times slower than untraced one.

This is a design flaw of ptrace interface, it can't be
fixed easily.

Currently, all other strace performance improvements
are bottlenecked at it: even if everything strace
does (apart from waitpid'ing and ptrace_cont'ing)
would take zero nanoseconds, straced process would
still run 30 times slower and untraced one.

To be precise, the problem that we use PTRACE_SYSCALL,
and therefore we stop traced process twice per syscall.
Then strace needs to wake up. Then it all needs to happen
in the opposite direction. Twice. Per syscall.

Over the years there were several attempts to produce
an alternative (e.g. three years of utrace effort
on kernel side, came to naught). It proved to be *hard*.

The situation started to look better recently, with
the evolution of kernel perf infrastructure from
"CPU performance counters" thingy to a more generic
tracing infrastructure.

The appealing feature of perf (for us) is that it
collects data *without stopping* traced processes.

During last week I took a look at the current state
and experimented a bit. Here are the results.

Kernel has "tracepoints" now, predefined locations
in kernel code where it is possible to attach and
collect information, and even call a custom function,
when this location is reached.

(It is possible to create new tracepoints on the fly
using kprobes, but we won't need that.).

There are two tracepoints which are of interest to us:
sys_enter amd sys_exit.

To attach to them, we need to create a file descriptor:

	fd = perf_event_open(attr, pid, ...);

where attr is a structure which describes that we want
to get notified when "sys_enter" tracepoint is reached
by task with PID.

Then we need to mmap a buffer on this fd, and kernel
will write tracepoint data every time it is reached.
For  "sys_enter", the information is:

	pid, syscall#, args[6]

The traced process will not stop and won't send any waitpid
notifications, it will send a POLLIN notification to the fd
instead.

The same can be done on "sys_exit". Dumped information is:

	pid, syscall#, retval

How we can use it?

We can get rid of those many PTRACE_SYSCALL stops.
Unfortunately, not all of them, but many.

A simple change which makes strace run traced process
with PTRACE_CONT instead of PTRACE_SYSCALL, and fetch
syscall parameters and results using perf buffer produces
this output:

brk(0)                                  = 0x6e0000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7a9047c000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
       ^^^^^^^^^^^^^^^^^^^^ managed to fetch filename
open(0x7f7a90279478, O_RDONLY|O_CLOEXEC) = 3
     ^^^^^^^^^^^^^^ but here it didn't work
fstat(3, {...})                         = 0
         ^^^^^ and here too didn't work
mmap(NULL, 78894, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7a90468000
close(3)                                = 0
...
...
close(3)                                = 0
fstat(1, {...})                         = 0
fstat(0, {...})                         = 0
fadvise64(0, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(0, "", 65536)                      = 0
close(0)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

What happens here is that traced process exited by the time
strace saw open() syscall data. Fetching of filename didn't
succeed, strace got ESCRH trying to do that:

process_vm_readv(15043, 0x7fff1825fed0, 1, 0x7fff1825fee0, 1, 0) = -1 ESRCH (No such process)

That's it: pid 15043 was no more.

However, note how many syscalls are still shown with
full information: close, mmap, fadvise64...

The situation with decoding can be fixed if we don't eliminate
all stops (yet), but still stop after those syscalls
which have parameters or results *referenced by registers*,
not only registers themselves.
(I also expect that there will be a few syscalls
where we will need to stop on "sys_enter").

This needs some cooperation on kernel side:
registration of tracepoint events should be extended
to allow selective stopping of the process when event triggers.

I have a proof-of-concept kernel patch which stops
in every "sys_exit" tracepoint (if the process in question
is ptraced). With this patch, I managed to make modified
strace print the correct output, with all parameters decoded.
Compare with the above:

...
...
close(3)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
fstat(0, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
fadvise64(0, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(0, "", 65536)                      = 0
close(0)                                = 0
close(1)                                = 0
close(2)                                = 0
+++ exited with 0 +++

This can be easily improved further: there should be a bit mask
of syscalls which we want stopped.

The catch is, of course, that we need to talk kernel people into
accepting the patch.

It may be not that easy: perf developers very much like the fact that
they never needed any pesky stops. From their POV, the feature
I propose, while technically realizable, is "conceptually wrong".

Also, kernel tree has a "perf" command line tool which is pretty impressive,
and recently it grew a strace-like subcommand, "perf trace PROG [ARGS]".
However, it is exactly at the same stage where my toy non-stopping hack
was: it shows only sycall argument values per se, not data they reference.

There are arguments which I plan to use to convince them:

* The feature is optional: if the call

	fd = perf_event_open(attr, pid, ...);

  does not request stopping in the attr structure, then no stopping
  ever happens. Their tools (perf, systemtap et al.) won't need to do
  anything to accommodate it.

* The dramatically faster strace will be available *soon*, not after
  many more years of development necessary in "perf trace" to provide
  complete strace-like functionality.
  (I don't think that "perf trace" can't eventually achieve the
  level of details strace shows today.
  but it will take **A LOT** of time - I'd say several years, likely
  five years or more).

* We (strace developers) will work on using this feature *less*,
  not *more* - for one, we do want to make strace faster, not slower!
  Eventually we may manage to get rid of stopping entirely.

The kernel patch is attached.

Remember, it is a proof-of-concept, *don't propose it*
for kernel inclusion as-is! The final version we will want
to be included is quite likely to be significantly different.

How it works:
struct perf_event_attr, among its many other fields,
has auxiliary fields config1 and config2 (currently used, IIRC,
only with some Inter CPU performance counters).
For tracepoints, these fields were not used before my patch.

With the patch:
if config1 field is set to magic value 42 in perf_event_open(attr) call,
and if the task where tracepoint is triggered is ptraced,
then task will enter ptrace-stop with PTRACE_EVENT_STOP + 1
(we can name this new ptrace-event-stop PTRACE_EVENT_STOP_TRACEPOINT).

My modified strace code which uses this feature is currently
too ugly in several ways. I can post it after cleaning up,
if there's interest.

Do you guys have ideas how to make this proposal more palatable
for kernel guys?

-- 
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 3clean.diff
Type: text/x-patch
Size: 1909 bytes
Desc: not available
URL: <http://lists.strace.io/pipermail/strace-devel/attachments/20130924/05f463ce/attachment.bin>