Experience from writing trace_inputs.py

Thu Mar 20 20:50:50 UTC 2014

Hi,

I was pointed out to this thread. I wrote trace_inputs.py. It was used to
isolate chromium tests and figure out which files are touched by each test
case. It doesn't care about anything except files being opened so strace
log parsing was kept to a minimum. Note that I've been mostly using the
stock ubuntu strace version, which is quite old. Some of the issues below
may have been fixed already. For background context, note that I wrote the
tool in python because 1) I'm lazy and 2) it was easier to use a single
tool to deploy quickly on linux, OSX and Windows and that the goal of this
tool is to hide as much as possible the underlying differences between each
OS.

In practice, I stumbled upon a lot of issues while reading the logs;
- Processes being killed arbitrarily, causing all sorts of incomplete logs.
The chromium unit tests tends to open a lot of processes quickly and kill
them just for fun.
- Using a single log file resulted in severely corrupted log file, looks
like multithreading is not properly interlocked in the output file.
- Even using one log file per process with -ff still cause occasional
corrupted log file on heavily multithreaded processes. Note that using a
workstation will a lots of core may probably help exhibit this problem more.
- No clear separation of call / return in the logs, causing a funny search
to figure out which call completed. There's an heuristic in
Strace.Content.Process._pending_calls.
- Encoding, since linux doesn't prescribe utf-8, I have to hope for the
best. Note: I should use -x, I don't recall why I had commented it out.
Maybe just laziness.
- Encoding, I had to write a state machine to read the logs properly, see
the >110 lines of strace_process_quoted_arguments(). That's independent of
-x.
- No clear parent-child relationship between the logs when using -ff,
causing the reader code to play a game to figure out which process started
which processes.
- Only after the process hierarchy is figured out I could figure out the
starting current working directory of each process, causing a lot of "late
bound path". See RelativePath class. It definitely slows down reading the
logs and adds potential corruption of the paths in the trace reader if it
incorrectly guessed the starting cwd.
- Generally inconsistent output, see all the RE_* in Strace.Content. No
regexp is guaranteed to be able to read a log line.
- No header to list the hooked functions I should expect to appear, and the
number of arguments in each. Since it changes from OS version to another, I
made the code assert on unknown function so I can add them manually.
- Variable number of arguments to a function coupled with split log lines
and interesting quoting caused some headaches IIRC. (Or was it OSX doing
this? Sorry I don't recall)

Of all the problems, the log corruption with heavily multithreaded process
on a multi-core system has been the most problematic. Everything else is
solvable but when the log is corrupted, there isn't much to do. Last time I
tried, which is a long long time ago, I was still able to reproduce
occasional corruptions with our test named "browser_tests".

A separate issue I got was that I was not able to trace native client
sandboxed unit test processes unless strace was running under sudo but to
do that I need to attach to the already started process to be able to not
run the unit test itself as root. It's been a pain to implement. Note
strictly strace's fault, just FYI.

What I'd recommend;
- json is good for smallish document but is not great for unbounded
documents. Logs can get >2gb. IMHO, I'm not sure it's a good fit.
Especially with regards to 64 ints and json assumes unicode, which  may not
be correct on some systems. That said, I don't have a good recommendation
either. Something custom would be fine with me. Just expose a reference
code to read it back in 1 language, 2-3 popular languages if possible.
- Optionally listing at the start the current process info; pid, ppid, cwd,
list of hooked functions and the number of args for each, etc. Would save a
*lot* of processing when reading the logs of a process tree.
- Always generate 2 entries per call, one enter, one exit. Don't try to be
smart and merge when possible. Be consistent instead! The exit call should
be extremely easy to match with the corresponding enter. In particular in
case of reentrancy. It's less messy if the thread gets killed in the middle
of execution.
- python is ridiculously slow at parsing, ensure whatever you do doesn't
make reading the log *too* slow. In particular, keep support for streaming
the log instead of forcing to read all the log at once, like a big json
file.

It may look like I was dismayed at strace log parsing but you haven't see
Windows' tracerpt.exe output. See LogmanTrace.Tracer.CsvReader.next() to a
contest of which log format is the worst to read. Whoever wrote this junk
shouldn't be allowed to code ever again.

I hope my comments are useful.

Thanks,

M-A
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.strace.io/pipermail/strace-devel/attachments/20140320/9a82f11c/attachment.html>