[GSOC 2014] structured output of strace

Marc-Antoine Ruel maruel at chromium.org
Fri Mar 21 13:53:02 UTC 2014


Expanding specifically on the JSON streaming idea;

(Sorry if I'm rehearsing ideas already stated, I just subscribed
yesterday and only glanced at the recent archives)

2014-03-21 6:50 GMT-04:00 Zev Weiss <zev at bewilderbeest.net>:
> (Though w.r.t another aspect of Marc-Antoine's comment -- JSON doesn't necessarily have to be un-streamable, does it?  Couldn't you just leave the top-level structure of the output file as the concatenation of a bunch of discrete JSON objects, without wrapping them up in an array or similar?)

Having a stream of individually JSON encoded items would probably be
fine but then decoding must be simple. For example, each line could be
a JSON string, separated by a simple \n, e.g.

-- CUT HERE --
{"version":"0.1","pid":82323,"ppid":3342,"cwd":"/home/blank","uid":123","functions":[...],...}
[1395408175.21312,0,"open",["/path/to/file",0700]]
[1395408175.56843,1,3]
-- CUT HERE --

The file itself is *not* a valid JSON file, it's a \n joined list of
JSON encoded packets.

Where the hypothetical format is:
- First item: a dict describing the format and global state at the
start of the log. It's fine for this line to be verbose, since it
occurs only once in the log. It tells the reader how to read the rest
of the file. File format versioning FTW.
- Rest: it is composed of a single list with one common part and one
variable part.
  Common:
    [timestamp, returnid,
  if returnid == 0, it's a call, else it's a return.
  For call,
    ..., function_name, [args] ]
  For return,
    ..., returnvalue ]
   The returnid permits a strict match of call->return lines. The
returnid value is the index of the log entry where the call was
logged, which is omitted in the common part of the line itself for
brevity. I had initially put it in the log line but I think keeping it
as dense as possible has value.
For compactness, the function name could be a function id as a number instead.

So the actual log lines are relatively dense even if text/ascii encoded.

I'm not describing other things like signal and process events, since
it's really just an example design but the general of common part +
variable part would remain. I think it would be relatively easy
"reader implementation wise" to do.

That said, two problems remains about the encoding itself:
- JSON assume double for their numbers and by default are encoded in
base10 as strings. So using something like hex encoded in a string
would be more efficient.
- JSON string assume unicode. This could mean using custom escaping
for byte streams. base64 is a valid option in that case but this
reduces bit density by 37%.

There's two completely separate questions:
- each packet itself could be encoded with "something" where I picked
JSON as the something.
- defining each packet properly, I exposed an example for 3 packet descriptions.

For the encoding itself, the big question is: Do you want the output
to be ascii or binary? That influence what you are going to select.

One potentially interesting side effect of JSON to a subset of users
is that it's trivial to read from python because JSON is included in
its stdlib. Using something like BSON or MessagePack means the user
will have to install these third parties first. No big deal but still
one more step to do. It could be annoying when a sysadmin want to
login to a server and quickly diagnose something. To state the
obvious, using a encoding that has wide spread support in many
languages (I'd say at least perl, python, C++) would be better. Just
sayin', I'm not vested in this choice.

M-A




More information about the Strace-devel mailing list