[GSOC 2014] structured output of strace

Fri Mar 21 13:25:03 UTC 2014

On Fri, Mar 21, 2014 at 11:50 AM, Zev Weiss <zev at bewilderbeest.net> wrote:
> On Mar 20, 2014, at 12:54 PM, yangmin zhu <zym0017d at gmail.com> wrote:
>> Hi,
>>  I'm yangmin zhu. I'm a master student from University of Chinese Academy of Sciences and now I'm participating in the Google Summer of Code 2014.
>>  I'm working for the strace project about structured output. You can find more information from [1] and [2]. And I find your work from [3] and [4].
>>  I think it would be great to contact the strace output parser's author to collect their actual needs. I'm trying to modify strace to support output in JSON format. But I'm not very clear what the exact format you want.
>>   For examole,
>> 1) should all the value in the JSON output be string? or some value should be number?
>> 2) which of the followling style of syscall's arguments do you prefer?
>>   "args" : ["arg1", "arg2", "arg3" ]
>> or
>>   "arg1" : [ "arg1_name" : "arg1_value", "arg2_name" : "arg2_value" ]
>> ANY suggestions are welcome.
>> Thank you.
>> yangmin zhu
>
> Hi Yangmin,
> Firstly, thanks for getting in touch!
> On the specifics you mentioned:
> 1) I think using "real" types (e.g. actual integers instead of string-encoded ones) wherever possible would be highly preferable in order to simplify parsing by downstream structured-output consumers.

JSON may have issues with floats as you noted below, so "real" types
may not be practical for this style and plain strings will work
alright.

> 2) I guess I don't have any real strong opinions at this point on whether syscall arguments should be named in a map/dictionary style collection or a simple ordered list/array.  I could see the map keys being potentially useful in certain situations, but looked at over an entire trace it seems like it would result in a great deal of redundancy (e.g. duplicating "domain", "type", and "protocol" for every instance of a socket(2) call); also I'd guess that many if not most potential consumers of structured output would need (or already have) some awareness of syscall parameter lists built into them anyway, so I guess I'd probably lean toward a plain unlabeled array.

You saw the previous discussions in the mailing list: at the moment
the consensus is IMHO erring towards being explicit rather than
implicit, even if this means being more verbose.
I think as a principle being correct and explicit first works best.

> Also, while I mentioned previously on the list that I'd probably be in favor of JSON-structured output, that was based on a fairly cursory knowledge of the format, basically just from having seen examples of it in lots of places.  It has since been pointed out though that it might not be such a great candidate -- for instance, with regard to point #1 above, JSON has the major disadvantage here (as mentioned by Elliott Hughes) of inheriting javascript's unfortunate "all numbers are doubles" brain-damage.  Also (as noted by Marc-Antoine Ruel), while JSON's inherent verbosity is certainly much less than, say, XML, it's still perhaps a bit "larger" than would be desirable.  (Though w.r.t another aspect of Marc-Antoine's comment -- JSON doesn't necessarily have to be un-streamable, does it?  Couldn't you just leave the top-level structure of the output file as the concatenation of a bunch of discrete JSON objects, without wrapping them up in an array or similar?)
>
> So I think it might be worth considering some possible alternatives to JSON...a few I'm vaguely aware of and/or have just done some brief research on now:
>
> XML: ugly, bloated and verbose, unpopular with lots of people (myself included), just mentioning "because it's there", though I'd vote against it.
>
> MessagePack (http://msgpack.org/):
>  - more compact than JSON
>  - binary, not text -- obviously less human-readable, but presumably for structured output we care more about ease of consumption by programs, not humans (and for programmatic use a binary format is significantly simpler than text, I'd say).  If human-readability is desired we'll still have the current output format available; I see no reason to try to optimize one output format for both purposes.
>  - type system seems much better-suited for strace's purposes (has 64-bit ints, for one thing), and offers application-specific extensibility if needed.
>  - not nearly as ubiquitous as JSON, but already has existing serdes implementations for lots of languages (https://github.com/msgpack)
>
> BSON (http://bsonspec.org/):
>  - similar to MessagePack in a lot of ways, I think, but has the property that in order to be well-formed and spec-compliant, a top-level document must be prefixed with a total-length descriptor, which seems like it would a deal-breaker for strace (we'd have to be able to start streaming out the trace before we know how long it is).  That said, I suppose there's no reason strace couldn't just output a concatenation of smaller discrete BSON documents (as mentioned above with JSON).
>  - type system: certainly a better fit for strace than JSON (has 64-bit ints), but seems generally a bit cruftier than MessagePack, with a lot of oddball bits and pieces thrown in (regexes, JS code, MD5s...none of which strace would need to use, but just seem like weird things to have).  Despite being a fairly young format, already has a bunch features marked "old" or "deprecated", which to me (at least superficially) gives it the appearance of maybe being not all that well-designed.
>
> So, given all that, I think MessagePack is actually looking fairly appealing, personally.

For now, my take is that we should start first with JSON.
JSON has enough attributes, quasi-universal support and is simple
enough for that. It is also plain text.
We need to start somewhere and I think that simpler is better for starters.
To your point, we could later support other serialization formats for sure.
What do you think?

Cordially
-- 
Philippe Ombredanne