[GSOC 2014] structured output of strace

Fri Mar 21 18:48:13 UTC 2014

On Mar 21, 2014, at 8:25 AM, Philippe Ombredanne <pombredanne at nexb.com> wrote:

> On Fri, Mar 21, 2014 at 11:50 AM, Zev Weiss <zev at bewilderbeest.net> wrote:
>> On Mar 20, 2014, at 12:54 PM, yangmin zhu <zym0017d at gmail.com> wrote:
>>> Hi,
>>> I'm yangmin zhu. I'm a master student from University of Chinese Academy of Sciences and now I'm participating in the Google Summer of Code 2014.
>>> I'm working for the strace project about structured output. You can find more information from [1] and [2]. And I find your work from [3] and [4].
>>> I think it would be great to contact the strace output parser's author to collect their actual needs. I'm trying to modify strace to support output in JSON format. But I'm not very clear what the exact format you want.
>>>  For examole,
>>> 1) should all the value in the JSON output be string? or some value should be number?
>>> 2) which of the followling style of syscall's arguments do you prefer?
>>>  "args" : ["arg1", "arg2", "arg3" ]
>>> or
>>>  "arg1" : [ "arg1_name" : "arg1_value", "arg2_name" : "arg2_value" ]
>>> ANY suggestions are welcome.
>>> Thank you.
>>> yangmin zhu
>> 
>> Hi Yangmin,
>> Firstly, thanks for getting in touch!
>> On the specifics you mentioned:
>> 1) I think using "real" types (e.g. actual integers instead of string-encoded ones) wherever possible would be highly preferable in order to simplify parsing by downstream structured-output consumers.
> 
> JSON may have issues with floats as you noted below, so "real" types
> may not be practical for this style and plain strings will work
> alright.
> 

Right, hence my later suggestion that we perhaps consider alternate formats with more useful type systems.

>> 2) I guess I don't have any real strong opinions at this point on whether syscall arguments should be named in a map/dictionary style collection or a simple ordered list/array.  I could see the map keys being potentially useful in certain situations, but looked at over an entire trace it seems like it would result in a great deal of redundancy (e.g. duplicating "domain", "type", and "protocol" for every instance of a socket(2) call); also I'd guess that many if not most potential consumers of structured output would need (or already have) some awareness of syscall parameter lists built into them anyway, so I guess I'd probably lean toward a plain unlabeled array.
> 
> You saw the previous discussions in the mailing list: at the moment
> the consensus is IMHO erring towards being explicit rather than
> implicit, even if this means being more verbose.
> I think as a principle being correct and explicit first works best.
> 

Sure, that'd be fine with me too -- as I said it's not a strongly-held opinion.

>> Also, while I mentioned previously on the list that I'd probably be in favor of JSON-structured output, that was based on a fairly cursory knowledge of the format, basically just from having seen examples of it in lots of places.  It has since been pointed out though that it might not be such a great candidate -- for instance, with regard to point #1 above, JSON has the major disadvantage here (as mentioned by Elliott Hughes) of inheriting javascript's unfortunate "all numbers are doubles" brain-damage.  Also (as noted by Marc-Antoine Ruel), while JSON's inherent verbosity is certainly much less than, say, XML, it's still perhaps a bit "larger" than would be desirable.  (Though w.r.t another aspect of Marc-Antoine's comment -- JSON doesn't necessarily have to be un-streamable, does it?  Couldn't you just leave the top-level structure of the output file as the concatenation of a bunch of discrete JSON objects, without wrapping them up in an array or similar?)
>> 
>> So I think it might be worth considering some possible alternatives to JSON...a few I'm vaguely aware of and/or have just done some brief research on now:
>> 
>> XML: ugly, bloated and verbose, unpopular with lots of people (myself included), just mentioning "because it's there", though I'd vote against it.
>> 
>> MessagePack (http://msgpack.org/):
>> - more compact than JSON
>> - binary, not text -- obviously less human-readable, but presumably for structured output we care more about ease of consumption by programs, not humans (and for programmatic use a binary format is significantly simpler than text, I'd say).  If human-readability is desired we'll still have the current output format available; I see no reason to try to optimize one output format for both purposes.
>> - type system seems much better-suited for strace's purposes (has 64-bit ints, for one thing), and offers application-specific extensibility if needed.
>> - not nearly as ubiquitous as JSON, but already has existing serdes implementations for lots of languages (https://github.com/msgpack)
>> 
>> BSON (http://bsonspec.org/):
>> - similar to MessagePack in a lot of ways, I think, but has the property that in order to be well-formed and spec-compliant, a top-level document must be prefixed with a total-length descriptor, which seems like it would a deal-breaker for strace (we'd have to be able to start streaming out the trace before we know how long it is).  That said, I suppose there's no reason strace couldn't just output a concatenation of smaller discrete BSON documents (as mentioned above with JSON).
>> - type system: certainly a better fit for strace than JSON (has 64-bit ints), but seems generally a bit cruftier than MessagePack, with a lot of oddball bits and pieces thrown in (regexes, JS code, MD5s...none of which strace would need to use, but just seem like weird things to have).  Despite being a fairly young format, already has a bunch features marked "old" or "deprecated", which to me (at least superficially) gives it the appearance of maybe being not all that well-designed.
>> 
>> So, given all that, I think MessagePack is actually looking fairly appealing, personally.
> 
> For now, my take is that we should start first with JSON.
> JSON has enough attributes, quasi-universal support and is simple
> enough for that. It is also plain text.

Well, part of my view on the matter was that being plain text seems to be of questionable value for structured output -- the intent is for machine consumption rather than human readers, no?  In general I think a binary format would be advantageous for machine use, though you're right that JSON's "quasi-universality" is a plus that few other formats could match.

> We need to start somewhere and I think that simpler is better for starters.
> To your point, we could later support other serialization formats for sure.
> What do you think?
> 

Well, I suppose if the design & implementation of structured output are done properly, adding alternate output formats later should be a relatively simple matter.  That said, if the first-pass implementation outputs JSON, I'm not sure what the odds of other output formats ever getting actually added would be -- if everything starts using JSON because it's what's there, the inertia of a "good enough" (if only just barely) format might mean we effectively get stuck with a sub-par format instead of one that actually matches our data well.

As a potential example case, for my own strace-consumer JSON would certainly be a much nicer format than the current unstructured output, so if JSON output became available I'd be likely to adapt my project to use it, even if it still required a bunch of manual atoi()/strtol() calls and so forth because the limitations of JSON mean everything has to come through as strings.  Having done that, if strace then later added MessagePack (or what-have-you) output that didn't require string-encoding everything, my motivation to then adapt my project *again* to consume that format would be substantially less, since its advantages over JSON would be less than those of JSON over unstructured output. I do think those incremental advantages are still appreciable and may be worth considering for the initial implementation, however.

Zev