[GSOC2015] JSON Formatting

Fri Mar 6 15:21:07 UTC 2015

> A better way (which requires much more work of course) would be to
> come with smaller incremental patches and clearly separate the parts
> that are JSON-specific for later, focusing first on an incremental
> refactoring and abstraction of the printf scattered code.

I think adding abstraction would be hard to do in incremental patches, but I
agree on the need for them and a potential CI system.

> So IMHO a good proposal would go at this in small steps:
> - find a best approach abstract printing in abstract of any JSON
> output and submit these as small incremental and consistent patches.
> This should introduce no new feature.

Here are the questions that I think should be asked independently:

- what do we want the code/printing API to look like?
- what do we want the json output to look like?
- do we want the json output to be human-readable?

Currently, the outputing code is scattered across the code paths. When put
together, all different parts look like this:

tprintf("%s(", tcp->s_ent->sys_name);
printfd(tcp, tcp->u_args[0]);
tprints(", ");
printfd(tcp, tcp->u_args[0]);
tprints(") ");
tprints("= %#lx", tcp->u_rval);
tprints("\n");

I feel like the code writing in the comas, parenthesis and everything could be
abstracted. Indeed, these delimitations are specific to the classical output,
and json would require something else.

I am thinking about an `output machine` that would keep a state.  This state
would determine multiple things: the next delimiter (),= in the case of
classical output, wether to flush the output or not (yes after outputting all
arguments and waiting for the return of the syscall, no in-between
arguments...). We probably would end up with something like:

out_update_state(&om, OSTATE_SYSNAME);
out(&om, F_STR, tcp->s_ent->sys_name);
out_update_state(&om, OSTATE_ARGS);
out(&om, F_FD, tcp->u_args[0]);
out(&om, F_FD, tcp->u_args[1]);
out_update_state(&om, OSTATE_RET);
out(&om, F_FD, tcp->u_rval);
out_update_state(&om, OSTATE_DONE);
...

This is just an idea in the works and I don't know up to what point we could
shorten this with implied state change after the printing of the syscall name,
printing of multiples arguments in a single call, and the likes.

> - once and if that can be completed, implement JSON support, if possible

Considering json doesn't handle hexa/octal, wether we should output a string
containing an hexadecimal number or just convert that number to decimal output
depends on the third question asked at the beginning of this e-mail: do we care
for a human-readable json output? Cases such as the -y option fall into the
same category up to a point, although they would be easier to handle with
nested json.

{'syscall': 'dup2', 'args': [0, 1], 'ret': 2}

With -y argument, for example:

{'syscall': 'dup2', 'args': [{'fd': 0, 'path': '/dev/pts/5'}, ['fd': 1,
'path': '/dev/pts/5'], 'ret': {'fd': 2, 'path': '/dev/pts/5'}}

I strongly believe the json output is not to be human readable, and should
therefore contain as much information as possible (all of it, why not). For
example, why not always output the -y option? Considering no human should read
the json output, there is no 'output polluting' per say. We could therefore
incorporate timings, syscall count, syscall timestamps, ... This decision would
allow us to also not abbreviate the arguments lists. Discarding information
would be left to the discretion of the user.

> BTW, the line-by-line JSON approach has names and even specs now!
> See [3] , [4] and [5]

With line-delimited json, I am imagining this kind of output:

---- start of the output
{'syscall': 'dup2'}
{'timestamp': '15:27:02'}
{'eip': 139901979798028}
{'args': [{'fd': 0, 'path': '/dev/pts/5'}, ['fd': 1, 'path': '/dev/pts/5']}
---- potential hang on the syscall
{'ret': {'fd': 2, 'path': '/dev/pts/5'}}
{'time': 0.000010}
---- delimiter of some sort
{'syscall': 'close'}
{'timestamp': '15:27:02'}
{'eip': 139901978813632}
{'args': [{'fd': 2, 'path': '/dev/pts/5'}]}
---- potential hang on the syscall
{'ret': -1, 'errno': 13, 'error': 'EACCES', 'message': 'Permission denied'}
{'time': 0.000010}
---- end of the output

Please correct me if my understanding of the json output we are expecting is
not at all the same, but this feels right to me.

The -p option is a bit of a problem: each pid given uses a different tcb
structure for each pid. Creating a different output machine for each tcb
structure would work in my opinion. It could simply output on different fds, or
maybe use a multiplexing logic for managing multiple `output machines` on the
same file descriptor. It feels obvious to me that outputing json on stderr with
potential program output would make no sense and should not be handled.

The main issue I have not addressed is notification messages, unfinished,
resume stuff and the like.

There are still a lot of questions to be asked and answers to be given but I'd
like to know first your opinion on these few ideas. I also don't believe it to
be possible to both refactor the printing code and code the json output in the
same GSOC, but as Philippe said, one thing at a time.

Cheers,

-- 
Louis 'manny' Feuvrier
LSE - EPITA 2016