[GSOC2015] JSON Formatting

Sat Mar 7 02:38:40 UTC 2015

On Fri, Mar 06, 2015 at 04:21:07PM +0100, Louis Feuvrier wrote:
> > A better way (which requires much more work of course) would be to
> > come with smaller incremental patches and clearly separate the parts
> > that are JSON-specific for later, focusing first on an incremental
> > refactoring and abstraction of the printf scattered code.
> 
> I think adding abstraction would be hard to do in incremental patches, but I
> agree on the need for them and a potential CI system.

There are hundreds of raw tprintf and tprints calls.  I wonder how could
you introduce an output state machine in incremental patches.

> > So IMHO a good proposal would go at this in small steps:
> > - find a best approach abstract printing in abstract of any JSON
> > output and submit these as small incremental and consistent patches.
> > This should introduce no new feature.
> 
> Here are the questions that I think should be asked independently:
> 
> - what do we want the code/printing API to look like?
> - what do we want the json output to look like?
> - do we want the json output to be human-readable?
> 
> Currently, the outputing code is scattered across the code paths. When put
> together, all different parts look like this:
> 
> tprintf("%s(", tcp->s_ent->sys_name);
> printfd(tcp, tcp->u_args[0]);
> tprints(", ");
> printfd(tcp, tcp->u_args[0]);
> tprints(") ");
> tprints("= %#lx", tcp->u_rval);
> tprints("\n");
> 
> I feel like the code writing in the comas, parenthesis and everything could be
> abstracted. Indeed, these delimitations are specific to the classical output,
> and json would require something else.

Yes.

> I am thinking about an `output machine` that would keep a state.  This state
> would determine multiple things: the next delimiter (),= in the case of
> classical output, wether to flush the output or not (yes after outputting all
> arguments and waiting for the return of the syscall, no in-between
> arguments...). We probably would end up with something like:
> 
> out_update_state(&om, OSTATE_SYSNAME);
> out(&om, F_STR, tcp->s_ent->sys_name);
> out_update_state(&om, OSTATE_ARGS);
> out(&om, F_FD, tcp->u_args[0]);
> out(&om, F_FD, tcp->u_args[1]);
> out_update_state(&om, OSTATE_RET);
> out(&om, F_FD, tcp->u_rval);
> out_update_state(&om, OSTATE_DONE);
> ...
> 
> This is just an idea in the works and I don't know up to what point we could
> shorten this with implied state change after the printing of the syscall name,
> printing of multiples arguments in a single call, and the likes.

This machine is going to be a bit more complex: it would have to support
output of nested objects like structures containing arrays of structures
(e.g. struct msghdr), but in general I think this is the right approach.

> > - once and if that can be completed, implement JSON support, if possible
> 
> Considering json doesn't handle hexa/octal, wether we should output a string
> containing an hexadecimal number or just convert that number to decimal output
> depends on the third question asked at the beginning of this e-mail: do we care
> for a human-readable json output? Cases such as the -y option fall into the
> same category up to a point, although they would be easier to handle with
> nested json.
> 
> {'syscall': 'dup2', 'args': [0, 1], 'ret': 2}
> 
> With -y argument, for example:
> 
> {'syscall': 'dup2', 'args': [{'fd': 0, 'path': '/dev/pts/5'}, ['fd': 1,
> 'path': '/dev/pts/5'], 'ret': {'fd': 2, 'path': '/dev/pts/5'}}
> 
> I strongly believe the json output is not to be human readable, and should
> therefore contain as much information as possible (all of it, why not). For
> example, why not always output the -y option? Considering no human should read
> the json output, there is no 'output polluting' per say. We could therefore
> incorporate timings, syscall count, syscall timestamps, ... This decision would
> allow us to also not abbreviate the arguments lists. Discarding information
> would be left to the discretion of the user.

I agree that all available information should be included.  Whether
a particular piece of information is actually available or not is another
question.  For example, some information is readily available (e.g syscall
name and number), some costs a syscall to obtain (e.g. timestamp, -y,
and -i on some architectures), some is quite expensive (e.g. -yy).
In each case user decides how much information needs to be obtained.

> > BTW, the line-by-line JSON approach has names and even specs now!
> > See [3] , [4] and [5]
> 
> With line-delimited json, I am imagining this kind of output:
> 
> ---- start of the output
> {'syscall': 'dup2'}
> {'timestamp': '15:27:02'}
> {'eip': 139901979798028}
> {'args': [{'fd': 0, 'path': '/dev/pts/5'}, ['fd': 1, 'path': '/dev/pts/5']}
> ---- potential hang on the syscall
> {'ret': {'fd': 2, 'path': '/dev/pts/5'}}
> {'time': 0.000010}
> ---- delimiter of some sort
> {'syscall': 'close'}
> {'timestamp': '15:27:02'}
> {'eip': 139901978813632}
> {'args': [{'fd': 2, 'path': '/dev/pts/5'}]}
> ---- potential hang on the syscall
> {'ret': -1, 'errno': 13, 'error': 'EACCES', 'message': 'Permission denied'}
> {'time': 0.000010}
> ---- end of the output
> 
> Please correct me if my understanding of the json output we are expecting is
> not at all the same, but this feels right to me.

Yes, but please keep in mind that not all syscalls are that simple.  For
example, many syscalls have some of their arguments decoded on exiting
syscall, and some syscall arguments are decoded both on entering and
exiting syscall, e.g. _IOWR ioctls.

> The -p option is a bit of a problem: each pid given uses a different tcb
> structure for each pid. Creating a different output machine for each tcb
> structure would work in my opinion.

Exactly.

> It could simply output on different fds, or
> maybe use a multiplexing logic for managing multiple `output machines` on the
> same file descriptor.

This would follow the current practice: there is a multiplexing logic for
the regular mode, and in -ff mode each tcb has its own output descriptor.

> It feels obvious to me that outputing json on stderr with
> potential program output would make no sense and should not be handled.

There is an option (-o) to control this behaviour.

> The main issue I have not addressed is notification messages, unfinished,
> resume stuff and the like.
> 
> There are still a lot of questions to be asked and answers to be given but I'd
> like to know first your opinion on these few ideas.

I think a per-tcb output state machine with its own stack (remember about
nested objects) is the right approach.

-- 
ldv
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://lists.strace.io/pipermail/strace-devel/attachments/20150307/acd90262/attachment.bin>