[GSOC2015] JSON Formatting

Sat Mar 7 17:33:45 UTC 2015

On Sat, Mar 07, 2015 at 05:38:40AM +0300, Dmitry V. Levin wrote:
> > I think adding abstraction would be hard to do in incremental patches, but I
> > agree on the need for them and a potential CI system.
> 
> There are hundreds of raw tprintf and tprints calls.  I wonder how could
> you introduce an output state machine in incremental patches.

Yes, that is indeed what I was thinking.

> > out_update_state(&om, OSTATE_SYSNAME);
> > out(&om, F_STR, tcp->s_ent->sys_name);
> > out_update_state(&om, OSTATE_ARGS);
> > out(&om, F_FD, tcp->u_args[0]);
> > out(&om, F_FD, tcp->u_args[1]);
> > out_update_state(&om, OSTATE_RET);
> > out(&om, F_FD, tcp->u_rval);
> > out_update_state(&om, OSTATE_DONE);
> > ...

I have discussed this with Gabriel Laskar over lunch and it feels like
another more descriptive API like:

output_sysname(&om, tcp->s_ent->sys_name);
output_arg(&om, F_FD, tcp->u_args[0]);
output_arg(&om, F_FD, tcp->u_args[1]);
output_ret(&om, F_FD, tcp->u_rval);

with output_arg() being a simple wrapper for single arguments of:

output_begin_arg(&om);
/* printing of a structure for example */
output(...);
output_end_arg(&om);

This is the result of a quick brainstorming and might change once again.

> > This is just an idea in the works and I don't know up to what point we could
> > shorten this with implied state change after the printing of the syscall name,
> > printing of multiples arguments in a single call, and the likes.
> 
> This machine is going to be a bit more complex: it would have to support
> output of nested objects like structures containing arrays of structures
> (e.g. struct msghdr), but in general I think this is the right approach.

I have been looking for a smart way to print out structures in C, but
with lack of reflexion in the language it is not a simple task, and
might require something like two compilation rounds.

Another solution might be to have a printing subroutine for each type of
structure but this feels a bit overkill and re-writing everything would
be very error-prone in my opinion. Writing unit tests on-the-fly would
take a very long time and would certainly end-up in the project not
being finished at the deadline.

The above-mentioned idea of just using output_{begin,end}_arg()
functions in the main code path looks like a softer break from the
current state of the codebase, even though abstraction would not be as
great as in the ideas stated above.

I do not have any idea other than that, and feel a bit stuck here.

> > {'syscall': 'dup2', 'args': [{'fd': 0, 'path': '/dev/pts/5'}, ['fd': 1,
> > 'path': '/dev/pts/5'], 'ret': {'fd': 2, 'path': '/dev/pts/5'}}
> > 
> > I strongly believe the json output is not to be human readable, and should
> > therefore contain as much information as possible (all of it, why not). For
> > example, why not always output the -y option? Considering no human should read
> > the json output, there is no 'output polluting' per say. We could therefore
> > incorporate timings, syscall count, syscall timestamps, ... This decision would
> > allow us to also not abbreviate the arguments lists. Discarding information
> > would be left to the discretion of the user.
> 
> I agree that all available information should be included.  Whether
> a particular piece of information is actually available or not is another
> question.  For example, some information is readily available (e.g syscall
> name and number), some costs a syscall to obtain (e.g. timestamp, -y,
> and -i on some architectures), some is quite expensive (e.g. -yy).
> In each case user decides how much information needs to be obtained.

I didn't thought at first of the need for additional syscalls/costly
logic when outputting the -i/-y/-yy options. I thought they could be set
to true by default when using json, but I see now how that logic is
flawed.

> > With line-delimited json, I am imagining this kind of output:
> > 
> > ---- start of the output
> > {'syscall': 'dup2'}
> > {'timestamp': '15:27:02'}
> > {'eip': 139901979798028}
> > {'args': [{'fd': 0, 'path': '/dev/pts/5'}, ['fd': 1, 'path': '/dev/pts/5']}
> > ---- potential hang on the syscall
> > {'ret': {'fd': 2, 'path': '/dev/pts/5'}}
> > {'time': 0.000010}
> > ---- delimiter of some sort
> > {'syscall': 'close'}
> > {'timestamp': '15:27:02'}
> > {'eip': 139901978813632}
> > {'args': [{'fd': 2, 'path': '/dev/pts/5'}]}
> > ---- potential hang on the syscall
> > {'ret': -1, 'errno': 13, 'error': 'EACCES', 'message': 'Permission denied'}
> > {'time': 0.000010}
> > ---- end of the output
> > 
> > Please correct me if my understanding of the json output we are expecting is
> > not at all the same, but this feels right to me.
> 
> Yes, but please keep in mind that not all syscalls are that simple.  For
> example, many syscalls have some of their arguments decoded on exiting
> syscall, and some syscall arguments are decoded both on entering and
> exiting syscall, e.g. _IOWR ioctls.

I understand fully that this is the base case scenario. However, I don't
understand something regarding the arguments being printed after the end
of the syscall. Are they printed out twice? I looked a bit at the code
path for _IOWR ioctls and I don't see when it is done. The sys_ioctl
function looks like it is called at the same time as all the other sys_
functions.

> 
> > The -p option is a bit of a problem: each pid given uses a different tcb
> > structure for each pid. Creating a different output machine for each tcb
> > structure would work in my opinion.
> 
> Exactly.
> 
> > It could simply output on different fds, or
> > maybe use a multiplexing logic for managing multiple `output machines` on the
> > same file descriptor.
> 
> This would follow the current practice: there is a multiplexing logic for
> the regular mode, and in -ff mode each tcb has its own output descriptor.

Given the fact that json would be used in special cases and probably
read by a machine, I am entertaining the idea of having it
-ff-dependent. This would remove the need for a multiplexing logic.
What do you think?

> > It feels obvious to me that outputing json on stderr with
> > potential program output would make no sense and should not be handled.
> 
> There is an option (-o) to control this behaviour.

Yes, but I meant just disabling the possibility to output json on
something else than a -o'd file. After all, mixing output of the program
and json just plain doesn't make sense.

> > The main issue I have not addressed is notification messages, unfinished,
> > resume stuff and the like.

I am still uncertain about the way to 

> > There are still a lot of questions to be asked and answers to be given but I'd
> > like to know first your opinion on these few ideas.
> 
> I think a per-tcb output state machine with its own stack (remember about
> nested objects) is the right approach.

A recursive approach to nested-objects would imply a subroutine for each
type of argument, am I right? Would this be the right approach? I feel
like it is too great of a change and shifts a great deal of logic into
the outputting module.

-- 
Louis 'manny' Feuvrier
LSE - EPITA 2016