Proliferation of hundreds of tiny files

Sun Oct 25 02:12:59 UTC 2015

On Sat, Oct 24, 2015 at 07:49:43PM +0200, Denys Vlasenko wrote:
> On 10/24/2015 07:01 AM, Dmitry V. Levin wrote:
> > On Fri, Oct 23, 2015 at 09:53:02AM +0200, Denys Vlasenko wrote:
> >> On 10/23/2015 08:32 AM, Dmitry V. Levin wrote:
> >>> Hi,
> >>>
> >>> On Thu, Oct 22, 2015 at 05:27:24PM +0200, Denys Vlasenko wrote:
> >>>> Hello Dmitry,
> >>>>
> >>>> Two years ago, at the end of 2013, strace source had 29 *.c files
> >>>> in its main source directory (meaning: this does not count any tests).
> >>>> linux/* contained 148 files in total (not only *.c files).
> >>>>
> >>>> Today, there are 120 *.c files in main source directory,
> >>>> and 419 files in linux/*
> >>>
> >>> Ideally, each parser should have been in a separate file from the
> >>> beginning, that would make maintenance easier.
> >>> I'm working in this direction.
> >>
> >> It's easier in what way?
> > 
> > You must be joking.  Smaller translation unit has less context to care
> > about.
> 
> Translation unit is not a human, it's a piece of code.
> It does not feel hurt when it, say, has a bunch extra includes
> it does not need, or a function is being compiled after
> 100 other functions were preceding it. Even if they were unrelated
> to it. What's the difference?

Translation units are maintained by humans, and humans usually do better
with smaller ones.  Every extra include takes extra attention, too.
I certainly prefer chdir.c over old file.c

> >  For example, there used to be unwanted side effects when headers
> > included for one parser affected other parsers.
> 
> Can you be more specific? What side effects?
> 
> If two headers can't live together, then there is a bug in headers.
> It should be reported, and effects worked around while it is
> fixed upstream. Been there, done that.
> 
> All headers together (for example, all libc headers) define the API
> of the corresponding package (in my example, glibc). So it is legal,
> and should not lead to any bugs, to include *every* glibc header
> file at the beginning of a C program.

Unlike regular C programs, strace utilizes linux headers a lot, and libc
headers do not live well together with linux headers.  Situation is slowly
changing for the better, but we still support systems with rhel5-ish
headers, and will have to support them for quite some time.

> We don't do that not because it's wrong. (It is not).
> We don't do that because it slows down compilation.
> There is no need to include <syslog.h> if the program is not
> planning to use syslog().
> 
> But logically speaking, syslog() API *exists* regardless
> of whether you include its header or not.
> 
> Non-C languages such as Ada and Pascal do that all the time.
> They always have entire API visible to them.
> 
> 
> >> Let's get back to linux/x86_64/get_scno.c
> >> Here's the entire beginning of the file, I'm skipping nothing:
> >>
> >>
> >> #ifndef __X32_SYSCALL_BIT
> >> # define __X32_SYSCALL_BIT      0x40000000
> >> #endif
> >>
> >> unsigned int currpers;
> >>
> >> #if 1
> >> /* ... */
> >> if (x86_io.iov_len == sizeof(i386_regs)) {
> >>         scno = i386_regs.orig_eax;
> >>         currpers = 1;
> >> } else {
> >>
> >>
> >> So. What is "scno"? Is is a global variable? A local one?
> >> A function parameter?
> >>
> >> Is it a "long"? Or "unsigned long long"?
> >>
> >> To learn answers to these questions, I need to go
> >> to *another file* and look there!
> >>
> >> Usually, in C language people don't have to do that
> >> to find out such things.
> > 
> > Do you mean that an arch_get_scno function called by generic get_scno
> > would be easier to read than current #include "get_scno.c"?
> > Yes, I think it would.
> > 
> >> This is not easier. This is more obscure. IMHO, of course.
> >> You probably have logical reasons why I'm not right in my opinion.
> >> Can you voice these reasons?
> > 
> > Mixing numerous arch specific ifdefs with generic code is much harder to
> > read
> 
> It's harder to read an ifdef than to look at half a dozen files to
> see one function?

It is quite hard to follow the code with ~20 arch ifdefs per function.
Why do you need to look at half a dozen files to see one function is
beyond my understanding.

> >>>> Another example:
> >>>>
> >>>> commit 9b2f674adbd5c44fe892b31cf95703eeceb21c40
> >>>> Author: Dmitry V. Levin <ldv at altlinux.org>
> >>>> Date:   Sat Dec 6 03:53:16 2014 +0000
> >>>>
> >>>>     file.c: move chdir parser to a separate file
> >>>>
> >>>> I have a question: Why it's better to have it in a separate file?
> >>>> I mean, the
> >>>>
> >>>> #include "defs.h"
> >>>>
> >>>> SYS_FUNC(chdir)
> >>>> {
> >>>>         printpath(tcp, tcp->u_arg[0]);
> >>>>
> >>>>         return RVAL_DECODED;
> >>>> }
> >>>>
> >>>> needs a separate file why?
> >>>
> >>> file.c was a huge mess, I cleaned it up as much as I could by moving all
> >>> non-stat parsers to separate files.  The remaining part of file.c is still
> >>> a mess, but now it's in a more maintainable state.
> 
> For example, busybox/shell/ash.c is 299163 bytes long and it contains
> all kinds of unrelated code: from shell syntax parser to random number
> generator for $RANDOM to job control to signal handling:
> 
> 	http://git.busybox.net/busybox/tree/shell/ash.c
> 
> Looks like it also would be a "huge mess" in you eyes.

I'm so lucky I don't have to maintain that.

> Where's the problem? What would I gain if I split it up into 20 files?
> Will code become smaller? Faster? Will it compile faster?
> I suspect all three are answered with "no".
> Maybe whatever bugs and non-elegant bits it has now will get fixed by
> a split? Again, no.

Speaking of myself, splitting the strace code made my life as a maintainer
easier.  Without that, I wouldn't be able to contribute as much as I did.

btw, I'm nursing an idea of putting all syscall parsers into a libstrace.a
file so that, from one side, newer architectures won't get linked code
they don't use, and from another side, this optimization will happen
automatically without any extra ifdefs.

> >> Sorry. You are saying that it's cleaner this way, but you don't explain
> >> *why* it's cleaner this way. I don't see any particular benefits
> >> when a larger *.c file gets split into three dozen tiny ones.
> > 
> > Sorry, I think it's obvious.  The larger the source file, the more
> > unrelated context is there and more chances of unwanted side effects.
> 
> Please give me an example of this "unwanted side effect", I am not sure
> what you mean by that.
> 
> > At
> > the same time, I see no benefits from unrelated parsers being crammed into
> > large source files.
> 
> I can give you one example off the top of my head:
> 
> Due to file.c split, you had to make *printsigmask* family of functions
> non-static.

No, that was due to signal.c split (commits
74219ea36f86d934abfb962964047f05e611baba and
a3c483545a7fb3a075f885a01a3c58b2f84db8fa).

> On i386, gcc optimizes static functions to use more efficient "regparm"
> calling convention instead of ABI-mandated stack-based one.
> It can do this because it knows no one outside of the module
> can see its statics, so if nothing takes address of this static function,
> then compiler is not bound by ABI.

The ptrace call that is needed to fetch the data for sprintsigmask_n makes
this optimization negligible anyway.

P.S.  I'm going offline for a week and during that time I won't be able to
contribute to this thread.

-- 
ldv
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://lists.strace.io/pipermail/strace-devel/attachments/20151025/4aa8e927/attachment.bin>