Proliferation of hundreds of tiny files

Sat Oct 24 17:49:43 UTC 2015

On 10/24/2015 07:01 AM, Dmitry V. Levin wrote:
> On Fri, Oct 23, 2015 at 09:53:02AM +0200, Denys Vlasenko wrote:
>> On 10/23/2015 08:32 AM, Dmitry V. Levin wrote:
>>> Hi,
>>>
>>> On Thu, Oct 22, 2015 at 05:27:24PM +0200, Denys Vlasenko wrote:
>>>> Hello Dmitry,
>>>>
>>>> Two years ago, at the end of 2013, strace source had 29 *.c files
>>>> in its main source directory (meaning: this does not count any tests).
>>>> linux/* contained 148 files in total (not only *.c files).
>>>>
>>>> Today, there are 120 *.c files in main source directory,
>>>> and 419 files in linux/*
>>>
>>> Ideally, each parser should have been in a separate file from the
>>> beginning, that would make maintenance easier.
>>> I'm working in this direction.
>>
>> It's easier in what way?
> 
> You must be joking.  Smaller translation unit has less context to care
> about.

Translation unit is not a human, it's a piece of code.
It does not feel hurt when it, say, has a bunch extra includes
it does not need, or a function is being compiled after
100 other functions were preceding it. Even if they were unrelated
to it. What's the difference?

>  For example, there used to be unwanted side effects when headers
> included for one parser affected other parsers.

Can you be more specific? What side effects?

If two headers can't live together, then there is a bug in headers.
It should be reported, and effects worked around while it is
fixed upstream. Been there, done that.

All headers together (for example, all libc headers) define the API
of the corresponding package (in my example, glibc). So it is legal,
and should not lead to any bugs, to include *every* glibc header
file at the beginning of a C program.

We don't do that not because it's wrong. (It is not).
We don't do that because it slows down compilation.
There is no need to include <syslog.h> if the program is not
planning to use syslog().

But logically speaking, syslog() API *exists* regardless
of whether you include its header or not.

Non-C languages such as Ada and Pascal do that all the time.
They always have entire API visible to them.

>> Let's get back to linux/x86_64/get_scno.c
>> Here's the entire beginning of the file, I'm skipping nothing:
>>
>>
>> #ifndef __X32_SYSCALL_BIT
>> # define __X32_SYSCALL_BIT      0x40000000
>> #endif
>>
>> unsigned int currpers;
>>
>> #if 1
>> /* ... */
>> if (x86_io.iov_len == sizeof(i386_regs)) {
>>         scno = i386_regs.orig_eax;
>>         currpers = 1;
>> } else {
>>
>>
>> So. What is "scno"? Is is a global variable? A local one?
>> A function parameter?
>>
>> Is it a "long"? Or "unsigned long long"?
>>
>> To learn answers to these questions, I need to go
>> to *another file* and look there!
>>
>> Usually, in C language people don't have to do that
>> to find out such things.
> 
> Do you mean that an arch_get_scno function called by generic get_scno
> would be easier to read than current #include "get_scno.c"?
> Yes, I think it would.
> 
>> This is not easier. This is more obscure. IMHO, of course.
>> You probably have logical reasons why I'm not right in my opinion.
>> Can you voice these reasons?
> 
> Mixing numerous arch specific ifdefs with generic code is much harder to
> read

It's harder to read an ifdef than to look at half a dozen files to
see one function?

>>>> Another example:
>>>>
>>>> commit 9b2f674adbd5c44fe892b31cf95703eeceb21c40
>>>> Author: Dmitry V. Levin <ldv at altlinux.org>
>>>> Date:   Sat Dec 6 03:53:16 2014 +0000
>>>>
>>>>     file.c: move chdir parser to a separate file
>>>>
>>>> I have a question: Why it's better to have it in a separate file?
>>>> I mean, the
>>>>
>>>> #include "defs.h"
>>>>
>>>> SYS_FUNC(chdir)
>>>> {
>>>>         printpath(tcp, tcp->u_arg[0]);
>>>>
>>>>         return RVAL_DECODED;
>>>> }
>>>>
>>>> needs a separate file why?
>>>
>>> file.c was a huge mess, I cleaned it up as much as I could by moving all
>>> non-stat parsers to separate files.  The remaining part of file.c is still
>>> a mess, but now it's in a more maintainable state.

For example, busybox/shell/ash.c is 299163 bytes long and it contains
all kinds of unrelated code: from shell syntax parser to random number
generator for $RANDOM to job control to signal handling:

	http://git.busybox.net/busybox/tree/shell/ash.c

Looks like it also would be a "huge mess" in you eyes.

Where's the problem? What would I gain if I split it up into 20 files?
Will code become smaller? Faster? Will it compile faster?
I suspect all three are answered with "no".
Maybe whatever bugs and non-elegant bits it has now will get fixed by
a split? Again, no.

>> Sorry. You are saying that it's cleaner this way, but you don't explain
>> *why* it's cleaner this way. I don't see any particular benefits
>> when a larger *.c file gets split into three dozen tiny ones.
> 
> Sorry, I think it's obvious.  The larger the source file, the more
> unrelated context is there and more chances of unwanted side effects.

Please give me an example of this "unwanted side effect", I am not sure
what you mean by that.

> At
> the same time, I see no benefits from unrelated parsers being crammed into
> large source files.

I can give you one example off the top of my head:

Due to file.c split, you had to make *printsigmask* family of functions
non-static.

On i386, gcc optimizes static functions to use more efficient "regparm"
calling convention instead of ABI-mandated stack-based one.
It can do this because it knows no one outside of the module
can see its statics, so if nothing takes address of this static function,
then compiler is not bound by ABI.