[PATCH RFC v2 1/3] Introduce seccomp-assisted syscall filtering

Mon Aug 12 12:28:00 UTC 2019

On Wed, Aug 07, 2019 at 06:03:38PM +0200, Paul Chaignon wrote:
> Thanks a lot for the review!  I've addressed most comments (I still need a
> add a few more comments in the code) and have a couple answers and
> questions below.

My apologies for a belated reply.

> On Mon, Aug 05, 2019 at 07:59:25PM +0200, Eugene Syromiatnikov wrote:
> > On Wed, Jul 31, 2019 at 05:35:56PM +0200, Paul Chaignon wrote:
> > [...]
> >
> > I think this license boilerplate may be replaced with
> > "SPDX-License-Identifier: LGPL-2.1-or-later", as all new contributions
> > to strace are under LGPL 2.1+, but I'm not sure about details of
> > re-licensing of a someone else's patch.
> 
> Would that really be considered re-licensing?  In any case, I'll ask for
> Chen Jingpiao's okay.

I don't think that BSD -> LGPL transition requires explicit author's permission
(at least, per my understanding of these licenses), but at the same time I'm not
a lawyer, so I'd avoid relying on my opinion in this regard.

> > > +static bool
> > > +traced_by_seccomp(unsigned int scno, unsigned int p)
> > > +{
> > > +   if (!sysent_vec[p][scno].sys_func
> > > +       || is_number_in_set_array(scno, trace_set, p)
> > > +       || sysent_vec[p][scno].sys_flags & TRACE_INDIRECT_SUBCALL)
> > > +           return true;
> >
> > > +   for (unsigned int i = 0; i < ARRAY_SIZE(traced_by_default); ++i)
> > > +           if (sysent_vec[p][scno].sen == traced_by_default[i])
> > > +                   return true;
> >
> > I wonder if it would be better if it would be rewritten into (yet another)
> > syscall entry flag check (like, TRACE_SECCOMP_DEFAULT).
> 
> It seems a bit odd to add it as a new syscall entry flag.  Other such
> flags characterize their associated syscalls, whereas in this case, it's
> mostly a list of syscalls we don't filter inside the kernel (e.g.,
> socketcall) at the moment.  The list may change as the BPF program
> supports more cases.  I'm not sure that difference matters though (?)

There are, for example, SYSCALL_NEVER_FAILS, MEMORY_MAPPING_CHANGE,
STACKTRACE_CAPTURE_ON_ENTER (and, to some extent, TRACE_PURE) that used
for some internal purposes, I see no problem with adding yet another one.

Even if BPF capabilities will change in future kernel versions, we still
have to support older kernels that lack them (so we likely still want
to filter them).

> > There's already some code which does nearly the same job,
> > bpf_filter.c:print_bpf_filter_block(), I wonder if it can be re-used.
> 
> We could even reuse print_bpf_fprog() I think.  I'm not sure how to do
> that though as we'll need to output to stderr instead of current_tcp.  We
> could create a fake current_tcp, but that sounds very hacky...  Any other
> idea?

The common approach is print to string in the intermal functions and
just have different top-level wrappers.

> > Some compile-time check (i.e. static_assert) that corresponds to the statement
> > in the comment is likely needed here, in order to ensure that if at any point
> > there appears another architecture that has 3 personalities (line MIPS or Aarch64),
> > the code will be updated if needed (MIPS n32 and Aarch64ilp32 are indeed different
> > in that regard in comparison to x32).
> 
> What makes you think architectures with 3 personalities other than x86
> would cause issues here?  MIPS n32 already has the appropriate
> AUDIT_ARCH_XXX constants (e.g., AUDIT_ARCH_MIPS64N32) and one should also
> be added for aarch64 ilp32 [1] before it's merged in the kernel.
> 
> 1 - https://lkml.org/lkml/2019/1/7/1272

IIUC, checks for syscall mask are made only in case there are multiple
personalities, which is somewhat x32-specific.

> > > +#if SUPPORTED_PERSONALITIES > 1
> > > +           if (audit_arch_vec[p].mask) {
> >
> > Again, it is not clear why audit_arch_vec[p].mask checking code is
> > conditioned with "SUPPORTED_PERSONALITIES > 1"; some static check that
> > ensures that there's no non-zero audit_arch_vec[p].mask present on any
> > of the supported personalities would be nice.
> 
> Why "again"?  This is the only audit_arch_vec[p].mask checking code (with
> its simplified version in check_bpf_instruction_number()).  Did I miss a
> comment?
> 
> It's conditioned on SUPPORTED_PERSONALITIES value because the following
> three BPF instructions are only needed if the architecture supports
> several personalities and a mask is used to disambiguate two of them
> (e.g., x86-64 and x32 ABIs).
> 
> audit_arch_vec[p].mask can be null.  Maybe you meant
> audit_arch_vec[p].arch?  audit_arch_vec[p].arch will be null for all
> architectures with a single personality since we don't define
> PERSONALITY{0,1,2}_AUDIT_ARCH constants for those.

Yep, that was the gist of my concern: if mask is to be used for
distinguishing between personalities, it is ought to be conditioned
more explicitly.

> > "mask" is an odd choice for a field name, it's more like a "flag" or "offset",
> > judging by the way it is used.
> 
> That's because it's weirdly used :-)  I kept it from the original patch, but

> it would probably be best to switch those "+" into "|".

Yes, please.

> It's also called a mask in the seccomp(2) manpage (see __X32_SYSCALL_BIT).

This is odd as well.

> > There's powepc64le architecture now added (as it doesn't have compat),
> > so the relevant definition will likely be needed there as well
> > (something like "PERSONALITY0_AUDIT_ARCH { AUDIT_ARCH_PPC64LE, 0 }").
> 
> But powerpc64le has a single personality, right?  If that's the case, we
> don't need to define any PERSONALITY{0,1,2}_AUDIT_ARCH.

Yes, that's correct.

> > For IA-64, there's also SYSCALLENT_BASE_NR, which probably has to be
> > accounted for.
> 
> Ok, then we'll need
> "PERSONALITY0_AUDIT_ARCH { AUDIT_ARCH_IA64, SYSCALLENT_BASE_NR }".
> 
> This slightly changes how audit_arch_vec[p].mask is used; we now should
> *add* it to the syscall number instead of *bitwise or'ing* it.  So we now
> have two documented uses for audit_arch_vec[p].mask:
> - for IA64, "+" is required, "|" works only has long as we have less than
>   SYSCALLENT_BASE_NR IA64 syscalls;
> - for x86, "|" is required, "+" works because we have less than
>   __X32_SYSCALL_BIT x86 syscalls.
> 
> I'm tempted to use "nr = i | audit_arch_vec[p].mask" and add a static
> check on the number of syscalls (< SYSCALLENT_BASE_NR).  The other way
> around also works.

Since on IA-64 the base is a power of 2 (luckily for us), I'd stick with
treating it as a flag, using "|" and having a static check.

And then there's MIPS and its 1000-based bases, but it doesn't matter
here (strace doesn't store MIPS syscall entries with offsets).