[PATCH RFC v2 1/3] Introduce seccomp-assisted syscall filtering

Wed Aug 7 16:03:38 UTC 2019

Thanks a lot for the review!  I've addressed most comments (I still need a
add a few more comments in the code) and have a couple answers and
questions below.

On Mon, Aug 05, 2019 at 07:59:25PM +0200, Eugene Syromiatnikov wrote:
> On Wed, Jul 31, 2019 at 05:35:56PM +0200, Paul Chaignon wrote:
> > From: Chen Jingpiao <chenjingpiao at gmail.com>

[...]

> > + * Redistribution and use in source and binary forms, with or without
> > + * modification, are permitted provided that the following conditions
> > + * are met:
> > + * 1. Redistributions of source code must retain the above copyright
> > + *    notice, this list of conditions and the following disclaimer.
> > + * 2. Redistributions in binary form must reproduce the above copyright
> > + *    notice, this list of conditions and the following disclaimer in the
> > + *    documentation and/or other materials provided with the distribution.
> > + * 3. The name of the author may not be used to endorse or promote products
> > + *    derived from this software without specific prior written permission.
> > + *
> > + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
> > + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
> > + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
> > + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
> > + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> > + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
> > + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>
> I think this license boilerplate may be replaced with
> "SPDX-License-Identifier: LGPL-2.1-or-later", as all new contributions
> to strace are under LGPL 2.1+, but I'm not sure about details of
> re-licensing of a someone else's patch.

Would that really be considered re-licensing?  In any case, I'll ask for
Chen Jingpiao's okay.

[...]

> > +static bool
> > +traced_by_seccomp(unsigned int scno, unsigned int p)
> > +{
> > +   if (!sysent_vec[p][scno].sys_func
> > +       || is_number_in_set_array(scno, trace_set, p)
> > +       || sysent_vec[p][scno].sys_flags & TRACE_INDIRECT_SUBCALL)
> > +           return true;
>
> > +   for (unsigned int i = 0; i < ARRAY_SIZE(traced_by_default); ++i)
> > +           if (sysent_vec[p][scno].sen == traced_by_default[i])
> > +                   return true;
>
> I wonder if it would be better if it would be rewritten into (yet another)
> syscall entry flag check (like, TRACE_SECCOMP_DEFAULT).

It seems a bit odd to add it as a new syscall entry flag.  Other such
flags characterize their associated syscalls, whereas in this case, it's
mostly a list of syscalls we don't filter inside the kernel (e.g.,
socketcall) at the moment.  The list may change as the BPF program
supports more cases.  I'm not sure that difference matters though (?)

[...]

> > +static void
> > +dump_seccomp_bpf(const struct sock_filter *filter, unsigned short len)
> > +{
> > +   for (unsigned int i = 0; i < len; ++i) {
> > +           switch (filter[i].code) {
> > +           case BPF_LD + BPF_W + BPF_ABS:
> > +                   switch (filter[i].k) {
> > +                   case offsetof(struct seccomp_data, arch):
> > +                           error_msg("STMT(BPF_LDWABS, data->arch)");
> > +                           break;
> > +                   case offsetof(struct seccomp_data, nr):
> > +                           error_msg("STMT(BPF_LDWABS, data->nr)");
> > +                           break;
> > +                   default:
> > +                           error_msg("STMT(BPF_LDWABS, 0x%x)",
> > +                                     filter[i].k);
> > +                   }
> > +                   break;
> > +           case BPF_RET + BPF_K:
> > +                   switch (filter[i].k) {
> > +                   case SECCOMP_RET_TRACE:
> > +                           error_msg("STMT(BPF_RET, SECCOMP_RET_TRACE)");
> > +                           break;
> > +                   case SECCOMP_RET_ALLOW:
> > +                           error_msg("STMT(BPF_RET, SECCOMP_RET_ALLOW)");
> > +                           break;
> > +                   default:
> > +                           error_msg("STMT(BPF_RET, 0x%x)", filter[i].k);
> > +                   }
> > +                   break;
> > +           case BPF_JMP + BPF_JEQ + BPF_K:
> > +                   error_msg("JUMP(BPF_JEQ, %u, %u, %u)",
> > +                             filter[i].jt, filter[i].jf,
> > +                             filter[i].k);
> > +                   break;
> > +           case BPF_JMP + BPF_JGE + BPF_K:
> > +                   error_msg("JUMP(BPF_JGE, %u, %u, %u)",
> > +                             filter[i].jt, filter[i].jf,
> > +                             filter[i].k);
> > +                   break;
> > +           case BPF_JMP + BPF_JA:
> > +                   error_msg("JUMP(BPF_JA, %u)", filter[i].k);
> > +                   break;
> > +           default:
> > +                   error_msg("STMT(0x%x, %u, %u, 0x%x)", filter[i].code,
> > +                             filter[i].jt, filter[i].jf, filter[i].k);
> > +           }
> > +   }
> > +}
>
> There's already some code which does nearly the same job,
> bpf_filter.c:print_bpf_filter_block(), I wonder if it can be re-used.

We could even reuse print_bpf_fprog() I think.  I'm not sure how to do
that though as we'll need to output to stderr instead of current_tcp.  We
could create a fake current_tcp, but that sounds very hacky...  Any other
idea?

[...]

> > +   /* Personnalities are iterated in reverse-order in the BPF program so that
> > +    * the x86 case is naturally handled.  In x86, the first and third
> > +    * personnalities have the same arch identifier.  The third can be
> > +    * distinguished based on its associated bit mask, so we check it first.
> > +    * The only drawback here is that the first personnality is more common,
> > +    * which may make the BPF program slower to match syscalls on average. */

[...]

> Some compile-time check (i.e. static_assert) that corresponds to the statement
> in the comment is likely needed here, in order to ensure that if at any point
> there appears another architecture that has 3 personalities (line MIPS or Aarch64),
> the code will be updated if needed (MIPS n32 and Aarch64ilp32 are indeed different
> in that regard in comparison to x32).

What makes you think architectures with 3 personalities other than x86
would cause issues here?  MIPS n32 already has the appropriate
AUDIT_ARCH_XXX constants (e.g., AUDIT_ARCH_MIPS64N32) and one should also
be added for aarch64 ilp32 [1] before it's merged in the kernel.

1 - https://lkml.org/lkml/2019/1/7/1272

[...]

> > +#if SUPPORTED_PERSONALITIES > 1
> > +           if (audit_arch_vec[p].mask) {
>
> Again, it is not clear why audit_arch_vec[p].mask checking code is
> conditioned with "SUPPORTED_PERSONALITIES > 1"; some static check that
> ensures that there's no non-zero audit_arch_vec[p].mask present on any
> of the supported personalities would be nice.

Why "again"?  This is the only audit_arch_vec[p].mask checking code (with
its simplified version in check_bpf_instruction_number()).  Did I miss a
comment?

It's conditioned on SUPPORTED_PERSONALITIES value because the following
three BPF instructions are only needed if the architecture supports
several personalities and a mask is used to disambiguate two of them
(e.g., x86-64 and x32 ABIs).

audit_arch_vec[p].mask can be null.  Maybe you meant
audit_arch_vec[p].arch?  audit_arch_vec[p].arch will be null for all
architectures with a single personality since we don't define
PERSONALITY{0,1,2}_AUDIT_ARCH constants for those.

[...]

> > +           for (unsigned int i = 0; i < nsyscall_vec[p]; ++i) {
> > +                   if (traced_by_seccomp(i, p)) {
> > +                           if (lower == UINT_MAX)
> > +                                   lower = i;
> > +                           continue;
> > +                   }
> > +                   if (lower == UINT_MAX)
> > +                           continue;
> > +                   pos += bpf_add_syscalls(filter + pos,
> > +                                           lower + audit_arch_vec[p].mask,
> > +                                           i + audit_arch_vec[p].mask);
> > +                   lower = UINT_MAX;
> > +           }
> > +           if (lower != UINT_MAX)
> > +                   pos += bpf_add_syscalls(filter + pos,
> > +                                           lower + audit_arch_vec[p].mask,
>
> > +                                           nsyscall_vec[p] + audit_arch_vec[p].mask);
>
> An overly long line.
>
> "mask" is an odd choice for a field name, it's more like a "flag" or "offset",
> judging by the way it is used.

That's because it's weirdly used :-)  I kept it from the original patch, but
it would probably be best to switch those "+" into "|".  It's also called
a mask in the seccomp(2) manpage (see __X32_SYSCALL_BIT).

[...]

> > diff --git a/filter_seccomp.h b/filter_seccomp.h
> > new file mode 100644
> > index 00000000..6942e05b
> > --- /dev/null
> > +++ b/filter_seccomp.h
> > @@ -0,0 +1,49 @@

[...]

>
> > + */
> > +
> > +#ifndef STRACE_SECCOMP_FILTER_H
> > +#define STRACE_SECCOMP_FILTER_H
> > +
> > +#include "defs.h"
> > +
>
> > +#ifdef HAVE_LINUX_SECCOMP_H
> > +# include <linux/seccomp.h>
> > +#endif
>
> I wonder what happens if it's not available. Or BPF_MAXINSNS is the only
> definition this include is needed?

I removed both; BPF_MAXINSNS is not needed in the header file anyway.

>
> > +
> > +#ifndef BPF_MAXINSNS
> > +# define BPF_MAXINSNS 4096
> > +#endif

[...]

> > diff --git a/linux/powerpc64/arch_defs_.h b/linux/powerpc64/arch_defs_.h
> > index 871f4109..a4ac007e 100644
> > --- a/linux/powerpc64/arch_defs_.h
> > +++ b/linux/powerpc64/arch_defs_.h
> > @@ -8,3 +8,5 @@
> >  #define HAVE_ARCH_OLD_SELECT 1
> >  #define SUPPORTED_PERSONALITIES 2
> >  #define HAVE_ARCH_DEDICATED_ERR_REG 1
> > +#define PERSONALITY0_AUDIT_ARCH { AUDIT_ARCH_PPC64, 0 }
> > +#define PERSONALITY1_AUDIT_ARCH { AUDIT_ARCH_PPC,   0 }
>
> There's powepc64le architecture now added (as it doesn't have compat),
> so the relevant definition will likely be needed there as well
> (something like "PERSONALITY0_AUDIT_ARCH { AUDIT_ARCH_PPC64LE, 0 }").

But powerpc64le has a single personality, right?  If that's the case, we
don't need to define any PERSONALITY{0,1,2}_AUDIT_ARCH.

[...]

> > diff --git a/linux/x32/arch_defs_.h b/linux/x32/arch_defs_.h
> > index 1055de12..9f48d313 100644
> > --- a/linux/x32/arch_defs_.h
> > +++ b/linux/x32/arch_defs_.h
> > @@ -11,3 +11,5 @@
> >  #define HAVE_ARCH_UID16_SYSCALLS 1
> >  #define HAVE_ARCH_OLD_TIME64_SYSCALLS 1
> >  #define SUPPORTED_PERSONALITIES 2
> > +#define PERSONALITY0_AUDIT_ARCH { AUDIT_ARCH_X86_64, __X32_SYSCALL_BIT }
> > +#define PERSONALITY1_AUDIT_ARCH { AUDIT_ARCH_I386,   0 }
> > diff --git a/linux/x86_64/arch_defs_.h b/linux/x86_64/arch_defs_.h
> > index a8c1d991..c2924ac2 100644
> > --- a/linux/x86_64/arch_defs_.h
> > +++ b/linux/x86_64/arch_defs_.h
> > @@ -9,3 +9,6 @@
> >  #define HAVE_ARCH_OLD_SELECT 1
> >  #define HAVE_ARCH_UID16_SYSCALLS 1
> >  #define SUPPORTED_PERSONALITIES 3
> > +#define PERSONALITY0_AUDIT_ARCH { AUDIT_ARCH_X86_64, 0 }
> > +#define PERSONALITY1_AUDIT_ARCH { AUDIT_ARCH_I386,   0 }
> > +#define PERSONALITY2_AUDIT_ARCH { AUDIT_ARCH_X86_64, __X32_SYSCALL_BIT }
>
> For IA-64, there's also SYSCALLENT_BASE_NR, which probably has to be
> accounted for.

Ok, then we'll need
"PERSONALITY0_AUDIT_ARCH { AUDIT_ARCH_IA64, SYSCALLENT_BASE_NR }".

This slightly changes how audit_arch_vec[p].mask is used; we now should
*add* it to the syscall number instead of *bitwise or'ing* it.  So we now
have two documented uses for audit_arch_vec[p].mask:
- for IA64, "+" is required, "|" works only has long as we have less than
  SYSCALLENT_BASE_NR IA64 syscalls;
- for x86, "|" is required, "+" works because we have less than
  __X32_SYSCALL_BIT x86 syscalls.

I'm tempted to use "nr = i | audit_arch_vec[p].mask" and add a static
check on the number of syscalls (< SYSCALLENT_BASE_NR).  The other way
around also works.

Paul