[PATCH RFC v2 1/3] Introduce seccomp-assisted syscall filtering

Mon Aug 12 15:18:05 UTC 2019

On Mon, Aug 12, 2019 at 02:28:00PM +0200, Eugene Syromiatnikov wrote:
> On Wed, Aug 07, 2019 at 06:03:38PM +0200, Paul Chaignon wrote:
> > Thanks a lot for the review!  I've addressed most comments (I still need a
> > add a few more comments in the code) and have a couple answers and
> > questions below.
>
> My apologies for a belated reply.

No worries, I was occupied with filter generation improvements.

[...]

> > On Mon, Aug 05, 2019 at 07:59:25PM +0200, Eugene Syromiatnikov wrote:
> > > On Wed, Jul 31, 2019 at 05:35:56PM +0200, Paul Chaignon wrote:
> > > [...]
> > >
> > > I think this license boilerplate may be replaced with
> > > "SPDX-License-Identifier: LGPL-2.1-or-later", as all new contributions
> > > to strace are under LGPL 2.1+, but I'm not sure about details of
> > > re-licensing of a someone else's patch.
> >
> > Would that really be considered re-licensing?  In any case, I'll ask for
> > Chen Jingpiao's okay.
>
> I don't think that BSD -> LGPL transition requires explicit author's permission
> (at least, per my understanding of these licenses), but at the same time I'm not
> a lawyer, so I'd avoid relying on my opinion in this regard.

I just got Jingpiao's okay.

>
> > > > +static bool
> > > > +traced_by_seccomp(unsigned int scno, unsigned int p)
> > > > +{
> > > > +   if (!sysent_vec[p][scno].sys_func
> > > > +       || is_number_in_set_array(scno, trace_set, p)
> > > > +       || sysent_vec[p][scno].sys_flags & TRACE_INDIRECT_SUBCALL)
> > > > +           return true;
> > >
> > > > +   for (unsigned int i = 0; i < ARRAY_SIZE(traced_by_default); ++i)
> > > > +           if (sysent_vec[p][scno].sen == traced_by_default[i])
> > > > +                   return true;
> > >
> > > I wonder if it would be better if it would be rewritten into (yet another)
> > > syscall entry flag check (like, TRACE_SECCOMP_DEFAULT).
> >
> > It seems a bit odd to add it as a new syscall entry flag.  Other such
> > flags characterize their associated syscalls, whereas in this case, it's
> > mostly a list of syscalls we don't filter inside the kernel (e.g.,
> > socketcall) at the moment.  The list may change as the BPF program
> > supports more cases.  I'm not sure that difference matters though (?)
>
> There are, for example, SYSCALL_NEVER_FAILS, MEMORY_MAPPING_CHANGE,
> STACKTRACE_CAPTURE_ON_ENTER (and, to some extent, TRACE_PURE) that used
> for some internal purposes, I see no problem with adding yet another one.
>
> Even if BPF capabilities will change in future kernel versions, we still
> have to support older kernels that lack them (so we likely still want
> to filter them).

I don't think I explained this very well.  I don't expect the list of
traced syscalls to evolve because of kernel changes, but simply because we
will improve the BPF program to support more cases.  There's nothing that
prevents that today; I just want to keep this first patchset simple.

BPF capabilities in seccomp-bpf are also unlikely to improve in the near
future; the security folks are quite against using eBPF in seccomp.

>
> > > There's already some code which does nearly the same job,
> > > bpf_filter.c:print_bpf_filter_block(), I wonder if it can be re-used.
> >
> > We could even reuse print_bpf_fprog() I think.  I'm not sure how to do
> > that though as we'll need to output to stderr instead of current_tcp.  We
> > could create a fake current_tcp, but that sounds very hacky...  Any other
> > idea?
>
> The common approach is print to string in the intermal functions and
> just have different top-level wrappers.

I thought that would be too heavy a change.  I'll do that.

>
> > > Some compile-time check (i.e. static_assert) that corresponds to the statement
> > > in the comment is likely needed here, in order to ensure that if at any point
> > > there appears another architecture that has 3 personalities (line MIPS or Aarch64),
> > > the code will be updated if needed (MIPS n32 and Aarch64ilp32 are indeed different
> > > in that regard in comparison to x32).
> >
> > What makes you think architectures with 3 personalities other than x86
> > would cause issues here?  MIPS n32 already has the appropriate
> > AUDIT_ARCH_XXX constants (e.g., AUDIT_ARCH_MIPS64N32) and one should also
> > be added for aarch64 ilp32 [1] before it's merged in the kernel.
> >
> > 1 - https://lkml.org/lkml/2019/1/7/1272
>
> IIUC, checks for syscall mask are made only in case there are multiple
> personalities, which is somewhat x32-specific.

Yes, it's very much x32-specific.  Such checks are only needed if there
are multiple personalities and at least two use the same AUDIT_ARCH_XXX
constant.  It's the case for x32 and x86-64, but I don't think the kernel
folks will make the same mistake again :)

>
> > > > +#if SUPPORTED_PERSONALITIES > 1
> > > > +           if (audit_arch_vec[p].mask) {
> > >
> > > Again, it is not clear why audit_arch_vec[p].mask checking code is
> > > conditioned with "SUPPORTED_PERSONALITIES > 1"; some static check that
> > > ensures that there's no non-zero audit_arch_vec[p].mask present on any
> > > of the supported personalities would be nice.
> >
> > Why "again"?  This is the only audit_arch_vec[p].mask checking code (with
> > its simplified version in check_bpf_instruction_number()).  Did I miss a
> > comment?
> >
> > It's conditioned on SUPPORTED_PERSONALITIES value because the following
> > three BPF instructions are only needed if the architecture supports
> > several personalities and a mask is used to disambiguate two of them
> > (e.g., x86-64 and x32 ABIs).
> >
> > audit_arch_vec[p].mask can be null.  Maybe you meant
> > audit_arch_vec[p].arch?  audit_arch_vec[p].arch will be null for all
> > architectures with a single personality since we don't define
> > PERSONALITY{0,1,2}_AUDIT_ARCH constants for those.
>
> Yep, that was the gist of my concern: if mask is to be used for
> distinguishing between personalities, it is ought to be conditioned
> more explicitly.

I'm not sure I understand.  Do you mean you'd like us to verify .mask is
non-zero when it's supposed to be non-zero?

I don't see how to do that, because there's nothing that says that .mask
should be non-zero.  It's possible (and common) to have several
personalities and have a zero .mask.

>
> > > "mask" is an odd choice for a field name, it's more like a "flag" or "offset",
> > > judging by the way it is used.
> >
> > That's because it's weirdly used :-)  I kept it from the original patch, but
>
> > it would probably be best to switch those "+" into "|".
>
> Yes, please.
>
> > It's also called a mask in the seccomp(2) manpage (see __X32_SYSCALL_BIT).
>
> This is odd as well.

Ah, I see what you mean.  I'll rename to .flag.  I'm guessing it's called
mask elsewhere because of how it's used in the kernel (with bitwise
AND/OR).

[...]

> And then there's MIPS and its 1000-based bases, but it doesn't matter
> here (strace doesn't store MIPS syscall entries with offsets).

Thanks for pointing that out!  It doesn't matter here, but it does for the
binary match filter generation, for the following patchset.

Paul