[PATCH RFC 2/2] seccomp: implement SECCOMP_FILTER_FLAG_NO_INHERITANCE

Mon Nov 18 18:32:45 UTC 2019

On Mon, Nov 18, 2019 at 04:27:25AM +0300, Dmitry V. Levin wrote:
> On Sun, Nov 17, 2019 at 06:52:04PM +0100, Paul Chaignon wrote:
> > On Fri, Nov 15, 2019 at 05:56:50PM +0100, Paul Chaignon wrote:
> > > On Fri, Nov 15, 2019 at 06:55:55PM +0300, Dmitry V. Levin wrote:
> > > > On Fri, Nov 15, 2019 at 04:46:14PM +0100, Paul Chaignon wrote:
> > > > > On Thu, Nov 14, 2019 at 07:06:20PM +0100, Paul Chaignon wrote:
> > > > > > On Mon, Nov 11, 2019 at 11:44:50PM +0300, Dmitry V. Levin wrote:
> > > > > > > On Mon, Nov 11, 2019 at 04:20:20PM +0100, Paul Chaignon wrote:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > > >  /*
> > > > > > > >   * All BPF programs must return a 32-bit value.
> > > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > > > > index 55af6931c6ec..1df5b058b067 100644
> > > > > > > > --- a/kernel/fork.c
> > > > > > > > +++ b/kernel/fork.c
> > > > > > > > @@ -1605,6 +1605,11 @@ static void copy_seccomp(struct task_struct *p)
> > > > > > > >  	 */
> > > > > > > >  	assert_spin_locked(&current->sighand->siglock);
> > > > > > > >  
> > > > > > > > +	if (!inherited_seccomp_filter(current)) {
> > > > > > > > +		clear_tsk_thread_flag(p, TIF_SECCOMP);
> > > > > > > > +		return;
> > > > > > > > +	}
> > > > > > > 
> > > > > > > Since the task can have many seccomp filters (chained by seccomp_filter.prev)
> > > > > > > and SECCOMP_FILTER_FLAG_NO_INHERIT flag affects only those filters that were
> > > > > > > created using this flag, this code should copy only those filters that are
> > > > > > > excluded from the fork inheritance.
> > > > > > 
> > > > > > Hm, I had the semantics all wrong.  I'll send a v2 with a new patch and
> > > > > > test.
> > > > > 
> > > > > Actually, I think we should just mandate that all seccomp filters in the
> > > > > list have the same value for flag SECCOMP_FILTER_FLAG_NO_FORK_INHERIT.
> > > > > This should be as simple as checking that the flag has the same value for
> > > > > a new filter and its parent.  In case of TSYNC we should be fine since all
> > > > > threads need to have the same ancestors as the calling thread.
> > > > > 
> > > > > If we don't do that, we'll have to inherit partial lists, which prevents
> > > > > us from using references since we'll need to modify field .prev of some
> > > > > filters.  I'm guessing making actual copies of filters will not be
> > > > > acceptable upstream...
> > > > > 
> > > > > One big issue with this approach is if we want to strace an application
> > > > > that itself uses seccomp-bpf.  In that case, the application will likely
> > > > > fail to set up its own filter because flag NO_FORK_INHERIT differ.
> > > > 
> > > > Seccomp filters are ubiquitous nowadays, software like systemd-nspawn
> > > > generates and applies seccomp filters by default.  If the proposed feature
> > > > does not support this workflow, does it have a chance to be accepted
> > > > into the kernel?
> > > 
> > > Yep, that was my concern too...  I'll have a second look after the weekend
> > > in case I missed something, but I don't see any other clear solution at
> > > the moment :(
> > 
> > Actually, we could relax the constraint: when adding a new filter with
> > flag NO_FORK_INHERIT, we can simply mandate that all *previous* filters
> > (again, we only need to check the previous one) have NO_FORK_INHERIT as
> > well.  Then, forks inherit a sublist of filters starting with the first
> > filter without NO_FORK_INHERIT set.
> 
> Would it be possible to add a traditional seccomp filter after
> a NO_FORK_INHERIT one?  If yes, how is it going to work?
> If no, is there any security implications?

Scratch that, it won't work.  I always forget that new filters are added
at the beginning of the list, not at the end.

What if we had the PID inside the BPF program?  We could compare that
against a hardcoded value and shortcut to RET_ALLOW if they match.

The new issue is how to retrieve the PID though.  Adding to seccomp_data
might add too much overhead and adding a new instruction might be too big
a change.  Plus, a new call instruction sounds a bit too much like we're
reimplementing eBPF...

> Anyway, ...
> 
> > Since strace will typically install the first seccomp filter, we should be
> > fine, right?  What do you think?
> 
> ... this means that strace --seccomp-bpf would imply -f or not depending
> on circumstances, e.g. whether it's invoked in a container or not.

Yep, that would have been a mess too :/

Paul