strace of io_uring events?

Thu Jul 16 15:12:35 UTC 2020

On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
> On Wed, Jul 15, 2020 at 04:07:00PM -0700, Kees Cook wrote:
> [...]
> 
> > Speaking to Stefano's proposal[1]:
> > 
> > - There appear to be three classes of desired restrictions:
> >   - opcodes for io_uring_register() (which can be enforced entirely with
> >     seccomp right now).
> >   - opcodes from SQEs (this _could_ be intercepted by seccomp, but is
> >     not currently written)
> >   - opcodes of the types of restrictions to restrict... for making sure
> >     things can't be changed after being set? seccomp already enforces
> >     that kind of "can only be made stricter"
> 
> In addition we want to limit the SQEs to use only the registered fd and buffers.

Hmm, good point. Yeah, since it's an "extra" mapping (ioring file number
vs fd number) this doesn't really map well to seccomp. (And frankly,
there's some difficulty here mapping many of the ioring-syscalls to
seccomp because it's happening "deeper" than the syscall layer (i.e.
some of the arguments have already been resolved into kernel object
pointers, etc).

> Do you think it's better to have everything in seccomp instead of adding
> the restrictions in io_uring (the patch isn't very big)?

I'm still trying to understand how io_uring will be used, and it seems
odd to me that it's effectively a seccomp bypass. (Though from what I
can tell it is not an LSM bypass, which is good -- though I'm worried
there might be some embedded assumptions in LSMs about creds vs current
and LSMs may try to reason (or report) on actions with the kthread in
mind, but afaict everything important is checked against creds.

> With seccomp, would it be possible to have different restrictions for two
> instances of io_uring in the same process?

For me, this is the most compelling reason to have the restrictions NOT
implemented via seccomp. Trying to make "which instance" choice in
seccomp would be extremely clumsy.

So at this point, I think it makes sense for the restriction series to
carry on -- it is io_uring-specific and solves some problems that
seccomp is not in good position to reason about.

All this said, I'd still like a way to apply seccomp to io_uring
because it's a rather giant syscall filter bypass mechanism, and gaining
access (IIUC) is possible without actually calling any of the io_uring
syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
access to the SQ and CQ, and off it goes? (The only glitch I see is
waking up the worker thread?)

What appears to be the worst bit about adding seccomp to io_uring is the
almost complete disassociation of process hierarchy from syscall action.
Only a cred is used for io_uring, and seccomp filters are associated with
task structs. I'm not sure if there is a way to solve this disconnect
without a major internal refactoring of seccomp to attach to creds and
then make every filter attachment create a new cred... *head explody*

-- 
Kees Cook