strace of io_uring events?

Wed Jul 15 23:07:00 UTC 2020

Earlier Andy Lutomirski wrote:
> Let’s add some seccomp folks. We probably also want to be able to run
> seccomp-like filters on io_uring requests. So maybe io_uring should call into
> seccomp-and-tracing code for each action.

Okay, I'm finally able to spend time looking at this. And thank you to
the many people that CCed me into this and earlier discussions (at least
Jann, Christian, and Andy).

It *seems* like there is a really clean mapping of SQE OPs to syscalls.
To that end, yes, it should be trivial to add ptrace and seccomp support
(sort of). The trouble comes for doing _interception_, which is how both
ptrace and seccomp are designed.

In the basic case of seccomp, various syscalls are just being checked
for accept/reject. It seems like that would be easy to wire up. For the
more ptrace-y things (SECCOMP_RET_TRAP, SECCOMP_RET_USER_NOTIF, etc),
I think any such results would need to be "upgraded" to "reject". Things
are a bit complex in that seccomp's form of "reject" can be "return
errno" (easy) or it can be "kill thread (or thread_group)" which ...
becomes less clear. (More on this later.)

In the basic case of "I want to run strace", this is really just a
creative use of ptrace in that interception is being used only for
reporting. Does ptrace need to grow a way to create/attach an io_uring
eventfd? Or should there be an entirely different tool for
administrative analysis of io_uring events (kind of how disk IO can be
monitored)?

For io_uring generally, I have a few comments/questions:

- Why did a new syscall get added that couldn't be extended? All new
  syscalls should be using Extended Arguments. :(

- Why aren't the io_uring syscalls in the man-page git? (It seems like
  they're in liburing, but that's should document the _library_ not the
  syscalls, yes?)

Speaking to Stefano's proposal[1]:

- There appear to be three classes of desired restrictions:
  - opcodes for io_uring_register() (which can be enforced entirely with
    seccomp right now).
  - opcodes from SQEs (this _could_ be intercepted by seccomp, but is
    not currently written)
  - opcodes of the types of restrictions to restrict... for making sure
    things can't be changed after being set? seccomp already enforces
    that kind of "can only be made stricter"

- Credentials vs no_new_privs needs examination (more on this later)

So, I think, at least for restrictions, seccomp should absolutely be
the place to get this work done. It already covers 2 of the 3 points in
the proposal.

Solving the mapping of seccomp interception types into CQEs (or anything
more severe) will likely inform what it would mean to map ptrace events
to CQEs. So, I think they're related, and we should get seccomp hooked
up right away, and that might help us see how (if) ptrace should be
attached.

Specifically for seccomp, I see at least the following design questions:

- How does no_new_privs play a role in the existing io_uring credential
  management? Using _any_ kind of syscall-effective filtering, whether
  it's seccomp or Stefano's existing proposal, needs to address the
  potential inheritable restrictions across privilege boundaries (which is
  what no_new_privs tries to eliminate). In regular syscall land, this is
  an issue when a filter follows a process through setuid via execve()
  and it gains privileges that now the filter-creator can trick into
  doing weird stuff -- io_uring has a concept of alternative credentials
  so I have to ask about it. (I don't *think* there would be a path to
  install a filter before gaining privilege, but I likely just
  need to do my homework on the io_uring internals. Regardless,
  use of seccomp by io_uring would need to have this issue "solved"
  in the sense that it must be "safe" to filter io_uring OPs, from a
  privilege-boundary-crossing perspective.

- From which task perspective should filters be applied? It seems like it
  needs to follow the io_uring personalities, as that contains the
  credentials. (This email is a brain-dump so far -- I haven't gone to
  look to see if that means io_uring is literally getting a reference to
  struct cred; I assume so.) Seccomp filters are attached to task_struct.
  However, for v5.9, seccomp will gain a more generalized get/put system
  for having filters attached to the SECCOMP_RET_USER_NOTIF fd. Adding
  more get/put-ers for some part of the io_uring context shouldn't
  be hard.

- How should seccomp return values be applied? Three seem okay:
	SECCOMP_RET_ALLOW: do SQE action normally
	SECCOMP_RET_LOG: do SQE action, log via seccomp
	SECCOMP_RET_ERRNO: skip actions in SQE and pass errno to CQE
  The rest not so much:
	SECCOMP_RET_TRAP: can't send SIGSYS anywhere sane?
	SECCOMP_RET_TRACE: no tracer, can't send SIGSYS?
	SECCOMP_RET_USER_NOTIF: can't do user_notif rewrites?
	SECCOMP_RET_KILL_THREAD: kill which thread?
	SECCOMP_RET_KILL_PROCESS: kill which thread group?
  If TRAP, TRACE, and USER_NOTIF need to be "upgraded" to KILL_THREAD,
  what does KILL_THREAD mean? Does it really mean "shut down the entire
  SQ?" Does it mean kill the worker thread? Does KILL_PROCESS mean kill
  all the tasks with an open mapping for the SQ?

Anyway, I'd love to hear what folks think, but given the very direct
mapping from SQE OPs to syscalls, I really think seccomp needs to be
inserted in here somewhere to maintain any kind of sensible reasoning
about syscall filtering.

-Kees

[1] https://lore.kernel.org/lkml/20200710141945.129329-3-sgarzare@redhat.com/

-- 
Kees Cook