[GSoC] Proposal: Support the decode of BPF map manipulation command by using tracee loaded BTF info

Wed Apr 5 00:34:30 UTC 2023

Hi,
    Below is my proposal for GSoC 2023, please take a look and feel
free to make any suggestions on it.
-------
# Support the decode of BPF map manipulation command by using tracee
loaded BTF info

## Abstract
The current strace cannot parse the argument of map related `bpf()`
syscall pretty well. In order to change this situation, we can
retrieve the BTF info from the program and use BTF info to help us
decode/parse the arguments of `bpf()` better. For this GSoC 2023
session, I'd like to first start with a small subset of the bpf map
manipulation syscall(For example, `BPF_MAP_LOOKUP_ELEM` and
`BPF_MAP_DELETE_ELEM`), if these two works well, then I can expand the
support to more bpf commands in the rest of GSoC timeline.

## Details
The current strace cannot parse the argument of map related `bpf()`
syscall pretty well, more specifically, it can only print the syscall
args out with its plain value. For example, this is how strace handle
bpf syscall `BPF_MAP_LOOKUP_ELEM` command:
```
bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=4, key=0x7fffedfe3ed0,
value=0x7fffedfe3ed8, flags=BPF_ANY}, 32)
```
Only printing out the address value is not a useful way for people who
want to find out what's inside those pointers, we do need a parser to
parse the memory block pointed by the given pointer and print the data
in that memory out correctly. We can parse the BTF info loaded by the
tracee program and use the type info for key/value to fetch what's
pointed by those pointers.

The first problem is, strace cannot access to tracee memory space by
directly dereferencing those addresses. My approach is using
`process_vm_readv()` syscall. By using this syscall, we only need to
know the pid of the target process and address. Then we can do
`readv()` on the target process memory space, which is pretty
convenient. Although
[ucopy.c](https://github.com/strace/strace/blob/master/src/ucopy.c) do
provides `vm_read_mem()` with `process_read_mem()` syscall, due to the
using of `struct iovec` in `process_vm_readv()`, the latency to read
multiple chunks of BTF information may smaller than using regular
`process_read_mem()` as `vm_read_mem()`.

The second problem is, bpf map manipulation commands usually require a
file descriptor of maps as parameters. But we cannot uniquely
reference the correct type info of a map with its fd(since fd can be
re-allocated by close current map and create a new one). BPF do have a
unique reference `btf_id` for each map object to their type info, we
can get this `btf_id` from the result of `BPF_MAP_CREATE`, as well as
the `btf_id` for its key/value. In order to implement a reference from
the normal fd to its corresponding btf_id along the tracee executing,
I have 2 approaches:

### <b>1. Dynamically maintaining a lookup table to find the correct
btf_id with given fd</b>
This approach is quite intuitive, we can easily build a table to do
btf_fd lookup(lookup btf_id by fd). We added a new entry to this
lookup table when we encountered `BPF_MAP_CREATE`. Everytime when we
encountered commands like `BPF_MAP_LOOKUP_ELEM` or
`BPF_MAP_GET_NEXT_KEY` , we can lookup the correspond btf_id with
given fd in syscall arglist. Also, we have to remove(or set to
invalid) the corresponding entry when tracee close a valid entry for
fd in out lookup table with `close(fd)`. (I plan to start with a plain
table as beginning, then switch to hash table as an optimization)

### <b>2. Use ptrace() syscall inject BPF_OBJ_GET_INFO_BY_FD and
retrieve btf_id from tracee</b>
`bpf()` provides the command `BPF_OBJ_GET_INFO_BY_FD` for user to get
the btf_id with the given fd. And we only need to find a way to inject
such a syscall into the tracee memory space. `ptrace()` can deal with
this problem perfectly. Since we can use `PTRACE_GETREGS` to obtain
the current registers info(with IP), then we inject a shell code(which
is a `bpf()` syscall with `BPF_OBJ_GET_INFO_BY_FD` as command field).
After that we can do another `PTRACE_GETREGS` to get the result of the
syscall from register file( `bpf()` syscall have 3 args, so according
to x86 Linux syscall convention, the return value of a syscall should
stay in %eax/%rax). In order to make sure the program can resume
execution normally, we also need to "swap back" the register value
from the first `PTRACE_GETREGS` to tracee context after we "hijack"
the tracee syscall. This method may perform better than a lookup table
when there're a lot of maps created by tracee for lookup. But the
problem is `struct pt_regs` is an architecture related structure, so
it may take a lot of time to develop/debug under different
architecture.

Regards to approach #2, Andrei Vagin submitted a
[patch](https://lore.kernel.org/linux-api/20210414055217.543246-1-avagin@gmail.com/)
talking about a new syscall `process_vm_exec()`. Although this patch
doesn't look like being accepted by Linux for now, it does reveal that
using `ptrace()` to inject syscall into tracee address space takes
1446 ns to complete. I'm wondering if this time latency is tolerable
for tracing purposes.

With the `btf_id` and the argument data block we get from
`process_vm_readv()`, we can start to fit the argument data block with
the correct BTF entry. We do need a BTF parser to do this job. Linux
kernel already had an example for BTF
parser([/kernel/bpf/btf.c:btf_parse](https://elixir.bootlin.com/linux/latest/source/kernel/bpf/btf.c#L5396)),
so it is not a hard to re-write one by reference to kernel.

Above is my plan for enhancing BPF map manipulation command result
decode. As I've stated, I will start with a small subset of map
manipulation commands, and once that works well, I can then try to do
some expansion and optimization on this idea.

# Basic Information
Name: Boming Kong

Major: Computer Science

School: University of California, Santa Barbara

Github: bigjr-mkkong

Email: michaelkongboming at gmail.com

## Previous work
I have been doing BPF related work since freshman year, and I'm now
working on use UBPF(Userspace BPF virtual machine) to handle the
kernel-user shared memory, and here is the
[link](https://github.com/bigjr-mkkong/ubpf_with_sbpf). This project
is still experimental and mostly setup on my local machine, so
there're many patches/scripts that haven't been sync with the github
version. Users can implement a small BPF program and pass it into a
UBPF virtual machine. Once the verification is successful, this bpf
program can be used to handle the r/w on a kernel-user shared memory
region. The use of shared memory can help us avoid many overheads
caused by the context switch, and since we're using bpf bytecode and
kernel-site verifier, the safety can have a guarantee. It is more
flexible than existed io_uring since users can write a bpf program and
do more things on the shared memory instead of calling some fixed
functions provided by the kernel.

## Academic studies
I learned the basic C/C++ and system programming in high school, and
I've made a small linux-like
[kernel](https://github.com/bigjr-mkkong/OStest) before entering
college. I've taken the data structure/algorithm and computer
architecture class in college and I also joined UCSB computer
architecture lab doing research related to BPF technology.

## Mini-project
I've submitted my mini-project to maillist. However, I haven't
received any response on that. Since the deadline of proposal is
coming, I can only post the link of my patch in strace devel archive:
[Link to my patch](https://lists.strace.io/pipermail/strace-devel/2023-March/011261.html).

## Work time in SoC
I have some classes & exams before June 17 and another workshop from
June 17-21, after that I can work as half-time.