[GSoC] Proposal Draft RFC: Implementing PCAP dump and decoding library interface

Tue Apr 4 08:12:09 UTC 2023

Hi,

I finished my proposal draft and I think it is ready for submitting.

Here is a online version, with source code and rendered PDF. I'll put a
pandoc converted markdown version below.

https://www.overleaf.com/read/gpymywypsscf

As for the micro Pproject, I need more time on big-endian environments.
Emulators run slowly and so that tests need hours to finish, I'll submit
new patchset if the tests pass on the three big-endian platforms that
confirms the tests failure.

And here's my proposal:

=== Proposal: Implementing PCAP dump and decoding library interface ===

Author: Zerui Li

# Abstract

I propose to create a new PCAP link type that captures system calls, by
treating syscalls as network connections, and to implement
a PCAP dump option and decoding library interface for strace, This
will enable the capturing and decoding of system calls in external tools
like Wireshark, similar to how network packets are captured.

----------

This will be a huge project. I will definitely not able to implement all
of these features in a single GSoC period. I'll try my best in GSoC and
continue to improve this after GSoC period. It may be more appropriate
to consider this as a general proposal instead of a GSoC one. In the
timeline section, I'll clarify what I'll done in GSoC. And I'm not
familiar with strace enough, so errors will occur. Feel free to comment
on overleaf or the mailing list.

# The purpose

It may be convenient to consider syscalls as "network traffic" between
userspace processes and kernel, and use the mature networking toolchains
such as PCAP file format and Wireshark to do the dirty job. Which
allows:

-   Live capturing and processing
-   Space efficient binary dump file
-   Re-using Wireshark's user interface
-   Uniformed output format with as much as information retained
-   Filtering syscalls without re-run the tracee from time to time

But the binary format needs basically another "strace project" that
print them in a user-friendly way. So it's also essential to allow
strace act as a library that pretty-print syscalls on demand, but not
real-time.

Strace currently captures system calls and presents them in a readable
format in real-time, with the capturing and formatting functions being
tightly integrated. Decoders will retrieve information about the tracee
as the decoding proceeds. Decoupling these two parts gives strace more
flexibility and customization.

If all of them are implemented. And the coding part of the wireshark is
done (mostly decoders). One can launch Wireshark, choose strace external
capture, start strace in PCAP dump mode, and capture syscalls as if
capturing network packets with each field and flags decoded by the
strace library.

It will also be easier to develop pipe-based "plugins" that transform
the PCAP output into other formats, for example the GitHub#34 (note the
difference between ordinary network packets and syscall communication
link-type we are discussing).

# Current states

Strace project is based on an assumption that capturing and decoding
happens at the same time and on the same machine when/where syscalls
happens.

That is, we need decouple the frontend (human friendly output) and
backend (ptrace and vm_read) of strace, and store as much as information
as we should for the frontend.

Also, strace is configured and built on the same architecture that it
will capture. But PCAP is architecture independent (mostly and need to
be). So build scripts need to be modified to support decoding non-native
architecture.

There were several previous proposals and attempts on JSON based
structured output. I don't think I'm doing a duplicate to implement
another structured output option. These JSON outputs are decoded and
human readable. They're for applications that don't care the binary
representation and architecture dependent details. Instead, the PCAP
dump does not decode anything. It just snapshots the current state of
`struct tcb` and other relevant information for future usages.

# Technical Details

I'm still just investigating rather than implementing, if problems
encountered, THIS SECTION MAY SUBJECT TO CHANGE.

In the "packet" we'll define, any information strace could possibly use
in decoding the syscall need to be saved. We can know what is acutally
needed by running the decoders at runtime, and act like a man in the
middle, do caputre and replay against the decoders.

The most basic information is the architecture and endianness of the
tracee. The `struct tcb` contains most stateful syscall information, of
course. But more information is retrieved by decoders on demand. For
example, they call `umoven` and `umovstr` to copy a small part of
tracee's memory into strace's memory space. Besides, there are also
procfs accesses.

To simplify the PCAP decoder, `size_t` and `kernel_long_t` are always
considered 8-bytes long integers but in target endiannness.

## The "packet" format

The packet endianness should be specified in the PCAP file header, just
as netlink and other platform dependent protocols. The packet size
should be read in PCAP packet header, along with the time.

        struct packet_hdr {
            uint16_t arch;
            uint16_t karch;
            uint16_t flags;
            uint16_t reserved;
        }

`struct packet_hdr` above defines the strace build needed to decode this
packet, as there're arch and karch variables in autotools scripts. The
flags defines the features enabled in strace capturing this packet,
decoder may not have a superset of these flags. And 2-byte reserved
length.

The packet body consists of `struct tcb`'s most important fields and an
auxiliary section count.

        struct packet_body {
            int64_t tcb_flags;
            int32_t pid;
            enum trace_event te;
            kernel_ulong_t scno;
            kernel_ulong_t u_error;
            kernel_long_t u_rval;
            kernel_ulong_t u_arg[MAX_ARGS];
            uint32_t currpers;
            uint32_t auxcnt;
            /* MAX_ARGS is known from (k)arch fields in pkt_hdr */
        }

The "`kernel_long_t`" are considered 8-bytes long regardless of the
actual architecture.

The `te` or "trace event" field above is the reason why this packet is
captured.

Note that we omitted "`real_scno`" as it only differs from "`scno`"
under multiple personalities environment. The "`real_scno`" is rarely
used, only in numerical syscall number outputting, and can be calculated
if we know all the personalities related information.

"`currpers`" is a global variable in strace, but we need to snapshot it
too, to allow the decoder run on the right personality mode.

There may be other global variable I've missed. I'll keep an eye on it
when inspecting the strace's code.

The additional information retrieved by decoders will be placed in
auxiliary sections.

        struct aux_hdr {
            uint16_t typ;
            uint16_t flg;
            uint32_t len;
        }

Every auxiliary section begins with this "`struct aux_hdr`". The types
are defined below, flags varies between types, and the length do not
include the header itself.

### Memory dump

This section begins with a "`kernel_ulong_t addr`" and
"`len - sizeof(kernel_ulong_t)`" bytes of memory dump. Usually triggered
by "umoven" and "umovstr" functions.

### File dump

This section begins with a "`struct filedmp_hdr`" header defined below.
And then `fn_len` bytes of filename (without the `'\0'` suffix) and
`fc_len`. Note that strace do not seek on files, the offset is omitted.
Setting flag `FILEDUMP_READLINK` flag indicates that the file content is
actually the symlink destination, which is quite reasonable with respect
to how symlinks are implemented.

        struct filedmp_hdr {
            uint32_t fn_len;
            uint32_t fc_len;
        }

### Other syscalls

`ioctl`s, `getxattr`s and other non-file operations may be used by
strace, so we dump the args and results into this section. The syscall
is indicated by flags, the content structure depends on which syscall
the decoder used. The document will be updated if this is to implement.

### Signals

A `siginfo_t` dump if the event is triggered by a signal.

## Library interface

Before use the library to parse the packet, you should check the
(k)arch.

The library will be built with a `struct tcb` with fewer fields,
balancing the burden of caller, refactor needed in strace and the
situation I'll face in implementing this.

`SYS_FUNC`s and xlats will be the most part of the library, they're
directly put into the user's linking namespace, conflicts may encounter.

The "umoven" family functions, readlink / read file operations and other
syscalls will be turned into an auxiliary section lookup. If a such
operation cannot be found in these sections, then we should tell the
decoder to give this up. This maximizes the compatibility between strace
versions and build flavors.

The caller will give a packet described above, and query the whole log
string or specific fields or data related to the syscall on demand.

We must reduce the global variable usage for both namespace pollution
and multi-threading considerations.

A possible library interface should be:

        struct strace_ctx {
            // some fields here..
            /* may be exposed and stabled for usage like getting
             * scno or rval
             */
        };

        void libstrace_init(struct strace_ctx*);
        int libstrace_packet(struct strace_ctx*, void *packet,
                             size_t packet_len);

        const char* libstrace_log(struct strace_ctx*);

Other information getting APIs should also be there.

# The Timeline

### From 05-29 to 06-04
Implement a basic PCAP file dump option. With no personality, auxiliary
sections, signals support.

### From 06-05 to 06-18
(two weeks, several exams here) Add auxiliary sections dump and signal
support.

### From 06-19 to 06-25
Tests on the dump.

### From 06-26 to 07-01
Personality support. Update docs and news, tests on other architecture.

### From 07-02 to 07-08
Compile some simple decoders into library.

### From 07-09 to 07-15
Implement umoven / umovstr and file read interceptor.

### From 07-16 to 07-29
(two weeks) Implement a PCAP dump replayer use the library.

### From 07-30 to 08-05
Test script to automatically run a PCAP dump, replay and diff testing
process.

### From 08-06 to 08-12
Tests on other architecture, on multi-personalities, bugfixes.

### From 08-13 to 08-20
Bugfixes if needed. Wireshark dissectors if time permitting.

### Final week
Prepare the report and final patchset.

If I encountered something unexpected, the timeline may be extended.

If some big issues, that need a huge refactor to fix, with the workload
beyond my ability, happen, some of the targets above may be given up.

# About me

My name is Zerui Li, and I use leedagee as my nick name.

I'm a freshman in Huazhong University of Science and Technology,
majoring computer science.

I don't have a internship or academic study experience before. I used to
be an algorithm competition player, but soon become a open-source
enthusiast. I can read and write C without problems. I did some
packaging work on AUR and openSUSE's OBS. But it's the first time I
contribute to a well-known public open source project. I joined a
technical student club last semester, and finished some newcomer tasks,
like a shell lab, the code is available on my GitHub.

I did a MicroProject about signalfd decoding, which is available on the
mailing list. Through that, I learned the way to properly use mailing
lists and git send-email to make contribution to the strace project.
I've already had an overview of the project structure.

My English ability may not be good enough, so there may be some wording
issues or grammar errors, apologies in advance.

Currently, I'm a freshman and there're some courses and exams before
July, I'll be full-day available after June 29 in expectation.