[RFC PATCH] ioctl: decode VFIO_IOMMU_MAP_DMA
Masatake YAMATO
yamato at redhat.com
Thu Aug 26 20:22:49 UTC 2021
I wrote:
> Last week, I wanted to submit this change to LKML or VFIO specific list.
> However, the kernel applying the batch didn't boot on my pc. So the
> submission is pending...I will report again when if there is any progress.
Finally, the kernel with my patch booted.
I integrated your patch on my patches related to "vfio-driver" field in fdinfo.
https://github.com/masatake/strace/commits/vfio-draft
31d25289d Apply Alyssa Ross's work
72f6d8812 Introduce vfio_cookie
91221b617 Bundle linux/include/uapi/linux/vfio.h
d91159f18 Introduce get_fdinfo(), generic getter for /proc/$pid/fdinfo/$fd
128bcd434 Attach the misc device name to the fd_cookie private data
d03de5c19 Attach the major device name to the fd_cookie private data
2e94a3c72 Make the type of the xlat created by dyxlat_alloc customizable
19bbc9dce Introduce fd_cookie for passing properties of device files
d564e4ccf Extend tcb to take multiple private datum
bf6f39c25 Add functions for disabling tprint* temporary
I got following output with our patches:
ioctl(129, VFIO_IOMMU_MAP_DMA, {argsz=32, flags=VFIO_DMA_MAP_FLAG_READ|VFIO_DMA_MAP_FLAG_WRITE, vaddr=0x7ef967e00000, iova=NULL, size=655360}) = 0
ioctl(129, VFIO_IOMMU_MAP_DMA, {argsz=32, flags=VFIO_DMA_MAP_FLAG_READ, vaddr=0x7ef967ec0000, iova=0xc0000, size=53248}) = 0
ioctl(129, VFIO_IOMMU_MAP_DMA, {argsz=32, flags=VFIO_DMA_MAP_FLAG_READ|VFIO_DMA_MAP_FLAG_WRITE, vaddr=0x7ef967ecd000, iova=0xcd000, size=12288}) = 0
The "VFIO_DEVICE_GET_PCI_HOT_RESET_INFO or VFIO_IOMMU_MAP_DMA" is gone!
The decoded output is really helpful to understand what happens whe attaching
a PCI device directly to a KVM instance.
As you know, there are more mysterious ioctl cmd in vfio:
VFIO_DEVICE_SET_IRQS, VFIO_GROUP_SET_CONTAINER, etc. I hope you don't
lose your interest for decoding them:)
After revising I will submit the change for Linux to kvm at vger.kernel.org.
I will report again when if there is any progress.
Masatake YAMATO
On Wed, 25 Aug 2021 07:29:52 +0900 (JST), Masatake YAMATO <yamato at redhat.com> wrote:
> On Thu, 6 May 2021 17:07:43 +0000, Alyssa Ross <hi at alyssa.is> wrote:
>> ---
>> The argument decoding itself is pretty solid, I think (apart from a
>> couple of small things I'll get to in a bit).
>>
>> But the main reason I'm submitting this as an RFC is that I'm not sure
>> how to handle the 'request' argument of the ioctl() call
>> (i.e. VFIO_IOMMU_MAP_DMA). Some of the VFIO ioctl request numbers are
>> reused. VFIO_IOMMU_MAP_DMA == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. To
>> tell them apart, we have to check the device type of the file the
>> ioctl was sent to. It's easy enough to do that in vfio_ioctl(), as
>> I've done here, but that means that the request argument will still
>> show up in strace as "VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA".
>> Ideally, we'd do this check earlier, before printing the request
>> argument, so we could figure out which of the two candidate names to
>> print. But this doesn't seem to be a problem strace has solved before,
>> and it's not at all obvious to me how ioctl.c could be modified to
>> accomodate this, in a way where the disambiguated name could be passed
>> through ioctl_decode() to vfio_ioctl(). I'd appreciate any direction I
>> could be offered on this.
>
> I have just find your mail today.
> I have explored the way to decode VFIO_IOMMU_MAP_DMA independently to study kvm.
>
>
> As far as my studying, there is no hint for choosing
> VFIO_IOMMU_MAP_DMA or VFIO_DEVICE_GET_PCI_HOT_RESET_INFO in user space.
>
> Let's see an example code in linux/Documentation/driver-api/vfio.rst:
>
> /* Create a new container */
> container = open("/dev/vfio/vfio", O_RDWR);
> ...
> /* Enable the IOMMU model we want */
> ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
> ...
> ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
> ...
>
> As IOMMU, there are at least two choices in kernel:
>
> #define VFIO_TYPE1_IOMMU 1
> #define VFIO_SPAPR_TCE_IOMMU 2
>
> If VFIO_TYPE1_IOMMU was passed when calling VFIO_SET_IOMMU, we
> can expect VFIO_IOMMU_MAP_DMA is called instead of
> VFIO_DEVICE_PCI_HOT_RESET.
>
>
> So the question is "can strace know VFIO_TYPE1_IOMMU is passed or not?".
> As far as reading kernel-code, I cannot find the way to know it from
> user space. e.g. the major number and minor number solved from the
> file descriptor doesn't help to know the choice.
>
> It means, we have to add a peep hole to the kernel:
>
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 02cc51ce6891..40e1ad0471f6 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1195,6 +1195,15 @@ static int vfio_fops_mmap(struct file *filep, struct vm_area_struct *vma)
> return ret;
> }
>
> +static void vfio_fops_show_fdinfo(struct seq_file *m, struct file *filep)
> +{
> + struct vfio_container *container = filep->private_data;
> + struct vfio_iommu_driver *driver = container->iommu_driver;
> +
> + if (driver && driver->ops->name)
> + seq_printf(m, "vfio-driver:\t%s\n", driver->ops->name);
> +}
> +
> static const struct file_operations vfio_fops = {
> .owner = THIS_MODULE,
> .open = vfio_fops_open,
> @@ -1204,6 +1213,7 @@ static const struct file_operations vfio_fops = {
> .unlocked_ioctl = vfio_fops_unl_ioctl,
> .compat_ioctl = compat_ptr_ioctl,
> .mmap = vfio_fops_mmap,
> + .show_fdinfo = vfio_fops_show_fdinfo,
> };
>
> /**
>
> With this change, we may be able to know the IOMMU via
> /proc/$PID/fdinfo/$fd as "vfio-iommu-type1".
> Here "vfio-iommu-type1" comes from linux/drivers/vfio/vfio_iommu_type1.c
>
> static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
> .name = "vfio-iommu-type1",
> .owner = THIS_MODULE,
>
> Last week, I wanted to submit this change to LKML or VFIO specific list.
> However, the kernel applying the batch didn't boot on my pc. So the
> submission is pending...I will report again when if there is any progress.
>
> Masatake YAMATO
>
>> Another, easier question: should I use a bundled header for this? I
>> did that in my initial implementation, just because it felt easiest to
>> me at the time. (I've ommitted the content of the header in this
>> patch, since that should make reviewing easier, and I don't expect
>> this patch to be applied as-is anyway.) I don't really understand
>> when it's preferable to let xlat detect values, when fallbacks should
>> be given, and when bundled headers should be used.
>>
>> Finally, a note about the test: we have to open /dev/vfio/vfio for
>> real to test this, since the decoding code will check which device the
>> ioctl is sent to (so we can't just do ioctl(-1, ...)). On most Linux
>> systems today, only root will be able to do that, but this will change
>> in the next version of systemd[1], so going forward any user will be
>> able to run this test without it being skipped due to failing to open
>> /dev/vfio/vfio.
>>
>> [1]: systemd commit f8eb410 ("udev: make /dev/vfio/vfio 0666")
>>
>> bundled/Makefile.am | 1 +
>> * bundled/linux/include/uapi/linux/vfio.h | 1362 +++++++++++++++++++++++
>> src/Makefile.am | 1 +
>> src/defs.h | 6 +
>> src/ioctl.c | 2 +
>> src/mmap_cache.c | 1 -
>> src/pidns.c | 1 -
>> src/strace.c | 1 -
>> src/util.c | 9 +-
>> src/vfio.c | 93 ++
>> src/xlat/vfio_iommu_map_flags.in | 4 +
>> tests/.gitignore | 1 +
>> tests/Makefile.am | 1 +
>> tests/gen_tests.in | 1 +
>> tests/ioctl_vfio.c | 66 ++
>> 15 files changed, 1546 insertions(+), 4 deletions(-)
>> create mode 100644 bundled/linux/include/uapi/linux/vfio.h
>> create mode 100644 src/vfio.c
>> create mode 100644 src/xlat/vfio_iommu_map_flags.in
>> create mode 100644 tests/ioctl_vfio.c
>>
>> * content ommitted from this patch; see above.
>>
>> diff --git a/bundled/Makefile.am b/bundled/Makefile.am
>> index fe2b1ce4..398ab85b 100644
>> --- a/bundled/Makefile.am
>> +++ b/bundled/Makefile.am
>> @@ -99,6 +99,7 @@ EXTRA_DIST = \
>> linux/include/uapi/linux/v4l2-common.h \
>> linux/include/uapi/linux/v4l2-controls.h \
>> linux/include/uapi/linux/videodev2.h \
>> + linux/include/uapi/linux/vfio.h \
>> linux/include/uapi/mtd/mtd-abi.h \
>> linux/include/uapi/mtd/ubi-user.h \
>> linux/include/uapi/rdma/ib_user_verbs.h \
>> diff --git a/src/Makefile.am b/src/Makefile.am
>> index de19b8bf..6e149534 100644
>> --- a/src/Makefile.am
>> +++ b/src/Makefile.am
>> @@ -368,6 +368,7 @@ libstrace_a_SOURCES = \
>> utime.c \
>> utimes.c \
>> v4l2.c \
>> + vfio.c \
>> wait.c \
>> wait.h \
>> watchdog_ioctl.c \
>> diff --git a/src/defs.h b/src/defs.h
>> index 5c369ece..b913b4f2 100644
>> --- a/src/defs.h
>> +++ b/src/defs.h
>> @@ -37,6 +37,7 @@
>> # include "error_prints.h"
>> # include "gcc_compat.h"
>> # include "kernel_types.h"
>> +# include "largefile_wrappers.h"
>> # include "list.h"
>> # include "macros.h"
>> # include "mpers_type.h"
>> @@ -1077,6 +1078,10 @@ extern int get_proc_pid(struct tcb *);
>> extern int translate_pid(struct tcb *, int dest_id, enum pid_type type,
>> int *proc_pid_ptr);
>>
>> +strace_stat_t;
>> +extern int
>> +stat_fd_in_proc(pid_t pid, int fd, strace_stat_t *statbuf);
>> +
>> extern void
>> dumpiov_in_msghdr(struct tcb *, kernel_ulong_t addr, kernel_ulong_t data_size);
>>
>> @@ -1274,6 +1279,7 @@ DECL_IOCTL(tee);
>> DECL_IOCTL(term);
>> DECL_IOCTL(ubi);
>> DECL_IOCTL(uffdio);
>> +DECL_IOCTL(vfio);
>> DECL_IOCTL(watchdog);
>> # undef DECL_IOCTL
>>
>> diff --git a/src/ioctl.c b/src/ioctl.c
>> index 12029028..7d257148 100644
>> --- a/src/ioctl.c
>> +++ b/src/ioctl.c
>> @@ -352,6 +352,8 @@ ioctl_decode(struct tcb *tcp)
>> return scsi_ioctl(tcp, code, arg);
>> case '$': /* 0x24 */
>> return perf_ioctl(tcp, code, arg);
>> + case ';': /* 0x3b */
>> + return vfio_ioctl(tcp, code, arg);
>> case '=': /* 0x3d */
>> return ptp_ioctl(tcp, code, arg);
>> case 'E':
>> diff --git a/src/mmap_cache.c b/src/mmap_cache.c
>> index f113f1e2..53e81c1e 100644
>> --- a/src/mmap_cache.c
>> +++ b/src/mmap_cache.c
>> @@ -8,7 +8,6 @@
>> #include "defs.h"
>> #include <limits.h>
>>
>> -#include "largefile_wrappers.h"
>> #include "mmap_cache.h"
>> #include "mmap_notify.h"
>> #include "xstring.h"
>> diff --git a/src/pidns.c b/src/pidns.c
>> index 17cb7793..05ab5842 100644
>> --- a/src/pidns.c
>> +++ b/src/pidns.c
>> @@ -22,7 +22,6 @@
>> #include <sys/stat.h>
>>
>> #include <linux/nsfs.h>
>> -#include "largefile_wrappers.h"
>> #include "trie.h"
>> #include "xmalloc.h"
>> #include "xstring.h"
>> diff --git a/src/strace.c b/src/strace.c
>> index 61a598cb..da7a57f6 100644
>> --- a/src/strace.c
>> +++ b/src/strace.c
>> @@ -30,7 +30,6 @@
>>
>> #include "kill_save_errno.h"
>> #include "filter_seccomp.h"
>> -#include "largefile_wrappers.h"
>> #include "mmap_cache.h"
>> #include "number_set.h"
>> #include "ptrace_syscall_info.h"
>> diff --git a/src/util.c b/src/util.c
>> index e87ac94b..f611128f 100644
>> --- a/src/util.c
>> +++ b/src/util.c
>> @@ -23,7 +23,6 @@
>> #endif
>> #include <sys/uio.h>
>>
>> -#include "largefile_wrappers.h"
>> #include "number_set.h"
>> #include "print_utils.h"
>> #include "secontext.h"
>> @@ -684,6 +683,14 @@ printfd_pid_tracee_ns(struct tcb *tcp, pid_t pid, int fd)
>> printfd_pid(tcp, strace_pid, fd);
>> }
>>
>> +int
>> +stat_fd_in_proc(pid_t pid, int fd, strace_stat_t *statbuf)
>> +{
>> + char path[33];
>> + xsprintf(path, "/proc/%d/fd/%d", pid, fd);
>> + return stat_file(path, statbuf);
>> +}
>> +
>> /*
>> * Quote string `instr' of length `size'
>> * Write up to (3 + `size' * 4) bytes to `outstr' buffer.
>> diff --git a/src/vfio.c b/src/vfio.c
>> new file mode 100644
>> index 00000000..b4b96b6a
>> --- /dev/null
>> +++ b/src/vfio.c
>> @@ -0,0 +1,93 @@
>> +/*
>> + * Copyright (c) 2021 Alyssa Ross <hi at alyssa.is>
>> + * Copyright (c) 2021 The strace developers.
>> + * All rights reserved.
>> + *
>> + * SPDX-License-Identifier: LGPL-2.1-or-later
>> + */
>> +
>> +#include "defs.h"
>> +#include <sys/stat.h>
>> +#include <sys/sysmacros.h>
>> +#include <linux/vfio.h>
>> +#include "xlat/vfio_iommu_map_flags.h"
>> +
>> +static int
>> +vfio_device_ioctl(struct tcb *const tcp, const unsigned int code,
>> + const kernel_ulong_t arg)
>> +{
>> + return RVAL_DECODED;
>> +}
>> +
>> +static int
>> +vfio_iommu_ioctl(struct tcb *const tcp, const unsigned int code,
>> + const kernel_ulong_t arg)
>> +{
>> + switch (code) {
>> + case VFIO_IOMMU_MAP_DMA: {
>> + struct vfio_iommu_type1_dma_map map;
>> + size_t minsz = offsetofend(typeof(map), size);
>> +
>> + tprint_arg_next();
>> + if (umove_or_printaddr(tcp, arg, &map))
>> + break;
>> +
>> + if (map.argsz < minsz) {
>> + printaddr(arg);
>> + break;
>> + }
>> +
>> + tprint_struct_begin();
>> +
>> + PRINT_FIELD_U(map, argsz);
>> +
>> + tprint_struct_next();
>> + PRINT_FIELD_FLAGS(map, flags, vfio_iommu_map_flags,
>> + "VFIO_IOMMU_MAP_FLAG_???");
>> +
>> + tprint_struct_next();
>> + PRINT_FIELD_PTR(map, vaddr);
>> +
>> + tprint_struct_next();
>> + PRINT_FIELD_PTR(map, iova);
>> +
>> + tprint_struct_next();
>> + PRINT_FIELD_U64(map, size);
>> +
>> + if (map.argsz > sizeof map)
>> + print_nonzero_bytes(tcp, tprint_struct_next,
>> + arg, sizeof map,
>> + MIN(map.argsz, get_pagesize()),
>> + QUOTE_FORCE_HEX);
>> +
>> + tprint_struct_end();
>> + break;
>> + }
>> +
>> + default:
>> + return RVAL_DECODED;
>> + }
>> +
>> + return RVAL_IOCTL_DECODED;
>> +}
>> +
>> +int
>> +vfio_ioctl(struct tcb *const tcp, const unsigned int code,
>> + const kernel_ulong_t arg)
>> +{
>> + if (!verbose(tcp))
>> + return RVAL_DECODED;
>> +
>> + strace_stat_t stat;
>> + if (code >= VFIO_IOMMU_GET_INFO) {
>> + if (stat_fd_in_proc(tcp->pid, tcp->u_arg[0], &stat) == -1)
>> + return RVAL_DECODED;
>> +
>> + if (stat.st_rdev == makedev(0xa, 0xc4))
>> + return vfio_iommu_ioctl(tcp, code, arg);
>> +
>> + return vfio_device_ioctl(tcp, code, arg);
>> + }
>> +
>> + return RVAL_IOCTL_DECODED;
>> +}
>> diff --git a/src/xlat/vfio_iommu_map_flags.in b/src/xlat/vfio_iommu_map_flags.in
>> new file mode 100644
>> index 00000000..ac575d19
>> --- /dev/null
>> +++ b/src/xlat/vfio_iommu_map_flags.in
>> @@ -0,0 +1,4 @@
>> +#unconditional
>> +VFIO_DMA_MAP_FLAG_READ
>> +VFIO_DMA_MAP_FLAG_WRITE
>> +VFIO_DMA_MAP_FLAG_VADDR
>> diff --git a/tests/.gitignore b/tests/.gitignore
>> index 8f2472b6..32380ee8 100644
>> --- a/tests/.gitignore
>> +++ b/tests/.gitignore
>> @@ -337,6 +337,7 @@ ioctl_v4l2-v
>> ioctl_v4l2-v-Xabbrev
>> ioctl_v4l2-v-Xraw
>> ioctl_v4l2-v-Xverbose
>> +ioctl_vfio
>> ioctl_watchdog
>> ioperm
>> iopl
>> diff --git a/tests/Makefile.am b/tests/Makefile.am
>> index 126e8436..42c8bdf9 100644
>> --- a/tests/Makefile.am
>> +++ b/tests/Makefile.am
>> @@ -199,6 +199,7 @@ check_PROGRAMS = $(PURE_EXECUTABLES) \
>> ioctl_v4l2-success-v-Xabbrev \
>> ioctl_v4l2-success-v-Xraw \
>> ioctl_v4l2-success-v-Xverbose \
>> + ioctl_vfio \
>> ioprio--pidns-translation \
>> is_linux_mips_n64 \
>> kcmp-y--pidns-translation \
>> diff --git a/tests/gen_tests.in b/tests/gen_tests.in
>> index 97b47b98..ae9ae2fe 100644
>> --- a/tests/gen_tests.in
>> +++ b/tests/gen_tests.in
>> @@ -328,6 +328,7 @@ ioctl_v4l2-v +ioctl.test -v
>> ioctl_v4l2-v-Xabbrev +ioctl.test -v -Xabbrev
>> ioctl_v4l2-v-Xraw +ioctl.test -v -Xraw
>> ioctl_v4l2-v-Xverbose +ioctl.test -v -Xverbose
>> +ioctl_vfio +ioctl.test
>> ioctl_watchdog +ioctl.test
>> ioperm -a27
>> iopl -a8
>> diff --git a/tests/ioctl_vfio.c b/tests/ioctl_vfio.c
>> new file mode 100644
>> index 00000000..ca8c504b
>> --- /dev/null
>> +++ b/tests/ioctl_vfio.c
>> @@ -0,0 +1,66 @@
>> +/*
>> + * Copyright (c) 2021 Alyssa Ross <hi at alyssa.is>
>> + * Copyright (c) 2021 The strace developers.
>> + * All rights reserved.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#include "tests.h"
>> +
>> +#include <fcntl.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +
>> +void test_iommu_map_dma(int fd)
>> +{
>> + struct vfio_iommu_type1_dma_map *map = calloc(1, sizeof *map + 1);
>> + if (!map)
>> + perror_msg_and_fail("calloc");
>> +
>> + map->argsz = offsetofend(struct vfio_iommu_type1_dma_map, size) - 1;
>> + ioctl(fd, VFIO_IOMMU_MAP_DMA, map);
>> + printf("ioctl(%d, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, %p) "
>> + "= -1 EINVAL (%m)\n", fd, map);
>> +
>> + map->argsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>> + ioctl(fd, VFIO_IOMMU_MAP_DMA, map);
>> + printf("ioctl(%d, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, "
>> + "{argsz=32, flags=0, vaddr=NULL, iova=NULL, size=0}) = -1 EINVAL (%m)\n",
>> + fd);
>> +
>> + map->argsz = offsetofend(struct vfio_iommu_type1_dma_map, size) + 1;
>> + ((unsigned char *)map)[sizeof *map] = 0xff;
>> + ioctl(fd, VFIO_IOMMU_MAP_DMA, map);
>> + printf("ioctl(%d, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, "
>> + "{argsz=33, flags=0, vaddr=NULL, iova=NULL, size=0, "
>> + "/* bytes 32..32 */ \"\\xff\"}) = -1 EINVAL (%m)\n", fd);
>> +
>> + map->argsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>> + map->flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE
>> + | VFIO_DMA_MAP_FLAG_VADDR;
>> + map->vaddr = 0xfedcba9876543210;
>> + map->iova = 0x0123456789abcdef;
>> + map->size = -1U;
>> + ioctl(fd, VFIO_IOMMU_MAP_DMA, map);
>> + printf("ioctl(%d, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, "
>> + "{argsz=32, flags=VFIO_DMA_MAP_FLAG_READ|VFIO_DMA_MAP_FLAG_WRITE"
>> + "|VFIO_DMA_MAP_FLAG_VADDR, vaddr=0xfedcba9876543210, "
>> + "iova=0x123456789abcdef, size=%u}) = -1 EINVAL (%m)\n", fd, -1U);
>> +
>> + free(map);
>> +}
>> +
>> +int
>> +main(void)
>> +{
>> + int fd = open("/dev/vfio/vfio", O_RDONLY);
>> + if (fd == -1)
>> + return 77;
>> +
>> + test_iommu_map_dma(fd);
>> +
>> + printf("+++ exited with 0 +++\n");
>> +}
>> --
>> 2.31.0
>>
>> --
>> Strace-devel mailing list
>> Strace-devel at lists.strace.io
>> https://lists.strace.io/mailman/listinfo/strace-devel
>>
More information about the Strace-devel
mailing list