Gabriel Krisman Bertazi
November 21, 2018
Reading time:
When Brendan Gregg gave his Performance Analysis superpowers with Linux BPF talk during the Open Source Summit in Los Angeles last year, he wasn't messing around. Using the new eBPF tracing tools really feels like you gained some x-ray vision powers since, suddenly, opening the program's hood is no longer necessary to see in details how it is behaving internally.
I had the chance of applying these new gained powers last month, when I was asked to develop a small tool to trace when and what parts of each file was being accessed for the first-time during a program initialization. This request came with a few constraints, like the information had to be available independently of the kind of buffered I/O method being used (synchronous, aio, memory-mapping). It also should be trivial to correlate the block data with which files were being accessed and, most importantly, the tracing code should not result in a large performance impact in the observed system.
I started by investigating where exactly we'd want to place our probe:
A tracer at the Block Layer would examine requests in terms of disk blocks, which don't directly correlate to filesystem files. A trace at this level gives insight on which areas of the disk are being read or write, and whether they are organized physically in a contiguous fashion. But it doesn't give you a higher level view of the system in a file basis. Other tools already exist to trace block-level accesses, such as the EBF script biosnoop and the traditional blktrace.
A tracer at the filesystem level exposes data in terms of file blocks, which can resolve to one or more blocks of data in the disk. In an example scenario, an Ext4 filesystem with 4Kib file block sizes, one system page likely corresponds to 1 physical block (in 4K disks) or 4 physical blocks in disks with 512 sector size. The tracing at the filesystem level allows us to look at the file in terms of offsets, such that we ignore disk fragmentation. Different installations might fragment the disk differently, and from an application level perspective, we shouldn't really be interested in the disk layout as much as what file blocks of data we need, in case we want to optimize by prefetching them.
The page cache is a structure that sits in between the VFS/memory-mapping system and the filesystem layer, and is responsible for managing memory sections already read from the disk. By tracing the inclusion and removal pages in this cache we could gain the "First-access" behavior for free. When a new page is first accessed it is brought to the cache, and further accesses won't need to go to the disk. If the page is eventually dropped from the cache because it is no longer needed, a new use will need to reach the disk, and a new access entry will be logged. In our Performance investigation scenario, was the exactly functionality we were looking for.
The probe we implemented traces Page Cache Misses inside the Kernel Page Cache handling functions to identify the first time a block is requested, before the request is even submitted to the disk. Further requests to the same memory area (as long as the data is still mapped) will return a hit in the cache, which we don't care about, nor trace. This prevents our code from interfering with further accesses, severely diminishing the impact on performance our probe could have.
By tracing the page cache, we are also capable of differentiating blocks that were directly requested by the user program from blocks requested by the Read Ahead logic inside the kernel. Knowing which blocks were read ahead is very interesting information for application developers and system administrators, since it allows them to tune their systems or applications to prefetch the blocks they want in a sane manner.
As is common with any eBPF application, the code is very straightforward. If we ignore some of the boilerplate stuff required to make eBPF compile, the probe comes down to the following function:
int fblktrace_read_pages(struct pt_regs *ctx, struct address_space *mapping, struct list_head *pages, struct page *page, unsigned nr_pages, bool is_readahead) { u64 index; unsigned blkbits = mapping->host->i_blkbits; unsigned long ino = mapping->host->i_ino; u64 block_in_file; for (int i = 0; i < 32 && nr_pages--; i++) { if (pages) { pages = pages->prev; page = container_of(pages, struct page, lru); } index = page->index; block_in_file = (unsigned long) index << (12 - blkbits); bpf_trace_printk("=> inode: %ld: FSBLK=%lu BSIZ=%lu %s\\n", ino, index, 1 << blkbits, is_readahead?"[RA]":""); } return 0; }
The function above is installed as a tracer around the =ext4_mpage_readpages=
function, by the following snippet:
b.attach_kprobe(event="ext4_mpage_readpages", fn_name="fblktrace_read_pages");
The probe runs every time the in-kernel Page Cache asks the filesystem to fetch some page from the disk for the first time. Which area should be read is indirectly identified by the index and offset of the page in the address space of the process. We use that information to calculate the offset of the file to be loaded, in file blocks, and pass that information, alongside the inode number that identifies the file to the print function.
For demonstration purposes we wrote a small program, called touchblk, which reads a file in two ways: with the synchronous read/write system calls, and using the mmap feature. On both cases, we read two arbitrarily chosen regions of the file, the block 34 followed by the block 100.
To run the probe, one needs to install the eBPF compiler available in the package BCC tools. Apart from running this example, the BCC package, already available in many Linux distros, includes a large set of eBPF-based probe examples that you can use to learn how to use this tool and write utilities specific to your needs.
The bcc compiler is provided by the iovisor project at:
https://github.com/iovisor/bcc
Now, let's look at the probe in action.
[remote:root@fblktrace ~]$ ./fblktrace printing... touchblk-2143 [002] d... 91137.791064: 0: => Open inode 14: fname = test.img touchblk-2143 [002] .N.. 91137.811093: 0: => inode: 14: FSBLK=34 BSIZ=4096 [RA] touchblk-2143 [002] .... 91137.828293: 0: => inode: 14: FSBLK=100 BSIZ=4096 [RA]
The output above shows the exact behavior of the test application I described before. Since read/write don't necessarily trigger a read ahead, only two entries are show, for the exact blocks we were looking for. A timestamp and inode number are also given. To improve the output, a second probe was installed to map inode numbers to file names, but this is obviously not necessary. It is used only to simplify user's life.
Below is the output of the memory-mapped version. It is quite long...
[remote:root@fblktrace ~]$ ./fblktrace printing... touchblk-2147 [003] d... 91258.462486: 0: => Open inode 14: fname = image touchblk-2147 [003] .... 91258.480927: 0: => inode: 14: FSBLK=18 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480940: 0: => inode: 14: FSBLK=19 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480942: 0: => inode: 14: FSBLK=20 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480943: 0: => inode: 14: FSBLK=21 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480944: 0: => inode: 14: FSBLK=22 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480945: 0: => inode: 14: FSBLK=23 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480946: 0: => inode: 14: FSBLK=24 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480947: 0: => inode: 14: FSBLK=25 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480948: 0: => inode: 14: FSBLK=26 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480949: 0: => inode: 14: FSBLK=27 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480952: 0: => inode: 14: FSBLK=28 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480952: 0: => inode: 14: FSBLK=29 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480954: 0: => inode: 14: FSBLK=30 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480955: 0: => inode: 14: FSBLK=31 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480955: 0: => inode: 14: FSBLK=32 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480956: 0: => inode: 14: FSBLK=33 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480957: 0: => inode: 14: FSBLK=34 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480958: 0: => inode: 14: FSBLK=35 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480959: 0: => inode: 14: FSBLK=36 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480960: 0: => inode: 14: FSBLK=37 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480961: 0: => inode: 14: FSBLK=38 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480962: 0: => inode: 14: FSBLK=39 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480963: 0: => inode: 14: FSBLK=40 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480966: 0: => inode: 14: FSBLK=41 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480967: 0: => inode: 14: FSBLK=42 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480968: 0: => inode: 14: FSBLK=43 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480969: 0: => inode: 14: FSBLK=44 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480970: 0: => inode: 14: FSBLK=45 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480971: 0: => inode: 14: FSBLK=46 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480972: 0: => inode: 14: FSBLK=47 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480973: 0: => inode: 14: FSBLK=48 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.480974: 0: => inode: 14: FSBLK=49 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498554: 0: => inode: 14: FSBLK=84 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498565: 0: => inode: 14: FSBLK=85 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498566: 0: => inode: 14: FSBLK=86 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498567: 0: => inode: 14: FSBLK=87 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498568: 0: => inode: 14: FSBLK=88 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498569: 0: => inode: 14: FSBLK=89 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498570: 0: => inode: 14: FSBLK=90 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498573: 0: => inode: 14: FSBLK=91 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498574: 0: => inode: 14: FSBLK=92 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498575: 0: => inode: 14: FSBLK=93 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498576: 0: => inode: 14: FSBLK=94 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498577: 0: => inode: 14: FSBLK=95 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498578: 0: => inode: 14: FSBLK=96 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498579: 0: => inode: 14: FSBLK=97 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498580: 0: => inode: 14: FSBLK=98 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498581: 0: => inode: 14: FSBLK=99 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498582: 0: => inode: 14: FSBLK=100 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498583: 0: => inode: 14: FSBLK=101 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498584: 0: => inode: 14: FSBLK=102 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498585: 0: => inode: 14: FSBLK=103 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498586: 0: => inode: 14: FSBLK=104 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498587: 0: => inode: 14: FSBLK=105 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498588: 0: => inode: 14: FSBLK=106 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498589: 0: => inode: 14: FSBLK=107 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498591: 0: => inode: 14: FSBLK=108 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498592: 0: => inode: 14: FSBLK=109 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498593: 0: => inode: 14: FSBLK=110 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498594: 0: => inode: 14: FSBLK=111 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498595: 0: => inode: 14: FSBLK=112 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498596: 0: => inode: 14: FSBLK=113 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498597: 0: => inode: 14: FSBLK=114 BSIZ=4096 [RA] touchblk-2147 [003] .... 91258.498598: 0: => inode: 14: FSBLK=115 BSIZ=4096 [RA]
Why is it so much longer than the read()/write() syscall example? Because the kernel, in an attempt to optimize expensive I/O operations, assumes that when a specific address is accessed in a non-sequential read file, nearby regions are likely to be needed soon, such that it performs read ahead (RA) I/Os.
The kernel can't assume the next region needed will be the one that immediately follows the block accessed, so it tries to access neighbors both before and after the target block. The number of neighbors that are looked ahead is defined by a filesystem specific parameter in sysfs.
[krisman@dilma sda2]$ cat /sys/fs/ext4/sda2/inode_readahead_blks 32
This parameter tells the kernel to load 32 blocks around the target block during a read ahead. If you go back to the output of the second version of the example code trace and count the blocks that were read for each of the two accesses, you'll observe that there were exactly 32 blocks read for each access, 15 immediately before the target block, and 16 immediately after. This gives a very interesting insight on how the read-ahead mechanism works.
This method tries to capture I/O accesses as they go through the page cache, such that other non-buffered mechanisms like Direct I/O will not be traced. This example is also limited to ext4, but it can trivially be expanded to any other Linux filesystem as well.
As always, the full code is available in the only way we know how: under a Free Software license in a public repository. Enjoy!
https://gitlab.collabora.com/krisman/bcc/blob/master/tools/fblktrace.py
Tracing with eBPF is a very powerful and surgical tool for performance analysis that is within the reach of anyone that takes the time to learn them. And let me tell you, it is definitely worth your time!
Visit Gabriel's blog.
03/12/2024
this is a test post
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
26/06/2024
WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…
Comments (0)
Add a Comment