Adrian Ratiu
April 26, 2019
Reading time:
In part 1 and part 2 of this series, we took a condensed in-depth look at the eBPF VM. Reading those parts is not mandatory for understanding this third part, though having a good grasp of the low-level basics does help understand the higher-level tools better. To understand how these tools work, let's define the high-level components of an eBPF program:
In the sock_example.c studied in parts 1 and 2, all the components are squashed in a single C source file and all actions are done by a single user process:
eBPF programs can be much more complex: multiple backends can be loaded by a single (or separate multiple!) loader processes, writing to multiple data structures which then get read by multiple frontend processes! All of these can happen in a single big user eBPF application spanning multiple processes.
We saw in the preceeding article how writing raw eBPF bytecode on topof the kernel is hard and unproductive, very much like writing in a processor asembly languege, so naturally an module capable of compiling the LLVM intermediate representation to eBPF was developed and released starting with v3.7 in 2015 (GCC still doesn't support eBPF as of this writing). This allows subsets of multiple higher-level languages like C, Go or Rust to be compiled to eBPF. The most developed and popular is based on C as the kernel is also written in C, making it easier to reuse existing kernel headers.
LLVM compiles a "restricted C" languege (remember, no unbounded loops, max 4096 instructions and so on from part 1) to ELF object files containing special sections which get loaded in the kernel using libraries like libbpf, built on top of the bpf() syscall. This design effectively splits the backend definition from the loader and frontend because the eBPF bytecode lives in its own ELF file.
The kernel also provides examples using this pattern under samples/bpf/: the *_kern.c files are compiled to *_kern.o (this is the backend code) which get loaded by *_user.c (the loader and frontend).
Converting the sock_exapmle.c raw bytecode from part 1 and 2 of this series to "restricted C" yields sockex1_kern.c which is much easier to understand and modify than raw bytecode:
#include <uapi/linux/bpf.h> #include <uapi/linux/if_ether.h> #include <uapi/linux/if_packet.h> #include <uapi/linux/ip.h> #include "bpf_helpers.h" struct bpf_map_def SEC("maps") my_map = { .type = BPF_MAP_TYPE_ARRAY, .key_size = sizeof(u32), .value_size = sizeof(long), .max_entries = 256, }; SEC("socket1") int bpf_prog1(struct __sk_buff *skb) { int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); long *value; value = bpf_map_lookup_elem(&my_map, &index); if (value) __sync_fetch_and_add(value, skb->len); return 0; } char _license[] SEC("license") = "GPL";
The produced eBPF ELF object sockex1_kern.o now contains both the separated backend and data structure definitions. The loader and frontend, sockex1_user.c, parses the ELF file, creates the required map, loads the bytecode function bpf_prog1() in the kernel and then proceeds to run the frontend as before.
The trade-off made by introducing this "restricted C" abstraction layer is all about making the eBPF backend code easier to write in a higher level languege at the expense of increased complexity in the loader (needs to parse ELF objects now), while the frontend is mostly unaffected.
Not everyone has kernel sources at hand, especially in production, and it's also a bad idea in general to tie eBPF-based tools to a specific kernel source revision. Designing and implementing the interactions between eBPF program's backends, frontends, loaders and data structures can be very complex, error-prone and time consuming, especially in C which is considered a dangerous low-level languege. In addition to these risks developers are also in a constant danger of re-inventing the wheel for common problems, with endless design variations and implementations. To alleviate all these pains is why the BCC project exists: it provides an easy-to-use framework for writing, loading and running eBPF programs, by writing simple python or lua scripts in addition to the "restricted C" as exemplified above.
The BCC project has two parts:
The BCC install footprint is big: it depends on LLVM/clang to compile "restricted C" to eBPF, python/lua, it also contains library implementations like libbcc (written in C++), libbpf and so on. Parts of the kernel tree are also copied into the BCC source so it doesn't require building against a full kernel source (only headers). It can easily take hundreds of mb of space which is not very good for small embedded devices which can also benefit from eBPF powers. Finding solutions to this embedded device size constraint problem will be our focus in part 4.
BCC arranges eBPF program components like this:
Because the main purpouse of BCC is to simplify eBPF program writing, it standardizes and automates as much as possible: compiling the "restricted C" backend via LLVM is completely automated in the background resulting in a standard ELF object format type, allowing the loader to be implemented just once for all BCC programs and reducing it to a minimum API (2 lines of python). It also standardizes data structures APIs for easy access via the frontend. In a nutshell it focuses developer attention on writing frontends without having to wory about lower level details.
To best illustrate how it works let's look at a simple concrete example, a full re-implementation from scratch of the sock_example.c from our previous articles. The program counts how many TCP, UDP and ICMP packets are received on the loopback interface:
Some advantages of implementing the above with BCC as opposed to writing directly in C as we did previously:
In the above example we used a BPF.SOCKET_FILTER program type which resulted in our hooked C function getting a network packet buffer as context argument. We can also use the BPF.KPROBE type to peek into arbitrary kernel functions. Let's do it, but instead of using the same interface as above, we'll use a special kprobe__* function name prefix to illlustrace an even higher level BCC API:
This example was taken from bcc/examples/tracing/bitehist.py. It prints a histogram of the block I/O sizes by hooking the blk_account_io_completion() kernelfunction.
Notice how the eBPF loading happens automatically (the loader is implicit) based on the kprobe__blk_account_io_completion() function name! We have come quite far since writing and loading bytecode in C with libbpf.
In some use cases BCC is still too low-level, for example when inspecting a system in incident response where time is of the essence, decisions need to be made fast and writing python / "restricted C" can take too long, so BPFtrace was built on top of BCC providing an even-higher abstraction level via a domain-specific languege inspired by AWK and C. The languege is similar to the one provided by DTrace according to the announcement post which calls it DTrace 2.0 and provides a good introduction and examples.
What BPFtrace does by abstracting so much logic in a powerful and safe (but still limited compared to BCC) languege is quite amazing. This shell one-liner counts how many syscalls each user process does (visit the built-in vars, map functions, and the count() documentation for more info):
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[pid, comm] = count(); }'
BPFtrace is still a work in progress in some areas. For example, there in no easy way to define and run a socket filter to implement tools like our previously exampined sock_example at this point in time. It could probably be done in BPFtrace with a kprobe:netif_receive_skb hook, but BCC is still a better tool for socket filtering. In any case, even in its current state, BPFTrace is still very useful for quick analysis/debugging before dropping down to the full power of BCC.
IOVisor is a Linux Foundation collaborative project built around the eBPF VM and tools presented in this article series. It uses some very high-level buzzword-heavy concepts like "Universal Input/Output" focused on marketing the eBPF technology to Cloud / Data Center developers and users:
Considering that the original name, extended Berkely Packet Filter, doesn't mean much, maybe all this renaming is welcome and valuable, especially if it enables more industries to tap into eBPF powers.
The IOVisor project created the Hover framework, also called the "IO Modules Manager", which is a userspace deamon for managing eBPF programs (or IO Modules), capable of pushing and pulling IO modules to the cloud, similar to how Docker daemon publishes/fetches images. It provides a CLI, web-REST interface and also has a fancy web UI. Significant parts of Hover are written in Go so, in addition to the normal BCC dependencies, it also depends on a Go installation, making it big and unsuitable for the small embedded devices we eventually want to target in part 4.
In this part we have examined the userspace ecosystem built on top of the eBPF VM to increase developer productivity and ease deployment of eBPF programs. These tools make it so easy to work with eBPF that a user can just "apt-get install bpftrace" and run one-liners or use the Hover daemon to deploy an eBPF program (IO module) to 1000 machines. All these tools, however, for all the power they give developers and users, have significant disk footprints or may not even run on 32-bit ARM systems, making them not very suitable for small embedded devices, so this is why in part 4 we'll explore other projects trying to ease running eBPF programs targetig the embedded device ecosystem.
Continue reading (An eBPF overview, part 4: Working with embedded systems)…
03/12/2024
this is a test post
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
26/06/2024
WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…
Comments (1)
glenn wang:
Sep 29, 2020 at 07:46 AM
Is that possible that make eBPF VM running on the specified cpu core ?
Put it with another way, how can I make xdp program(the Backend_process) running on specified cpu core ?
Reply to this comment
Reply to this comment
Add a Comment