Deborah Brouwer
July 24, 2024
Reading time:
Co-authored by Emanuele Aina.
OpenQA is a tool for functional, end-to-end testing of operating system distributions. It takes an image to test, installs it into a virtual machine, and interacts with the installation in a “human-like” way. It tests that the operating system boots and shows a graphical desktop. It moves the mouse, clicks buttons, resizes file systems, creates a username, logs in and out, and visually compares installation screenshots and matches them against expected images called “needles.” It checks that essential packages install without issue and that certain core applications function as expected.
This year we undertook a project, sponsored by Meta, to reproduce Fedora’s openQA deployment as a proof of concept in the AWS cloud. It has been very interesting for us to work on openQA especially because we have a strong CI and testing expertise in Collabora largely focusing on testing a wide variety of serial-attached physical boards and devices with LAVA on projects like KernelCI, Mesa CI, and Apertis. By focusing on GUI testing on QEMU, openQA complements what LAVA provides well.
We started with the goal of containerizing each component of the openQA deployment. We were curious about the extent to which the components could be isolated from each other and particularly whether they could be deployed on different cloud instances and different networks. Luckily, openSUSE had already done some of this hard work and offers openSUSE containers upstream and through its build service.
If you want to start a containerized openQA deployment very easily, this rootless podman command from the upstream documentation is the place to begin:
podman run --device /dev/kvm -p 1080:80 --rm -it registry.opensuse.org/devel/openqa/containers/openqa-single-instance
After running this command, the web user interface should appear on localhost:1080
.
But, the ease of setting up the single-instance container conceals quite a lot of complexity underneath. Fundamentally, openQA architecture can be divided into two main components: (1) the web server and (2) its workers. The web server is responsible for the graphical user interface, authentication and authorization, storing test jobs and resources in a Postgres database, and, most importantly, orchestrating the workers to carry out these tests. On the other hand, the workers use an isotovideo "engine" to start a new virtual machine for each test and report live results along with a final report to the web server.
The natural first point for containerization is between the web server and workers. OpenSUSE has upstream containers for each of the web server and the workers. We did the same, making changes as needed for Fedora, namely, we:
We deployed these containers to the AWS cloud and had a small containerized Fedora openQA deployment:
Although the workers must publish unprivileged ports to allow Developer Mode and VNC access, one nice benefit of the cloud deployment is the ability to create AWS security groups to restrict access to the EC2 instances. We used special openQA security groups to restrict incoming access to only known servers or trusted developer machines.
Another new feature of cloud deployment is the ability to use the upstream asset caching service instead of NFS mounts between EC2 instances. OpenSUSE provided the blueprint for this approach in the blog post Cloud-based workers for openQA. The caching service runs background processes on each worker. When a worker receives a test for a particular operating system image, the caching service first checks if the worker has a copy of the image locally and, if not, makes an HTTP request to the web server to download the image. The caching service monitors the size of the worker's cache and periodically removes stale assets to prevent the worker's disk from filling up. The caching service has performed very reliably in Fedora's cloud deployment.
After setting up our initial proof-of-concept, we expanded the deployment to more closely reproduce Fedora's openQA production instance:
The updated cloud deployment looked essentially like this with the number of workers scaling up as necessary:
The production-level deployment required the implementation of automatic test scheduling. A unique feature of Fedora's openQA deployment is how it fetches new updates and composes to test. Unlike a CI system triggered by developer commits, openQA tests are triggered by messages published by release engineers to Fedora's public message broker. The command-line tool fedora_openqa allows for manual test scheduling, but it also acts as a message consumer with a callback to automatically schedule tests with the openQA web server. We placed fedora_openqa into a container and ran it alongside the web server so that scheduled tests began to appear on the cloud instance just like on the production deployment.
The production-level deployment required some adjustments to allow the full suite of Fedora tests to run. An advanced feature of openQA allows workers to communicate with each other during live testing. This is particularly useful to test the functionality of client-server programs. Regular tests have a worker class that corresponds to the architecture of the image under test, e.g. qemu_x86_64, but if a test requires worker communication, then, for Fedora, workers must belong to additional classes either "tap" or "tap2" (the numbering has no significance other than to divide the tests into two groups). For these kinds of tests, the virtual machines communicate using TAP devices (software-emulated ethernet devices) connected by an OpenvSwitch bridge. Upstream openQA provides an excellent service, os-autoinst-openvswitch, that isolates TAP devices and routes traffic to virtual machines. Isolating TAP devices is essential to avoiding IP address conflicts because several of the same tests are usually running at the same time with the same IP addresses but for different builds. Unfortunately, os-autoinst-openvswitch also relies on the workers to make calls to the host's D-Bus. Since the worker containers were designed specifically to avoid sharing the host's network, the containerized workers also had a problem making calls to the D-Bus and using the os-autoinst-openvswitch service.
To avoid privileged operations on the host we experimented with creating virtual distributed ethernet (vde) switches and binding them into worker containers. We created an application to manage the dynamic creation and destruction of vde switches and workers for each new Fedora build to avoid IP address conflicts. However, the complexity of this approach outweighed the benefits of containerizing the workers, thus we opted to keep the workers that need to use tap devices running on the host without containers with the current os-autoinst-openvswitch
service.
OpenQA is a powerful tool for functional testing and although its initial setup can be daunting, its reliability and robustness make it worth the effort. It has been our pleasure to work with the Fedora Project. You can read more about the cloud deployment project in Fedora Magazine.
27/11/2024
Recently (test), both Weston 14.0, and 14.0.1 (bug fix) were released. Here's at look at some of the highlights and changes for this latest…
26/11/2024
Linux kernel 6.12 is here with real-time preemption support and an extensible scheduler class. Take a look at the contributions our kernel…
15/11/2024
The Linux Foundation Member Summit is an opportune time to gather on the state of open source. Our talk will address the concerns and challenges…
Comments (0)
Add a Comment