---
# Project Bluefin is Helping Prove That Dark Factories Work for Operating Systems

**URL:** https://crunchtools.com/project-bluefin-is-helping-prove-that-dark-factories-work-for-operating-systems/
Date: 2026-06-03
Author: fatherlinux
Post Type: post
Summary: Six years ago, I wrote about the good, better, best approach to Linux quality when evaluating container images. The same framework applies to desktop Linux distributions – maybe even more so, because desktops have a GUI that’s notoriously hard to test automatically. Here’s what I wrote: Good: Use a bug tracker and collect problems asContinue Reading "Project Bluefin is Helping Prove That Dark Factories Work for Operating Systems" →
Categories: Articles
Tags: AI/ML, Container Images, Linux, Open Source Software, Systems Administration
Featured Image: https://crunchtools.com/wp-content/uploads/2026/06/gemini_gen_20260603_115333_2830426c.png
---

Six years ago, I wrote about the [good, better, best approach to Linux quality](https://crunchtools.com/comparison-linux-container-images/#Performance) when evaluating container images. The same framework applies to desktop Linux distributions - maybe even more so, because desktops have a GUI that's notoriously hard to test automatically. Here's what I wrote:
> 

 	- **Good:** Use a bug tracker and collect problems as contributors and users report them. Fix them as they are tracked. Almost all Linux distributions do this. I call this the police and fire method, wait until people dial 911.

 	- **Better:** Use the bug tracker and proactively build tests so that these bugs don't creep back into the Linux distribution, and hence back into the container images.

 	- **Best:** Have a team of engineers proactively build out complete test cases, and publish the results. Feed all of the lessons learned back into the Linux distribution.

Almost every Linux desktop distribution is still closer to "Good" than "Best." Some — Fedora, GNOME, and openSUSE — have invested in automated desktop testing through [openQA](https://open.qa/), which boots VMs and validates the GUI by matching screenshots against reference images. That's real progress. But it requires dedicated infrastructure, and the tests are inherently fragile — a theme change, a resolution tweak, or a shifted pixel can break a needle match. Jorge Castro just took Bluefin somewhere different.

## What Happened

Jorge did a [4-day sprint](https://github.com/ublue-os/bluefin/discussions/4724) to rebuild Bluefin's entire build-and-release pipeline using AI coding agents. Jorge's discussion post covers a lot of ground — the vision, the lore, the community conversation. I want to zoom in on the technical substance, because what actually got built deserves its own telling.

The headline isn't the AI. It's what the AI helped build.

## The Problem With Screenshot-Based Testing

A handful of Linux distributions do automated desktop testing — Fedora and GNOME both run openQA instances that boot VMs and match screenshots against reference images. That's a real investment, and it catches real bugs. But screenshot-based testing is brittle. Change a theme, bump a font size, shift a widget by a few pixels, and your reference images break. The tests need constant manual maintenance to keep up with the UI they're testing. And openQA requires dedicated infrastructure — it's not something you bolt onto a standard GitHub Actions runner.

So for most projects, desktop QA is still a manual process. Real humans click through test plans on real machines. For a volunteer project like Bluefin, it mostly just didn't happen.

## What Jorge Built

The new Bluefin pipeline boots the actual OS image in a VM on a standard GitHub Actions runner, starts a real GNOME session, and runs 255 automated test scenarios against it. Not a container — a real VM, with a real kernel, real systemd, and a real GNOME Wayland session, running on the same free CI runner everyone else uses for `npm test`. GitHub Actions runners support nested KVM, so QEMU gets hardware acceleration without any special infrastructure. Tests like "does the lock screen work," "can you launch Firefox," "do the GNOME Shell extensions crash on load," "is the Homebrew PATH set correctly."

The tests interact with the desktop the same way a screen reader does — through the accessibility tree. Every button, menu, and toggle in GNOME exposes itself with a name and a role. The test framework finds widgets semantically ("find the toggle button named Activities") and invokes their actions. This is the key difference from openQA's approach. There's no screenshot comparison, no pixel matching, no reference images to maintain. Change your theme or resolution and the tests still pass — because they're testing the semantic structure of the UI, not its appearance.

The test scenarios are written in plain English:
```
Scenario: Screen lock engages without crashing GNOME Shell
  * GNOME Shell is accessible via AT-SPI
  * Lock screen via Shell.Eval
  * Session is locked
  * Unlock screen via Shell.Eval

Scenario: GNOME Shell extensions do not crash shell on load
  * GNOME Shell is accessible via AT-SPI
  * No journal entries at priority "err..emerg" contain "gnome-shell"
```

The tooling that makes this possible — [qecore](https://pypi.org/project/qecore/), [dogtail](https://gitlab.com/dogtail/dogtail), [gnome-ponytail-daemon](https://gitlab.gnome.org/ofourdan/gnome-ponytail-daemon) — comes from Red Hat's desktop QE team and a handful of GNOME developers who solved the Wayland input problem. Jorge's team wired it all together into a reusable GitHub Actions workflow that any GNOME-based OS image can call with three lines of YAML.

Every successful build on main triggers a smoke test. Every Tuesday, a promotion workflow locks the commit SHA, verifies the tests passed, runs extended suites, and requires human approval via GitHub's production Environment before retagging the image to stable. Untested code cannot reach users. Period.

## Where AI Actually Mattered

The testing framework, the Wayland bridge, the accessibility automation — none of that is new. Red Hat QE has used these tools internally. What's new is that someone took all of it and built a complete, public, reusable CI/CD pipeline around it in four days.

That's where the AI coding agents earned their keep. Not by inventing new testing concepts, but by compressing months of CI/CD plumbing into a weekend. 29 test feature files. 25 workflow definitions. 10 shared GitHub Actions. A weekly promotion gate that refuses to ship untested code — exactly the kind of work that's tough to get done in any project — important but unglamorous, always deprioritized in favor of the next feature or the latest fire.

## Beyond "Best" -Agents That Write the Tests

Going back to my old framework, "Best" was having a team of engineers proactively build out complete test cases, publish the results, and feed the lessons back into the distribution. Bluefin has that now. But what happens next is what pushes past that tier.

When a user files a bug — say, GNOME Shell extensions crashing on load ([bluefin#4612](https://github.com/ublue-os/bluefin/issues/4612)) — an agent can write a regression test for it. You can see it right in the test suite: `@regression @bluefin_4612`. That test now runs on every build, forever. The bug can never come back without being caught before it reaches a user.

Traditionally, this cycle runs at clock speed. An engineer triages the bug, reproduces it, understands the root cause, writes a fix, writes a test, and gets it reviewed. Each step takes human time. Multiply that by every bug, every regression, every release — and the project slows down as it matures. More code means more surface area means more bugs means more time spent on maintenance instead of new work. This is the story of every long-lived software project.

Agents break that curve. They can triage, draft a fix, and write the regression test, not replacing the human judgment, but compressing the time between "bug filed" and "fix shipped with a test that prevents recurrence." The engineers are still making the decisions, but the yak shaving happens in the background. That means the humans can focus on where it actually matters, how the software feels to use. The fit and finish. The edge of the product where it meets real people.

This is the butterfly effect that compounds over years. Instead of the project getting slower as it grows, you should see more meaningful features shipped with the same number of engineers. Five years of accumulated project knowledge, what breaks, what to test for, what the failure modes are, is being encoded in test scenarios and CI workflows instead of tribal knowledge. It's not police and fire anymore. It's building codes, inspections, and a fire suppression system that installs itself.

## Try It Yourself

The test suite is a reusable GitHub Actions workflow. If you build a bootc/ostree GNOME image, you can point it at your image today:
```
uses: projectbluefin/testsuite/.github/workflows/e2e.yml@main
with:
  image: ghcr.io/yourorg/yourimage:latest
  suites: smoke
```

No self-hosted runners. No special hardware. Standard `ubuntu-latest`. The test framework is built to be reusable by any GNOME-based distribution — not just Bluefin. If more projects adopt accessibility-tree-based testing alongside existing openQA coverage, the Linux desktop gets a testing approach that's both deeper and less fragile than screenshot matching alone.

 	- [The Pattern — technical comparison](https://github.com/projectbluefin/bluefin/blob/main/THEPATTERN.md)

 	- [Test suite repo](https://github.com/projectbluefin/testsuite)

 	- [Discussion thread](https://github.com/ublue-os/bluefin/discussions/4724)

 	- [Original "police and fire" framework](https://crunchtools.com/comparison-linux-container-images/#Performance)

---

## Categories

- Articles

---

## Navigation

- [Home](https://crunchtools.com/)
- [Articles](https://crunchtools.com/category/articles/)
- [Events](https://crunchtools.com/category/events/)
- [News](https://crunchtools.com/category/news/)
- [Presentations](https://crunchtools.com/category/presentations/)
- [Software](https://crunchtools.com/software/)
- [Beaver Backup](https://crunchtools.com/software/beaver-backup/)
- [Check BGP Neighbors](https://crunchtools.com/software/check-bgp-neighbors-nagios/)
- [Chev](https://crunchtools.com/software/chev-check-vulnerabilities-script/)
- [Graph BGP Neighbors](https://crunchtools.com/software/grpah-bgp-neighbors/)
- [Graph MySQL Stats](https://crunchtools.com/software/graph-mysql-stats/)
- [Graph Sockets Pipes Files](https://crunchtools.com/software/graph-sockets-pipes-files/)
- [MCP Servers](https://crunchtools.com/software/mcp-servers/)
- [Petit](https://crunchtools.com/software/petit/)
- [Racecar](https://crunchtools.com/software/racecar/)
- [Shiva](https://crunchtools.com/software/shiva/)
- [About](https://crunchtools.com/about/)

## Tags

- AI/ML
- Container Images
- Linux
- Open Source Software
- Systems Administration