Ever since we started working on “task-oriented programming” (aka vibe coding) in 2023, our group at GitHub Next have been throwing around ideas related to “continuous” tasks in software repositories: Continuous Code Cleanup, or Continuous Documentation and so on. This finally bubbled up as the Continuous AI project, locating it within the tradition of Continuous Integration/Continuous Deployment (CI/CD).
One of the most powerful ways to implement Continuous AI workloads is “agentically”, through natural language programming such as Agentic Workflows, a new demonstrator technology where AI-powered workflows can perform complex, multi-step tasks described in natural language in GitHub Actions, with the natural language interpreted by systems such as Claude Code or Codex.
Recently we’ve added a new demonstrator workflow to our sample pack called the 🧪 Daily Test Coverage Improver – “Improve test coverage daily by adding meaningful tests to under-tested areas”.
I’m intrigued by Continuous Test Improvement as an application area for AI in software development. While it is not a new concept – companies such as DiffBlue offer powerful language-specific toolchains for improving coverage – the ease with which agentic AI combined with automated coverage reports can be used to build out a staggering range of testing across very diverse project and language types is quite incredible. Given the goal of “improve coverage” – and a way of measuring coverage – the coding agents chase that goal with obsessive fervour, exploring a myriad of ways to achieve it.
But there is another reason I’m intrigued: better testing means better software. That is, by embracing Continuous Test Improvement, we can rapidly and significantly improve the quality of software through LLMs and Agentic AI. That is a fundamental shift from the “AI means more codegen” mindset to “AI means better software”.
An Example – Improving the Testing of Open Source Libraries
Let’s take an example: yesterday I spent an hour trialling the Daily Test Coverage Improver on three of the more popular libraries on the planet:
- .NET’s NewtonSoft.Json (my fork here), a package with 6 Billion downloads
- Rust’s tokio (my fork here), a package with 350 Million downloads
- Python’s dateutil (my fork here), a very mature package use in Python
I’ve never contributed to any of these libraries before. Given the importance of these libraries you’d hope for near 100% test coverage.
To give a brief summary:
- Within ~1h of routine steps I had multiple potential PRs ready (here and here and here), each improving test coverage. (There are also separate perf improvements there, which I will discuss in later blog entries.)
- Model cost was about $50 for the first 10 PRs.
There are plenty of ways to improve the flow and experience for installing and using these workflows, which are just demonstrators. However, results like these mean I am incredibly excited about the potential for Continuous AI-driven software improvement – across the software industry in general. With modest spend, and some human guidance, we can automatically improve the software quality of the most popular an important libraries in existence, and continue to do this day by day (or at what ever cadence you choose). I can also see a need for enterprises to apply the same techniques to their most critical internal components and OSS dependencies.
On Technical Debt
I will add a personal angle to this: redeeming the guilt of technical debt. It is hard to over-emphasise the guilt I feel for software that I’ve left behind in the world when it is not sufficiently tested. Untested software is poison to those who have to deal with it, maintain it, use it over the years. And yes, throughout the years I’ve both written and suffered from vast quantities of software that needs better testing. A primary way we as software engineers can be more responsible in our techncial jobs is to ensure our software has adequate testing. This can now effectively be near-automated, and we no longer have any excuses. CTOs of companies also no longer have excuses. Test coverage matters, and today’s companies should be measured on it.
Summary
I strongly believe that together we can use Continuous AI – and Continuous Test Improvement in particular – to make software better, higher quality and more tested. I think this modality will become fundamental to the working life of almost every software developer, team and company. Companies will finally be able to near-automatically cover their legacy codebases with testing, and finally much technical testing debt can be paid off. If CTOs and industry leaders invest wisely, for the long term, the tech industry’s guilt of 50 years of billions of lines of untested critical code may finally start to be addressed.
NOTE: GitHub Agentic Workflows are research demonstrators only. See caveats. All opinions and thoughts are my own.
2 thoughts on “On Continuous AI for Test Improvement”