Google has sprinkled the magic of artificial intelligence into its open source fuzz testing infrastructure and the results suggest LLM (large language model) algorithms will radically alter the bug-hunting space.
Google added generative-AI technology to its OSS-FUZZ project (a free service that runs fuzzers for open source projects and privately alerts developers to the bugs detected) and discovered a massive improvement in code coverage when LLMs are used to create new fuzz targets.
“By using LLMs, we’re able to increase the code coverage for critical projects using our OSS-Fuzz service without manually writing additional code. Using LLMs is a promising new way to scale security improvements across the over 1,000 projects currently fuzzed by OSS-Fuzz and to remove barriers to future projects adopting fuzzing,” the company said in a note with results from a months-long experiment.
Fuzz testers, or fuzzers, are used in vulnerability research to pinpoint security vulnerabilities by sending random input to an application. If the program contains a vulnerability that leads to an exception, crash or server error, researchers can parse the results of the test to pinpoint the cause of the crash.
However, the art of fuzzing is heavily dependent on manual effort to write fuzz targets and functions to test sections of code, leading Google software engineers to test whether LLMs could be used to boost the effectiveness of the six-year-old OSS-Fuzz service.
The company said the OSS-Fuzz project has helped to find and verify fixes for more than 10,000 security bugs in open source software but researchers believed the tool could likely find even more bugs with increased code coverage.
“The fuzzing service covers only around 30% of an open source project’s code on average, meaning that a large portion of our users’ code remains untouched by fuzzing,” Google said.
To test whether an LLM could successfully write new fuzz targets, Google’s software engineers built an evaluation framework that connects OSS-Fuzz to its LLM to pinpoint under-fuzzed, high-potential portions of the sample project’s code for evaluation.
The company explained that the evaluation framework sitting between the OSS-Fuzz and the LLM then creates a prompt that the LLM will use to write the new fuzz target. “At first, the code generated from our prompts wouldn’t compile, however after several rounds of prompt engineering and trying out the new fuzz targets, we saw projects gain between 1.5% and 31% code coverage,” the company said.
In one sample project — tinyxml2 — Google said code coverage improved from 38% to 69% without any interventions from humans.
“The case of tinyxml2 taught us: when LLM-generated fuzz targets are added, tinyxml2 has the majority of its code covered,” the engineers said. “To replicate tinyxml2’s results manually would have required at least a day’s worth of work — which would mean several years of work to manually cover all OSS-Fuzz projects.”
During the experiment, Google said the LLM was able to automatically generate a working target that rediscovered CVE-2022-3602 (see OpenSSL advisory), which was in an area of code that previously did not have fuzzing coverage. “Though this is not a new vulnerability, it suggests that as code coverage increases, we will find more vulnerabilities that are currently missed by fuzzing,” Google added.
The company plans to open source the evaluation framework to allow researchers to test their own automatic fuzz target generation.