Google’s Project Zero team has developed a framework to enable large language models (LLMs) to perform basic vulnerability research autonomously.
A recent blog post explained how the “Project Naptime” framework builds on research by Meta, which set benchmarks for the ability of LLMs to discover and exploit memory vulnerabilities, namely advanced memory corruption and buffer overflow flaws.
The project sought to address a fundamental shortcoming in LLMs when it comes to assessing security flaws. In the Meta experiments, dubbed “CyberSecEval 2,” LLMs were found to score low in their ability to perform basic vulnerability discovery, with none coming close to “passing” the benchmark challenge.
However, Google’s Project Zero researchers found that the Naptime framework, named for the idea that LLMs may one day allow security researchers to “take regular naps” during automated processes, improved the performance of LLMs on CyberSecEval 2 tests by up to 20-fold.
Project Naptime gives LLMs access to tools to mimic human workflows
The Naptime architecture designed by Project Zero includes a toolset consisting of a debugger, code browser, Python tool and reporter tool that enhance LLMs’ abilities to evaluate code, exploit vulnerabilities and verify successful exploitation autonomously.
For example, the code browser enables LLMs to navigate the target program’s source code similarly to how a human researcher would use something like Chromium Code Search to better identify the locations of referenced functions or variables.
The Python tool enables the LLMs to run Python scripts within a sandbox in order to both perform precise calculations and generate complex inputs to text and exploit that target program.
The debugger grants the LLMs the ability to better observe, record and understand the behavior of the target program in response to different inputs, and the reporter provides a mechanism for the LLM to signal its progress to a controller, which will verify whether or not a success condition, such as a crash, has been achieved.
The Naptime framework also aims to grant LLMs the ability to work more similarly to a human researcher by giving it more flexibility to use “reasoning” processes. For example, the framework encourages the LLMs to produce long explanations for its decisions, which has been shown to increase accuracy.
GPT 4 Turbo, Gemini 1.5 Pro excel in basic vulnerability research
The Naptime test results published by Project Zero reveal that GPT 4 Turbo performed best in the CyberSecEval 2 buffer overflow test, which required exploiting a buffer overflow vulnerability to trigger a program output outside of the program’s “normal” execution, while Gemini 1.5 Pro scored highest in the advanced memory corruption test, in which triggering a program crash signaled success.
In the buffer overflow test, GPT 4 Turbo was the only LLM to receive a “passing” score of 1.00, with Gemini 1.5 Pro coming in at a close second with a score of 0.99 over 20 test completions.
In the advanced memory corruption test, the researchers discovered that the LLMs achieved an unexpectedly high success rate by discovering and exploiting a separate unintended, easy-to-exploit vulnerability in the target program, with GPT 4 Turbo achieving the best results.
However, when this unintended flaw was removed, leaving only the original target vulnerability, Gemini 1.5 Pro came out on top with a score of 0.58 after 20 test completions.
The other models tested were GPT 3.5 Turbo and Gemini 1.5 Flash, which scored a maximum of 0.21 and 0.26 in the buffer overflow test and a maximum of 0.56 and 0.53 in the advanced memory corruption test, respectively.
“When provided with the right tools, current LLMs can really start to perform (admittedly rather basic) vulnerability research!” the researchers wrote.
However, the Project Zero team acknowledged that LLMs are still far from achieving the ability to autonomously aid researchers in real-life vulnerability research scenarios, which involve greater ambiguity and complexity than the benchmark tests of CyberSecEval 2.
“Solving these challenges is closer to the typical usage of targeted, domain-specific fuzzing performed as part of a manual review workflow than a fully autonomous researcher,” the authors concluded. “More importantly, we believe that in tasks where an expert human would rely on multiple iterative steps of reasoning, hypothesis formation, and validation, we need to provide the same flexibility to the models; otherwise, the results cannot reflect the true capability level of the models.”