Google’s Big Sleep LLM agent discovers exploitable bug in SQLite

Google has used a large language model (LLM) agent called “Big Sleep” to discover a previously unknown, exploitable memory flaw in a widely used software for the first time, the company announced Friday.

The stack buffer underflow vulnerability in a development version of the popular open-source database engine SQLite was found through variant analysis by Big Sleep, which is a collaboration between Google Project Zero and Google DeepMind.

Big Sleep is an evolution of Project Zero’s Naptime project, which is a framework announced in June that enables LLMs to autonomously perform basic vulnerability research. The framework provides LLMs with tools to test software for potential flaws in a human-like workflow, including a code browser, debugger, reporter tool and sandbox environment for running Python scripts and recording outputs.

The researchers provided the Gemini 1.5 Pro-driven AI agent with the starting point of a previous SQLIte vulnerability, providing context for Big Sleep to search for potential similar vulnerabilities in newer versions of the software. The agent was presented with recent commit messages and diff changes and asked to review the SQLite repository for unresolved issues.

Google’s Big Sleep ultimately identified a flaw involving the function “seriesBestIndex” mishandling the use of the special sentinel value -1 in the iColumn field. Since this field would typically be non-negative, all code that interacts with this field must be designed to handle this unique case properly, which seriesBestIndex fails to do, leading to a stack buffer underflow.

Project Zero’s blog further revealed how Big Sleep worked through multiple steps to search for and test the vulnerability using the provided context and tools, documenting its process through natural language outputs. The LLM agent autonomously drew connections between the previous bug and other parts of the code, developed a testcase to run in the sandbox and then generated a root-cause analysis and full crash report after triggering a crash.

Big Sleep ultimately generated a summary of its findings that was “almost ready to report directly,” the Google Project Zero and Google DeepMind researchers wrote, clearly explaining how a certain input triggered a crash due to the failure of seriesBestIndex to handle negative values in the iColumn field.

The Google researchers reported the issue to SQLite, which fixed the problem the same day, on Oct. 9, 2024. The researchers noted that because the flaw was in a development version of the database engine, it never made its way into the official release or impacted SQLite users.

“We think that this work has tremendous defensive potential. Finding vulnerabilities in software before it’s even released, means that there’s no scope for attackers to compete: the vulnerabilities are fixed before attackers even have the chance to use them,” the researchers stated.

The Big Sleep team also noted that the agent has the potential to discover bugs that are more difficult to discover using typical fuzzing techniques, saying that attempts to rediscover the SQLite flaw using fuzzing did not result in a discovery after 150 CPU hours of testing. They noted that this is most likely due to limitations in the configuration of the fuzzing harnesses available for SQLite and the fact that tool traditionally used for SQLite fuzzing – American Fuzzy Lop (AFL) – has “reached a natural saturation point” after long-time use.

However, the team emphasized that Big Sleep remains “highly experimental” and that they believe a target-specific fuzzer “would be at least as effective” at detecting vulnerabilities as the AI agent in its current state.