Can LLMs find bugs in large codebases?

TLDR

We built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases.
GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.
The hype is real. GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models.
Generally, longer context length resulted in lower accuracy. However, there were some exceptions to this.
Models react differently to the placement of the bug within the source code. GPT-3.5-Turbo and Claude 3 Opus were the most sensitive, and GPT-4-Series was the least sensitive. Generally, less sensitivity means a more robust model.
Gemini 1.5-Pro is ~3x better than Gemini 1.0 Pro; as Google claimed, 1.5-Pro's performance stays constant across all context lengths and target depths.
Codestral 22B performed on par with GPT-3.5-Turbo and Llama3-70B despite being a significantly smaller model.

Motivation

As LLMs' context window sizes grow, their use as coding assistants for large codebases is increasing. It's crucial to understand how longer context lengths impact their performance.

The "needle in the haystack" analysis tests LLMs' ability to find specific information in long documents. Previous benchmarks like BABILONG focused on text tasks. Now, as LLMs are used more for coding, it's important to see how they perform on code tasks and if the task type affects their accuracy.

Experimental design

We developed a new benchmark called Bug In The Code Stack (BICS), which contains auto-assembled Python source code as the haystack and a syntactic bug placed within the source code as the needle. The LLM is tasked with finding the line number and the type of the bug.

Figure 1. Sample source code with a syntactic bug, the LLM is tasked with retrieving both the line number (e.g. Line 4) and the type of the bug (e.g. Missing Parenthesis) that occurs in the input code. We place the bug at various depths within the source code to evaluate the LLM's ability to retrieve the bug accurately.

Each model was run on context lengths ranging from 500 tokens to 16K tokens and target depths ranging from 0% to 100%. We ran each experiment 25 times, and the average accuracy is shown in the following charts.

To give context, 16K tokens are around 25 pages long. The models are challenged to find a single syntactic bug, which could be as small as a missing parenthesis, within 25 pages of code! This benchmark poses quite a challenge to many of the models.

Comparing results on most popular models

Figure 2. Comparing the average accuracy of each model at a target length of 50%. The performance for GPT-4o and GPT-4-Turbo stays consistently high at various context lengths. Note: Llama3-70B has a context window of 8192 tokens; the 16K tokens tests are omitted.

Figure 3. Comparing the average accuracy across all context lengths and target depth for each model, excluding 16K tokens results. GPT-4o has ~2x higher accuracy than 3.5, Llama3-70B & Command-R+

Figure 4. Comparing the variance in average accuracy per target depth for each model, a higher variance indicates that the model is more sensitive to the placement of the bug within the source code. Claude 3 Opus, 3.5 Turbo and Command-R+ are the most sensitive.

From the charts above, we can see the performance gap between different models, with GPT-4o performing the best at both short and long context lengths, closely followed by GPT-4-Turbo. Claude 3 Opus shows a similar level of performance at short context lengths but struggles at long context lengths. Additionally, GPT-3.5-Turbo, Codestral 22B, Llama3-70B, and Command-R+ all show similar performance levels.

Comparing BICS and BABILONG

Figure 5. GPT-3.5-Turbo performs 5-10x worse on BICS than BABILONG at a target depth of 75%; i.e. our benchmark is harder than BABILONG.

In addition, we see that LLMs display much lower accuracy on the BICS benchmark than the BABILONG benchmark. This indicates that LLMs struggle more at understanding long codebases than long text, hinting at a future improvement to the models for code comprehension capabilities.

Detailed results

Here are the detailed results for each model.

Figure 6. GPT-4o: Extremely high retrieval performance across all context lengths and target depths.

Figure 7. GPT-4-Turbo: High retrieval performance at short context lengths; seems to struggle at 50% depth for longer contexts.

Figure 8. Claude 3 Opus: Sharp drop in accuracy at most depths for longer context lengths.

Figure 9. Gemini-1.5-Pro: Okay-ish performance that holds up for longer context lengths.

Figure 10. Gemini-1.5-Flash: Sharp drop in accuracy at most depths for longer context lengths like Opus.

Figure 11. GPT-3.5-Turbo: Weirdly better at bugs placed at the end of the code.

Figure 12. Codestral 22B: The best performance / model size out of the models we tested.

Figure 12. Command-R+: Poor performance at target depth < 0.25, worst at longer context lengths.

Figure 13. Gemini-1.0-Pro: Bad at everything.

Figure 14. Llama3-70B: On par with GPT-3.5-Turbo; very impressive for a small model.

Future experiments

The "Bug In The Code Stack" benchmark presents a new challenge measuring LLMs' capability at long context lengths. In the future, we would also like to extend the benchmark by adding logical errors that cannot be detected using static code analyzers, which further helps evaluate the capabilities of the models. In addition, we can run experiments with different programming languages, such as Javascript or C++, and observe the performance difference.

🎉

Special thanks to Bingxu Hu and Sumanyu Sharma for helping structure the experiments and revising the article. We've open-sourced the benchmark Bug In The Code Stack. Any help to improve the benchmark or run it on additional models would be highly appreciated 🙂.