Bug in the CodeStack
Our new code-focused needle in the haystack test measures how well LLMs can find bugs in code.
Context
What
In collaboration with Andy Lee (from Wat.ai), we built a new benchmark called 'Bug In The Code Stack' (BICS) to test how well LLMs can find syntactic bugs in large Python codebases.
Why
As LLMs' context window sizes grow, their use as coding assistants for large codebases is increasing. We wanted to understand longer context lengths impact their retrieval performance.
How
We use auto-assembled Python source code as the haystack and a syntactic bug placed within the source code as the needle. The LLM is tasked with finding the line number and the type of the bug.
TLDR
1.GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.
2.GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models.
3.Gemini 1.5-Pro is ~3x better than Gemini 1.0 Pro; as Google claimed, 1.5-Pro's performance stays constant across all context lengths and target depths.
4.Codestral 22B performed on par with GPT-3.5-Turbo and Llama3-70B despite being a significantly smaller model.
Results
Provider | Model Name | Average Retrieval Accuracy (%) |
---|---|---|
GPT-4o | 82.72 | |
GPT-4-Turbo | 78.08 | |
Claude 3 Opus | 59.52 | |
Gemini 1.5-Pro | 52.48 | |
Gemini 1.5-Flash | 49.44 | |
GPT-3.5-Turbo | 36.48 | |
Codestral 22B | 36.32 | |
Llama3-70B | 35.36 | |
Command-R+ | 30.08 | |
Gemini-1.0-Pro | 16.32 | |
CodeQwen 1.5 | 0.48 |