Bug in the CodeStack

Our new code-focused needle in the haystack test measures how well LLMs can find bugs in code.

Context

  • What

    In collaboration with Andy Lee (from Wat.ai), we built a new benchmark called 'Bug In The Code Stack' (BICS) to test how well LLMs can find syntactic bugs in large Python codebases.

  • Why

    As LLMs' context window sizes grow, their use as coding assistants for large codebases is increasing. We wanted to understand longer context lengths impact their retrieval performance.

  • How

    We use auto-assembled Python source code as the haystack and a syntactic bug placed within the source code as the needle. The LLM is tasked with finding the line number and the type of the bug.

TLDR

1.GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.
2.GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models.
3.Gemini 1.5-Pro is ~3x better than Gemini 1.0 Pro; as Google claimed, 1.5-Pro's performance stays constant across all context lengths and target depths.
4.Codestral 22B performed on par with GPT-3.5-Turbo and Llama3-70B despite being a significantly smaller model.

Results

ProviderModel NameAverage Retrieval Accuracy (%)
GPT-4oGPT-4o
82.72
GPT-4-TurboGPT-4-Turbo
78.08
Claude 3 OpusClaude 3 Opus
59.52
Gemini 1.5-ProGemini 1.5-Pro
52.48
Gemini 1.5-FlashGemini 1.5-Flash
49.44
GPT-3.5-TurboGPT-3.5-Turbo
36.48
Codestral 22BCodestral 22B
36.32
Llama3-70BLlama3-70B
35.36
Command-R+Command-R+
30.08
Gemini-1.0-ProGemini-1.0-Pro
16.32
CodeQwen 1.5CodeQwen 1.5
0.48
Logo

Want to test your model?