Bug in the CodeStack
Our new code-focused needle in the haystack test measures how well LLMs can find bugs in code.
Context
- What- In collaboration with Andy Lee (from Wat.ai), we built a new benchmark called 'Bug In The Code Stack' (BICS) to test how well LLMs can find syntactic bugs in large Python codebases. 
- Why- As LLMs' context window sizes grow, their use as coding assistants for large codebases is increasing. We wanted to understand longer context lengths impact their retrieval performance. 
- How- We use auto-assembled Python source code as the haystack and a syntactic bug placed within the source code as the needle. The LLM is tasked with finding the line number and the type of the bug. 
TLDR
1.GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.
2.GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models.
3.Gemini 1.5-Pro is ~3x better than Gemini 1.0 Pro; as Google claimed, 1.5-Pro's performance stays constant across all context lengths and target depths.
4.Codestral 22B performed on par with GPT-3.5-Turbo and Llama3-70B despite being a significantly smaller model.
Results
| Provider | Model Name | Average Retrieval Accuracy (%) | 
|---|---|---|
| GPT-4o | 82.72 | |
| GPT-4-Turbo | 78.08 | |
|  | Claude 3 Opus | 59.52 | 
| Gemini 1.5-Pro | 52.48 | |
| Gemini 1.5-Flash | 49.44 | |
| GPT-3.5-Turbo | 36.48 | |
| Codestral 22B | 36.32 | |
| Llama3-70B | 35.36 | |
| Command-R+ | 30.08 | |
| Gemini-1.0-Pro | 16.32 | |
|  | CodeQwen 1.5 | 0.48 | 
