Bug in the CodeStack

Our new code-focused needle in the haystack test measures how well LLMs can find bugs in code.

Context

What
In collaboration with Andy Lee (from Wat.ai), we built a new benchmark called 'Bug In The Code Stack' (BICS) to test how well LLMs can find syntactic bugs in large Python codebases.
Why
As LLMs' context window sizes grow, their use as coding assistants for large codebases is increasing. We wanted to understand longer context lengths impact their retrieval performance.
How
We use auto-assembled Python source code as the haystack and a syntactic bug placed within the source code as the needle. The LLM is tasked with finding the line number and the type of the bug.

TLDR

1.GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.

2.GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models.

3.Gemini 1.5-Pro is ~3x better than Gemini 1.0 Pro; as Google claimed, 1.5-Pro's performance stays constant across all context lengths and target depths.

4.Codestral 22B performed on par with GPT-3.5-Turbo and Llama3-70B despite being a significantly smaller model.

Results

Provider	Model Name	Average Retrieval Accuracy (%)
	GPT-4o	82.72
	GPT-4-Turbo	78.08
	Claude 3 Opus	59.52
	Gemini 1.5-Pro	52.48
	Gemini 1.5-Flash	49.44
	GPT-3.5-Turbo	36.48
	Codestral 22B	36.32
	Llama3-70B	35.36
	Command-R+	30.08
	Gemini-1.0-Pro	16.32
	CodeQwen 1.5	0.48

Context

What

Why

How

TLDR

Results