Bug in the CodeStack

    Our new code-focused needle in the haystack test measures how well LLMs can find bugs in code.

    Context

    • What

      In collaboration with Andy Lee (from Wat.ai), we built a new benchmark called 'Bug In The Code Stack' (BICS) to test how well LLMs can find syntactic bugs in large Python codebases.

    • Why

      As LLMs' context window sizes grow, their use as coding assistants for large codebases is increasing. We wanted to understand longer context lengths impact their retrieval performance.

    • How

      We use auto-assembled Python source code as the haystack and a syntactic bug placed within the source code as the needle. The LLM is tasked with finding the line number and the type of the bug.

    TLDR

    1.GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.
    2.GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models.
    3.Gemini 1.5-Pro is ~3x better than Gemini 1.0 Pro; as Google claimed, 1.5-Pro's performance stays constant across all context lengths and target depths.
    4.Codestral 22B performed on par with GPT-3.5-Turbo and Llama3-70B despite being a significantly smaller model.

    Results

    ProviderModel NameAverage Retrieval Accuracy (%)
    GPT-4oGPT-4o
    82.72
    GPT-4-TurboGPT-4-Turbo
    78.08
    Claude 3 OpusClaude 3 Opus
    59.52
    Gemini 1.5-ProGemini 1.5-Pro
    52.48
    Gemini 1.5-FlashGemini 1.5-Flash
    49.44
    GPT-3.5-TurboGPT-3.5-Turbo
    36.48
    Codestral 22BCodestral 22B
    36.32
    Llama3-70BLlama3-70B
    35.36
    Command-R+Command-R+
    30.08
    Gemini-1.0-ProGemini-1.0-Pro
    16.32
    CodeQwen 1.5CodeQwen 1.5
    0.48
    Logo

    Want to test your model?