BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology

Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P Wellawatte, Andrew White, Lorenzo Sani et al.

arXiv ↗

Large Language Models (LLMs) and LLM-based agents show great promise in accelerating scientific research.
We present the Bioinformatics Benchmark (BixBench), a dataset of over 50 real-world scenarios of practical biological data analysis with nearly 300 open-answer questions designed to measure the ability of LLM-based agents to explore biological datasets, perform long, multi-step analytical trajectories, and interpret the nuanced results of those analyses.
Even the latest frontier models only achieve 17% accuracy in the open-answer regime, and no better than random in a multiple-choice setting.

§ 01 · Benchmark results