Codesota · Papers · Agentic AI2025-02-28 · arXiv
Paper
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology
Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P Wellawatte, Andrew White, Lorenzo Sani et al.
Large Language Models (LLMs) and LLM-based agents show great promise in accelerating scientific research.
We present the Bioinformatics Benchmark (BixBench), a dataset of over 50 real-world scenarios of practical biological data analysis with nearly 300 open-answer questions designed to measure the ability of LLM-based agents to explore biological datasets, perform long, multi-step analytical trajectories, and interpret the nuanced results of those analyses.
Even the latest frontier models only achieve 17% accuracy in the open-answer regime, and no better than random in a multiple-choice setting.
Benchmark trail
Read next