50+ real-world biological data-analysis scenarios with ~300 open-answer questions designed to measure LLM agents on long, multi-step analytical trajectories.
Accuracy is the reported evaluation metric for BixBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | GPT-4o | paper | 17 | N/A | Paper ↗ | Looks wrong? |
| 02 | Claude 3.5 Sonnet | paper | 17 | N/A | Paper ↗ | Looks wrong? |