Codesota · Natural Language Processing · Question Answering · BrowseCompTasks/Natural Language Processing/Question Answering
Question Answering · benchmark dataset · 2025 · EN

BrowseComp: A Benchmark for Browsing Agents.

Hard web-browsing QA benchmark with short factual answers that require persistent search over many online sources.

Paper Download datasetSubmit a result
§ 01 · Leaderboard

Best published scores.

16 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
16 rows
#ModelOrgSubmittedPaper / codeaccuracy
01DeepSeek-V4-Pro MaxDeepSeekApr 2026pwc-dump · code83.40
02Kimi K2.6Apr 2026pwc-dump83.20
03MiniMax-M2.5OpenMiniMaxAIFeb 2026pwc-dump · code76.30
04DeepSeek-V4-Flash MaxDeepSeekApr 2026pwc-dump · code73.20
05Qwen3.5-397B-A17BOpenAlibabaFeb 2026pwc-dump · code69
06GLM-5.1Feb 2026GLM-5: from Vibe Coding to Agentic Engineering · code68
07Qwen3.5-122B-A10BOpenAlibabaFeb 2026pwc-dump · code63.80
08GLM-5OpenZhipu AIFeb 2026GLM-5: from Vibe Coding to Agentic Engineering · code62
09Qwen3.5-35B-A3BOpenAlibabaFeb 2026pwc-dump · code61
10Qwen3.5-27BOpenAlibabaFeb 2026pwc-dump · code61
11Kimi-K2.5OpenMoonshot.AIFeb 2026Kimi K2.5: Visual Agentic Intelligence · code60.60
12Step-3.5-FlashFeb 2026Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code51.60
13DeepSeek-V3.2OpenDeepSeekDec 2025DeepSeek-V3.2: Pushing the Frontier of Open Large Langua…51.40
14NVIDIA-Nemotron-3-Super-120B-A12B-BF16Dec 2025NVIDIA Nemotron 3: Efficient and Open Intelligence31.28
15GLM-4.5OpenZhipu AIAug 2025GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation… · code26.40
16GLM-4.5-AirOpenZhipu AIAug 2025GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation… · code21.30
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

6 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. Aug 8, 2025GLM-4.5Zhipu AI26.40
  2. Dec 2, 2025DeepSeek-V3.2DeepSeek51.40
  3. Feb 2, 2026Kimi-K2.5Moonshot.AI60.60
  4. Feb 12, 2026MiniMax-M2.5MiniMaxAI76.30
  5. Apr 20, 2026Kimi K2.683.20
  6. Apr 24, 2026DeepSeek-V4-Pro MaxDeepSeek83.40
Fig 3 · SOTA-setting models only. 6 entries span Aug 2025 Apr 2026.
§ 04 · Literature

6 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies