Codesota · Registry log9,080 rows · 7132 new this monthShowing 200
Editorial · Registry log
Every score we've added, in order.
The append-only public ledger of every benchmark result on Codesota. When a row was written, when the result itself is dated, who the model was, what value was claimed, and where the citation lives. New-SOTA rows are marked in colour; unverified rows still show, but labelled.
This is the audit trail. If a score is wrong, this is where the error will be visible; if a source is missing, this is where you'll see the gap.
2026-04-23 · 105 rows
- 20:40Gemini 3 FlashLiveCodeBench90.8%-0.90source ↗· verified· dated 2026-03-15
- 20:40Gemini 3 Pro PreviewLiveCodeBench91.7%NEW SOTA+6.70source ↗· verified· dated 2026-03-15
- 20:40Claude Opus 4.7SWE-Bench Verified87.6%NEW SOTA+6.70source ↗· verified· dated 2026-04-18
- 18:58Qwen3.6 PlusMMMU-Pro73.8%-8.20source ↗· verified· dated 2026-03-15
- 18:58GPT-5.1MMMU-Pro76.5%-5.50source ↗· verified· dated 2025-11-13
- 18:58Gemini 3 ProMMMU-Pro80.0%-2.00source ↗· verified· dated 2026-01-15
- 18:58GPT-5.2MMMU-Pro81.0%-1.00source ↗· verified· dated 2025-12-11
- 18:58Gemini 3.1 Pro PreviewMMMU-Pro82.0%NEW SOTAfirst resultsource ↗· verified· dated 2026-03-18
- 18:57Qwen3.5-27BMMMU82.3%-3.70source ↗· verified· dated 2025-09-01
- 18:57Qwen3.5-122B-A10BMMMU83.9%-2.10source ↗· verified· dated 2025-09-01
- 18:57Qwen3.5-397B-A17BMMMU83.9%-2.10source ↗· verified· dated 2025-09-01
- 18:57GPT-5.1MMMU85.4%-0.60source ↗· verified· dated 2025-11-13
- 18:57GPT-5.1 InstantMMMU85.4%-0.60source ↗· verified· dated 2025-11-13
- 18:57GPT-5.1 ThinkingMMMU85.4%-0.60source ↗· verified· dated 2025-11-13
- 18:57Qwen3.6 PlusMMMU86.0%NEW SOTA+12.70source ↗· verified· dated 2026-03-15
- 10:52Gemini 3 ProSWE-Bench77.4%-4.70source ↗· verified· dated 2026-01-01
- 10:52GLM-5SWE-Bench77.8%-4.30source ↗· verified· dated 2026-01-01
- 10:52Claude Opus 4.5SWE-Bench79.2%-2.90source ↗· verified· dated 2026-01-01
- 10:52Sonar FoundationSWE-Bench79.2%-2.90source ↗· verified· dated 2026-01-01
- 10:52GPT-5.2SWE-Bench80.0%-2.10source ↗· verified· dated 2026-02-01
- 10:52MiniMax M2.5SWE-Bench80.2%-1.90source ↗· verified· dated 2026-01-01
- 10:52Claude Opus 4.6SWE-Bench80.8%-1.30source ↗· verified· dated 2026-02-01
- 10:52Claude Opus 4.5SWE-Bench80.9%-1.20source ↗· verified· dated 2026-02-01
- 10:52BERT + AoASQuAD v2.088.6%-2.80source ↗· verified· dated 2019-03-01
- 10:52BERT (Google AI)SQuAD v2.083.1%-8.30source ↗· verified· dated 2018-11-01
- 10:52Logistic Regression (SQuAD baseline)SQuAD v2.051.0%-40.40source ↗· verified· dated 2016-06-01
- 10:52SLQA+ (single model)SQuAD v2.087.0%-4.38source ↗· verified· dated 2018-01-01
- 10:52Hanvon_model (single model)SQuAD v2.087.1%-4.28source ↗· verified· dated 2019-09-01
- 10:52Insight-baseline-BERT (single model)SQuAD v2.087.6%-3.76source ↗· verified· dated 2019-04-01
- 10:52XLNet (single, Verified XiaoPAI)SQuAD v2.088.0%-3.40source ↗· verified· dated 2019-09-01
- 10:52SpanBERT (single model)SQuAD v2.088.7%-2.69source ↗· verified· dated 2019-07-01
- 10:52BERT + DAE + AoA (single model)SQuAD v2.088.6%-2.78source ↗· verified· dated 2019-03-01
- 10:52XLNet+Verifier (single, Ping An)SQuAD v2.089.1%-2.34source ↗· verified· dated 2019-08-01
- 10:52XLNet+Verifier (single, Google/CMU)SQuAD v2.089.1%-2.32source ↗· verified· dated 2019-10-01
- 10:52BERT + ConvLSTM + MTL + Verifier (ensemble)SQuAD v2.089.3%-2.11source ↗· verified· dated 2019-03-01
- 10:52RoBERTa+Verify (single model)SQuAD v2.089.6%-1.81source ↗· verified· dated 2019-11-01
- 10:52Enhanced Albert+Verifier3 (ensemble)SQuAD v2.089.8%-1.62source ↗· verified· dated 2020-05-01
- 10:52RoBERTa (single model)SQuAD v2.089.8%-1.61source ↗· verified· dated 2020-07-01
- 10:51Claude Sonnet 5SWE-Bench82.1%=0.0source ↗· verified· dated 2026-02-01
- 10:51MiniMax M2.5SWE-Bench80.2%-1.90source ↗· verified· dated 2026-01-01
- 10:51Claude Opus 4.5SWE-Bench78.0%-4.10source ↗· verified· dated 2025-12-01
- 10:51Claude Sonnet 4.5SWE-Bench70.8%-11.30source ↗· verified· dated 2025-09-01
- 10:51GPT-4.5SWE-Bench62.0%-20.10source ↗· verified· dated 2025-06-01
- 10:51Claude Opus 4SWE-Bench55.2%-26.90source ↗· verified· dated 2025-03-01
- 10:51Claude 3.5 Sonnet v2SWE-Bench49.0%-33.10source ↗· verified· dated 2024-12-01
- 10:51o1-previewSWE-Bench36.2%-45.90source ↗· verified· dated 2024-10-01
- 10:51Claude 3.5 SonnetSWE-Bench27.0%-55.10source ↗· verified· dated 2024-08-01
- 10:51GPT-4oSWE-Bench19.0%-63.10source ↗· verified· dated 2024-06-01
- 10:51GPT-4 TurboSWE-Bench12.5%-69.60source ↗· verified· dated 2024-03-01
- 10:51Claude 2SWE-Bench2.0%-80.14source ↗· verified· dated 2023-10-01
- 10:51DeepSeek-Coder 33BSWE-Bench15.6%-66.50source ↗· verified· dated 2024-06-01
- 10:51StarCoder2 15BSWE-Bench18.3%-63.80source ↗· verified· dated 2024-10-01
- 10:51CodeLlama 70BSWE-Bench29.8%-52.30source ↗· verified· dated 2024-12-01
- 10:51Qwen2.5-Coder 32BSWE-Bench55.4%-26.70source ↗· verified· dated 2025-06-01
- 10:51DeepSeek-Coder V2.5SWE-Bench68.2%-13.90source ↗· verified· dated 2025-08-01
- 10:51Qwen3 72BSWE-Bench72.4%-9.70source ↗· verified· dated 2025-10-01
- 10:51Step-3.5-FlashSWE-Bench74.4%-7.70source ↗· verified· dated 2026-01-01
- 10:51DeepSeek V3.5SWE-Bench74.6%-7.50source ↗· verified· dated 2025-11-01
- 10:51Qwen3-Max-ThinkingSWE-Bench75.3%-6.80source ↗· verified· dated 2026-02-01
- 10:51Gemini 3 FlashSWE-Bench75.8%-6.30source ↗· verified· dated 2026-02-01
- 10:51DeepSeek R1SWE-Bench76.3%-5.80source ↗· verified· dated 2025-12-01
- 10:51Kimi K2.5SWE-Bench76.8%-5.30source ↗· verified· dated 2026-01-01
- 10:51Claude Sonnet 4.5SWE-Bench77.2%-4.90source ↗· verified· dated 2025-12-01
- 10:51Gemini 3 ProSWE-Bench77.4%-4.70source ↗· verified· dated 2026-01-01
- 10:51GLM-5SWE-Bench77.8%-4.30source ↗· verified· dated 2026-01-01
- 10:51Claude Opus 4.6SWE-Bench79.8%-2.30source ↗· verified· dated 2026-02-01
- 10:51GPT-5.2SWE-Bench80.0%-2.10source ↗· verified· dated 2026-02-01
- 10:51MiniMax M2.5SWE-Bench80.2%-1.90source ↗· verified· dated 2026-01-01
- 10:51Claude Opus 4.5SWE-Bench80.9%-1.20source ↗· verified· dated 2026-02-01
- 10:51Claude Sonnet 5SWE-Bench82.1%NEW SOTAfirst resultsource ↗· verified· dated 2026-02-01
- 10:51SENetImageNet97.8%NEW SOTA+1.32source ↗· verified· dated 2017-01-01
- 10:51ResNet-152ImageNet96.4%NEW SOTA+3.13source ↗· verified· dated 2015-01-01
- 10:51GoogLeNetImageNet93.3%NEW SOTA+2.30source ↗· verified· dated 2014-01-01
- 10:51AlexNetImageNet83.6%-7.40source ↗· verified· dated 2012-01-01
- 10:51NEC-UIUCImageNet71.8%-19.20source ↗· verified· dated 2010-01-01
- 10:51convnext_base.fb_in22k_ft_in1kImageNet86.3%-4.70source ↗· verified· dated 2022-01-01
- 10:51swin_large.ms_in22k_ft_in1kImageNet86.3%-4.67source ↗· verified· dated 2021-03-01
- 10:51nextvit_large.bd_ssld_6m_in1k_384ImageNet86.5%-4.46source ↗· verified· dated 2022-11-01
- 10:51coatnet_2_rw_224.sw_in12k_ft_in1kImageNet86.6%-4.42source ↗· verified· dated 2022-09-01
- 10:51maxvit_base_tf_512.in1kImageNet86.6%-4.40source ↗· verified· dated 2023-04-01
- 10:51InternViT-6B (InternVL)ImageNet88.2%-2.80source ↗· verified· dated 2024-06-01
- 10:51ViT-22B/14ImageNet89.5%-1.49source ↗· verified· dated 2023-02-01
- 10:51EVA-02 (ViT-L/14+)ImageNet90.0%-1.00source ↗· verified· dated 2023-03-01
- 10:51SoViT-400M/14ImageNet90.3%-0.70source ↗· verified· dated 2023-05-01
- 10:51CoCa (ViT-G/14)ImageNet91.0%NEW SOTAfirst resultsource ↗· verified· dated 2022-05-01
- 10:51T5-11BGLUE89.3%-2.00source ↗· verified· dated 2019-10-01
- 10:51DeBERTa (ensemble)GLUE90.3%-1.00source ↗· verified· dated 2021-01-01
- 10:51ERNIE 3.0GLUE90.6%-0.70source ↗· verified· dated 2021-07-01
- 10:51ST-MoE-32BGLUE91.2%-0.10source ↗· verified· dated 2022-02-01
- 10:51Vega v2 (6B)GLUE91.3%NEW SOTAfirst resultsource ↗· verified· dated 2022-10-01
- 10:51clearOCROmniDocBench31.7%-65.80source ↗· verified
- 10:51mistral-ocr-2512OmniDocBench79.8%-17.75source ↗· verified
- 10:51Mistral OCR 3OmniDocBench79.8%-17.75source ↗· verified
- 10:51Codex (davinci-002)HumanEval46.9%-50.40source ↗· verified· dated 2021-07-01
- 10:51DeepSeek-Coder-33B-InstructHumanEval79.3%-18.00source ↗· verified· dated 2023-11-01
- 10:51Codestral 25.01HumanEval85.3%-12.00source ↗· verified· dated 2025-01-01
- 10:51GPT-4 TurboHumanEval86.6%-10.70source ↗· verified· dated 2023-11-01
- 10:51Llama-3.3-70B-InstructHumanEval88.4%-8.90source ↗· verified· dated 2024-12-01
- 10:51GPT-4oHumanEval90.2%-7.10source ↗· verified· dated 2024-05-01
- 10:51DeepSeek-Coder-V2-InstructHumanEval90.2%-7.10source ↗· verified· dated 2024-06-01
- 10:51Qwen2.5-Coder 32BHumanEval92.7%-4.60source ↗· verified· dated 2025-03-01
- 10:51Claude Sonnet 4.6HumanEval94.1%-3.20source ↗· verified· dated 2026-01-01
- 10:51o3HumanEval94.8%-2.50source ↗· verified· dated 2025-04-01
- 10:51GPT-5HumanEval95.1%-2.20source ↗· verified· dated 2025-12-01
- 10:51Claude Opus 4.6HumanEval96.3%-1.00source ↗· verified· dated 2026-01-01
2026-04-13 · 14 rows
- 23:16LlamaParse AgenticParseBench84.9%NEW SOTA+13.00source ↗· verified
- 23:16LlamaParse Cost EffectiveParseBench71.9%NEW SOTA+0.90source ↗· verified
- 23:16LandingAIParseBench45.2%-25.80source ↗· verified
- 23:16ExtendParseBench55.8%-15.20source ↗· verified
- 23:16ReductoParseBench67.8%-3.20source ↗· verified
- 23:16Azure Document IntelligenceParseBench59.6%-11.40source ↗· verified
- 23:16Google Cloud Document AIParseBench50.4%-20.60source ↗· verified
- 23:16AWS TextractParseBench47.9%-23.10source ↗· verified
- 23:16DoclingParseBench50.6%-20.40source ↗· verified
- 23:16Dots OCR 1.5ParseBench55.8%-15.20source ↗· verified
- 23:16Qwen3-VL-4BParseBench62.0%-9.00source ↗· verified
- 23:16Gemini 3 FlashParseBench71.0%NEW SOTA+24.20source ↗· verified
- 23:16Anthropic Haiku 4.5ParseBench45.2%-1.60source ↗· verified
- 23:16GPT-5-miniParseBench46.8%NEW SOTAfirst resultsource ↗· verified
2026-04-12 · 9 rows
- 20:20o3LiveCodeBench Pro1010.00-1429.00source ↗· verified
- 20:20DeepSeek R1LiveCodeBench Pro1161.00-1278.00source ↗· verified
- 20:20Gemini 2.5 FlashLiveCodeBench Pro1288.00-1151.00source ↗· verified
- 20:20Claude Sonnet 4.5LiveCodeBench Pro1412.00-1027.00source ↗· verified
- 20:20Qwen3-235B-A22BLiveCodeBench Pro1673.00-766.00source ↗· verified
- 20:20Gemini 2.5 ProLiveCodeBench Pro1769.00-670.00source ↗· verified
- 20:20o4-miniLiveCodeBench Pro2092.00-347.00source ↗· verified
- 20:20GPT-5LiveCodeBench Pro2176.00-263.00source ↗· verified
- 20:20Gemini 3 ProLiveCodeBench Pro2439.00NEW SOTAfirst resultsource ↗· verified
2026-04-09 · 72 rows
- 02:01CPN (Complementary Proposal Network)ic19-art79.9%-6.50source ↗· verified· dated 2024-02-18
- 02:01CPN (Complementary Proposal Network)ic19-art83.6%-2.80source ↗· verified· dated 2024-02-18
- 02:01CPN (Complementary Proposal Network)ic19-art81.7%-4.70source ↗· verified· dated 2024-02-18
- 02:00PLBARTcodesearchnet---java18.4%-4.16source ↗· verified
- 02:00CoTexTcodesearchnet---java19.1%-3.55source ↗· verified
- 02:00ProphetNet-Xcodesearchnet---java19.4%-3.22source ↗· verified
- 02:00PolyglotCodeBERTcodesearchnet---java20.1%-2.50source ↗· verified
- 02:00BART-base (STSM)e2e2.2%-69.50source ↗· verified· dated 2024-01-19
- 02:00BART-base (STSM)e2e68.8%-2.94source ↗· verified· dated 2024-01-19
- 02:00BART-base (STSM)e2e45.6%-26.10source ↗· verified· dated 2024-01-19
- 02:00BART-base (STSM)e2e8.5%-63.24source ↗· verified· dated 2024-01-19
- 02:00BART-base (STSM)e2e65.7%-5.96source ↗· verified· dated 2024-01-19
- 02:00FLAN-T5-base (STSM)e2e2.1%-69.58source ↗· verified· dated 2024-01-19
- 02:00FLAN-T5-base (STSM)e2e67.8%-3.85source ↗· verified· dated 2024-01-19
- 02:00FLAN-T5-base (STSM)e2e45.5%-26.16source ↗· verified· dated 2024-01-19
- 02:00FLAN-T5-base (STSM)e2e8.5%-63.21source ↗· verified· dated 2024-01-19
- 02:00FLAN-T5-base (STSM)e2e65.7%-6.05source ↗· verified· dated 2024-01-19
- 02:00T5-base (STSM)e2e2.3%-69.43source ↗· verified· dated 2024-01-19
- 02:00T5-base (STSM)e2e69.0%-2.73source ↗· verified· dated 2024-01-19
- 02:00T5-base (STSM)e2e45.7%-26.00source ↗· verified· dated 2024-01-19
- 02:00T5-base (STSM)e2e8.6%-63.11source ↗· verified· dated 2024-01-19
- 02:00T5-base (STSM)e2e67.0%-4.75source ↗· verified· dated 2024-01-19
- 02:00ESALEcodesearchnet---javascript15.6%-10.00source ↗· verified· dated 2024-07-01
- 02:00UniXcodercodesearchnet---javascript15.5%-10.15source ↗· verified· dated 2024-07-01
- 02:00GraphCodeBERT+AdvFusioncodesearchnet---javascript15.9%-9.72source ↗· verified· dated 2024-12-01
- 02:00CodeBERT+AdvFusioncodesearchnet---javascript16.8%-8.81source ↗· verified· dated 2024-12-01
- 02:00GraphCodeBERTcodesearchnet---javascript14.8%-10.82source ↗· verified· dated 2024-12-01
- 02:00CodeT5-basecodesearchnet---javascript16.2%-9.37source ↗· verified· dated 2024-12-01
- 02:00HTLM (prefix-tuning)e2e2.5%-69.25source ↗· verified· dated 2021-07-14
- 02:00HTLM (prefix-tuning)e2e71.2%-0.50source ↗· verified· dated 2021-07-14
- 02:00HTLM (prefix-tuning)e2e46.1%-25.60source ↗· verified· dated 2021-07-14
- 02:00HTLM (prefix-tuning)e2e8.8%-62.85source ↗· verified· dated 2021-07-14
- 02:00HTLM (prefix-tuning)e2e70.1%-1.60source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Large (prefix-tuning)e2e2.5%-69.23source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Large (prefix-tuning)e2e71.7%NEW SOTA+0.30source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Large (prefix-tuning)e2e46.2%-25.20source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Large (prefix-tuning)e2e8.8%-62.55source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Large (prefix-tuning)e2e70.3%-1.10source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Medium (prefix-tuning)e2e2.5%-68.91source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Medium (prefix-tuning)e2e71.4%NEW SOTA+0.40source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Medium (prefix-tuning)e2e46.1%-24.90source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Medium (prefix-tuning)e2e8.8%-62.19source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Medium (prefix-tuning)e2e69.7%-1.30source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Medium (fine-tuning)e2e2.5%-68.53source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Medium (fine-tuning)e2e71.0%NEW SOTA+0.20source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Medium (fine-tuning)e2e46.2%-24.60source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Medium (fine-tuning)e2e8.6%-62.18source ↗· verified· dated 2021-07-14
- 02:00GPT-2-Medium (fine-tuning)e2e68.2%-2.60source ↗· verified· dated 2021-07-14
- 01:58Oracle-BERT (HowSumm-Method)howsumm-method63.2%NEW SOTA+4.30source ↗· verified
- 01:58Oracle-BOW (HowSumm-Method)howsumm-method58.9%NEW SOTA+5.40source ↗· verified
- 01:58Random Baseline (HowSumm-Method)howsumm-method41.5%-12.00source ↗· verified
- 01:57Oracle-BERThowsumm-step46.8%NEW SOTA+0.80source ↗· verified· dated 2021-10-07
- 01:57GreedyRel (query: step + method + article titles)howsumm-step30.1%-15.90source ↗· verified· dated 2021-10-07
- 01:57Oracle-BOWhowsumm-step46.0%NEW SOTA+6.40source ↗· verified· dated 2021-10-07
- 01:57Oracle-HierSummhowsumm-step35.6%-4.00source ↗· verified· dated 2021-10-07
- 01:57Random Baseline (HowSumm)howsumm-step23.0%-16.60source ↗· verified· dated 2021-10-07
- 01:57InternVL2-76BCC-OCR61.6%-21.65source ↗· verified
- 01:57InternVL2-76BCC-OCR35.3%-47.92source ↗· verified
- 01:57GOT-OCR2.0CC-OCR39.2%-44.07source ↗· verified
- 01:57Claude 3.5 SonnetCC-OCR47.8%-35.46source ↗· verified
- 01:57GPT-4oCC-OCR53.3%-29.95source ↗· verified
- 01:57Qwen2-VL 72BCC-OCR53.8%-29.47source ↗· verified
- 01:57GOT-OCR2.0CC-OCR24.9%-58.30source ↗· verified
- 01:57KOSMOS-2.5CC-OCR36.2%-47.02source ↗· verified
- 01:57InternVL2-76BCC-OCR46.6%-36.68source ↗· verified
- 01:57Florence-2-LargeCC-OCR49.7%-33.55source ↗· verified
- 01:57Claude 3.5 SonnetCC-OCR65.7%-17.57source ↗· verified
- 01:57Qwen2-VL 72BCC-OCR71.1%-12.11source ↗· verified
- 01:57KOSMOS-2.5CC-OCR47.5%-35.70source ↗· verified
- 01:57Florence-2-LargeCC-OCR49.2%-34.01source ↗· verified
- 01:57TextMonkeyCC-OCR56.9%-26.37source ↗· verified
- 01:57GOT-OCR2.0CC-OCR61.0%-22.25source ↗· verified
Showing the 200 most-recent rows. To inspect a single dataset’s history, append ?dataset=ID (e.g. /log?dataset=mmmu). Delta compares each row to the prior-best value on the same dataset at the moment this row was added. Hidden datasets and hidden models are not shown.