Codesota · Registry log9,080 rows · 7132 new this monthShowing 200
Editorial · Registry log

Every score we've added, in order.

The append-only public ledger of every benchmark result on Codesota. When a row was written, when the result itself is dated, who the model was, what value was claimed, and where the citation lives. New-SOTA rows are marked in colour; unverified rows still show, but labelled.

This is the audit trail. If a score is wrong, this is where the error will be visible; if a source is missing, this is where you'll see the gap.

Filters:New-SOTA onlyVerified onlyclear all
2026-04-23 · 105 rows
  1. 20:40Gemini 3 FlashLiveCodeBench90.8%-0.90source ↗· verified· dated 2026-03-15
  2. 20:40Gemini 3 Pro PreviewLiveCodeBench91.7%NEW SOTA+6.70source ↗· verified· dated 2026-03-15
  3. 20:40Claude Opus 4.7SWE-Bench Verified87.6%NEW SOTA+6.70source ↗· verified· dated 2026-04-18
  4. 18:58Qwen3.6 PlusMMMU-Pro73.8%-8.20source ↗· verified· dated 2026-03-15
  5. 18:58GPT-5.1MMMU-Pro76.5%-5.50source ↗· verified· dated 2025-11-13
  6. 18:58Gemini 3 ProMMMU-Pro80.0%-2.00source ↗· verified· dated 2026-01-15
  7. 18:58GPT-5.2MMMU-Pro81.0%-1.00source ↗· verified· dated 2025-12-11
  8. 18:58Gemini 3.1 Pro PreviewMMMU-Pro82.0%NEW SOTAfirst resultsource ↗· verified· dated 2026-03-18
  9. 18:57Qwen3.5-27BMMMU82.3%-3.70source ↗· verified· dated 2025-09-01
  10. 18:57Qwen3.5-122B-A10BMMMU83.9%-2.10source ↗· verified· dated 2025-09-01
  11. 18:57Qwen3.5-397B-A17BMMMU83.9%-2.10source ↗· verified· dated 2025-09-01
  12. 18:57GPT-5.1MMMU85.4%-0.60source ↗· verified· dated 2025-11-13
  13. 18:57GPT-5.1 InstantMMMU85.4%-0.60source ↗· verified· dated 2025-11-13
  14. 18:57GPT-5.1 ThinkingMMMU85.4%-0.60source ↗· verified· dated 2025-11-13
  15. 18:57Qwen3.6 PlusMMMU86.0%NEW SOTA+12.70source ↗· verified· dated 2026-03-15
  16. 10:52Gemini 3 ProSWE-Bench77.4%-4.70source ↗· verified· dated 2026-01-01
  17. 10:52GLM-5SWE-Bench77.8%-4.30source ↗· verified· dated 2026-01-01
  18. 10:52Claude Opus 4.5SWE-Bench79.2%-2.90source ↗· verified· dated 2026-01-01
  19. 10:52Sonar FoundationSWE-Bench79.2%-2.90source ↗· verified· dated 2026-01-01
  20. 10:52GPT-5.2SWE-Bench80.0%-2.10source ↗· verified· dated 2026-02-01
  21. 10:52MiniMax M2.5SWE-Bench80.2%-1.90source ↗· verified· dated 2026-01-01
  22. 10:52Claude Opus 4.6SWE-Bench80.8%-1.30source ↗· verified· dated 2026-02-01
  23. 10:52Claude Opus 4.5SWE-Bench80.9%-1.20source ↗· verified· dated 2026-02-01
  24. 10:52BERT + AoASQuAD v2.088.6%-2.80source ↗· verified· dated 2019-03-01
  25. 10:52BERT (Google AI)SQuAD v2.083.1%-8.30source ↗· verified· dated 2018-11-01
  26. 10:52Logistic Regression (SQuAD baseline)SQuAD v2.051.0%-40.40source ↗· verified· dated 2016-06-01
  27. 10:52SLQA+ (single model)SQuAD v2.087.0%-4.38source ↗· verified· dated 2018-01-01
  28. 10:52Hanvon_model (single model)SQuAD v2.087.1%-4.28source ↗· verified· dated 2019-09-01
  29. 10:52Insight-baseline-BERT (single model)SQuAD v2.087.6%-3.76source ↗· verified· dated 2019-04-01
  30. 10:52XLNet (single, Verified XiaoPAI)SQuAD v2.088.0%-3.40source ↗· verified· dated 2019-09-01
  31. 10:52SpanBERT (single model)SQuAD v2.088.7%-2.69source ↗· verified· dated 2019-07-01
  32. 10:52BERT + DAE + AoA (single model)SQuAD v2.088.6%-2.78source ↗· verified· dated 2019-03-01
  33. 10:52XLNet+Verifier (single, Ping An)SQuAD v2.089.1%-2.34source ↗· verified· dated 2019-08-01
  34. 10:52XLNet+Verifier (single, Google/CMU)SQuAD v2.089.1%-2.32source ↗· verified· dated 2019-10-01
  35. 10:52BERT + ConvLSTM + MTL + Verifier (ensemble)SQuAD v2.089.3%-2.11source ↗· verified· dated 2019-03-01
  36. 10:52RoBERTa+Verify (single model)SQuAD v2.089.6%-1.81source ↗· verified· dated 2019-11-01
  37. 10:52Enhanced Albert+Verifier3 (ensemble)SQuAD v2.089.8%-1.62source ↗· verified· dated 2020-05-01
  38. 10:52RoBERTa (single model)SQuAD v2.089.8%-1.61source ↗· verified· dated 2020-07-01
  39. 10:51Claude Sonnet 5SWE-Bench82.1%=0.0source ↗· verified· dated 2026-02-01
  40. 10:51MiniMax M2.5SWE-Bench80.2%-1.90source ↗· verified· dated 2026-01-01
  41. 10:51Claude Opus 4.5SWE-Bench78.0%-4.10source ↗· verified· dated 2025-12-01
  42. 10:51Claude Sonnet 4.5SWE-Bench70.8%-11.30source ↗· verified· dated 2025-09-01
  43. 10:51GPT-4.5SWE-Bench62.0%-20.10source ↗· verified· dated 2025-06-01
  44. 10:51Claude Opus 4SWE-Bench55.2%-26.90source ↗· verified· dated 2025-03-01
  45. 10:51Claude 3.5 Sonnet v2SWE-Bench49.0%-33.10source ↗· verified· dated 2024-12-01
  46. 10:51o1-previewSWE-Bench36.2%-45.90source ↗· verified· dated 2024-10-01
  47. 10:51Claude 3.5 SonnetSWE-Bench27.0%-55.10source ↗· verified· dated 2024-08-01
  48. 10:51GPT-4oSWE-Bench19.0%-63.10source ↗· verified· dated 2024-06-01
  49. 10:51GPT-4 TurboSWE-Bench12.5%-69.60source ↗· verified· dated 2024-03-01
  50. 10:51Claude 2SWE-Bench2.0%-80.14source ↗· verified· dated 2023-10-01
  51. 10:51DeepSeek-Coder 33BSWE-Bench15.6%-66.50source ↗· verified· dated 2024-06-01
  52. 10:51StarCoder2 15BSWE-Bench18.3%-63.80source ↗· verified· dated 2024-10-01
  53. 10:51CodeLlama 70BSWE-Bench29.8%-52.30source ↗· verified· dated 2024-12-01
  54. 10:51Qwen2.5-Coder 32BSWE-Bench55.4%-26.70source ↗· verified· dated 2025-06-01
  55. 10:51DeepSeek-Coder V2.5SWE-Bench68.2%-13.90source ↗· verified· dated 2025-08-01
  56. 10:51Qwen3 72BSWE-Bench72.4%-9.70source ↗· verified· dated 2025-10-01
  57. 10:51Step-3.5-FlashSWE-Bench74.4%-7.70source ↗· verified· dated 2026-01-01
  58. 10:51DeepSeek V3.5SWE-Bench74.6%-7.50source ↗· verified· dated 2025-11-01
  59. 10:51Qwen3-Max-ThinkingSWE-Bench75.3%-6.80source ↗· verified· dated 2026-02-01
  60. 10:51Gemini 3 FlashSWE-Bench75.8%-6.30source ↗· verified· dated 2026-02-01
  61. 10:51DeepSeek R1SWE-Bench76.3%-5.80source ↗· verified· dated 2025-12-01
  62. 10:51Kimi K2.5SWE-Bench76.8%-5.30source ↗· verified· dated 2026-01-01
  63. 10:51Claude Sonnet 4.5SWE-Bench77.2%-4.90source ↗· verified· dated 2025-12-01
  64. 10:51Gemini 3 ProSWE-Bench77.4%-4.70source ↗· verified· dated 2026-01-01
  65. 10:51GLM-5SWE-Bench77.8%-4.30source ↗· verified· dated 2026-01-01
  66. 10:51Claude Opus 4.6SWE-Bench79.8%-2.30source ↗· verified· dated 2026-02-01
  67. 10:51GPT-5.2SWE-Bench80.0%-2.10source ↗· verified· dated 2026-02-01
  68. 10:51MiniMax M2.5SWE-Bench80.2%-1.90source ↗· verified· dated 2026-01-01
  69. 10:51Claude Opus 4.5SWE-Bench80.9%-1.20source ↗· verified· dated 2026-02-01
  70. 10:51Claude Sonnet 5SWE-Bench82.1%NEW SOTAfirst resultsource ↗· verified· dated 2026-02-01
  71. 10:51SENetImageNet97.8%NEW SOTA+1.32source ↗· verified· dated 2017-01-01
  72. 10:51ResNet-152ImageNet96.4%NEW SOTA+3.13source ↗· verified· dated 2015-01-01
  73. 10:51GoogLeNetImageNet93.3%NEW SOTA+2.30source ↗· verified· dated 2014-01-01
  74. 10:51AlexNetImageNet83.6%-7.40source ↗· verified· dated 2012-01-01
  75. 10:51NEC-UIUCImageNet71.8%-19.20source ↗· verified· dated 2010-01-01
  76. 10:51convnext_base.fb_in22k_ft_in1kImageNet86.3%-4.70source ↗· verified· dated 2022-01-01
  77. 10:51swin_large.ms_in22k_ft_in1kImageNet86.3%-4.67source ↗· verified· dated 2021-03-01
  78. 10:51nextvit_large.bd_ssld_6m_in1k_384ImageNet86.5%-4.46source ↗· verified· dated 2022-11-01
  79. 10:51coatnet_2_rw_224.sw_in12k_ft_in1kImageNet86.6%-4.42source ↗· verified· dated 2022-09-01
  80. 10:51maxvit_base_tf_512.in1kImageNet86.6%-4.40source ↗· verified· dated 2023-04-01
  81. 10:51InternViT-6B (InternVL)ImageNet88.2%-2.80source ↗· verified· dated 2024-06-01
  82. 10:51ViT-22B/14ImageNet89.5%-1.49source ↗· verified· dated 2023-02-01
  83. 10:51EVA-02 (ViT-L/14+)ImageNet90.0%-1.00source ↗· verified· dated 2023-03-01
  84. 10:51SoViT-400M/14ImageNet90.3%-0.70source ↗· verified· dated 2023-05-01
  85. 10:51CoCa (ViT-G/14)ImageNet91.0%NEW SOTAfirst resultsource ↗· verified· dated 2022-05-01
  86. 10:51T5-11BGLUE89.3%-2.00source ↗· verified· dated 2019-10-01
  87. 10:51DeBERTa (ensemble)GLUE90.3%-1.00source ↗· verified· dated 2021-01-01
  88. 10:51ERNIE 3.0GLUE90.6%-0.70source ↗· verified· dated 2021-07-01
  89. 10:51ST-MoE-32BGLUE91.2%-0.10source ↗· verified· dated 2022-02-01
  90. 10:51Vega v2 (6B)GLUE91.3%NEW SOTAfirst resultsource ↗· verified· dated 2022-10-01
  91. 10:51clearOCROmniDocBench31.7%-65.80source ↗· verified
  92. 10:51mistral-ocr-2512OmniDocBench79.8%-17.75source ↗· verified
  93. 10:51Mistral OCR 3OmniDocBench79.8%-17.75source ↗· verified
  94. 10:51Codex (davinci-002)HumanEval46.9%-50.40source ↗· verified· dated 2021-07-01
  95. 10:51DeepSeek-Coder-33B-InstructHumanEval79.3%-18.00source ↗· verified· dated 2023-11-01
  96. 10:51Codestral 25.01HumanEval85.3%-12.00source ↗· verified· dated 2025-01-01
  97. 10:51GPT-4 TurboHumanEval86.6%-10.70source ↗· verified· dated 2023-11-01
  98. 10:51Llama-3.3-70B-InstructHumanEval88.4%-8.90source ↗· verified· dated 2024-12-01
  99. 10:51GPT-4oHumanEval90.2%-7.10source ↗· verified· dated 2024-05-01
  100. 10:51DeepSeek-Coder-V2-InstructHumanEval90.2%-7.10source ↗· verified· dated 2024-06-01
  101. 10:51Qwen2.5-Coder 32BHumanEval92.7%-4.60source ↗· verified· dated 2025-03-01
  102. 10:51Claude Sonnet 4.6HumanEval94.1%-3.20source ↗· verified· dated 2026-01-01
  103. 10:51o3HumanEval94.8%-2.50source ↗· verified· dated 2025-04-01
  104. 10:51GPT-5HumanEval95.1%-2.20source ↗· verified· dated 2025-12-01
  105. 10:51Claude Opus 4.6HumanEval96.3%-1.00source ↗· verified· dated 2026-01-01
2026-04-13 · 14 rows
  1. 23:16LlamaParse AgenticParseBench84.9%NEW SOTA+13.00source ↗· verified
  2. 23:16LlamaParse Cost EffectiveParseBench71.9%NEW SOTA+0.90source ↗· verified
  3. 23:16LandingAIParseBench45.2%-25.80source ↗· verified
  4. 23:16ExtendParseBench55.8%-15.20source ↗· verified
  5. 23:16ReductoParseBench67.8%-3.20source ↗· verified
  6. 23:16Azure Document IntelligenceParseBench59.6%-11.40source ↗· verified
  7. 23:16Google Cloud Document AIParseBench50.4%-20.60source ↗· verified
  8. 23:16AWS TextractParseBench47.9%-23.10source ↗· verified
  9. 23:16DoclingParseBench50.6%-20.40source ↗· verified
  10. 23:16Dots OCR 1.5ParseBench55.8%-15.20source ↗· verified
  11. 23:16Qwen3-VL-4BParseBench62.0%-9.00source ↗· verified
  12. 23:16Gemini 3 FlashParseBench71.0%NEW SOTA+24.20source ↗· verified
  13. 23:16Anthropic Haiku 4.5ParseBench45.2%-1.60source ↗· verified
  14. 23:16GPT-5-miniParseBench46.8%NEW SOTAfirst resultsource ↗· verified
2026-04-12 · 9 rows
  1. 20:20o3LiveCodeBench Pro1010.00-1429.00source ↗· verified
  2. 20:20DeepSeek R1LiveCodeBench Pro1161.00-1278.00source ↗· verified
  3. 20:20Gemini 2.5 FlashLiveCodeBench Pro1288.00-1151.00source ↗· verified
  4. 20:20Claude Sonnet 4.5LiveCodeBench Pro1412.00-1027.00source ↗· verified
  5. 20:20Qwen3-235B-A22BLiveCodeBench Pro1673.00-766.00source ↗· verified
  6. 20:20Gemini 2.5 ProLiveCodeBench Pro1769.00-670.00source ↗· verified
  7. 20:20o4-miniLiveCodeBench Pro2092.00-347.00source ↗· verified
  8. 20:20GPT-5LiveCodeBench Pro2176.00-263.00source ↗· verified
  9. 20:20Gemini 3 ProLiveCodeBench Pro2439.00NEW SOTAfirst resultsource ↗· verified
2026-04-09 · 72 rows
  1. 02:01CPN (Complementary Proposal Network)ic19-art79.9%-6.50source ↗· verified· dated 2024-02-18
  2. 02:01CPN (Complementary Proposal Network)ic19-art83.6%-2.80source ↗· verified· dated 2024-02-18
  3. 02:01CPN (Complementary Proposal Network)ic19-art81.7%-4.70source ↗· verified· dated 2024-02-18
  4. 02:00PLBARTcodesearchnet---java18.4%-4.16source ↗· verified
  5. 02:00CoTexTcodesearchnet---java19.1%-3.55source ↗· verified
  6. 02:00ProphetNet-Xcodesearchnet---java19.4%-3.22source ↗· verified
  7. 02:00PolyglotCodeBERTcodesearchnet---java20.1%-2.50source ↗· verified
  8. 02:00BART-base (STSM)e2e2.2%-69.50source ↗· verified· dated 2024-01-19
  9. 02:00BART-base (STSM)e2e68.8%-2.94source ↗· verified· dated 2024-01-19
  10. 02:00BART-base (STSM)e2e45.6%-26.10source ↗· verified· dated 2024-01-19
  11. 02:00BART-base (STSM)e2e8.5%-63.24source ↗· verified· dated 2024-01-19
  12. 02:00BART-base (STSM)e2e65.7%-5.96source ↗· verified· dated 2024-01-19
  13. 02:00FLAN-T5-base (STSM)e2e2.1%-69.58source ↗· verified· dated 2024-01-19
  14. 02:00FLAN-T5-base (STSM)e2e67.8%-3.85source ↗· verified· dated 2024-01-19
  15. 02:00FLAN-T5-base (STSM)e2e45.5%-26.16source ↗· verified· dated 2024-01-19
  16. 02:00FLAN-T5-base (STSM)e2e8.5%-63.21source ↗· verified· dated 2024-01-19
  17. 02:00FLAN-T5-base (STSM)e2e65.7%-6.05source ↗· verified· dated 2024-01-19
  18. 02:00T5-base (STSM)e2e2.3%-69.43source ↗· verified· dated 2024-01-19
  19. 02:00T5-base (STSM)e2e69.0%-2.73source ↗· verified· dated 2024-01-19
  20. 02:00T5-base (STSM)e2e45.7%-26.00source ↗· verified· dated 2024-01-19
  21. 02:00T5-base (STSM)e2e8.6%-63.11source ↗· verified· dated 2024-01-19
  22. 02:00T5-base (STSM)e2e67.0%-4.75source ↗· verified· dated 2024-01-19
  23. 02:00ESALEcodesearchnet---javascript15.6%-10.00source ↗· verified· dated 2024-07-01
  24. 02:00UniXcodercodesearchnet---javascript15.5%-10.15source ↗· verified· dated 2024-07-01
  25. 02:00GraphCodeBERT+AdvFusioncodesearchnet---javascript15.9%-9.72source ↗· verified· dated 2024-12-01
  26. 02:00CodeBERT+AdvFusioncodesearchnet---javascript16.8%-8.81source ↗· verified· dated 2024-12-01
  27. 02:00GraphCodeBERTcodesearchnet---javascript14.8%-10.82source ↗· verified· dated 2024-12-01
  28. 02:00CodeT5-basecodesearchnet---javascript16.2%-9.37source ↗· verified· dated 2024-12-01
  29. 02:00HTLM (prefix-tuning)e2e2.5%-69.25source ↗· verified· dated 2021-07-14
  30. 02:00HTLM (prefix-tuning)e2e71.2%-0.50source ↗· verified· dated 2021-07-14
  31. 02:00HTLM (prefix-tuning)e2e46.1%-25.60source ↗· verified· dated 2021-07-14
  32. 02:00HTLM (prefix-tuning)e2e8.8%-62.85source ↗· verified· dated 2021-07-14
  33. 02:00HTLM (prefix-tuning)e2e70.1%-1.60source ↗· verified· dated 2021-07-14
  34. 02:00GPT-2-Large (prefix-tuning)e2e2.5%-69.23source ↗· verified· dated 2021-07-14
  35. 02:00GPT-2-Large (prefix-tuning)e2e71.7%NEW SOTA+0.30source ↗· verified· dated 2021-07-14
  36. 02:00GPT-2-Large (prefix-tuning)e2e46.2%-25.20source ↗· verified· dated 2021-07-14
  37. 02:00GPT-2-Large (prefix-tuning)e2e8.8%-62.55source ↗· verified· dated 2021-07-14
  38. 02:00GPT-2-Large (prefix-tuning)e2e70.3%-1.10source ↗· verified· dated 2021-07-14
  39. 02:00GPT-2-Medium (prefix-tuning)e2e2.5%-68.91source ↗· verified· dated 2021-07-14
  40. 02:00GPT-2-Medium (prefix-tuning)e2e71.4%NEW SOTA+0.40source ↗· verified· dated 2021-07-14
  41. 02:00GPT-2-Medium (prefix-tuning)e2e46.1%-24.90source ↗· verified· dated 2021-07-14
  42. 02:00GPT-2-Medium (prefix-tuning)e2e8.8%-62.19source ↗· verified· dated 2021-07-14
  43. 02:00GPT-2-Medium (prefix-tuning)e2e69.7%-1.30source ↗· verified· dated 2021-07-14
  44. 02:00GPT-2-Medium (fine-tuning)e2e2.5%-68.53source ↗· verified· dated 2021-07-14
  45. 02:00GPT-2-Medium (fine-tuning)e2e71.0%NEW SOTA+0.20source ↗· verified· dated 2021-07-14
  46. 02:00GPT-2-Medium (fine-tuning)e2e46.2%-24.60source ↗· verified· dated 2021-07-14
  47. 02:00GPT-2-Medium (fine-tuning)e2e8.6%-62.18source ↗· verified· dated 2021-07-14
  48. 02:00GPT-2-Medium (fine-tuning)e2e68.2%-2.60source ↗· verified· dated 2021-07-14
  49. 01:58Oracle-BERT (HowSumm-Method)howsumm-method63.2%NEW SOTA+4.30source ↗· verified
  50. 01:58Oracle-BOW (HowSumm-Method)howsumm-method58.9%NEW SOTA+5.40source ↗· verified
  51. 01:58Random Baseline (HowSumm-Method)howsumm-method41.5%-12.00source ↗· verified
  52. 01:57Oracle-BERThowsumm-step46.8%NEW SOTA+0.80source ↗· verified· dated 2021-10-07
  53. 01:57GreedyRel (query: step + method + article titles)howsumm-step30.1%-15.90source ↗· verified· dated 2021-10-07
  54. 01:57Oracle-BOWhowsumm-step46.0%NEW SOTA+6.40source ↗· verified· dated 2021-10-07
  55. 01:57Oracle-HierSummhowsumm-step35.6%-4.00source ↗· verified· dated 2021-10-07
  56. 01:57Random Baseline (HowSumm)howsumm-step23.0%-16.60source ↗· verified· dated 2021-10-07
  57. 01:57InternVL2-76BCC-OCR61.6%-21.65source ↗· verified
  58. 01:57InternVL2-76BCC-OCR35.3%-47.92source ↗· verified
  59. 01:57GOT-OCR2.0CC-OCR39.2%-44.07source ↗· verified
  60. 01:57Claude 3.5 SonnetCC-OCR47.8%-35.46source ↗· verified
  61. 01:57GPT-4oCC-OCR53.3%-29.95source ↗· verified
  62. 01:57Qwen2-VL 72BCC-OCR53.8%-29.47source ↗· verified
  63. 01:57GOT-OCR2.0CC-OCR24.9%-58.30source ↗· verified
  64. 01:57KOSMOS-2.5CC-OCR36.2%-47.02source ↗· verified
  65. 01:57InternVL2-76BCC-OCR46.6%-36.68source ↗· verified
  66. 01:57Florence-2-LargeCC-OCR49.7%-33.55source ↗· verified
  67. 01:57Claude 3.5 SonnetCC-OCR65.7%-17.57source ↗· verified
  68. 01:57Qwen2-VL 72BCC-OCR71.1%-12.11source ↗· verified
  69. 01:57KOSMOS-2.5CC-OCR47.5%-35.70source ↗· verified
  70. 01:57Florence-2-LargeCC-OCR49.2%-34.01source ↗· verified
  71. 01:57TextMonkeyCC-OCR56.9%-26.37source ↗· verified
  72. 01:57GOT-OCR2.0CC-OCR61.0%-22.25source ↗· verified
Showing the 200 most-recent rows. To inspect a single dataset’s history, append ?dataset=ID (e.g. /log?dataset=mmmu). Delta compares each row to the prior-best value on the same dataset at the moment this row was added. Hidden datasets and hidden models are not shown.