Codesota · Agentic AI · Task agents · AcademiClawTasks/Agentic AI/Task agents
Task agents · benchmark dataset · 2026 · EN

AcademiClaw: agentic frontier tasks benchmark.

Primary benchmark dataset for AcademiClaw: When Students Set Challenges for AI Agents.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

35 results indexed across 6 metrics. Shaded row marks current SOTA; ties broken by submission date.


Primary
avg-score · higher is better
All metrics
avg-score, avg-time-sec, avg-tokens-per-task-k, pass, safety-score, tool-calls-per-task
avg-score· primary
6 rows
#ModelOrgSubmittedPaper / codeavg-score
01Claude Opus 4.6APIAnthropicMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code71.90
02Claude Sonnet 4.6APIAnthropicMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code68.30
03GPT-5.4APIOpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code65.60
04Qwen3.5-397B-A17B†APIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code64.70
05Gemini 3.1 ProAPIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code64.30
06MiniMax M2.7APIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code63.10
avg-time-sec
5 rows
#ModelOrgSubmittedPaper / codeavg-time-sec
01Gemini 3.1 ProAPIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code822
02MiniMax M2.7APIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code686
03Claude Opus 4.6APIAnthropicMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code673
04Claude Sonnet 4.6APIAnthropicMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code662
05GPT-5.4APIOpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code240
avg-tokens-per-task-k
6 rows
#ModelOrgSubmittedPaper / codeavg-tokens-per-task-k
01Gemini 3.1 ProAPIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code2857
02MiniMax M2.7APIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code1663
03Claude Sonnet 4.6APIAnthropicMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code1562
04Claude Opus 4.6APIAnthropicMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code1425
05Qwen3.5-397B-A17B†APIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code970
06GPT-5.4APIOpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code525
pass
6 rows
#ModelOrgSubmittedPaper / codepass
01Claude Sonnet 4.6APIAnthropicMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code55
02Claude Opus 4.6APIAnthropicMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code55
03Gemini 3.1 ProAPIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code43.80
04GPT-5.4APIOpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code42.50
05Qwen3.5-397B-A17B†APIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code40
06MiniMax M2.7APIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code37.50
safety-score
6 rows
#ModelOrgSubmittedPaper / codesafety-score
01Claude Sonnet 4.6APIAnthropicMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code88.70
02GPT-5.4APIOpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code87.50
03Claude Opus 4.6APIAnthropicMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code87.40
04MiniMax M2.7APIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code86.50
05Qwen3.5-397B-A17B†APIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code80.80
06Gemini 3.1 ProAPIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code74.90
tool-calls-per-task
6 rows
#ModelOrgSubmittedPaper / codetool-calls-per-task
01Gemini 3.1 ProAPIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code57
02MiniMax M2.7APIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code37
03Claude Opus 4.6APIAnthropicMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code33
04Qwen3.5-397B-A17B†APIAnthropic/OpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code26
05Claude Sonnet 4.6APIAnthropicMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code26
06GPT-5.4APIOpenAIMay 2026AcademiClaw: When Students Set Challenges for AI Agents · code19
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

1 steps
of state of the art.

Each row below marks a model that broke the previous record on avg-score. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · avg-score
  1. May 4, 2026Claude Opus 4.6Anthropic71.90
Fig 3 · SOTA-setting models only. 1 entries span May 2026 May 2026.
§ 04 · Literature

1 paper
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

  • AcademiClaw: When Students Set Challenges for AI Agents
    Junjie YuPengrui LuWeiye SiHongliang LuJiabao WuKaiwen TaoKun WangLingyu YangQiran ZhangXiuting GuoXuanyu WangYang WangYanjie WangYi YangZijian HuZiyi YangZonghan ZhouBinghao QiangBorui ZhangChenning LiEnchang ZhangFeifan ChenFeng JianFengyin SunHao QiuHao ZhengHaoran ZhuHongyu LiuJianbin DengJiaxin SongJiaying ChiJiayou ShiJie FangJinghui ZhongJingyu ZhouJinze LiJunfeng YiJunyan YuJunzhi XueNi SongPengyi ChenQi ChenQuansheng LiRui TaoShenghai GongShenhang LuTianqi ShenTianxiang ZhuTiehan KangTingyu LiWendi WuXiao ShenXiao ZhouXiaotao ZhangXinrong LiXuankun YangXun ZhangYan LiYe LuYi WangYibo ZhouYichi ZhangYihao SunYijun HuangYixin ZhuYixuan WuYuchen SunYue WuYuheng SunYukun LiYutian TuYuxuan QinYuzhuo WuZeyu LiZhengyu LouZhenning RanZizhu HePengFei Liu
    May 2026·Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4 +3
§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies