Codesota · Benchmark · AcademiClawHome/Leaderboards/AcademiClaw
Unknown

AcademiClaw.

Primary benchmark dataset for AcademiClaw: When Students Set Challenges for AI Agents.

Paper Leaderboard
§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Avg Tokens Per Task K

Avg Tokens Per Task K is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Avg Tokens Per Task Kverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Gemini-3.1-Pro
source table: Tokens / task (K)
verified28572026Paper ↗Code ↗Source ↗Looks wrong?
02MiniMax M2.7
source table: Tokens / task (K)
verified16632026Paper ↗Code ↗Source ↗Looks wrong?
03Claude Sonnet 4.6
source table: Tokens / task (K)
verified15622026Paper ↗Code ↗Source ↗Looks wrong?
04Claude Opus 4.6
source table: Tokens / task (K)
verified14252026Paper ↗Code ↗Source ↗Looks wrong?
05Qwen3.5-397B-A17B†
source table: Tokens / task (K)
verified9702026Paper ↗Code ↗Source ↗Looks wrong?
06GPT-5.4
source table: Tokens / task (K)
verified5252026Paper ↗Code ↗Source ↗Looks wrong?

Avg Time Sec

Avg Time Sec is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Avg Time Secverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Gemini-3.1-Pro
source table: Time (s)
verified8222026Paper ↗Code ↗Source ↗Looks wrong?
02MiniMax M2.7
source table: Time (s)
verified6862026Paper ↗Code ↗Source ↗Looks wrong?
03Claude Opus 4.6
source table: Time (s)
verified6732026Paper ↗Code ↗Source ↗Looks wrong?
04Claude Sonnet 4.6
source table: Time (s)
verified6622026Paper ↗Code ↗Source ↗Looks wrong?
05GPT-5.4
source table: Time (s)
verified2402026Paper ↗Code ↗Source ↗Looks wrong?

Safety Score

Safety Score is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Safety Scoreverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Claude Sonnet 4.6
source table: Safety
verified88.72026Paper ↗Code ↗Source ↗Looks wrong?
02GPT-5.4
source table: Safety
verified87.52026Paper ↗Code ↗Source ↗Looks wrong?
03Claude Opus 4.6
source table: Safety
verified87.42026Paper ↗Code ↗Source ↗Looks wrong?
04MiniMax M2.7
source table: Safety
verified86.52026Paper ↗Code ↗Source ↗Looks wrong?
05Qwen3.5-397B-A17B†
source table: Safety
verified80.82026Paper ↗Code ↗Source ↗Looks wrong?
06Gemini-3.1-Pro
source table: Safety
verified74.92026Paper ↗Code ↗Source ↗Looks wrong?

Avg Score

Avg Score is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Avg Scoreverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Claude Opus 4.6
source table: Avg Score
verified71.92026Paper ↗Code ↗Source ↗Looks wrong?
02Claude Sonnet 4.6
source table: Avg Score
verified68.32026Paper ↗Code ↗Source ↗Looks wrong?
03GPT-5.4
source table: Avg Score
verified65.62026Paper ↗Code ↗Source ↗Looks wrong?
04Qwen3.5-397B-A17B†
source table: Avg Score
verified64.72026Paper ↗Code ↗Source ↗Looks wrong?
05Gemini-3.1-Pro
source table: Avg Score
verified64.32026Paper ↗Code ↗Source ↗Looks wrong?
06MiniMax M2.7
source table: Avg Score
verified63.12026Paper ↗Code ↗Source ↗Looks wrong?

Tool Calls Per Task

Tool Calls Per Task is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Tool Calls Per Taskverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Gemini-3.1-Pro
source table: Tools / task
verified572026Paper ↗Code ↗Source ↗Looks wrong?
02MiniMax M2.7
source table: Tools / task
verified372026Paper ↗Code ↗Source ↗Looks wrong?
03Claude Opus 4.6
source table: Tools / task
verified332026Paper ↗Code ↗Source ↗Looks wrong?
04Qwen3.5-397B-A17B†
source table: Tools / task
verified262026Paper ↗Code ↗Source ↗Looks wrong?
05Claude Sonnet 4.6
source table: Tools / task
verified262026Paper ↗Code ↗Source ↗Looks wrong?
06GPT-5.4
source table: Tools / task
verified192026Paper ↗Code ↗Source ↗Looks wrong?

Pass

Pass is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Passverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Claude Opus 4.6
source table: Pass (%)
verified552026Paper ↗Code ↗Source ↗Looks wrong?
02Claude Sonnet 4.6
source table: Pass (%)
verified552026Paper ↗Code ↗Source ↗Looks wrong?
03Gemini-3.1-Pro
source table: Pass (%)
verified43.82026Paper ↗Code ↗Source ↗Looks wrong?
04GPT-5.4
source table: Pass (%)
verified42.52026Paper ↗Code ↗Source ↗Looks wrong?
05Qwen3.5-397B-A17B†
source table: Pass (%)
verified402026Paper ↗Code ↗Source ↗Looks wrong?
06MiniMax M2.7
source table: Pass (%)
verified37.52026Paper ↗Code ↗Source ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards