Codesota · Benchmark · AcademiClawHome/Leaderboards/AcademiClaw
Unknown

AcademiClaw.

Primary benchmark dataset for AcademiClaw: When Students Set Challenges for AI Agents.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

Not enough data to show trend.
§ 02 · Leaderboard

Results by metric.

Avg Tokens Per Task K

Avg Tokens Per Task K is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Avg Tokens Per Task Kverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Gemini 3.1 Pro
source table: Tokens / task (K)
verified28572026Source ↗
02MiniMax M2.7
source table: Tokens / task (K)
verified16632026Source ↗
03Claude Sonnet 4.6
source table: Tokens / task (K)
verified15622026Source ↗
04Claude Opus 4.6
source table: Tokens / task (K)
verified14252026Source ↗
05Qwen3.5-397B-A17B†
source table: Tokens / task (K)
verified9702026Source ↗
06GPT-5.4
source table: Tokens / task (K)
verified5252026Source ↗

Avg Time Sec

Avg Time Sec is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Avg Time Secverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Gemini 3.1 Pro
source table: Time (s)
verified8222026Source ↗
02MiniMax M2.7
source table: Time (s)
verified6862026Source ↗
03Claude Opus 4.6
source table: Time (s)
verified6732026Source ↗
04Claude Sonnet 4.6
source table: Time (s)
verified6622026Source ↗
05GPT-5.4
source table: Time (s)
verified2402026Source ↗

Safety Score

Safety Score is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Safety Scoreverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Claude Sonnet 4.6
source table: Safety
verified88.72026Source ↗
02GPT-5.4
source table: Safety
verified87.52026Source ↗
03Claude Opus 4.6
source table: Safety
verified87.42026Source ↗
04MiniMax M2.7
source table: Safety
verified86.52026Source ↗
05Qwen3.5-397B-A17B†
source table: Safety
verified80.82026Source ↗
06Gemini 3.1 Pro
source table: Safety
verified74.92026Source ↗

Avg Score

Avg Score is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Avg Scoreverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Claude Opus 4.6
source table: Avg Score
verified71.92026Source ↗
02Claude Sonnet 4.6
source table: Avg Score
verified68.32026Source ↗
03GPT-5.4
source table: Avg Score
verified65.62026Source ↗
04Qwen3.5-397B-A17B†
source table: Avg Score
verified64.72026Source ↗
05Gemini 3.1 Pro
source table: Avg Score
verified64.32026Source ↗
06MiniMax M2.7
source table: Avg Score
verified63.12026Source ↗

Tool Calls Per Task

Tool Calls Per Task is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Tool Calls Per Taskverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Gemini 3.1 Pro
source table: Tools / task
verified572026Source ↗
02MiniMax M2.7
source table: Tools / task
verified372026Source ↗
03Claude Opus 4.6
source table: Tools / task
verified332026Source ↗
04Qwen3.5-397B-A17B†
source table: Tools / task
verified262026Source ↗
05Claude Sonnet 4.6
source table: Tools / task
verified262026Source ↗
06GPT-5.4
source table: Tools / task
verified192026Source ↗

Pass

Pass is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Passverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Claude Opus 4.6
source table: Pass (%)
verified552026Source ↗
02Claude Sonnet 4.6
source table: Pass (%)
verified552026Source ↗
03Gemini 3.1 Pro
source table: Pass (%)
verified43.82026Source ↗
04GPT-5.4
source table: Pass (%)
verified42.52026Source ↗
05Qwen3.5-397B-A17B†
source table: Pass (%)
verified402026Source ↗
06MiniMax M2.7
source table: Pass (%)
verified37.52026Source ↗
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards