AcademiClaw.

Name: AcademiClaw Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

Primary benchmark dataset for AcademiClaw: When Students Set Challenges for AI Agents.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

Not enough data to show trend.

§ 02 · Leaderboard

Results by metric.

Avg Tokens Per Task K

Avg Tokens Per Task K is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Avg Tokens Per Task Kverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Gemini 3.1 Pro source table: Tokens / task (K)	verified	2857	2026	Source ↗
02	MiniMax M2.7 source table: Tokens / task (K)	verified	1663	2026	Source ↗
03	Claude Sonnet 4.6 source table: Tokens / task (K)	verified	1562	2026	Source ↗
04	Claude Opus 4.6 source table: Tokens / task (K)	verified	1425	2026	Source ↗
05	Qwen3.5-397B-A17B† source table: Tokens / task (K)	verified	970	2026	Source ↗
06	GPT-5.4 source table: Tokens / task (K)	verified	525	2026	Source ↗

Avg Time Sec

Avg Time Sec is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Avg Time Secverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Gemini 3.1 Pro source table: Time (s)	verified	822	2026	Source ↗
02	MiniMax M2.7 source table: Time (s)	verified	686	2026	Source ↗
03	Claude Opus 4.6 source table: Time (s)	verified	673	2026	Source ↗
04	Claude Sonnet 4.6 source table: Time (s)	verified	662	2026	Source ↗
05	GPT-5.4 source table: Time (s)	verified	240	2026	Source ↗

Safety Score

Safety Score is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Safety Scoreverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Claude Sonnet 4.6 source table: Safety	verified	88.7	2026	Source ↗
02	GPT-5.4 source table: Safety	verified	87.5	2026	Source ↗
03	Claude Opus 4.6 source table: Safety	verified	87.4	2026	Source ↗
04	MiniMax M2.7 source table: Safety	verified	86.5	2026	Source ↗
05	Qwen3.5-397B-A17B† source table: Safety	verified	80.8	2026	Source ↗
06	Gemini 3.1 Pro source table: Safety	verified	74.9	2026	Source ↗

Avg Score

Avg Score is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Avg Scoreverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Claude Opus 4.6 source table: Avg Score	verified	71.9	2026	Source ↗
02	Claude Sonnet 4.6 source table: Avg Score	verified	68.3	2026	Source ↗
03	GPT-5.4 source table: Avg Score	verified	65.6	2026	Source ↗
04	Qwen3.5-397B-A17B† source table: Avg Score	verified	64.7	2026	Source ↗
05	Gemini 3.1 Pro source table: Avg Score	verified	64.3	2026	Source ↗
06	MiniMax M2.7 source table: Avg Score	verified	63.1	2026	Source ↗

Tool Calls Per Task

Tool Calls Per Task is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Tool Calls Per Taskverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Gemini 3.1 Pro source table: Tools / task	verified	57	2026	Source ↗
02	MiniMax M2.7 source table: Tools / task	verified	37	2026	Source ↗
03	Claude Opus 4.6 source table: Tools / task	verified	33	2026	Source ↗
04	Qwen3.5-397B-A17B† source table: Tools / task	verified	26	2026	Source ↗
05	Claude Sonnet 4.6 source table: Tools / task	verified	26	2026	Source ↗
06	GPT-5.4 source table: Tools / task	verified	19	2026	Source ↗

Pass

Pass is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Passverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Claude Opus 4.6 source table: Pass (%)	verified	55	2026	Source ↗
02	Claude Sonnet 4.6 source table: Pass (%)	verified	55	2026	Source ↗
03	Gemini 3.1 Pro source table: Pass (%)	verified	43.8	2026	Source ↗
04	GPT-5.4 source table: Pass (%)	verified	42.5	2026	Source ↗
05	Qwen3.5-397B-A17B† source table: Pass (%)	verified	40	2026	Source ↗
06	MiniMax M2.7 source table: Pass (%)	verified	37.5	2026	Source ↗

§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards