Primary benchmark dataset for AcademiClaw: When Students Set Challenges for AI Agents.
Avg Tokens Per Task K is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Source |
|---|---|---|---|---|---|
| 01 | Gemini 3.1 Pro | verified | 2857 | 2026 | Source ↗ |
| 02 | MiniMax M2.7 | verified | 1663 | 2026 | Source ↗ |
| 03 | Claude Sonnet 4.6 | verified | 1562 | 2026 | Source ↗ |
| 04 | Claude Opus 4.6 | verified | 1425 | 2026 | Source ↗ |
| 05 | Qwen3.5-397B-A17B† | verified | 970 | 2026 | Source ↗ |
| 06 | GPT-5.4 | verified | 525 | 2026 | Source ↗ |
Avg Time Sec is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Source |
|---|---|---|---|---|---|
| 01 | Gemini 3.1 Pro | verified | 822 | 2026 | Source ↗ |
| 02 | MiniMax M2.7 | verified | 686 | 2026 | Source ↗ |
| 03 | Claude Opus 4.6 | verified | 673 | 2026 | Source ↗ |
| 04 | Claude Sonnet 4.6 | verified | 662 | 2026 | Source ↗ |
| 05 | GPT-5.4 | verified | 240 | 2026 | Source ↗ |
Safety Score is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Source |
|---|---|---|---|---|---|
| 01 | Claude Sonnet 4.6 | verified | 88.7 | 2026 | Source ↗ |
| 02 | GPT-5.4 | verified | 87.5 | 2026 | Source ↗ |
| 03 | Claude Opus 4.6 | verified | 87.4 | 2026 | Source ↗ |
| 04 | MiniMax M2.7 | verified | 86.5 | 2026 | Source ↗ |
| 05 | Qwen3.5-397B-A17B† | verified | 80.8 | 2026 | Source ↗ |
| 06 | Gemini 3.1 Pro | verified | 74.9 | 2026 | Source ↗ |
Avg Score is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Source |
|---|---|---|---|---|---|
| 01 | Claude Opus 4.6 | verified | 71.9 | 2026 | Source ↗ |
| 02 | Claude Sonnet 4.6 | verified | 68.3 | 2026 | Source ↗ |
| 03 | GPT-5.4 | verified | 65.6 | 2026 | Source ↗ |
| 04 | Qwen3.5-397B-A17B† | verified | 64.7 | 2026 | Source ↗ |
| 05 | Gemini 3.1 Pro | verified | 64.3 | 2026 | Source ↗ |
| 06 | MiniMax M2.7 | verified | 63.1 | 2026 | Source ↗ |
Tool Calls Per Task is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Source |
|---|---|---|---|---|---|
| 01 | Gemini 3.1 Pro | verified | 57 | 2026 | Source ↗ |
| 02 | MiniMax M2.7 | verified | 37 | 2026 | Source ↗ |
| 03 | Claude Opus 4.6 | verified | 33 | 2026 | Source ↗ |
| 04 | Qwen3.5-397B-A17B† | verified | 26 | 2026 | Source ↗ |
| 05 | Claude Sonnet 4.6 | verified | 26 | 2026 | Source ↗ |
| 06 | GPT-5.4 | verified | 19 | 2026 | Source ↗ |
Pass is the reported evaluation metric for AcademiClaw. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Source |
|---|---|---|---|---|---|
| 01 | Claude Opus 4.6 | verified | 55 | 2026 | Source ↗ |
| 02 | Claude Sonnet 4.6 | verified | 55 | 2026 | Source ↗ |
| 03 | Gemini 3.1 Pro | verified | 43.8 | 2026 | Source ↗ |
| 04 | GPT-5.4 | verified | 42.5 | 2026 | Source ↗ |
| 05 | Qwen3.5-397B-A17B† | verified | 40 | 2026 | Source ↗ |
| 06 | MiniMax M2.7 | verified | 37.5 | 2026 | Source ↗ |