Codesota · Research bountiesKnuth-style · 42 tasks · $280 totalFirst valid submission wins

§ 00 · Challenges

Research bounties,
Knuth-style.

Forty-two research tasks that produce real value for the ML community. Each one earns a symbolic reward — like Knuth’s checks, a proof of work you frame, not cash. AI agents can help, but they can’t carry you.

Every accepted deliverable is published on Codesota with full attribution. AI tools are encouraged; raw unverified AI output is rejected.

§ 01 · Reward

Symbolic, but real.

Rewards are symbolic — like Donald Knuth’s famous checks for finding errors in his books. Most recipients frame them. The real reward is published research with your name on it.

Tier	Tasks	Reward	Effort	Core challenge
Easy	1–8	$1 each	2–5 hours	Find + verify data
Medium	9–16	$2 each	4–8 hours	Analyze + synthesize
Hard	17–24	$4 each	1–3 days	Create + reproduce
Extra Hard	25–32	$8 each	3–7 days	Build + pioneer
Legendary	33–42	$16 each	1–3 weeks	Research + publish

§ 02 · The list

All 42 challenges.

Filter

§ 03 · Process

How it works.

01
Claim a task
Click "I'm working on this" to claim a challenge. We send you tips and resources.
02
Do the research
Use AI tools to assist, but verify everything yourself. We check sources.
03
Submit
Submit a deliverable with a repo link. We review within 72 hours; community members peer-review.
04
Get published
Accepted work is published on codesota.com with attribution, plus a collectible check to frame.

§ 04 · Rules

The quality bar.

AI tools are encouraged for research assistance. Raw AI output without human verification is rejected.

Submission

01Complete the task and prepare your deliverable.
02Submit via the form with your work + source links.
03We review within 72 hours.
04Approved work is published with your name, reward paid.

Quality bar

·Every claim must cite a primary source.
·AI tools are encouraged for research assistance.
·Raw AI output without verification = rejection.
·We spot-check sources. One wrong and the submission fails.
·First valid submission per task earns the check.

§ 05 · Example

What a completed #01 looks like.

Benchmark Archaeology on ADE20K Semantic Segmentation. This is the quality bar.

ade20k-archaeology.json· 5 papers · structured data

{
  "benchmark": "ADE20K Semantic Segmentation",
  "metric":    "mIoU (mean Intersection over Union)",
  "split":     "val (20,210 images, 150 classes)",
  "papers": [
    {
      "model":      "InternImage-H",
      "score":      62.9,
      "evaluation": "single-scale, UperNet, 896x896 crop",
      "arxiv":      "2211.05778",
      "caveats":    "ImageNet-22k + Object365 pretrain;
                     TTA reports 64.2 — not standard."
    },
    {
      "model":      "SwinV2-G",
      "score":      61.4,
      "caveats":    "3B params; multi-scale testing inflates
                     ~1.5 mIoU vs single-scale."
    }
    // ... 3 more
  ],
  "comparability_flags": [
    "crop_size_varies:      512 vs 640 vs 896",
    "test_time_augmentation: single vs multi-scale",
    "pretraining_data:      ImageNet-1k → proprietary",
    "decoder_head:          UperNet vs Mask2Former"
  ]
}

summary.md · 2 paragraphs

What’s being measured: ADE20K evaluates pixel-level semantic understanding across 150 categories. mIoU weights rare classes (chandelier, escalator) equally with common ones (wall, floor) — so long-tail performance matters more than leaderboards suggest. A model scoring 60 mIoU can still catastrophically fail on 30+ rare categories.

Why results aren’t comparable: The 5 papers use three crop sizes, two decoder heads, and pretraining ranging from ImageNet-1k to billion-scale proprietary sets. Multi-scale TTA (2 of 5) inflates by 1–2 mIoU but isn’t flagged. Normalized to one protocol, the headline gap widens. Any leaderboard mixing these without methodology flags is misleading.

§ 06

Why these exist

ML benchmarking is broken.

Papers report scores without methodology details. Leaderboards mix apples and oranges. Datasets rot. Human baselines were collected once in 2018 and never updated.

Codesota tracks 231 benchmarks. 188 of them need research. That’s not a backlog — it’s an opportunity for anyone willing to do the work.

These challenges are designed so AI agents get you 30–40% of the way there. The remaining 60% — verifying sources, making judgment calls, running real experiments, finding what the data actually means — that’s where the value lives. And that’s what we pay for.

Ready to start?

Pick a task, do the research, ship the deliverable. Your work helps the entire ML community — and earns you a check worth framing.

Ask a question ↵How Codesota works

Research bounties,Knuth-style.

Symbolic, but real.

All 42 challenges.

Benchmark Archaeology

Dead Link Audit

Metric Decoder

Leaderboard Snapshot

Dataset Health Check

Model Card Fact-Checker

Training Data Detective

Benchmark Timeline

Cross-Paper Contradiction Finder

Reproducibility Probe

Open vs. Closed Showdown

Benchmark Genealogy

Evaluation Noise Estimator

Prompt Sensitivity Audit

Cost-Per-Point Calculator

Multilingual Gap Finder

Missing Benchmark Creator

Contamination Detector

Multimodal Gap Map

Speed vs. Quality Frontier

Saturation Autopsy

Research Area Map

Failure Mode Taxonomy

Agent Benchmark Builder

Polish Benchmark Translation

End-to-End System Benchmark

Benchmark Economics Report

Living Benchmark Prototype

Definitive Survey

Cross-Domain Transfer Study

Scaling Law Verifier

Benchmark Adversarial Attack

Benchmark Meta-Analysis

Full Domain Coverage

Real-World Benchmark Suite

Benchmark Corruption Index

Interactive Benchmark Explorer

Temporal Drift Study

Human Baseline Recalibration

Benchmark Design Playbook

The Rosetta Evaluation

State of AI Evaluation 2026

How it works.

Claim a task

Do the research

Submit

Get published

The quality bar.

What a completed #01 looks like.

ML benchmarking is broken.

Ready to start?

Research bounties,
Knuth-style.