Codesota · Research bountiesKnuth-style · 42 tasks · $280 totalFirst valid submission wins
§ 00 · Challenges

Research bounties,
Knuth-style.

Forty-two research tasks that produce real value for the ML community. Each one earns a symbolic reward — like Knuth’s checks, a proof of work you frame, not cash. AI agents can help, but they can’t carry you.

Every accepted deliverable is published on Codesota with full attribution. AI tools are encouraged; raw unverified AI output is rejected.

§ 01 · Reward

Symbolic, but real.

Rewards are symbolic — like Donald Knuth’s famous checks for finding errors in his books. Most recipients frame them. The real reward is published research with your name on it.

TierTasksRewardEffortCore challenge
Easy1–8$1 each2–5 hoursFind + verify data
Medium9–16$2 each4–8 hoursAnalyze + synthesize
Hard17–24$4 each1–3 daysCreate + reproduce
Extra Hard25–32$8 each3–7 daysBuild + pioneer
Legendary33–42$16 each1–3 weeksResearch + publish
§ 02 · The list

All 42 challenges.

Filter
§ 03 · Process

How it works.

  1. 01

    Claim a task

    Click "I'm working on this" to claim a challenge. We send you tips and resources.

  2. 02

    Do the research

    Use AI tools to assist, but verify everything yourself. We check sources.

  3. 03

    Submit

    Submit a deliverable with a repo link. We review within 72 hours; community members peer-review.

  4. 04

    Get published

    Accepted work is published on codesota.com with attribution, plus a collectible check to frame.

§ 04 · Rules

The quality bar.

AI tools are encouraged for research assistance. Raw AI output without human verification is rejected.

Submission
  • 01Complete the task and prepare your deliverable.
  • 02Submit via the form with your work + source links.
  • 03We review within 72 hours.
  • 04Approved work is published with your name, reward paid.
Quality bar
  • ·Every claim must cite a primary source.
  • ·AI tools are encouraged for research assistance.
  • ·Raw AI output without verification = rejection.
  • ·We spot-check sources. One wrong and the submission fails.
  • ·First valid submission per task earns the check.
§ 05 · Example

What a completed #01 looks like.

Benchmark Archaeology on ADE20K Semantic Segmentation. This is the quality bar.

ade20k-archaeology.json· 5 papers · structured data
$1
{
  "benchmark": "ADE20K Semantic Segmentation",
  "metric":    "mIoU (mean Intersection over Union)",
  "split":     "val (20,210 images, 150 classes)",
  "papers": [
    {
      "model":      "InternImage-H",
      "score":      62.9,
      "evaluation": "single-scale, UperNet, 896x896 crop",
      "arxiv":      "2211.05778",
      "caveats":    "ImageNet-22k + Object365 pretrain;
                     TTA reports 64.2 — not standard."
    },
    {
      "model":      "SwinV2-G",
      "score":      61.4,
      "caveats":    "3B params; multi-scale testing inflates
                     ~1.5 mIoU vs single-scale."
    }
    // ... 3 more
  ],
  "comparability_flags": [
    "crop_size_varies:      512 vs 640 vs 896",
    "test_time_augmentation: single vs multi-scale",
    "pretraining_data:      ImageNet-1k → proprietary",
    "decoder_head:          UperNet vs Mask2Former"
  ]
}
summary.md · 2 paragraphs

What’s being measured: ADE20K evaluates pixel-level semantic understanding across 150 categories. mIoU weights rare classes (chandelier, escalator) equally with common ones (wall, floor) — so long-tail performance matters more than leaderboards suggest. A model scoring 60 mIoU can still catastrophically fail on 30+ rare categories.

Why results aren’t comparable: The 5 papers use three crop sizes, two decoder heads, and pretraining ranging from ImageNet-1k to billion-scale proprietary sets. Multi-scale TTA (2 of 5) inflates by 1–2 mIoU but isn’t flagged. Normalized to one protocol, the headline gap widens. Any leaderboard mixing these without methodology flags is misleading.

§ 06
Why these exist

ML benchmarking is broken.

Papers report scores without methodology details. Leaderboards mix apples and oranges. Datasets rot. Human baselines were collected once in 2018 and never updated.

Codesota tracks 231 benchmarks. 188 of them need research. That’s not a backlog — it’s an opportunity for anyone willing to do the work.

These challenges are designed so AI agents get you 30–40% of the way there. The remaining 60% — verifying sources, making judgment calls, running real experiments, finding what the data actually means — that’s where the value lives. And that’s what we pay for.

Ready to start?

Pick a task, do the research, ship the deliverable. Your work helps the entire ML community — and earns you a check worth framing.

Ask a question How Codesota works