Codesota · Services · Custom BenchmarkFor teams who can’t afford the wrong modelIssue: April 27, 2026

§ 00 · The pitch

Public benchmarks don’t answer your question.

You’re evaluating 3–15 models for production. Each one boasts a different public score on a different test set. None of those test sets are your documents — and half the scores are now contaminated.

We build a private benchmark on your task, evaluate every model you’re considering on the same hold-out set, and refresh it quarterly so the answer stays current as new models ship. Outcome: a defensible model choice in 4 weeks instead of 4 months of internal speculation.

Start an engagement →What you get Process

§ 01 · Why this exists

Three reasons public benchmarks fail you.

§ 01

Contamination

Every popular benchmark — MMLU, HumanEval, OmniDoc, OCRBench — is in the next training set within six months of release. A 2026 model scoring 92 on a 2024 benchmark might have memorised the answers. Your private hold-out doesn't have that problem.

§ 02

Wrong task

Public benchmarks measure averaged performance on Wikipedia-style data. They miss the task that pays your rent: Polish invoices, German handwritten medical forms, scanned legacy PDFs with deliberate redactions, screenshot-based UI flows. A model that wins on OmniDoc may lose on your documents.

§ 03

No common axis

Two models you're evaluating may never have been tested on the same benchmark. Without a shared test, you're comparing marketing claims, not capabilities. We make the comparison real.

§ 02 · The deliverable

What you actually get.

Not a PDF report that ages out in six months. A living evaluation that updates as new models ship.

§ 01

Private dashboard

Hosted on codesota.com/private/<your-slug>, magic-link auth. Same visual language as our public Power Rankings — power score, per-metric breakdown, cost/latency Pareto, model lineage. Bookmark it; share it with the eval committee.

§ 02

Methodology document

Defensible to a CTO, an auditor, or a board. Test set construction, exclusion criteria, statistical significance, what was held out and why. The document you hand the procurement officer when they ask ‘how do you know?’

§ 03

Sample items pack

5–10 redacted examples per category, public-shareable. Lets your stakeholders sanity-check the test set without exposing the held-out items. The rest stays private — that’s the point.

§ 04

Quarterly refresh

Every 90 days: new models added (the ones that shipped since last quarter), ~30% of test items rotated to stay ahead of contamination, 60-min review call to discuss what moved and why.

§ 03 · Process

Four weeks, then quarterly.

Week	What happens	Output
01	Scoping call. You describe the task, the candidate models, the production conditions (latency budget, $ budget, languages, document types). We sign mutual NDA. You send a sample of 50–100 representative documents.	Signed scope
02	We curate the hold-out set with you (200–500 items), define metrics (3–8 depending on task), and lock the methodology. You sign off on methodology before we run anything.	Methodology doc
03	We run every candidate model on the hold-out set. Frontier APIs (OpenAI, Anthropic, Google, Mistral) plus 2–3 strong open-source. Capture accuracy, latency, $/call, refusal rate.	Raw results
04	Dashboard goes live. 60-min readout call. You walk away with a defensible model choice and a private URL you can show internally.	Live dashboard
+90d	Refresh. New models added, ~30% of items rotated. Quarterly review call. Repeats for the duration of your engagement.	Refreshed dashboard

§ 04 · Fit check

Right fit if you...

+...are evaluating 3–15 candidate models for a production deployment
+...have ≥ €5k/mo in current or planned LLM/OCR API spend (the engagement pays for itself if it shifts even 20% of traffic to a cheaper tier)
+...need to defend the choice internally — to a CTO, board, procurement, or compliance
+...have a non-trivial document type that public benchmarks miss (CEE languages, regulated industries, legacy formats)
+...want the answer to stay current as new models ship

Wrong fit if you...

−...want us to build the production system. We evaluate; we don’t deploy. See consulting →
−...want a one-shot benchmark with no refresh — just buy a Gartner report, it’s the same shape and probably cheaper
−...need us to make a specific model win. We run honest comparisons or we don’t run them.
−...are a research lab benchmarking your own models for a paper (you have the in-house skill; you’re not our customer)
−...won’t share representative production data under NDA. The whole point is your data — without it we’re just rerunning public benchmarks.

§ 05 · The method

Why hold-out, not open data.

If we publish your test set, it’s in the next training corpus within six months. Your benchmark stops measuring capability and starts measuring memory. We’ve watched it happen to MMLU, HumanEval, GSM8K, OmniDoc.

The hold-out architecture solves it: methodology and sample items are public-shareable (you can put them in a slide deck), the actual test set stays private and rotates quarterly (so even if a question eventually leaks, it’s not the question we’re using anymore). Same approach used by SWE-bench Verified, ARC-AGI, GAIA — for the same reason.

Your data never leaves our infrastructure. Mutual NDA, EU-hosted (Frankfurt), deletion on request, retention only for the duration of your engagement plus 90 days for audit.

§ 06 · Start

The wrong model in production costs more than this engagement.

Send us the task family, the candidate models, and your timeline. We’ll respond within 48 hours with a fit assessment — no proposal theatre.

Email a request →See the public ranking format Or: full implementation

hello@codesota.com · usually reply within a working day · EU/Warsaw