Public benchmarks don’t answer your question.
You’re evaluating 3–15 models for production. Each one boasts a different public score on a different test set. None of those test sets are your documents — and half the scores are now contaminated.
We build a private benchmark on your task, evaluate every model you’re considering on the same hold-out set, and refresh it quarterly so the answer stays current as new models ship. Outcome: a defensible model choice in 4 weeks instead of 4 months of internal speculation.
Three reasons public benchmarks fail you.
Every popular benchmark — MMLU, HumanEval, OmniDoc, OCRBench — is in the next training set within six months of release. A 2026 model scoring 92 on a 2024 benchmark might have memorised the answers. Your private hold-out doesn't have that problem.
Public benchmarks measure averaged performance on Wikipedia-style data. They miss the task that pays your rent: Polish invoices, German handwritten medical forms, scanned legacy PDFs with deliberate redactions, screenshot-based UI flows. A model that wins on OmniDoc may lose on your documents.
Two models you're evaluating may never have been tested on the same benchmark. Without a shared test, you're comparing marketing claims, not capabilities. We make the comparison real.
What you actually get.
Not a PDF report that ages out in six months. A living evaluation that updates as new models ship.
Hosted on codesota.com/private/<your-slug>, magic-link auth. Same visual language as our public Power Rankings — power score, per-metric breakdown, cost/latency Pareto, model lineage. Bookmark it; share it with the eval committee.
Defensible to a CTO, an auditor, or a board. Test set construction, exclusion criteria, statistical significance, what was held out and why. The document you hand the procurement officer when they ask ‘how do you know?’
5–10 redacted examples per category, public-shareable. Lets your stakeholders sanity-check the test set without exposing the held-out items. The rest stays private — that’s the point.
Every 90 days: new models added (the ones that shipped since last quarter), ~30% of test items rotated to stay ahead of contamination, 60-min review call to discuss what moved and why.
Four weeks, then quarterly.
| Week | What happens | Output |
|---|---|---|
| 01 | Scoping call. You describe the task, the candidate models, the production conditions (latency budget, $ budget, languages, document types). We sign mutual NDA. You send a sample of 50–100 representative documents. | Signed scope |
| 02 | We curate the hold-out set with you (200–500 items), define metrics (3–8 depending on task), and lock the methodology. You sign off on methodology before we run anything. | Methodology doc |
| 03 | We run every candidate model on the hold-out set. Frontier APIs (OpenAI, Anthropic, Google, Mistral) plus 2–3 strong open-source. Capture accuracy, latency, $/call, refusal rate. | Raw results |
| 04 | Dashboard goes live. 60-min readout call. You walk away with a defensible model choice and a private URL you can show internally. | Live dashboard |
| +90d | Refresh. New models added, ~30% of items rotated. Quarterly review call. Repeats for the duration of your engagement. | Refreshed dashboard |
Right fit if you...
- +...are evaluating 3–15 candidate models for a production deployment
- +...have ≥ €5k/mo in current or planned LLM/OCR API spend (the engagement pays for itself if it shifts even 20% of traffic to a cheaper tier)
- +...need to defend the choice internally — to a CTO, board, procurement, or compliance
- +...have a non-trivial document type that public benchmarks miss (CEE languages, regulated industries, legacy formats)
- +...want the answer to stay current as new models ship
Wrong fit if you...
- −...want us to build the production system. We evaluate; we don’t deploy. See consulting →
- −...want a one-shot benchmark with no refresh — just buy a Gartner report, it’s the same shape and probably cheaper
- −...need us to make a specific model win. We run honest comparisons or we don’t run them.
- −...are a research lab benchmarking your own models for a paper (you have the in-house skill; you’re not our customer)
- −...won’t share representative production data under NDA. The whole point is your data — without it we’re just rerunning public benchmarks.
Why hold-out, not open data.
If we publish your test set, it’s in the next training corpus within six months. Your benchmark stops measuring capability and starts measuring memory. We’ve watched it happen to MMLU, HumanEval, GSM8K, OmniDoc.
The hold-out architecture solves it: methodology and sample items are public-shareable (you can put them in a slide deck), the actual test set stays private and rotates quarterly (so even if a question eventually leaks, it’s not the question we’re using anymore). Same approach used by SWE-bench Verified, ARC-AGI, GAIA — for the same reason.
Your data never leaves our infrastructure. Mutual NDA, EU-hosted (Frankfurt), deletion on request, retention only for the duration of your engagement plus 90 days for audit.
The wrong model in production costs more than this engagement.
Send us the task family, the candidate models, and your timeline. We’ll respond within 48 hours with a fit assessment — no proposal theatre.