AI Implementation Methodology
We design the
failure modes first.
Most AI projects fail not because of bad models — but because of wrong outcomes, ignored risks, and no adoption plan. We fix that with a process built from benchmarking 40+ models across real production conditions.
40+
Models benchmarked
codesota.com/ocr
71%
CER reduction
RysOCR, Polish docs
6
Phase framework
outcome → risk → ship
The Problem
Why AI projects keep failing the same way
The pattern is consistent across industries. We have seen it, measured it, and built a process that does not repeat it.
01
No defined outcomes
Projects kick off with "discovery" before anyone agrees what success looks like in business terms.
02
Risk as afterthought
Teams rush to PoC before mapping failure modes. The expensive risks surface late.
03
Vendor claims, no data
Vendors benchmark on their own test sets. We run independent evaluations on standardised datasets.
04
No explainability
Confidence design happens after the model is built. Users don't trust the system and route around it.
05
Deployment = done
Projects "complete" at launch. No adoption tracking, no feedback loops, no measurement of impact.
06
No kill criteria
Without pre-defined stopping conditions, failing projects keep burning budget.
The Framework
Outcome-first AI development
A continuous loop that starts with outcomes, selects models using benchmark data, and never stops measuring.
Phase 00
Outcome Definition
Before any discovery, any model selection, any code — we define what success looks like in business terms. And what would make us stop.
North Star Metric
One number that moves
What single metric proves this worked? Not "model accuracy" — actual business impact. Invoice processing time. Screening throughput. Error rate in production.
Decision Rights
What does AI decide?
Recommend? Automate? Escalate? The boundary between AI and human decision is designed explicitly — not discovered later when something goes wrong.
Cost of Wrong
Failure taxonomy
A false positive in PEP screening costs differently than one in document OCR. We map the asymmetry of errors before touching a dataset.
Kill Criteria
Pre-defined stopping conditions
What would make us recommend stopping? Defined upfront, in writing, signed by sponsor. Not a post-hoc rationalisation when budget is spent.
What Changes
The difference in practice
| Dimension | Typical agency | CodeSOTA approach |
|---|---|---|
| Model selection | Vendor demos, blog posts, familiarity | Independent benchmarks on standardised datasets |
| Risk detection | During or after build | Phase 01 — mapped before a single line of code |
| Adoption strategy | Training session at launch | Explicit parallel track from Phase 05 onwards |
| Trust architecture | Post-build, if at all | Designed in Phase 04 before engineering begins |
| Kill decision | After sunk costs, political | Pre-defined criteria in Outcome Charter, Phase 00 |
| Measurement | Project closes at deployment | Phase ∞ — continuous loop, business metrics first |
Selected Outcomes
What this looks like in production
Anonymised. Numbers are real.
Industrial inspection — NDT / Energy
~85%
Detection accuracy
Real-time
Processing speed
Computer vision for automated defect detection in industrial inspection. Replaced manual visual review. Key decision from Risk Architecture: manual override is always available and every override is logged as a training signal.
Compliance / AML screening — Fintech
3x
Screening throughput
-60%
False positive rate
LLM-powered adverse media and PEP screening. The Cost of Wrong analysis (Phase 00) determined that false negatives carried regulatory risk — so the model was tuned conservatively, with explainability designed for compliance officer review.
Outcome Definition session
We work through your North Star Metric, decision rights, and what "done" actually means.
Initial Risk Architecture
We map your top 5 riskiest assumptions. The ones that would kill the project if you discovered them in month four.
Benchmark model shortlist
If relevant, we pull current benchmark data from CodeSOTA and give you a ranked shortlist — before you have spent anything.
Written report
A 2-page document you keep regardless of what happens next. Useful whether you work with us or not.