A benchmark on Codesota is not simply a dataset. It is a task-dataset-metric triple with enough structure that two people can run it and compare their answers. That minimum structure is four items: a declared metric with a direction, a fixed test split, a reproducibility script, and a dated submission.
A declared metric means the benchmark page states which number is being reported and whether higher or lower is better. Half the confusion in the field comes from tables that leave this implicit; ours do not.
A fixed test split means the slice of data used to produce the score is the same slice used by every other row in the table. Private test sets are recorded separately, and never mixed with public scores in the same ranking.
A reproducibility script means there is some path by which the score can be re-executed. For open checkpoints this is a command, a commit hash and an environment; for closed models it is the API endpoint, version string, prompt template and decoding parameters. Without one of the two, the row does not publish as a verified score.
A dated submission means every score carries the day it was run and, separately, the day it was verified. That separation matters when you are reading the table six months later and trying to tell whether a “current” number is still current.