RULER is a synthetic, configurable long-context benchmarking suite for evaluating language models’ ability to use very long contexts. Introduced in the paper “RULER: What’s the Real Context Size of Your Long-Context Language Models?” (arXiv:2404.06654), RULER extends the common “needle-in-a-haystack” (NIAH) retrieval test into a richer set of controlled variations with flexible configurations for sequence length and task complexity. The benchmark is designed to probe more than simple retrieval by varying task types and difficulty and to measure model performance across many context lengths (the authors report evaluations up to 1M tokens). The code and data-generation tools are provided by the authors in the public NVIDIA RULER GitHub repository (https://github.com/NVIDIA/RULER).
Accuracy is the reported evaluation metric for RULER. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Source |
|---|---|---|---|---|---|
| 01 | Qwen2.5-72B-Instruct | paper | 95.1 | N/A | Source ↗ |