RULER.

Name: RULER Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

RULER is a synthetic, configurable long-context benchmarking suite for evaluating language models’ ability to use very long contexts. Introduced in the paper “RULER: What’s the Real Context Size of Your Long-Context Language Models?” (arXiv:2404.06654), RULER extends the common “needle-in-a-haystack” (NIAH) retrieval test into a richer set of controlled variations with flexible configurations for sequence length and task complexity. The benchmark is designed to probe more than simple retrieval by varying task types and difficulty and to measure model performance across many context lengths (the authors report evaluations up to 1M tokens). The code and data-generation tools are provided by the authors in the public NVIDIA RULER GitHub repository (https://github.com/NVIDIA/RULER).

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

Not enough data to show trend.

§ 02 · Leaderboard

Results by metric.

Only 1 model on this benchmark

Help build the community leaderboard — submit your model results.

Accuracy

Accuracy is the reported evaluation metric for RULER. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Qwen2.5-72B-Instruct dataset: RULER; task: 5	paper	95.1	N/A	Source ↗

§ 04 · Submit a result

Add to the leaderboard.

← Back to Language Modeling