Codesota · Benchmark · RULERHome/Leaderboards/Language & Knowledge/Language Modeling/RULER
Unknown

RULER.

RULER is a synthetic, configurable long-context benchmarking suite for evaluating language models’ ability to use very long contexts. Introduced in the paper “RULER: What’s the Real Context Size of Your Long-Context Language Models?” (arXiv:2404.06654), RULER extends the common “needle-in-a-haystack” (NIAH) retrieval test into a richer set of controlled variations with flexible configurations for sequence length and task complexity. The benchmark is designed to probe more than simple retrieval by varying task types and difficulty and to measure model performance across many context lengths (the authors report evaluations up to 1M tokens). The code and data-generation tools are provided by the authors in the public NVIDIA RULER GitHub repository (https://github.com/NVIDIA/RULER).

Paper Leaderboard
§ 01 · SOTA history

Year over year.

Not enough data to show trend.
§ 02 · Leaderboard

Results by metric.

Only 1 model on this benchmark
Help build the community leaderboard — submit your model results.

Accuracy

Accuracy is the reported evaluation metric for RULER. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Qwen2.5-72B-Instruct
dataset: RULER; task: 5
paper95.1N/ASource ↗
§ 04 · Submit a result

Add to the leaderboard.

← Back to Language Modeling