LV-Eval.

Name: LV-Eval Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

LV-Eval is a bilingual long-context benchmark designed to evaluate large language models at very large context lengths (up to 256k tokens). It provides controllable evaluation across five length levels (16k, 32k, 64k, 128k, 256k) and includes multiple QA-style tasks (single-hop and multi-hop QA) drawn from several bilingual datasets. The benchmark incorporates techniques to reduce knowledge leakage and increase difficulty and objectivity: confusing facts insertion (CFI), keyword and phrase replacement (KPR), and a keyword-recall-based metric evaluated at multiple lengths. LV-Eval is provided with balanced numbers of instances across lengths and is intended to stress-test long-context capabilities of LLMs.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

Not enough data to show trend.

§ 02 · Leaderboard

Results by metric.

Only 1 model on this benchmark

Help build the community leaderboard — submit your model results.

Accuracy

Accuracy is the reported evaluation metric for LV-Eval. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Qwen2.5-72B-Instruct dataset: LV-Eval; task: 5	paper	60.4	N/A	Source ↗

§ 04 · Submit a result

Add to the leaderboard.

← Back to Language Modeling