IFEval.

Name: IFEval Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

A straightforward and easy-to-reproduce evaluation benchmark for large language models focused on instruction-following capabilities. IFEval contains around 500 prompts (541 in the train split) with verifiable instructions that can be objectively evaluated by heuristics, such as "write in more than 400 words", "mention the keyword of AI at least 3 times", "use no commas", or "include at least 3 highlighted sections". The benchmark identifies 25 types of verifiable instructions including punctuation constraints, length requirements, detectable content/format requirements, and keyword usage. Each prompt contains one or more verifiable instructions with corresponding kwargs for verification. This benchmark is designed for evaluating chat or instruction fine-tuned language models and is one of the core benchmarks used in the Open LLM Leaderboard.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

Not enough data to show trend.

§ 02 · Leaderboard

Results by metric.

Only 1 model on this benchmark

Help build the community leaderboard — submit your model results.

Accuracy

Accuracy is the reported evaluation metric for IFEval. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Qwen2.5-Plus dataset: IFEval; task: 5	paper	86.3	N/A	Source ↗

§ 04 · Submit a result

Add to the leaderboard.

← Back to Language Modeling