Codesota · Benchmark · IFEvalHome/Leaderboards/Language & Knowledge/Language Modeling/IFEval
Unknown

IFEval.

A straightforward and easy-to-reproduce evaluation benchmark for large language models focused on instruction-following capabilities. IFEval contains around 500 prompts (541 in the train split) with verifiable instructions that can be objectively evaluated by heuristics, such as "write in more than 400 words", "mention the keyword of AI at least 3 times", "use no commas", or "include at least 3 highlighted sections". The benchmark identifies 25 types of verifiable instructions including punctuation constraints, length requirements, detectable content/format requirements, and keyword usage. Each prompt contains one or more verifiable instructions with corresponding kwargs for verification. This benchmark is designed for evaluating chat or instruction fine-tuned language models and is one of the core benchmarks used in the Open LLM Leaderboard.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

Not enough data to show trend.
§ 02 · Leaderboard

Results by metric.

Only 1 model on this benchmark
Help build the community leaderboard — submit your model results.

Accuracy

Accuracy is the reported evaluation metric for IFEval. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Qwen2.5-Plus
dataset: IFEval; task: 5
paper86.3N/ASource ↗
§ 04 · Submit a result

Add to the leaderboard.

← Back to Language Modeling