Who leads the Livebench benchmark?

Qwen2.5-Plus currently leads Livebench with a score of 54.6 on Accuracy.

What is the state-of-the-art score on Livebench?

The state-of-the-art result on Livebench is 54.6 (Accuracy), achieved by Qwen2.5-Plus.

How many models are tracked on Livebench?

Codesota tracks 1 model on Livebench.

Codesota · Benchmark · LivebenchHome/Leaderboards/Language & Knowledge/Language Modeling/Livebench

Unknown

Livebench.

Name: Livebench Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

The Livebench dataset is a time-series dataset related to language modeling. It gathers and processes data from the LiveBench website's GitHub repository and the files used by the live version of the website to ensure the data is up-to-date. The dataset includes information such as question IDs, categories (which are consistently "language"), and release dates for the data. It also contains counts associated with different date ranges and label ranges (e.g., 0.00 - 10.00, 10.00 - 20.00).

Paper ↗Leaderboard ↓

§ 01 · Leaderboard

Results by metric.

Only 1 model on this benchmark

Help build the community leaderboard — submit your model results.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

Accuracy

Accuracy is the reported evaluation metric for Livebench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Qwen2.5-Plus dataset: Livebench; task: 5	paper	54.6	N/A	Paper ↗Code ↗Source ↗	Looks wrong?

§ 04 · Submit a result

Add to the leaderboard.

← Back to Language Modeling