Arena-Hard is a human-aligned benchmark of challenging open-ended prompts sourced from live crowd platforms (notably Chatbot Arena) designed to robustly separate LLM capability and reflect human preference. It was introduced in the paper “From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline” (arXiv:2406.11939). The Arena-Hard-Auto variant (published on Hugging Face as Arena-Hard-Auto / v0.1) is an automatic evaluation suite that contains 500 challenging user queries extracted from Chatbot Arena and uses an LLM-as-a-judge (the dataset authors report prompting GPT-4-Turbo to act as judge, comparing model responses against a baseline such as GPT-4-0314). The BenchBuilder pipeline described in the paper automates extracting high-quality prompts from crowdsourced data and producing an automatically-judged benchmark with high correlation and separability relative to the live Chatbot Arena. Common uses: automatic and human-aligned evaluation of instruction-tuned LLMs and benchmarking alignment/safety/helpfulness.
Accuracy is the reported evaluation metric for Arena-Hard. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Source |
|---|---|---|---|---|---|
| 01 | Qwen2.5-Plus | paper | 81.4 | N/A | Source ↗ |