Codesota · LLM · Open ModelsLLM/Open Models
Open weights · updated April 2026

Open-Weight LLM Leaderboard.

Benchmark comparison across open-weight models: DeepSeek-R1, Llama 3, Qwen 2.5, Mistral, Gemma 3. Run locally or self-host — no API fees.

Leaderboard Model notesFAQ
§ 01 · Multi-benchmark comparison

Open weights, three axes.

Sorted by highest available score. MMLU = MMLU accuracy; MATH = MATH-500 accuracy; LCB = LiveCodeBench Pass@1.

#ModelProviderParamsMMLUMATHLCBLicense
DeepSeek-R1-ZeroDeepSeek95.9%
2DeepSeek-R1-Distill-Llama-70BDeepSeek94.5%65.2%
3DeepSeek-R1-Distill-Qwen-32BDeepSeek94.3%62.1%
4DeepSeek-v3-0324DeepSeek94%49.2%
5DeepSeek R1DeepSeek671B MoE90.8%97.3%65.9%
6QwQ-32BAlibaba/Qwen90.6%
7Llama-4-MaverickMeta400B total / 17B active (128 experts)89.4%89.4%43.4%
8Qwen 3 72BAlibaba72B88.7%
9Llama 3.1 405BMeta88.6%73.8%
10DeepSeek-V3DeepSeek88.5%90.2%49.2%
11DeepSeek V3.5DeepSeek685B MoE88.2%
12Llama 4 405BMeta405B87.8%
13Mistral Large 3Mistral123B87.1%
14MiniMax M2.5MiniMaxUnknown86.5%
15Qwen2.5-72B-InstructAlibaba72B86.1%83.1%
16Qwen 3 14BAlibaba14B84.3%
17Phi-4 14BMicrosoft14B83.9%
18Llama 3.1 70BMeta82%68%
19DeepSeek-R1-0528DeepSeek73.3%
20Qwen3-235B-A22BAlibaba235B (22B active)70.7%
21Qwen2.5-Coder 32BAlibaba32B47.8%
22DeepSeek-Coder-V2-InstructDeepSeekUnknown43.4%
23Gemma-3-27bGoogle27B39%
24Llama-4-ScoutMeta109B total / 17B active (16 experts)32.8%
25Gemma 3 12B ITGoogle DeepMind12B32%
26Codestral 22BMistralUnknown29.5%
27Gemma 3 4B ITGoogle DeepMind4B23%
§ 02 · Model notes

Where each model lives.

DeepSeek-R1
Reasoning leader

671B MoE model trained with reinforcement learning for chain-of-thought reasoning. Matches or exceeds GPT-4o on math and coding benchmarks. MIT license. Requires significant GPU to run locally.

Llama 3.3 70B
Best accessible size

Meta's most capable 70B model as of Dec 2024. Practical size for self-hosting on 2x A100 GPUs. Strong instruction following. Limited commercial use under Llama 3 license.

Qwen 2.5 72B
Strong all-rounder

Alibaba's 72B model with excellent math and code performance. Apache 2.0 licensed. Strong multilingual capability including Chinese. Competitive with Llama 3.3 70B on most benchmarks.

Phi-3 Medium 14B
Efficient small model

Microsoft's 14B model punches above its weight on knowledge benchmarks. MIT license. Good choice for edge deployments.

§ 03 · Methodology

Frequently asked.

What is the best open-weight model to run locally in 2026?+

Llama 3.3 70B or Qwen 2.5 72B are the best options for self-hosting — they fit on 2x A100 or 4x 4090 GPUs in 4-bit quantization. DeepSeek-R1 (671B) requires a full multi-GPU server. For edge devices, Phi-3 Medium 14B is the best quality/size tradeoff.

How close are open models to frontier proprietary models?+

DeepSeek-R1 matches GPT-4o on math and GPQA Diamond, while being fully open-weight. However, proprietary frontier models like Claude 3.7 and o3 still lead on complex reasoning, agentic tasks, and HLE. The gap was ~2 years in 2023; it's now ~6-12 months.

What does "open-weight" vs "open-source" mean?+

Open-weight means the model weights are downloadable, but training code and data may not be released. True open-source includes training code and data. Most "open" models (Llama, Mistral) are open-weight. Only some (OLMo, Pythia) are fully open-source.

§ 04 · Related

Continue reading.

Reasoning
Reasoning Benchmarks
GPQA Diamond, MMLU-Pro, HLE
Math
Math Benchmarks
GSM8K, MATH-500, AIME 2024
Index
All LLM Benchmarks
Frontier model leaderboards