Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Natural Language Processing · Polish Text Understanding · CPTU-BenchTasks/Natural Language Processing/Polish Text Understanding
Polish Text Understanding · benchmark dataset · 2025 · PL

Complex Polish Text Understanding Benchmark.

Evaluates LLMs on understanding Polish text across 4 dimensions: sentiment analysis, language understanding (implicatures, author intent), phraseology (idioms, phraseological compounds), and tricky questions (logic, ambiguity, hallucination resistance). Score range 0-5 per category. 378 hand-written examples. Created by SpeakLeash/Spichlerz.

Paper Download datasetSubmit a result
§ 01 · Leaderboard

Best published scores.

465 results indexed across 5 metrics. Shaded row marks current SOTA; ties broken by submission date.


Primary
average · higher is better
All metrics
average, language-understanding, phraseology, sentiment, tricky-questions
average· primary
93 rows
#ModelOrgSubmittedPaper / codeaverage
01Qwen/Qwen3.5-27B thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.34
02gemini-2.0-flash-001OSSGoogleFeb 2025SpeakLeash/CPTU-Bench4.29
03Qwen/Qwen3.5-27B non-thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.27
04Qwen/Qwen3.5-35B-A3B thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.22
05Qwen/Qwen3.5-35B-A3B non-thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.18
06deepseek-ai/DeepSeek-V3.2 (API)OSSdeepseek-aiJul 2025SpeakLeash/CPTU-Bench4.14
07deepseek-ai/DeepSeek-R1 (API)OSSdeepseek-aiJan 2025SpeakLeash/CPTU-Bench4.14
08gemini-2.0-flash-lite-001OSSGoogleFeb 2025SpeakLeash/CPTU-Bench4.09
09🚧DeepSeek-V3-0324OSSdeepseek-aiMar 2025SpeakLeash/CPTU-Bench4.03
10deepseek-ai/DeepSeek-V3.1 (API)OSSdeepseek-aiMay 2025SpeakLeash/CPTU-Bench4.03
11deepseek-ai/DeepSeek-V3 (API)OSSdeepseek-aiDec 2024SpeakLeash/CPTU-Bench4.02
12Mistral-Large-Instruct-2411OSSmistralaiNov 2024SpeakLeash/CPTU-Bench4.00
13moonshotai/Kimi-K2-Instruct-0905 (API)OSSmoonshotaiSep 2025SpeakLeash/CPTU-Bench3.98
14Qwen2.5-72B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.95
15Mistral-Large-Instruct-2407OSSmistralaiJul 2024SpeakLeash/CPTU-Bench3.93
16meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)OSSmeta-llamaApr 2025SpeakLeash/CPTU-Bench3.93
17Qwen/Qwen3-235B-A22B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.91
18mistralai/Mistral-Small-3.1-24B-Instruct-2503 (API FP8)OSSmistralaiMar 2025SpeakLeash/CPTU-Bench3.90
19mistralai/Mistral-Small-3.2-24B-Instruct-2506 (API FP8)OSSmistralaiJun 2025SpeakLeash/CPTU-Bench3.83
20openai/gpt-oss-120b (API)OSSopenaiJun 2025SpeakLeash/CPTU-Bench3.82
21gemma-3-27b-itOSSgoogleMar 2025SpeakLeash/CPTU-Bench3.81
22Meta-Llama-3-70B-InstructOSSmeta-llamaApr 2024SpeakLeash/CPTU-Bench3.78
23Qwen2.5-32B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.75
24Llama-4-Scout-17B-16E-InstructOSSmeta-llamaApr 2025SpeakLeash/CPTU-Bench3.75
25Bielik-11B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench3.73
26Qwen/Qwen3-32B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.71
27Mistral-Small-24B-Instruct-2501OSSmistralaiJan 2025SpeakLeash/CPTU-Bench3.71
28WizardLM-2-8x22BOSSalpindaleApr 2024SpeakLeash/CPTU-Bench3.70
29pllum-12b-nc-chat-250715OSSCYFRAGOVPLJul 2025SpeakLeash/CPTU-Bench3.67
30Qwen2-72B-InstructOSSQwenJun 2024SpeakLeash/CPTU-Bench3.65
31Llama-3.3-70B-InstructOSSmeta-llamaDec 2024SpeakLeash/CPTU-Bench3.64
32Bielik-11B-v2.6-InstructOSSspeakleashFeb 2025SpeakLeash/CPTU-Bench3.64
33Bielik-11B-v2.3-InstructOSSspeakleashNov 2024SpeakLeash/CPTU-Bench3.63
34Meta-Llama-3.1-70B-InstructOSSmeta-llamaJul 2024SpeakLeash/CPTU-Bench3.62
35Bielik-11B-v2.1-InstructOSSspeakleashSep 2024SpeakLeash/CPTU-Bench3.61
36Mixtral-8x22B-Instruct-v0.1OSSmistralaiApr 2024SpeakLeash/CPTU-Bench3.56
37Qwen2.5-14B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.55
38Qwen/Qwen3-30B-A3B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.54
39Llama-PLLuM-70B-chatOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.53
40Qwen/Qwen3-14B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.51
41Bielik-11B-v2.5-InstructOSSspeakleashJan 2025SpeakLeash/CPTU-Bench3.48
42Bielik-11B-v2.2-InstructOSSspeakleashOct 2024SpeakLeash/CPTU-Bench3.46
43speakleash/Bielik-Minitron-7B-v3.0-InstructOSSspeakleashJul 2025SpeakLeash/CPTU-Bench3.38
44Bielik-4.5B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench3.38
45Llama-PLLuM-70B-instructOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.33
46CYFRAGOVPL/pllum-12b-nc-instruct-250715OSSCYFRAGOVPLJul 2025SpeakLeash/CPTU-Bench3.33
47phi-4OSSmicrosoftJan 2025SpeakLeash/CPTU-Bench3.30
48Qwen/Qwen3.5-9B non-thinking (API, FP8)OSSQwenJul 2025SpeakLeash/CPTU-Bench3.28
49Bielik-11B-v2.0-InstructOSSspeakleashAug 2024SpeakLeash/CPTU-Bench3.26
50NVIDIA-Nemotron-3-Nano-30B-A3B-BF16OSSnvidiaJun 2025SpeakLeash/CPTU-Bench3.25
51Qwen1.5-72B-ChatOSSQwenFeb 2024SpeakLeash/CPTU-Bench3.16
52CYFRAGOVPL/PLLuM-12B-nc-chatOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.15
53EuroLLM-9B-InstructOSSutter-projectMar 2025SpeakLeash/CPTU-Bench3.15
54PLLuM-12B-chatOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.14
55PLLuM-8x7B-nc-instructOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.11
56PLLuM-12B-instructOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.09
57Qwen2.5-7B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.07
58Qwen/Qwen3-8B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.06
59PLLuM-8x7B-nc-chatOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.03
60Meta-Llama-3.1-8B-InstructOSSmeta-llamaJul 2024SpeakLeash/CPTU-Bench3.01
61PLLuM-8x7B-instructOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.01
62PLLuM-8x7B-chatOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.01
63Meta-Llama-3-8B-InstructOSSmeta-llamaApr 2024SpeakLeash/CPTU-Bench3.00
64CYFRAGOVPL/PLLuM-12B-nc-instructOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench2.96
65glm-4-9b-chatOSSTHUDMJun 2024SpeakLeash/CPTU-Bench2.95
66Mistral-Nemo-Instruct-2407OSSmistralaiJul 2024SpeakLeash/CPTU-Bench2.94
67Llama-PLLuM-8B-chatOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench2.92
68Bielik-7B-Instruct-v0.1OSSspeakleashApr 2024SpeakLeash/CPTU-Bench2.88
69SOLAR-10.7B-Instruct-v1.0OSSupstageDec 2023SpeakLeash/CPTU-Bench2.88
70CYFRAGOVPL/Llama-PLLuM-8B-instructOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench2.82
71Mistral-7B-Instruct-v0.3OSSmistralaiMay 2024SpeakLeash/CPTU-Bench2.76
72openchat-3.5-0106-gemmaOSSopenchatDec 2023SpeakLeash/CPTU-Bench2.73
73Mixtral-8x7B-Instruct-v0.1OSSmistralaiDec 2023SpeakLeash/CPTU-Bench2.73
74gemma-2-2b-itOSSgoogleJun 2024SpeakLeash/CPTU-Bench2.65
75Starling-LM-7B-alphaOSSberkeley-nestNov 2023SpeakLeash/CPTU-Bench2.63
76openchat-3.5-0106OSSopenchatDec 2023SpeakLeash/CPTU-Bench2.63
77Qwen2.5-3B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench2.50
78Bielik-1.5B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench2.36
79Yi-1.5-34B-ChatOSS01-aiMay 2024SpeakLeash/CPTU-Bench2.33
80trurl-2-13b-academicOSSVoicelabJan 2024SpeakLeash/CPTU-Bench2.31
81NousResearch/Hermes-3-Llama-3.2-3BOSSNousResearchOct 2024SpeakLeash/CPTU-Bench2.31
82Phi-4-mini-instructOSSmicrosoftApr 2025SpeakLeash/CPTU-Bench2.17
83internlm2-chat-20bOSSinternlmJan 2024SpeakLeash/CPTU-Bench2.15
84Phi-3.5-mini-instructOSSmicrosoftAug 2024SpeakLeash/CPTU-Bench2.01
85Llama-3.2-3B-InstructOSSmeta-llamaSep 2024SpeakLeash/CPTU-Bench2.00
86granite-3.1-2b-instructOSSibm-graniteJan 2025SpeakLeash/CPTU-Bench1.95
87Llama-3.2-1B-InstructOSSmeta-llamaSep 2024SpeakLeash/CPTU-Bench1.92
88EuroLLM-1.7B-InstructOSSutter-projectJan 2025SpeakLeash/CPTU-Bench1.76
89Qwen2.5-1.5B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench1.76
90LGAI-EXAONE/EXAONE-3.5-2.4B-InstructOSSLGAI-EXAONEJan 2025SpeakLeash/CPTU-Bench1.67
91h2oai/h2o-danube2-1.8b-chatOSSh2oaiApr 2024SpeakLeash/CPTU-Bench1.64
92SmolLM2-1.7B-InstructOSSHuggingFaceTBFeb 2025SpeakLeash/CPTU-Bench1.50
93Qwen/Qwen2.5-0.5B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench1.40
language-understanding
93 rows
#ModelOrgSubmittedPaper / codelanguage-understanding
01deepseek-ai/DeepSeek-V3.2 (API)OSSdeepseek-aiJul 2025SpeakLeash/CPTU-Bench4.36
02deepseek-ai/DeepSeek-R1 (API)OSSdeepseek-aiJan 2025SpeakLeash/CPTU-Bench4.34
03deepseek-ai/DeepSeek-V3.1 (API)OSSdeepseek-aiMay 2025SpeakLeash/CPTU-Bench4.33
04gemini-2.0-flash-001OSSGoogleFeb 2025SpeakLeash/CPTU-Bench4.32
05deepseek-ai/DeepSeek-V3 (API)OSSdeepseek-aiDec 2024SpeakLeash/CPTU-Bench4.22
06Qwen/Qwen3.5-27B thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.21
07🚧DeepSeek-V3-0324OSSdeepseek-aiMar 2025SpeakLeash/CPTU-Bench4.20
08moonshotai/Kimi-K2-Instruct-0905 (API)OSSmoonshotaiSep 2025SpeakLeash/CPTU-Bench4.18
09Qwen/Qwen3.5-27B non-thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.17
10Qwen/Qwen3-235B-A22B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench4.16
11meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)OSSmeta-llamaApr 2025SpeakLeash/CPTU-Bench4.11
12gemini-2.0-flash-lite-001OSSGoogleFeb 2025SpeakLeash/CPTU-Bench4.05
13Qwen/Qwen3.5-35B-A3B non-thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.05
14mistralai/Mistral-Small-3.2-24B-Instruct-2506 (API FP8)OSSmistralaiJun 2025SpeakLeash/CPTU-Bench4.00
15Mistral-Large-Instruct-2407OSSmistralaiJul 2024SpeakLeash/CPTU-Bench4.00
16Mistral-Large-Instruct-2411OSSmistralaiNov 2024SpeakLeash/CPTU-Bench3.98
17openai/gpt-oss-120b (API)OSSopenaiJun 2025SpeakLeash/CPTU-Bench3.97
18Qwen2.5-72B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.97
19pllum-12b-nc-chat-250715OSSCYFRAGOVPLJul 2025SpeakLeash/CPTU-Bench3.96
20Bielik-11B-v2.6-InstructOSSspeakleashFeb 2025SpeakLeash/CPTU-Bench3.94
21Qwen/Qwen3.5-35B-A3B thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench3.94
22Bielik-11B-v2.1-InstructOSSspeakleashSep 2024SpeakLeash/CPTU-Bench3.92
23Qwen/Qwen3-32B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.91
24Bielik-11B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench3.91
25Meta-Llama-3.1-70B-InstructOSSmeta-llamaJul 2024SpeakLeash/CPTU-Bench3.91
26Qwen2-72B-InstructOSSQwenJun 2024SpeakLeash/CPTU-Bench3.89
27mistralai/Mistral-Small-3.1-24B-Instruct-2503 (API FP8)OSSmistralaiMar 2025SpeakLeash/CPTU-Bench3.88
28Llama-3.3-70B-InstructOSSmeta-llamaDec 2024SpeakLeash/CPTU-Bench3.87
29Bielik-11B-v2.5-InstructOSSspeakleashJan 2025SpeakLeash/CPTU-Bench3.86
30speakleash/Bielik-Minitron-7B-v3.0-InstructOSSspeakleashJul 2025SpeakLeash/CPTU-Bench3.83
31Meta-Llama-3-70B-InstructOSSmeta-llamaApr 2024SpeakLeash/CPTU-Bench3.82
32WizardLM-2-8x22BOSSalpindaleApr 2024SpeakLeash/CPTU-Bench3.81
33Llama-4-Scout-17B-16E-InstructOSSmeta-llamaApr 2025SpeakLeash/CPTU-Bench3.81
34Bielik-11B-v2.3-InstructOSSspeakleashNov 2024SpeakLeash/CPTU-Bench3.79
35gemma-3-27b-itOSSgoogleMar 2025SpeakLeash/CPTU-Bench3.79
36Bielik-11B-v2.0-InstructOSSspeakleashAug 2024SpeakLeash/CPTU-Bench3.75
37Bielik-11B-v2.2-InstructOSSspeakleashOct 2024SpeakLeash/CPTU-Bench3.73
38CYFRAGOVPL/pllum-12b-nc-instruct-250715OSSCYFRAGOVPLJul 2025SpeakLeash/CPTU-Bench3.73
39Mixtral-8x22B-Instruct-v0.1OSSmistralaiApr 2024SpeakLeash/CPTU-Bench3.67
40Llama-PLLuM-70B-instructOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.63
41Llama-PLLuM-70B-chatOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.61
42Bielik-4.5B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench3.61
43Mistral-Small-24B-Instruct-2501OSSmistralaiJan 2025SpeakLeash/CPTU-Bench3.60
44PLLuM-8x7B-nc-instructOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.59
45Qwen2.5-32B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.56
46Qwen2.5-14B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.56
47Qwen/Qwen3-14B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.56
48phi-4OSSmicrosoftJan 2025SpeakLeash/CPTU-Bench3.54
49Qwen1.5-72B-ChatOSSQwenFeb 2024SpeakLeash/CPTU-Bench3.52
50PLLuM-8x7B-nc-chatOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.48
51Bielik-7B-Instruct-v0.1OSSspeakleashApr 2024SpeakLeash/CPTU-Bench3.48
52PLLuM-8x7B-instructOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.47
53glm-4-9b-chatOSSTHUDMJun 2024SpeakLeash/CPTU-Bench3.46
54PLLuM-8x7B-chatOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.45
55Qwen/Qwen3-30B-A3B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.39
56Meta-Llama-3.1-8B-InstructOSSmeta-llamaJul 2024SpeakLeash/CPTU-Bench3.38
57CYFRAGOVPL/PLLuM-12B-nc-instructOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.31
58EuroLLM-9B-InstructOSSutter-projectMar 2025SpeakLeash/CPTU-Bench3.30
59Mistral-Nemo-Instruct-2407OSSmistralaiJul 2024SpeakLeash/CPTU-Bench3.29
60NVIDIA-Nemotron-3-Nano-30B-A3B-BF16OSSnvidiaJun 2025SpeakLeash/CPTU-Bench3.27
61CYFRAGOVPL/PLLuM-12B-nc-chatOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.23
62Qwen/Qwen3-8B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.23
63PLLuM-12B-chatOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.21
64SOLAR-10.7B-Instruct-v1.0OSSupstageDec 2023SpeakLeash/CPTU-Bench3.18
65Mixtral-8x7B-Instruct-v0.1OSSmistralaiDec 2023SpeakLeash/CPTU-Bench3.17
66PLLuM-12B-instructOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.17
67Meta-Llama-3-8B-InstructOSSmeta-llamaApr 2024SpeakLeash/CPTU-Bench3.15
68openchat-3.5-0106-gemmaOSSopenchatDec 2023SpeakLeash/CPTU-Bench3.08
69Mistral-7B-Instruct-v0.3OSSmistralaiMay 2024SpeakLeash/CPTU-Bench3.06
70Qwen2.5-7B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.02
71Qwen/Qwen3.5-9B non-thinking (API, FP8)OSSQwenJul 2025SpeakLeash/CPTU-Bench2.98
72Llama-PLLuM-8B-chatOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench2.93
73Starling-LM-7B-alphaOSSberkeley-nestNov 2023SpeakLeash/CPTU-Bench2.92
74gemma-2-2b-itOSSgoogleJun 2024SpeakLeash/CPTU-Bench2.90
75CYFRAGOVPL/Llama-PLLuM-8B-instructOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench2.90
76Yi-1.5-34B-ChatOSS01-aiMay 2024SpeakLeash/CPTU-Bench2.87
77openchat-3.5-0106OSSopenchatDec 2023SpeakLeash/CPTU-Bench2.83
78internlm2-chat-20bOSSinternlmJan 2024SpeakLeash/CPTU-Bench2.79
79trurl-2-13b-academicOSSVoicelabJan 2024SpeakLeash/CPTU-Bench2.75
80NousResearch/Hermes-3-Llama-3.2-3BOSSNousResearchOct 2024SpeakLeash/CPTU-Bench2.71
81Qwen2.5-3B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench2.46
82Phi-4-mini-instructOSSmicrosoftApr 2025SpeakLeash/CPTU-Bench2.43
83Bielik-1.5B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench2.33
84Llama-3.2-3B-InstructOSSmeta-llamaSep 2024SpeakLeash/CPTU-Bench2.29
85granite-3.1-2b-instructOSSibm-graniteJan 2025SpeakLeash/CPTU-Bench2.23
86Phi-3.5-mini-instructOSSmicrosoftAug 2024SpeakLeash/CPTU-Bench2.13
87LGAI-EXAONE/EXAONE-3.5-2.4B-InstructOSSLGAI-EXAONEJan 2025SpeakLeash/CPTU-Bench2.12
88EuroLLM-1.7B-InstructOSSutter-projectJan 2025SpeakLeash/CPTU-Bench1.79
89Llama-3.2-1B-InstructOSSmeta-llamaSep 2024SpeakLeash/CPTU-Bench1.74
90h2oai/h2o-danube2-1.8b-chatOSSh2oaiApr 2024SpeakLeash/CPTU-Bench1.59
91Qwen2.5-1.5B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench1.35
92SmolLM2-1.7B-InstructOSSHuggingFaceTBFeb 2025SpeakLeash/CPTU-Bench1.10
93Qwen/Qwen2.5-0.5B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench0.835
phraseology
93 rows
#ModelOrgSubmittedPaper / codephraseology
01gemini-2.0-flash-001OSSGoogleFeb 2025SpeakLeash/CPTU-Bench4.34
02gemini-2.0-flash-lite-001OSSGoogleFeb 2025SpeakLeash/CPTU-Bench4.24
03Qwen/Qwen3.5-35B-A3B non-thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.23
04WizardLM-2-8x22BOSSalpindaleApr 2024SpeakLeash/CPTU-Bench4.22
05Qwen/Qwen3.5-27B non-thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.20
06Qwen/Qwen3.5-35B-A3B thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.15
07mistralai/Mistral-Small-3.1-24B-Instruct-2503 (API FP8)OSSmistralaiMar 2025SpeakLeash/CPTU-Bench4.15
08Qwen/Qwen3.5-27B thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.11
09Qwen2.5-32B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench4.04
10gemma-3-27b-itOSSgoogleMar 2025SpeakLeash/CPTU-Bench4.03
11mistralai/Mistral-Small-3.2-24B-Instruct-2506 (API FP8)OSSmistralaiJun 2025SpeakLeash/CPTU-Bench4.00
12Mistral-Large-Instruct-2411OSSmistralaiNov 2024SpeakLeash/CPTU-Bench3.99
13Bielik-11B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench3.96
14Qwen2.5-72B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.93
15Llama-4-Scout-17B-16E-InstructOSSmeta-llamaApr 2025SpeakLeash/CPTU-Bench3.90
16Mistral-Small-24B-Instruct-2501OSSmistralaiJan 2025SpeakLeash/CPTU-Bench3.88
17Mistral-Large-Instruct-2407OSSmistralaiJul 2024SpeakLeash/CPTU-Bench3.86
18Bielik-4.5B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench3.67
19deepseek-ai/DeepSeek-R1 (API)OSSdeepseek-aiJan 2025SpeakLeash/CPTU-Bench3.60
20PLLuM-12B-instructOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.59
21Mixtral-8x22B-Instruct-v0.1OSSmistralaiApr 2024SpeakLeash/CPTU-Bench3.55
22Bielik-11B-v2.3-InstructOSSspeakleashNov 2024SpeakLeash/CPTU-Bench3.55
23deepseek-ai/DeepSeek-V3.2 (API)OSSdeepseek-aiJul 2025SpeakLeash/CPTU-Bench3.54
24🚧DeepSeek-V3-0324OSSdeepseek-aiMar 2025SpeakLeash/CPTU-Bench3.54
25CYFRAGOVPL/PLLuM-12B-nc-chatOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.54
26deepseek-ai/DeepSeek-V3 (API)OSSdeepseek-aiDec 2024SpeakLeash/CPTU-Bench3.52
27Qwen/Qwen3-30B-A3B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.50
28openai/gpt-oss-120b (API)OSSopenaiJun 2025SpeakLeash/CPTU-Bench3.49
29Qwen/Qwen3-235B-A22B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.48
30meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)OSSmeta-llamaApr 2025SpeakLeash/CPTU-Bench3.48
31deepseek-ai/DeepSeek-V3.1 (API)OSSdeepseek-aiMay 2025SpeakLeash/CPTU-Bench3.48
32Qwen/Qwen3.5-9B non-thinking (API, FP8)OSSQwenJul 2025SpeakLeash/CPTU-Bench3.48
33Meta-Llama-3-70B-InstructOSSmeta-llamaApr 2024SpeakLeash/CPTU-Bench3.46
34CYFRAGOVPL/Llama-PLLuM-8B-instructOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.46
35PLLuM-8x7B-instructOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.46
36pllum-12b-nc-chat-250715OSSCYFRAGOVPLJul 2025SpeakLeash/CPTU-Bench3.46
37moonshotai/Kimi-K2-Instruct-0905 (API)OSSmoonshotaiSep 2025SpeakLeash/CPTU-Bench3.43
38PLLuM-12B-chatOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.43
39Bielik-11B-v2.6-InstructOSSspeakleashFeb 2025SpeakLeash/CPTU-Bench3.41
40Qwen2.5-14B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.37
41Llama-PLLuM-8B-chatOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.36
42Llama-PLLuM-70B-chatOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.35
43PLLuM-8x7B-chatOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.35
44CYFRAGOVPL/PLLuM-12B-nc-instructOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.32
45CYFRAGOVPL/pllum-12b-nc-instruct-250715OSSCYFRAGOVPLJul 2025SpeakLeash/CPTU-Bench3.29
46Qwen2-72B-InstructOSSQwenJun 2024SpeakLeash/CPTU-Bench3.28
47Llama-PLLuM-70B-instructOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.26
48SOLAR-10.7B-Instruct-v1.0OSSupstageDec 2023SpeakLeash/CPTU-Bench3.25
49Bielik-11B-v2.2-InstructOSSspeakleashOct 2024SpeakLeash/CPTU-Bench3.25
50Meta-Llama-3.1-70B-InstructOSSmeta-llamaJul 2024SpeakLeash/CPTU-Bench3.25
51Qwen/Qwen3-14B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.25
52phi-4OSSmicrosoftJan 2025SpeakLeash/CPTU-Bench3.23
53Qwen/Qwen3-32B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.23
54speakleash/Bielik-Minitron-7B-v3.0-InstructOSSspeakleashJul 2025SpeakLeash/CPTU-Bench3.23
55PLLuM-8x7B-nc-instructOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.22
56EuroLLM-9B-InstructOSSutter-projectMar 2025SpeakLeash/CPTU-Bench3.17
57Bielik-11B-v2.5-InstructOSSspeakleashJan 2025SpeakLeash/CPTU-Bench3.13
58Bielik-11B-v2.0-InstructOSSspeakleashAug 2024SpeakLeash/CPTU-Bench3.13
59Bielik-11B-v2.1-InstructOSSspeakleashSep 2024SpeakLeash/CPTU-Bench3.10
60Qwen2.5-7B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.10
61PLLuM-8x7B-nc-chatOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.08
62Llama-3.3-70B-InstructOSSmeta-llamaDec 2024SpeakLeash/CPTU-Bench3.04
63Meta-Llama-3-8B-InstructOSSmeta-llamaApr 2024SpeakLeash/CPTU-Bench3.04
64Qwen1.5-72B-ChatOSSQwenFeb 2024SpeakLeash/CPTU-Bench2.98
65Mixtral-8x7B-Instruct-v0.1OSSmistralaiDec 2023SpeakLeash/CPTU-Bench2.88
66Starling-LM-7B-alphaOSSberkeley-nestNov 2023SpeakLeash/CPTU-Bench2.85
67Qwen2.5-3B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench2.80
68glm-4-9b-chatOSSTHUDMJun 2024SpeakLeash/CPTU-Bench2.78
69Qwen/Qwen3-8B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench2.77
70NousResearch/Hermes-3-Llama-3.2-3BOSSNousResearchOct 2024SpeakLeash/CPTU-Bench2.77
71NVIDIA-Nemotron-3-Nano-30B-A3B-BF16OSSnvidiaJun 2025SpeakLeash/CPTU-Bench2.76
72Mistral-Nemo-Instruct-2407OSSmistralaiJul 2024SpeakLeash/CPTU-Bench2.74
73Mistral-7B-Instruct-v0.3OSSmistralaiMay 2024SpeakLeash/CPTU-Bench2.68
74Qwen/Qwen2.5-0.5B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench2.60
75Meta-Llama-3.1-8B-InstructOSSmeta-llamaJul 2024SpeakLeash/CPTU-Bench2.58
76openchat-3.5-0106OSSopenchatDec 2023SpeakLeash/CPTU-Bench2.56
77h2oai/h2o-danube2-1.8b-chatOSSh2oaiApr 2024SpeakLeash/CPTU-Bench2.47
78openchat-3.5-0106-gemmaOSSopenchatDec 2023SpeakLeash/CPTU-Bench2.44
79Phi-3.5-mini-instructOSSmicrosoftAug 2024SpeakLeash/CPTU-Bench2.42
80internlm2-chat-20bOSSinternlmJan 2024SpeakLeash/CPTU-Bench2.38
81Bielik-1.5B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench2.38
82Yi-1.5-34B-ChatOSS01-aiMay 2024SpeakLeash/CPTU-Bench2.38
83SmolLM2-1.7B-InstructOSSHuggingFaceTBFeb 2025SpeakLeash/CPTU-Bench2.35
84Llama-3.2-1B-InstructOSSmeta-llamaSep 2024SpeakLeash/CPTU-Bench2.34
85Bielik-7B-Instruct-v0.1OSSspeakleashApr 2024SpeakLeash/CPTU-Bench2.31
86EuroLLM-1.7B-InstructOSSutter-projectJan 2025SpeakLeash/CPTU-Bench2.26
87Phi-4-mini-instructOSSmicrosoftApr 2025SpeakLeash/CPTU-Bench2.25
88Qwen2.5-1.5B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench2.23
89trurl-2-13b-academicOSSVoicelabJan 2024SpeakLeash/CPTU-Bench2.17
90LGAI-EXAONE/EXAONE-3.5-2.4B-InstructOSSLGAI-EXAONEJan 2025SpeakLeash/CPTU-Bench2.13
91gemma-2-2b-itOSSgoogleJun 2024SpeakLeash/CPTU-Bench2.10
92granite-3.1-2b-instructOSSibm-graniteJan 2025SpeakLeash/CPTU-Bench1.88
93Llama-3.2-3B-InstructOSSmeta-llamaSep 2024SpeakLeash/CPTU-Bench1.72
sentiment
93 rows
#ModelOrgSubmittedPaper / codesentiment
01gemini-2.0-flash-001OSSGoogleFeb 2025SpeakLeash/CPTU-Bench4.52
02deepseek-ai/DeepSeek-R1 (API)OSSdeepseek-aiJan 2025SpeakLeash/CPTU-Bench4.49
03deepseek-ai/DeepSeek-V3.2 (API)OSSdeepseek-aiJul 2025SpeakLeash/CPTU-Bench4.46
04deepseek-ai/DeepSeek-V3.1 (API)OSSdeepseek-aiMay 2025SpeakLeash/CPTU-Bench4.42
05Qwen/Qwen3.5-27B thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.42
06moonshotai/Kimi-K2-Instruct-0905 (API)OSSmoonshotaiSep 2025SpeakLeash/CPTU-Bench4.39
07meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)OSSmeta-llamaApr 2025SpeakLeash/CPTU-Bench4.39
08deepseek-ai/DeepSeek-V3 (API)OSSdeepseek-aiDec 2024SpeakLeash/CPTU-Bench4.36
09🚧DeepSeek-V3-0324OSSdeepseek-aiMar 2025SpeakLeash/CPTU-Bench4.36
10pllum-12b-nc-chat-250715OSSCYFRAGOVPLJul 2025SpeakLeash/CPTU-Bench4.36
11Mistral-Large-Instruct-2411OSSmistralaiNov 2024SpeakLeash/CPTU-Bench4.33
12Meta-Llama-3.1-70B-InstructOSSmeta-llamaJul 2024SpeakLeash/CPTU-Bench4.33
13Llama-3.3-70B-InstructOSSmeta-llamaDec 2024SpeakLeash/CPTU-Bench4.29
14Qwen/Qwen3.5-27B non-thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.29
15Mistral-Large-Instruct-2407OSSmistralaiJul 2024SpeakLeash/CPTU-Bench4.23
16gemini-2.0-flash-lite-001OSSGoogleFeb 2025SpeakLeash/CPTU-Bench4.23
17Qwen/Qwen3.5-35B-A3B non-thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.23
18Qwen/Qwen3-235B-A22B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench4.17
19Meta-Llama-3-70B-InstructOSSmeta-llamaApr 2024SpeakLeash/CPTU-Bench4.13
20Qwen/Qwen3-32B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench4.13
21mistralai/Mistral-Small-3.1-24B-Instruct-2503 (API FP8)OSSmistralaiMar 2025SpeakLeash/CPTU-Bench4.13
22Llama-4-Scout-17B-16E-InstructOSSmeta-llamaApr 2025SpeakLeash/CPTU-Bench4.10
23Qwen/Qwen3.5-35B-A3B thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.10
24Bielik-11B-v2.6-InstructOSSspeakleashFeb 2025SpeakLeash/CPTU-Bench4.10
25Qwen2.5-72B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench4.08
26Bielik-11B-v2.5-InstructOSSspeakleashJan 2025SpeakLeash/CPTU-Bench4.01
27mistralai/Mistral-Small-3.2-24B-Instruct-2506 (API FP8)OSSmistralaiJun 2025SpeakLeash/CPTU-Bench4.01
28Meta-Llama-3.1-8B-InstructOSSmeta-llamaJul 2024SpeakLeash/CPTU-Bench3.97
29Bielik-11B-v2.0-InstructOSSspeakleashAug 2024SpeakLeash/CPTU-Bench3.97
30Bielik-11B-v2.3-InstructOSSspeakleashNov 2024SpeakLeash/CPTU-Bench3.97
31Bielik-11B-v2.1-InstructOSSspeakleashSep 2024SpeakLeash/CPTU-Bench3.96
32openai/gpt-oss-120b (API)OSSopenaiJun 2025SpeakLeash/CPTU-Bench3.94
33Llama-PLLuM-70B-chatOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.94
34Qwen/Qwen3-14B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.91
35Qwen2.5-14B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.91
36Mistral-Small-24B-Instruct-2501OSSmistralaiJan 2025SpeakLeash/CPTU-Bench3.91
37CYFRAGOVPL/pllum-12b-nc-instruct-250715OSSCYFRAGOVPLJul 2025SpeakLeash/CPTU-Bench3.91
38PLLuM-8x7B-nc-instructOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.88
39Bielik-11B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench3.88
40gemma-3-27b-itOSSgoogleMar 2025SpeakLeash/CPTU-Bench3.88
41Qwen2.5-32B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.81
42Mixtral-8x22B-Instruct-v0.1OSSmistralaiApr 2024SpeakLeash/CPTU-Bench3.78
43Llama-PLLuM-70B-instructOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.78
44Bielik-4.5B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench3.76
45Qwen2-72B-InstructOSSQwenJun 2024SpeakLeash/CPTU-Bench3.76
46PLLuM-8x7B-nc-chatOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.76
47openchat-3.5-0106-gemmaOSSopenchatDec 2023SpeakLeash/CPTU-Bench3.73
48Qwen/Qwen3-30B-A3B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.72
49speakleash/Bielik-Minitron-7B-v3.0-InstructOSSspeakleashJul 2025SpeakLeash/CPTU-Bench3.72
50phi-4OSSmicrosoftJan 2025SpeakLeash/CPTU-Bench3.72
51Bielik-11B-v2.2-InstructOSSspeakleashOct 2024SpeakLeash/CPTU-Bench3.72
52PLLuM-12B-instructOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.71
53WizardLM-2-8x22BOSSalpindaleApr 2024SpeakLeash/CPTU-Bench3.71
54Mistral-Nemo-Instruct-2407OSSmistralaiJul 2024SpeakLeash/CPTU-Bench3.64
55PLLuM-8x7B-instructOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.59
56glm-4-9b-chatOSSTHUDMJun 2024SpeakLeash/CPTU-Bench3.59
57Bielik-7B-Instruct-v0.1OSSspeakleashApr 2024SpeakLeash/CPTU-Bench3.59
58Qwen2.5-7B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.56
59NVIDIA-Nemotron-3-Nano-30B-A3B-BF16OSSnvidiaJun 2025SpeakLeash/CPTU-Bench3.53
60Bielik-1.5B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench3.53
61Qwen/Qwen3-8B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.49
62Qwen1.5-72B-ChatOSSQwenFeb 2024SpeakLeash/CPTU-Bench3.47
63PLLuM-8x7B-chatOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench3.44
64gemma-2-2b-itOSSgoogleJun 2024SpeakLeash/CPTU-Bench3.40
65EuroLLM-9B-InstructOSSutter-projectMar 2025SpeakLeash/CPTU-Bench3.37
66Meta-Llama-3-8B-InstructOSSmeta-llamaApr 2024SpeakLeash/CPTU-Bench3.33
67Mistral-7B-Instruct-v0.3OSSmistralaiMay 2024SpeakLeash/CPTU-Bench3.33
68PLLuM-12B-chatOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.32
69internlm2-chat-20bOSSinternlmJan 2024SpeakLeash/CPTU-Bench3.30
70trurl-2-13b-academicOSSVoicelabJan 2024SpeakLeash/CPTU-Bench3.30
71CYFRAGOVPL/Llama-PLLuM-8B-instructOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.24
72CYFRAGOVPL/PLLuM-12B-nc-instructOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.24
73CYFRAGOVPL/PLLuM-12B-nc-chatOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench3.22
74openchat-3.5-0106OSSopenchatDec 2023SpeakLeash/CPTU-Bench3.16
75Llama-PLLuM-8B-chatOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.13
76Llama-3.2-1B-InstructOSSmeta-llamaSep 2024SpeakLeash/CPTU-Bench3.08
77Yi-1.5-34B-ChatOSS01-aiMay 2024SpeakLeash/CPTU-Bench3.08
78granite-3.1-2b-instructOSSibm-graniteJan 2025SpeakLeash/CPTU-Bench3.08
79Starling-LM-7B-alphaOSSberkeley-nestNov 2023SpeakLeash/CPTU-Bench3.06
80Mixtral-8x7B-Instruct-v0.1OSSmistralaiDec 2023SpeakLeash/CPTU-Bench3.06
81Qwen/Qwen3.5-9B non-thinking (API, FP8)OSSQwenJul 2025SpeakLeash/CPTU-Bench3.01
82SOLAR-10.7B-Instruct-v1.0OSSupstageDec 2023SpeakLeash/CPTU-Bench2.97
83Qwen2.5-3B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench2.95
84Qwen2.5-1.5B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench2.79
85Llama-3.2-3B-InstructOSSmeta-llamaSep 2024SpeakLeash/CPTU-Bench2.76
86Phi-4-mini-instructOSSmicrosoftApr 2025SpeakLeash/CPTU-Bench2.69
87NousResearch/Hermes-3-Llama-3.2-3BOSSNousResearchOct 2024SpeakLeash/CPTU-Bench2.62
88Phi-3.5-mini-instructOSSmicrosoftAug 2024SpeakLeash/CPTU-Bench2.44
89h2oai/h2o-danube2-1.8b-chatOSSh2oaiApr 2024SpeakLeash/CPTU-Bench2.37
90SmolLM2-1.7B-InstructOSSHuggingFaceTBFeb 2025SpeakLeash/CPTU-Bench2.28
91EuroLLM-1.7B-InstructOSSutter-projectJan 2025SpeakLeash/CPTU-Bench2.24
92Qwen/Qwen2.5-0.5B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench1.96
93LGAI-EXAONE/EXAONE-3.5-2.4B-InstructOSSLGAI-EXAONEJan 2025SpeakLeash/CPTU-Bench1.94
tricky-questions
93 rows
#ModelOrgSubmittedPaper / codetricky-questions
01Qwen/Qwen3.5-35B-A3B thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.70
02Qwen/Qwen3.5-27B thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.61
03Qwen/Qwen3.5-27B non-thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.43
04deepseek-ai/DeepSeek-V3.2 (API)OSSdeepseek-aiJul 2025SpeakLeash/CPTU-Bench4.20
05Qwen/Qwen3.5-35B-A3B non-thinking (API)OSSQwenJul 2025SpeakLeash/CPTU-Bench4.19
06deepseek-ai/DeepSeek-R1 (API)OSSdeepseek-aiJan 2025SpeakLeash/CPTU-Bench4.12
07🚧DeepSeek-V3-0324OSSdeepseek-aiMar 2025SpeakLeash/CPTU-Bench4.02
08deepseek-ai/DeepSeek-V3 (API)OSSdeepseek-aiDec 2024SpeakLeash/CPTU-Bench3.99
09gemini-2.0-flash-001OSSGoogleFeb 2025SpeakLeash/CPTU-Bench3.99
10moonshotai/Kimi-K2-Instruct-0905 (API)OSSmoonshotaiSep 2025SpeakLeash/CPTU-Bench3.93
11openai/gpt-oss-120b (API)OSSopenaiJun 2025SpeakLeash/CPTU-Bench3.89
12deepseek-ai/DeepSeek-V3.1 (API)OSSdeepseek-aiMay 2025SpeakLeash/CPTU-Bench3.87
13gemini-2.0-flash-lite-001OSSGoogleFeb 2025SpeakLeash/CPTU-Bench3.85
14Qwen/Qwen3-235B-A22B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.84
15Qwen2.5-72B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.81
16meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)OSSmeta-llamaApr 2025SpeakLeash/CPTU-Bench3.76
17Mistral-Large-Instruct-2411OSSmistralaiNov 2024SpeakLeash/CPTU-Bench3.72
18Meta-Llama-3-70B-InstructOSSmeta-llamaApr 2024SpeakLeash/CPTU-Bench3.71
19Qwen2-72B-InstructOSSQwenJun 2024SpeakLeash/CPTU-Bench3.68
20Mistral-Large-Instruct-2407OSSmistralaiJul 2024SpeakLeash/CPTU-Bench3.65
21Qwen/Qwen3.5-9B non-thinking (API, FP8)OSSQwenJul 2025SpeakLeash/CPTU-Bench3.64
22Qwen2.5-32B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.59
23Qwen/Qwen3-32B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.56
24Qwen/Qwen3-30B-A3B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.54
25gemma-3-27b-itOSSgoogleMar 2025SpeakLeash/CPTU-Bench3.53
26Bielik-11B-v2.1-InstructOSSspeakleashSep 2024SpeakLeash/CPTU-Bench3.47
27Mistral-Small-24B-Instruct-2501OSSmistralaiJan 2025SpeakLeash/CPTU-Bench3.45
28NVIDIA-Nemotron-3-Nano-30B-A3B-BF16OSSnvidiaJun 2025SpeakLeash/CPTU-Bench3.43
29mistralai/Mistral-Small-3.1-24B-Instruct-2503 (API FP8)OSSmistralaiMar 2025SpeakLeash/CPTU-Bench3.42
30Llama-3.3-70B-InstructOSSmeta-llamaDec 2024SpeakLeash/CPTU-Bench3.38
31Qwen2.5-14B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench3.34
32Qwen/Qwen3-14B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench3.33
33mistralai/Mistral-Small-3.2-24B-Instruct-2506 (API FP8)OSSmistralaiJun 2025SpeakLeash/CPTU-Bench3.30
34Mixtral-8x22B-Instruct-v0.1OSSmistralaiApr 2024SpeakLeash/CPTU-Bench3.24
35Bielik-11B-v2.3-InstructOSSspeakleashNov 2024SpeakLeash/CPTU-Bench3.22
36Llama-PLLuM-70B-chatOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench3.21
37Llama-4-Scout-17B-16E-InstructOSSmeta-llamaApr 2025SpeakLeash/CPTU-Bench3.19
38Bielik-11B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench3.19
39Bielik-11B-v2.2-InstructOSSspeakleashOct 2024SpeakLeash/CPTU-Bench3.12
40Bielik-11B-v2.6-InstructOSSspeakleashFeb 2025SpeakLeash/CPTU-Bench3.10
41WizardLM-2-8x22BOSSalpindaleApr 2024SpeakLeash/CPTU-Bench3.06
42Meta-Llama-3.1-70B-InstructOSSmeta-llamaJul 2024SpeakLeash/CPTU-Bench3.01
43Bielik-11B-v2.5-InstructOSSspeakleashJan 2025SpeakLeash/CPTU-Bench2.91
44pllum-12b-nc-chat-250715OSSCYFRAGOVPLJul 2025SpeakLeash/CPTU-Bench2.90
45Qwen/Qwen3-8B non-thinking (API)OSSQwenApr 2025SpeakLeash/CPTU-Bench2.76
46EuroLLM-9B-InstructOSSutter-projectMar 2025SpeakLeash/CPTU-Bench2.75
47speakleash/Bielik-Minitron-7B-v3.0-InstructOSSspeakleashJul 2025SpeakLeash/CPTU-Bench2.74
48phi-4OSSmicrosoftJan 2025SpeakLeash/CPTU-Bench2.72
49Qwen1.5-72B-ChatOSSQwenFeb 2024SpeakLeash/CPTU-Bench2.67
50Llama-PLLuM-70B-instructOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench2.63
51CYFRAGOVPL/PLLuM-12B-nc-chatOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench2.62
52PLLuM-12B-chatOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench2.59
53Qwen2.5-7B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench2.58
54Meta-Llama-3-8B-InstructOSSmeta-llamaApr 2024SpeakLeash/CPTU-Bench2.48
55Bielik-4.5B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench2.46
56CYFRAGOVPL/pllum-12b-nc-instruct-250715OSSCYFRAGOVPLJul 2025SpeakLeash/CPTU-Bench2.37
57Llama-PLLuM-8B-chatOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench2.25
58gemma-2-2b-itOSSgoogleJun 2024SpeakLeash/CPTU-Bench2.21
59Bielik-11B-v2.0-InstructOSSspeakleashAug 2024SpeakLeash/CPTU-Bench2.20
60Bielik-7B-Instruct-v0.1OSSspeakleashApr 2024SpeakLeash/CPTU-Bench2.16
61SOLAR-10.7B-Instruct-v1.0OSSupstageDec 2023SpeakLeash/CPTU-Bench2.12
62Meta-Llama-3.1-8B-InstructOSSmeta-llamaJul 2024SpeakLeash/CPTU-Bench2.11
63Mistral-Nemo-Instruct-2407OSSmistralaiJul 2024SpeakLeash/CPTU-Bench2.09
64Mistral-7B-Instruct-v0.3OSSmistralaiMay 2024SpeakLeash/CPTU-Bench1.99
65glm-4-9b-chatOSSTHUDMJun 2024SpeakLeash/CPTU-Bench1.98
66CYFRAGOVPL/PLLuM-12B-nc-instructOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench1.98
67openchat-3.5-0106OSSopenchatDec 2023SpeakLeash/CPTU-Bench1.96
68PLLuM-12B-instructOSSCYFRAGOVPLApr 2025SpeakLeash/CPTU-Bench1.90
69Qwen2.5-3B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench1.81
70PLLuM-8x7B-nc-chatOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench1.80
71Mixtral-8x7B-Instruct-v0.1OSSmistralaiDec 2023SpeakLeash/CPTU-Bench1.80
72PLLuM-8x7B-chatOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench1.78
73PLLuM-8x7B-nc-instructOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench1.76
74openchat-3.5-0106-gemmaOSSopenchatDec 2023SpeakLeash/CPTU-Bench1.68
75Starling-LM-7B-alphaOSSberkeley-nestNov 2023SpeakLeash/CPTU-Bench1.68
76CYFRAGOVPL/Llama-PLLuM-8B-instructOSSCYFRAGOVPLMar 2025SpeakLeash/CPTU-Bench1.66
77PLLuM-8x7B-instructOSSCYFRAGOVPLFeb 2025SpeakLeash/CPTU-Bench1.51
78Phi-4-mini-instructOSSmicrosoftApr 2025SpeakLeash/CPTU-Bench1.30
79Bielik-1.5B-v3.0-InstructOSSspeakleashJun 2025SpeakLeash/CPTU-Bench1.22
80Llama-3.2-3B-InstructOSSmeta-llamaSep 2024SpeakLeash/CPTU-Bench1.22
81NousResearch/Hermes-3-Llama-3.2-3BOSSNousResearchOct 2024SpeakLeash/CPTU-Bench1.14
82Phi-3.5-mini-instructOSSmicrosoftAug 2024SpeakLeash/CPTU-Bench1.04
83trurl-2-13b-academicOSSVoicelabJan 2024SpeakLeash/CPTU-Bench1.02
84Yi-1.5-34B-ChatOSS01-aiMay 2024SpeakLeash/CPTU-Bench1.00
85EuroLLM-1.7B-InstructOSSutter-projectJan 2025SpeakLeash/CPTU-Bench0.758
86Qwen2.5-1.5B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench0.663
87granite-3.1-2b-instructOSSibm-graniteJan 2025SpeakLeash/CPTU-Bench0.590
88Llama-3.2-1B-InstructOSSmeta-llamaSep 2024SpeakLeash/CPTU-Bench0.522
89LGAI-EXAONE/EXAONE-3.5-2.4B-InstructOSSLGAI-EXAONEJan 2025SpeakLeash/CPTU-Bench0.489
90SmolLM2-1.7B-InstructOSSHuggingFaceTBFeb 2025SpeakLeash/CPTU-Bench0.253
91Qwen/Qwen2.5-0.5B-InstructOSSQwenSep 2024SpeakLeash/CPTU-Bench0.219
92h2oai/h2o-danube2-1.8b-chatOSSh2oaiApr 2024SpeakLeash/CPTU-Bench0.129
93internlm2-chat-20bOSSinternlmJan 2024SpeakLeash/CPTU-Bench0.124
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

14 steps
of state of the art.

Each row below marks a model that broke the previous record on average. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · average
  1. Nov 30, 2023Starling-LM-7B-alphaberkeley-nest2.63
  2. Dec 11, 2023Mixtral-8x7B-Instruct-v0.1mistralai2.73
  3. Dec 20, 2023openchat-3.5-0106-gemmaopenchat2.73
  4. Dec 21, 2023SOLAR-10.7B-Instruct-v1.0upstage2.88
  5. Feb 5, 2024Qwen1.5-72B-ChatQwen3.16
  6. Apr 15, 2024WizardLM-2-8x22Balpindale3.70
  7. Apr 18, 2024Meta-Llama-3-70B-Instructmeta-llama3.78
  8. Jul 24, 2024Mistral-Large-Instruct-2407mistralai3.93
  9. Sep 19, 2024Qwen2.5-72B-InstructQwen3.95
  10. Nov 18, 2024Mistral-Large-Instruct-2411mistralai4.00
  11. Dec 26, 2024deepseek-ai/DeepSeek-V3 (API)deepseek-ai4.02
  12. Jan 20, 2025deepseek-ai/DeepSeek-R1 (API)deepseek-ai4.14
  13. Feb 5, 2025gemini-2.0-flash-001Google4.29
  14. Jul 15, 2025Qwen/Qwen3.5-27B thinking (API)Qwen4.34
Fig 3 · SOTA-setting models only. 14 entries span Nov 2023 Jul 2025.
§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies