Codesota · Models · Grok 4xAI16 results · 7 benchmarks
Model card

Grok 4.

xAIapi
§ 01 · Benchmarks

Every benchmark Grok 4 has a recorded score for.

#BenchmarkArea · TaskMetricValueRankDateSource
01HLEReasoning · Multi-step Reasoningaccuracy24.5%#3/13unverified
02PLCCNatural Language Processing · Polish Cultural Competencyhistory94.0%#3/165source ↗
03PLCCNatural Language Processing · Polish Cultural Competencygrammar90.0%#3/165source ↗
04LiveCodeBenchComputer Code · Code Generationpass@179.0%#4/30source ↗
05PLCCNatural Language Processing · Polish Cultural Competencyculture-and-tradition95.0%#5/165source ↗
06GPQAReasoning · Multi-step Reasoningaccuracy88.0%#6/33source ↗
07PLCCNatural Language Processing · Polish Cultural Competencyaverage90.5%#7/165source ↗
08React Native EvalsMobile Development · React Native Code Generationanimation-satisfaction59.4%#8/10source ↗
09React Native EvalsMobile Development · React Native Code Generationrequirement-satisfaction70.1%#9/10source ↗
10React Native EvalsMobile Development · React Native Code Generationasync-state-satisfaction73.8%#9/10source ↗
11React Native EvalsMobile Development · React Native Code Generationnavigation-satisfaction84.4%#9/10source ↗
12PLCCNatural Language Processing · Polish Cultural Competencyart-and-entertainment86.0%#10/165source ↗
13MMLU-ProReasoning · Commonsense Reasoningaccuracy86.6%#14/202026-04-20source ↗
14PLCCNatural Language Processing · Polish Cultural Competencygeography94.0%#14/165source ↗
15PLCCNatural Language Processing · Polish Cultural Competencyvocabulary84.0%#18/165source ↗
16MMLUReasoning · Commonsense Reasoningaccuracy86.6%#31/41source ↗
Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.
§ 02 · Strengths by area

Where Grok 4 actually performs.

Computer Code
1
benchmark
avg rank #4.0
Natural Language Processing
1
benchmark
avg rank #8.6
Mobile Development
1
benchmark
avg rank #8.8
Reasoning
4
benchmarks
avg rank #13.5
§ 04 · Related models

Other xAI models scored on Codesota.

Grok 2
4 results
Grok 3
1 result
Grok Code Fast 1
1 result
Grok-2-1212
0 results
Grok-3-Beta
0 results
Grok-3-Mini-Beta
0 results
Grok-4-Fast
0 results
Grok-4.1-Fast
0 results
§ 05 · Sources & freshness

Where these numbers come from.

sdadas/PLCC
7
results
Callstack Incubator
4
results
xai-grok-4-announcement
2
results
editorial
1
result
pricepertoken
1
result
artificial-analysis
1
result
11 of 16 rows marked verified.