Codesota · Models · Phi-4Microsoft17 results · 3 benchmarks
Model card

Phi-4.

Microsoftopen-weights14B paramstransformer

Microsoft Phi-4, December 2024

§ 01 · Benchmarks

Every benchmark Phi-4 has a recorded score for.

#BenchmarkArea · TaskMetricValueRankDateSource
01Polish MT-BenchNatural Language Processing · Polish Conversation Qualityreasoning9.6%#1/50source ↗
02Polish MT-BenchNatural Language Processing · Polish Conversation Qualitystem10.0%#1/50source ↗
03Polish MT-BenchNatural Language Processing · Polish Conversation Qualitypl-score9.1%#3/50source ↗
04Polish MT-BenchNatural Language Processing · Polish Conversation Qualitycoding7.6%#7/50source ↗
05Polish MT-BenchNatural Language Processing · Polish Conversation Qualitymath7.7%#7/50source ↗
06Polish MT-BenchNatural Language Processing · Polish Conversation Qualityhumanities9.9%#7/50source ↗
07Polish MT-BenchNatural Language Processing · Polish Conversation Qualityroleplay9.2%#8/50source ↗
08Polish MT-BenchNatural Language Processing · Polish Conversation Qualitywriting9.3%#10/50source ↗
09Polish MT-BenchNatural Language Processing · Polish Conversation Qualityextraction9.3%#14/50source ↗
10HumanEvalComputer Code · Code Generationpass@182.6%#32/42source ↗
11PLCCNatural Language Processing · Polish Cultural Competencyart-and-entertainment23.0%#144/165source ↗
12PLCCNatural Language Processing · Polish Cultural Competencygeography35.0%#147/165source ↗
13PLCCNatural Language Processing · Polish Cultural Competencygrammar34.0%#151/165source ↗
14PLCCNatural Language Processing · Polish Cultural Competencyvocabulary26.0%#152/165source ↗
15PLCCNatural Language Processing · Polish Cultural Competencyaverage29.2%#152/165source ↗
16PLCCNatural Language Processing · Polish Cultural Competencyhistory40.0%#152/165source ↗
17PLCCNatural Language Processing · Polish Cultural Competencyculture-and-tradition17.0%#155/165source ↗
Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.
§ 02 · Strengths by area

Where Phi-4 actually performs.

Computer Code
1
benchmark
avg rank #32.0
Natural Language Processing
2
benchmarks
avg rank #69.4
§ 04 · Related models

Other Microsoft models scored on Codesota.

RAD-DINO
2 results · 1 SOTA
NaturalSpeech 3
~500M params · 1 result · 1 SOTA
Swin Transformer V2 Large
197M params · 1 result · 1 SOTA
WavLM Large (SV)
316M params · 1 result · 1 SOTA
ResNet-50
25M params · 3 results
Florence-2-Large
2 results
KOSMOS-2.5
2 results
ResNet-152
60M params · 2 results
§ 05 · Sources & freshness

Where these numbers come from.

SpeakLeash/MT-Bench-PL
9
results
sdadas/PLCC
7
results
arxiv-2412.08905
1
result
17 of 17 rows marked verified.