Codesota · Models · Llama 3 70BMeta11 results · 11 benchmarks

Model card

Llama 3 70B.

Metaopen-sourceLLM

Meta Llama 3, 70B parameter instruct variant. Released April 2024.

Hugging Face ↗

§ 02 · Benchmarks

Every benchmark Llama 3 70B has a recorded score for.

#	Benchmark	Area · Task	Metric	Value	Rank	Date	Source
01	CommonsenseQA	Reasoning · Commonsense Reasoning	accuracy	80.9%	#3/5	—	source ↗
02	MAWPS	Reasoning · Arithmetic Reasoning	accuracy	94.1%	#3/3	—	source ↗
03	SVAMP	Reasoning · Arithmetic Reasoning	accuracy	89.5%	#3/3	—	source ↗
04	WinoGrande	Reasoning · Commonsense Reasoning	accuracy	85.3%	#3/13	—	source ↗
05	CoNLL-2003	Natural Language Processing · Named Entity Recognition	f1	89.3%	#6/7	2024-07-31	source ↗
06	SNLI	Natural Language Processing · Natural Language Inference	accuracy	89.7%	#7/8	2024-07-31	source ↗
07	HellaSwag	Reasoning · Commonsense Reasoning	accuracy	88.0%	#7/17	—	source ↗
08	ARC-Challenge	Reasoning · Commonsense Reasoning	accuracy	93.0%	#10/10	—	source ↗
09	SQuAD v2.0	Natural Language Processing · Question Answering	f1	85.3%	#23/26	2024-07-31	source ↗
10	GSM8K	Reasoning · Mathematical Reasoning	accuracy	93.0%	#30/48	—	source ↗
11	HumanEval	Computer Code · Code Generation	pass@1	81.7%	#34/42	—	source ↗

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.

§ 03 · Strengths by area

Where Llama 3 70B actually performs.

Reasoning

benchmarks

avg rank #8.4

Natural Language Processing

§ 04 · Papers

1 paper with results for Llama 3 70B.

2024-07-31· Natural Language Processing· 3 results
The Llama 3 Herd of Models

§ 05 · Related models

Other Meta models scored on Codesota.

Llama 3 (405B, Instruct)

400B total / 17B active (128 experts) params · 7 results

Llama 3.1 70B

4 results

Code Llama 34B

Unknown params · 2 results

ConvNeXt V2 Huge

650M params · 2 results

DeiT-B Distilled

86M params · 2 results

Muse Spark

2 results

§ 06 · Sources & freshness

Where these numbers come from.

meta-blog

results

arxiv

results

openai-simple-evals

result

3 of 11 rows marked verified.