Who leads the React Native Evals benchmark?

Composer 2 currently leads React Native Evals with a score of 98.9 on Navigation Satisfaction.

What is the state-of-the-art score on React Native Evals?

The state-of-the-art result on React Native Evals is 98.9 (Navigation Satisfaction), achieved by Composer 2 as of 2026.

How many models are tracked on React Native Evals?

Codesota tracks 10 models on React Native Evals across 4 metrics.

When was the React Native Evals leaderboard last updated?

The React Native Evals leaderboard on Codesota includes results through 2026.

Codesota · Benchmark · React Native EvalsHome/Leaderboards/React Native Evals

Unknown

React Native Evals.

Name: React Native Evals Benchmark Results
Creator: Unknown
Published: 2026-01-01
License: https://creativecommons.org/licenses/by/4.0/

A benchmark suite evaluating how AI coding models handle authentic React Native development tasks. 71 evals across 5 categories: animation (14), async-state management (14), lists (19), navigation (14), and React Native APIs (10). Each eval specifies explicit, judgeable requirements. Model outputs are scored on requirement satisfaction using LLM-based judging. Covers real libraries: Reanimated, React Navigation, Zustand, Jotai, React Query, FlatList, FlashList, LegendList.

Paper ↗Leaderboard ↓

§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

Navigation Satisfaction

Navigation Satisfaction is the reported evaluation metric for React Native Evals. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Navigation Satisfactionverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Composer 2 v0.2.0 run, 10x per model, LLM-judged. Integrated tool (no API cost)	verified	98.9	2026	Source ↗	Looks wrong?
02	GPT 5.3 Codex v0.2.0 run, 10x per model, LLM-judged. Cost: $19.37/run, tokens: 488K	verified	95.6	2026	Source ↗	Looks wrong?
03	GPT-5.4 v0.2.0 run, 10x per model, LLM-judged. Cost: $20.44/run, tokens: 547K	verified	95.6	2026	Source ↗	Looks wrong?
04	Gemini-3.1-Pro v0.2.0 run, 10x per model, LLM-judged. Cost: $32.5/run, tokens: 668K	verified	94.4	2026	Source ↗	Looks wrong?
05	Claude Sonnet 4.6 v0.2.0 run, 10x per model, LLM-judged. Cost: $22.41/run, tokens: 531K	verified	93.3	2026	Source ↗	Looks wrong?
06	Claude Opus 4.6 v0.2.0 run, 10x per model, LLM-judged. Cost: $38.8/run, tokens: 532K	verified	93.3	2026	Source ↗	Looks wrong?
07	Kimi K2.5 v0.2.0 run, 10x per model, LLM-judged. Cost: $12.2/run, tokens: 1.68M	verified	93.3	2026	Source ↗	Looks wrong?
08	GLM-5 v0.2.0 run, 10x per model, LLM-judged. Cost: $10.1/run, tokens: 812K	verified	86.7	2026	Source ↗	Looks wrong?
09	Grok 4 v0.2.0 run, 10x per model, LLM-judged. Cost: $63.05/run, tokens: 838K	verified	84.4	2026	Source ↗	Looks wrong?
10	DeepSeek-V3.2 v0.2.0 run, 10x per model, LLM-judged. Cost: $13.5/run, tokens: 5.13M	verified	75.7	2026	Source ↗	Looks wrong?

Async State Satisfaction

Async State Satisfaction is the reported evaluation metric for React Native Evals. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Async State Satisfactionverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Composer 2 v0.2.0 run, 10x per model, LLM-judged. Integrated tool (no API cost)	verified	98.5	2026	Source ↗	Looks wrong?
02	GPT-5.4 v0.2.0 run, 10x per model, LLM-judged. Cost: $20.44/run, tokens: 547K	verified	85.4	2026	Source ↗	Looks wrong?
03	GPT 5.3 Codex v0.2.0 run, 10x per model, LLM-judged. Cost: $19.37/run, tokens: 488K	verified	85.3	2026	Source ↗	Looks wrong?
04	Claude Opus 4.6 v0.2.0 run, 10x per model, LLM-judged. Cost: $38.8/run, tokens: 532K	verified	84.6	2026	Source ↗	Looks wrong?
05	Claude Sonnet 4.6 v0.2.0 run, 10x per model, LLM-judged. Cost: $22.41/run, tokens: 531K	verified	80.8	2026	Source ↗	Looks wrong?
06	Gemini-3.1-Pro v0.2.0 run, 10x per model, LLM-judged. Cost: $32.5/run, tokens: 668K	verified	80.8	2026	Source ↗	Looks wrong?
07	DeepSeek-V3.2 v0.2.0 run, 10x per model, LLM-judged. Cost: $13.5/run, tokens: 5.13M	verified	77.7	2026	Source ↗	Looks wrong?
08	Kimi K2.5 v0.2.0 run, 10x per model, LLM-judged. Cost: $12.2/run, tokens: 1.68M	verified	77.7	2026	Source ↗	Looks wrong?
09	GLM-5 v0.2.0 run, 10x per model, LLM-judged. Cost: $10.1/run, tokens: 812K	verified	73.8	2026	Source ↗	Looks wrong?
10	Grok 4 v0.2.0 run, 10x per model, LLM-judged. Cost: $63.05/run, tokens: 838K	verified	73.8	2026	Source ↗	Looks wrong?

Requirement Satisfaction

Requirement Satisfaction is the reported evaluation metric for React Native Evals. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Requirement Satisfactionverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Composer 2 v0.2.0 run, 10x per model, LLM-judged. Integrated tool (no API cost)	verified	96.2	2026	Source ↗	Looks wrong?
02	Claude Opus 4.6 v0.2.0 run, 10x per model, LLM-judged. Cost: $38.8/run, tokens: 532K	verified	84.36	2026	Source ↗	Looks wrong?
03	GPT-5.4 v0.2.0 run, 10x per model, LLM-judged. Cost: $20.44/run, tokens: 547K	verified	82.64	2026	Source ↗	Looks wrong?
04	GPT 5.3 Codex v0.2.0 run, 10x per model, LLM-judged. Cost: $19.37/run, tokens: 488K	verified	80.88	2026	Source ↗	Looks wrong?
05	Gemini-3.1-Pro v0.2.0 run, 10x per model, LLM-judged. Cost: $32.5/run, tokens: 668K	verified	78.9	2026	Source ↗	Looks wrong?
06	Claude Sonnet 4.6 v0.2.0 run, 10x per model, LLM-judged. Cost: $22.41/run, tokens: 531K	verified	77.91	2026	Source ↗	Looks wrong?
07	Kimi K2.5 v0.2.0 run, 10x per model, LLM-judged. Cost: $12.2/run, tokens: 1.68M	verified	74.91	2026	Source ↗	Looks wrong?
08	GLM-5 v0.2.0 run, 10x per model, LLM-judged. Cost: $10.1/run, tokens: 812K	verified	74.23	2026	Source ↗	Looks wrong?
09	Grok 4 v0.2.0 run, 10x per model, LLM-judged. Cost: $63.05/run, tokens: 838K	verified	70.06	2026	Source ↗	Looks wrong?
10	DeepSeek-V3.2 v0.2.0 run, 10x per model, LLM-judged. Cost: $13.5/run, tokens: 5.13M	verified	68.98	2026	Source ↗	Looks wrong?

Animation Satisfaction

Animation Satisfaction is the reported evaluation metric for React Native Evals. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Animation Satisfactionverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Composer 2 v0.2.0 run, 10x per model, LLM-judged. Integrated tool (no API cost)	verified	94.3	2026	Source ↗	Looks wrong?
02	Claude Opus 4.6 v0.2.0 run, 10x per model, LLM-judged. Cost: $38.8/run, tokens: 532K	verified	77.4	2026	Source ↗	Looks wrong?
03	GPT-5.4 v0.2.0 run, 10x per model, LLM-judged. Cost: $20.44/run, tokens: 547K	verified	68.9	2026	Source ↗	Looks wrong?
04	GLM-5 v0.2.0 run, 10x per model, LLM-judged. Cost: $10.1/run, tokens: 812K	verified	66	2026	Source ↗	Looks wrong?
05	Claude Sonnet 4.6 v0.2.0 run, 10x per model, LLM-judged. Cost: $22.41/run, tokens: 531K	verified	65.1	2026	Source ↗	Looks wrong?
06	Gemini-3.1-Pro v0.2.0 run, 10x per model, LLM-judged. Cost: $32.5/run, tokens: 668K	verified	64.2	2026	Source ↗	Looks wrong?
07	GPT 5.3 Codex v0.2.0 run, 10x per model, LLM-judged. Cost: $19.37/run, tokens: 488K	verified	63.2	2026	Source ↗	Looks wrong?
08	Grok 4 v0.2.0 run, 10x per model, LLM-judged. Cost: $63.05/run, tokens: 838K	verified	59.4	2026	Source ↗	Looks wrong?
09	Kimi K2.5 v0.2.0 run, 10x per model, LLM-judged. Cost: $12.2/run, tokens: 1.68M	verified	59.4	2026	Source ↗	Looks wrong?
10	DeepSeek-V3.2 v0.2.0 run, 10x per model, LLM-judged. Cost: $13.5/run, tokens: 5.13M	verified	56.4	2026	Source ↗	Looks wrong?

§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards