React Native Code Generation2025en

Callstack Incubator React Native Evaluation Suite

A benchmark suite evaluating how AI coding models handle authentic React Native development tasks. 71 evals across 5 categories: animation (14), async-state management (14), lists (19), navigation (14), and React Native APIs (10). Each eval specifies explicit, judgeable requirements. Model outputs are scored on requirement satisfaction using LLM-based judging. Covers real libraries: Reanimated, React Navigation, Zustand, Jotai, React Query, FlatList, FlashList, LegendList.

Samples:71
Metrics:requirement-satisfaction, pass-rate
Paper / WebsiteDownload
Current State of the Art

Composer 2

Anysphere

96.2

requirement-satisfaction

Top Models Performance Comparison

Top 10 models ranked by requirement-satisfaction

requirement-satisfaction1Composer 296.2100.0%2Claude Opus 4.684.487.7%3GPT 5.482.685.9%4GPT 5.3 Codex80.984.1%5Gemini 3.1 Pro Preview78.982.0%6Claude Sonnet 4.677.981.0%7Kimi K2.574.977.9%8GLM 574.277.2%9Grok 470.172.8%10DeepSeek V3.269.071.7%0%25%50%75%100%% of best
Best Score
96.2
Top Model
Composer 2
Models Compared
10
Score Range
27.2

animation-satisfaction

#ModelScorePaper / CodeDate
1
Composer 2
Anysphere
94.3Mar 2026
2
Claude Opus 4.6API
Anthropic
77.4Mar 2026
3
GPT 5.4API
OpenAI
68.9Mar 2026
4
GLM 5API
Zhipu AI
66Mar 2026
5
Claude Sonnet 4.6API
Anthropic
65.1Mar 2026
6
Gemini 3.1 Pro PreviewAPI
Google
64.2Mar 2026
7
GPT 5.3 CodexAPI
OpenAI
63.2Mar 2026
8
Grok 4API
xAI
59.4Mar 2026
9
Kimi K2.5API
Moonshot
59.4Mar 2026
10
DeepSeek V3.2API
DeepSeek
56.4Mar 2026

async-state-satisfaction

#ModelScorePaper / CodeDate
1
Composer 2
Anysphere
98.5Mar 2026
2
GPT 5.4API
OpenAI
85.4Mar 2026
3
GPT 5.3 CodexAPI
OpenAI
85.3Mar 2026
4
Claude Opus 4.6API
Anthropic
84.6Mar 2026
5
Gemini 3.1 Pro PreviewAPI
Google
80.8Mar 2026
6
Claude Sonnet 4.6API
Anthropic
80.8Mar 2026
7
DeepSeek V3.2API
DeepSeek
77.7Mar 2026
8
Kimi K2.5API
Moonshot
77.7Mar 2026
9
Grok 4API
xAI
73.8Mar 2026
10
GLM 5API
Zhipu AI
73.8Mar 2026

navigation-satisfaction

#ModelScorePaper / CodeDate
1
Composer 2
Anysphere
98.9Mar 2026
2
GPT 5.4API
OpenAI
95.6Mar 2026
3
GPT 5.3 CodexAPI
OpenAI
95.6Mar 2026
4
Gemini 3.1 Pro PreviewAPI
Google
94.4Mar 2026
5
Claude Opus 4.6API
Anthropic
93.3Mar 2026
6
Claude Sonnet 4.6API
Anthropic
93.3Mar 2026
7
Kimi K2.5API
Moonshot
93.3Mar 2026
8
GLM 5API
Zhipu AI
86.7Mar 2026
9
Grok 4API
xAI
84.4Mar 2026
10
DeepSeek V3.2API
DeepSeek
75.7Mar 2026

requirement-satisfactionPrimary

#ModelScorePaper / CodeDate
1
Composer 2
Anysphere
96.2Mar 2026
2
Claude Opus 4.6API
Anthropic
84.36Mar 2026
3
GPT 5.4API
OpenAI
82.64Mar 2026
4
GPT 5.3 CodexAPI
OpenAI
80.88Mar 2026
5
Gemini 3.1 Pro PreviewAPI
Google
78.9Mar 2026
6
Claude Sonnet 4.6API
Anthropic
77.91Mar 2026
7
Kimi K2.5API
Moonshot
74.91Mar 2026
8
GLM 5API
Zhipu AI
74.23Mar 2026
9
Grok 4API
xAI
70.06Mar 2026
10
DeepSeek V3.2API
DeepSeek
68.98Mar 2026
React Native Evals Benchmark - React Native Code Generation | CodeSOTA