React Native Code Generation
Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation, navigation, state management, lists, and platform APIs using real-world libraries (Reanimated, React Navigation, Zustand, FlashList).
Most AI coding benchmarks test generic Python snippets. React Native Evals tests what actually matters in mobile development: can the model wire up Reanimated gestures, configure React Navigation stacks, manage async state with Zustand, and render performant lists with FlashList? 71 tasks, each with explicit requirements judged against file-level evidence. No multiple choice — the model writes real code.
React Native Evals
67 evals across 6 categories. 10 models. Requirement-based scoring with LLM judging.
The first rigorous benchmark for AI-generated React Native code.
96.2%
Best score
Composer 2
37.9pt
Animation spread
Hardest category
10x
Runs per model
Statistical rigor
Leaderboard
Requirement satisfaction across 39 evals, 10 runs per model
Composer 2
Anysphere
Opus 4.6
Anthropic
GPT 5.4
OpenAI
Codex 5.3
OpenAI
Gemini 3.1
Sonnet 4.6
Anthropic
Kimi K2.5
Moonshot
GLM 5
Zhipu AI
Grok 4
xAI
DeepSeek
DeepSeek
Source: callstackincubator/evals v0.2.0 · March 24, 2026 · LLM-judged requirement satisfaction
rn-evals.vercel.app →31.9pt
Animation gap
Best vs worst on animation evals
4.5pt
Navigation ceiling
Top 5 models within 4.5pts
$63/run
Grok 4 cost
Worst cost-to-performance ratio
5.13M
DeepSeek tokens
10x more tokens, 16pts less
By Category
Navigation is nearly solved. Animation is the true differentiator.
Navigation
6pt spread13 evals
Animation
31pt spread13 evals
Async State
18pt spread13 evals
Anatomy of an Eval
Each eval ships a prompt, scaffold code, structured requirements, and a gold-standard reference.
prompt.md
Task description
app/App.tsx
Scaffold code
AI Model
Generates solution
requirements.yaml
Judging criteria
LLM Judge
Scores per-requirement
Create a scroll-linked collapsing header using react-native-reanimated. The header should: - Start at 200px height - Collapse to 60px as user scrolls - Interpolate title font size (24→16) - Fade out subtitle below 120px - Use useAnimatedScrollHandler - Apply extrapolation clamping
Uses useAnimatedScrollHandler for scroll tracking
Header height interpolates 200 → 60px with clamp
Title fontSize interpolates 24 → 16 with scroll
Subtitle opacity fades to 0 below 120px threshold
Uses Animated.ScrollView (not FlatList)
Eval Explorer
Per-eval pass rates across models. Green is high, red is low.
Composer 2
Opus 4.6
GPT 5.4
Sonnet 4.6
Gemini 3.1
DeepSeek
Where Models Break
The three hardest evals by average pass rate. These require deep framework understanding.
Pinch + pan simultaneous photo
50%
avg pass rate
Long-press then pan gate
57%
avg pass rate
Scroll-linked collapsing header
57%
avg pass rate
Try the Evals
Interactive web versions of actual eval tasks. These are the patterns AI models are tested on.
Press and hold the button
animation/01 — Pressable scale with timing
Example Evals
// Task: Create a toggle that animates using spring physics
// Requirements:
// - Must use useSharedValue (not Animated.Value)
// - Must use withSpring with damping/stiffness config
// - Must animate both scale and backgroundColor
// - Must handle onPress toggling state
// ❌ What most models generate (timing-based):
const opacity = useSharedValue(0);
const style = useAnimatedStyle(() => ({
opacity: withTiming(opacity.value, { duration: 300 }),
}));
// ✅ What the eval requires (spring physics):
const progress = useSharedValue(0);
const animatedStyle = useAnimatedStyle(() => ({
transform: [{ scale: interpolate(progress.value, [0, 1], [1, 0.95]) }],
backgroundColor: interpolateColor(
progress.value,
[0, 1],
['#3b82f6', '#22c55e']
),
}));
const toggle = () => {
progress.value = withSpring(
progress.value === 0 ? 1 : 0,
{ damping: 15, stiffness: 150 }
);
};The eval distinguishes between timing-based animation (common model output) and spring physics (what the requirement demands). Models that default to withTiming fail this eval even if the result looks visually similar.
// Task: Build a chat interface with FlashList
// Requirements:
// - MUST use FlashList, not FlatList
// - Must set inverted={true} for bottom-to-top message flow
// - Must provide estimatedItemSize (required by FlashList)
// - Must implement getItemType for sent vs received messages
// - Must scroll to end on new message
import { FlashList } from '@shopify/flash-list';
<FlashList
data={messages}
inverted
estimatedItemSize={72}
renderItem={({ item }) => <MessageBubble message={item} />}
getItemType={(item) => item.sender === 'me' ? 'sent' : 'received'}
ref={listRef}
onContentSizeChange={() => listRef.current?.scrollToEnd()}
/>Models frequently substitute FlatList when FlashList is specified. FlashList requires estimatedItemSize (FlatList doesn't) and has different recycling behavior. The eval catches both the wrong component and missing required props.
// Task: Prevent navigation when form has unsaved changes
// Requirements:
// - Must use beforeRemove event listener
// - Must track dirty state from form inputs
// - Must show Alert with Discard/Cancel options
// - Must remove listener when changes are saved
import { useNavigation } from '@react-navigation/native';
import { Alert } from 'react-native';
const [isDirty, setIsDirty] = useState(false);
useEffect(() => {
if (!isDirty) return;
const unsubscribe = navigation.addListener('beforeRemove', (e) => {
e.preventDefault();
Alert.alert(
'Unsaved Changes',
'You have unsaved changes. Discard them?',
[
{ text: 'Cancel', style: 'cancel' },
{
text: 'Discard',
style: 'destructive',
onPress: () => navigation.dispatch(e.data.action),
},
]
);
});
return unsubscribe;
}, [isDirty, navigation]);Tests understanding of React Navigation lifecycle events. Models often use navigation.goBack() prevention or custom back handlers instead of the correct beforeRemove event pattern.
Current Landscape
Before React Native Evals, there was no standardized way to measure how well AI models handle mobile development. HumanEval and SWE-bench test general coding. MBPP tests Python snippets. None touch the patterns mobile developers deal with daily: gesture handlers, navigation stacks, native module bridging, platform-specific rendering. This benchmark fills that gap with tasks extracted from production codebases.
Key Challenges
Animation requires understanding Reanimated's worklet threading model — running animations on the UI thread, not the JS thread. Most models mix paradigms.
React Navigation has grown complex: stack, tab, drawer, modal, and auth flow patterns each have distinct configuration. Models often merge incompatible patterns.
Async state management spans React Query, Zustand, and Jotai — each with different patterns for cache invalidation, optimistic updates, and hydration.
List performance is make-or-break. FlatList, FlashList, and LegendList have different APIs and recycling behaviors. Models frequently use the wrong component.
Platform-specific code (Platform.select, .ios.tsx/.android.tsx) is poorly represented in training data. Models write platform-agnostic code even when the eval requires branching.
Real RN projects use Expo and bare RN differently. The evals cover both, testing whether models adapt to project context.
What's Next
Three categories are in development: Expo SDK integration (camera, notifications, auth), brownfield embedding (RN inside existing native apps), and Nitro modules (native-to-JS bridge). The benchmark is open source and accepting contributions. As models improve on the existing 71 tasks, harder evals covering multi-screen flows and end-to-end feature implementation are planned.
Benchmarks & SOTA
Something wrong or missing?
Help keep React Native Code Generation benchmarks accurate. Report outdated results, missing benchmarks, or errors.