Mobile Development

React Native Code Generation

Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation, navigation, state management, lists, and platform APIs using real-world libraries (Reanimated, React Navigation, Zustand, FlashList).

1 datasets40 resultsView full task mapping →

Most AI coding benchmarks test generic Python snippets. React Native Evals tests what actually matters in mobile development: can the model wire up Reanimated gestures, configure React Navigation stacks, manage async state with Zustand, and render performant lists with FlashList? 71 tasks, each with explicit requirements judged against file-level evidence. No multiple choice — the model writes real code.

Callstack Incubator · v0.2.0 · March 2026

React Native Evals

67 evals across 6 categories. 10 models. Requirement-based scoring with LLM judging.The first rigorous benchmark for AI-generated React Native code.

96.2%

Best score

Composer 2

37.9pt

Animation spread

Hardest category

10x

Runs per model

Statistical rigor

Leaderboard

Requirement satisfaction across 39 evals, 10 runs per model

1

Composer 2

Anysphere

96.2%
2

Opus 4.6

Anthropic

84.4%
3

GPT 5.4

OpenAI

82.6%
4

Codex 5.3

OpenAI

80.9%
5

Gemini 3.1

Google

78.9%
6

Sonnet 4.6

Anthropic

77.9%
7

Kimi K2.5

Moonshot

74.9%
8

GLM 5

Zhipu AI

74.2%
9

Grok 4

xAI

70.1%
10

DeepSeek

DeepSeek

69.0%

Source: callstackincubator/evals v0.2.0 · March 24, 2026 · LLM-judged requirement satisfaction

rn-evals.vercel.app →

31.9pt

Animation gap

Best vs worst on animation evals

4.5pt

Navigation ceiling

Top 5 models within 4.5pts

$63/run

Grok 4 cost

Worst cost-to-performance ratio

5.13M

DeepSeek tokens

10x more tokens, 16pts less

By Category

Navigation is nearly solved. Animation is the true differentiator.

Navigation

6pt spread

13 evals

1
Composer 2
98.9
2
GPT 5.4
95.6
3
Codex 5.3
95.6
4
Gemini 3.1
94.4
5
Opus 4.6
93.3
6
Sonnet 4.6
93.3

Animation

31pt spread

13 evals

1
Composer 2
94.3
2
Opus 4.6
77.4
3
GPT 5.4
68.9
4
Sonnet 4.6
65.1
5
Gemini 3.1
64.2
6
Codex 5.3
63.2

Async State

18pt spread

13 evals

1
Composer 2
98.5
2
GPT 5.4
85.4
3
Codex 5.3
85.3
4
Opus 4.6
84.6
5
Gemini 3.1
80.8
6
Sonnet 4.6
80.8

Anatomy of an Eval

Each eval ships a prompt, scaffold code, structured requirements, and a gold-standard reference.

prompt.md

Task description

app/App.tsx

Scaffold code

AI Model

Generates solution

requirements.yaml

Judging criteria

LLM Judge

Scores per-requirement

prompt.mdanimation/04
Create a scroll-linked collapsing header
using react-native-reanimated.

The header should:
- Start at 200px height
- Collapse to 60px as user scrolls
- Interpolate title font size (24→16)
- Fade out subtitle below 120px
- Use useAnimatedScrollHandler
- Apply extrapolation clamping
requirements.yaml
R1

Uses useAnimatedScrollHandler for scroll tracking

R2

Header height interpolates 200 → 60px with clamp

R3

Title fontSize interpolates 24 → 16 with scroll

R4

Subtitle opacity fades to 0 below 120px threshold

R5

Uses Animated.ScrollView (not FlatList)

Claude Opus 4.6 on this eval4/5 — 80%

Eval Explorer

Per-eval pass rates across models. Green is high, red is low.

Composer 2

Opus 4.6

GPT 5.4

Sonnet 4.6

Gemini 3.1

DeepSeek

Pressable scale with timingeasy
100
95
88
85
80
65
Scroll-linked collapsing headerhard
92
72
55
48
45
30
Pan drag with snap pointshard
95
78
62
55
50
35
Pinch + pan simultaneous photohard
88
65
48
42
38
20
Long-press then pan gatehard
90
70
55
50
48
28
Stack product detailseasy
100
100
100
100
100
90
Tabs three sectionseasy
100
95
98
95
95
80
Deep link profilemedium
98
88
90
85
88
60
Unsaved edit confirmationmedium
100
92
95
90
92
72
Optimistic update rollbackhard
98
80
82
75
72
70
Stale response guardhard
95
78
80
72
70
68
Persist hydration gatehard
100
85
85
80
78
75
95+
80+
65+
50+
<50

Where Models Break

The three hardest evals by average pass rate. These require deep framework understanding.

1

Pinch + pan simultaneous photo

animation·reanimatedgesture-handler

50%

avg pass rate

Composer 2
88%
Opus 4.6
65%
GPT 5.4
48%
Sonnet 4.6
42%
Gemini 3.1
38%
DeepSeek
20%
2

Long-press then pan gate

animation·gesture-handler

57%

avg pass rate

Composer 2
90%
Opus 4.6
70%
GPT 5.4
55%
Sonnet 4.6
50%
Gemini 3.1
48%
DeepSeek
28%
3

Scroll-linked collapsing header

animation·reanimated

57%

avg pass rate

Composer 2
92%
Opus 4.6
72%
GPT 5.4
55%
Sonnet 4.6
48%
Gemini 3.1
45%
DeepSeek
30%

Try the Evals

Interactive web versions of actual eval tasks. These are the patterns AI models are tested on.

Press and hold the button

useSharedValue(scale)1.00
withTimingat rest

animation/01 — Pressable scale with timing

Example Evals

Eval prompt: Spring-based toggle animation
animation/02tsx
// Task: Create a toggle that animates using spring physics
// Requirements:
// - Must use useSharedValue (not Animated.Value)
// - Must use withSpring with damping/stiffness config
// - Must animate both scale and backgroundColor
// - Must handle onPress toggling state

// ❌ What most models generate (timing-based):
const opacity = useSharedValue(0);
const style = useAnimatedStyle(() => ({
  opacity: withTiming(opacity.value, { duration: 300 }),
}));

// ✅ What the eval requires (spring physics):
const progress = useSharedValue(0);
const animatedStyle = useAnimatedStyle(() => ({
  transform: [{ scale: interpolate(progress.value, [0, 1], [1, 0.95]) }],
  backgroundColor: interpolateColor(
    progress.value,
    [0, 1],
    ['#3b82f6', '#22c55e']
  ),
}));

const toggle = () => {
  progress.value = withSpring(
    progress.value === 0 ? 1 : 0,
    { damping: 15, stiffness: 150 }
  );
};

The eval distinguishes between timing-based animation (common model output) and spring physics (what the requirement demands). Models that default to withTiming fail this eval even if the result looks visually similar.

Eval prompt: FlashList chat with inverted scroll
lists/12tsx
// Task: Build a chat interface with FlashList
// Requirements:
// - MUST use FlashList, not FlatList
// - Must set inverted={true} for bottom-to-top message flow
// - Must provide estimatedItemSize (required by FlashList)
// - Must implement getItemType for sent vs received messages
// - Must scroll to end on new message

import { FlashList } from '@shopify/flash-list';

<FlashList
  data={messages}
  inverted
  estimatedItemSize={72}
  renderItem={({ item }) => <MessageBubble message={item} />}
  getItemType={(item) => item.sender === 'me' ? 'sent' : 'received'}
  ref={listRef}
  onContentSizeChange={() => listRef.current?.scrollToEnd()}
/>

Models frequently substitute FlatList when FlashList is specified. FlashList requires estimatedItemSize (FlatList doesn't) and has different recycling behavior. The eval catches both the wrong component and missing required props.

Eval prompt: Unsaved changes navigation guard
navigation/13tsx
// Task: Prevent navigation when form has unsaved changes
// Requirements:
// - Must use beforeRemove event listener
// - Must track dirty state from form inputs
// - Must show Alert with Discard/Cancel options
// - Must remove listener when changes are saved

import { useNavigation } from '@react-navigation/native';
import { Alert } from 'react-native';

const [isDirty, setIsDirty] = useState(false);

useEffect(() => {
  if (!isDirty) return;
  
  const unsubscribe = navigation.addListener('beforeRemove', (e) => {
    e.preventDefault();
    Alert.alert(
      'Unsaved Changes',
      'You have unsaved changes. Discard them?',
      [
        { text: 'Cancel', style: 'cancel' },
        {
          text: 'Discard',
          style: 'destructive',
          onPress: () => navigation.dispatch(e.data.action),
        },
      ]
    );
  });
  
  return unsubscribe;
}, [isDirty, navigation]);

Tests understanding of React Navigation lifecycle events. Models often use navigation.goBack() prevention or custom back handlers instead of the correct beforeRemove event pattern.

callstackincubator/evals — Open source benchmark suitehttps://github.com/callstackincubator/evals

Current Landscape

Before React Native Evals, there was no standardized way to measure how well AI models handle mobile development. HumanEval and SWE-bench test general coding. MBPP tests Python snippets. None touch the patterns mobile developers deal with daily: gesture handlers, navigation stacks, native module bridging, platform-specific rendering. This benchmark fills that gap with tasks extracted from production codebases.

Key Challenges

Animation requires understanding Reanimated's worklet threading model — running animations on the UI thread, not the JS thread. Most models mix paradigms.

React Navigation has grown complex: stack, tab, drawer, modal, and auth flow patterns each have distinct configuration. Models often merge incompatible patterns.

Async state management spans React Query, Zustand, and Jotai — each with different patterns for cache invalidation, optimistic updates, and hydration.

List performance is make-or-break. FlatList, FlashList, and LegendList have different APIs and recycling behaviors. Models frequently use the wrong component.

Platform-specific code (Platform.select, .ios.tsx/.android.tsx) is poorly represented in training data. Models write platform-agnostic code even when the eval requires branching.

Real RN projects use Expo and bare RN differently. The evals cover both, testing whether models adapt to project context.

What's Next

Three categories are in development: Expo SDK integration (camera, notifications, auth), brownfield embedding (RN inside existing native apps), and Nitro modules (native-to-JS bridge). The benchmark is open source and accepting contributions. As models improve on the existing 71 tasks, harder evals covering multi-screen flows and end-to-end feature implementation are planned.

Benchmarks & SOTA

Something wrong or missing?

Help keep React Native Code Generation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
React Native Code Generation Benchmarks - Mobile Development - CodeSOTA | CodeSOTA