Mobile Development

React Native Code Generation

Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation, navigation, state management, lists, and platform APIs using real-world libraries (Reanimated, React Navigation, Zustand, FlashList).

1 datasets40 resultsView full task mapping →

Most AI coding benchmarks test generic Python snippets. React Native Evals tests what actually matters in mobile development: can the model wire up Reanimated gestures, configure React Navigation stacks, manage async state with Zustand, and render performant lists with FlashList? 71 tasks, each with explicit requirements judged against file-level evidence. No multiple choice — the model writes real code.

Callstack Incubator · v0.2.0 · March 2026

React Native Evals

67 evals across 6 categories. 10 models. Requirement-based scoring with LLM judging.
The first rigorous benchmark for AI-generated React Native code.

96.2%

Best score

Composer 2

37.9pt

Animation spread

Hardest category

10x

Runs per model

Statistical rigor

Leaderboard

Requirement satisfaction across 39 evals, 10 runs per model

Composer 2

Anysphere

96.2%

Opus 4.6

Anthropic

84.4%

GPT 5.4

OpenAI

82.6%

Codex 5.3

OpenAI

80.9%

Gemini 3.1

Google

78.9%

Sonnet 4.6

Anthropic

77.9%

Kimi K2.5

Moonshot

74.9%

GLM 5

Zhipu AI

74.2%

Grok 4

xAI

70.1%

DeepSeek

69.0%

Source: callstackincubator/evals v0.2.0 · March 24, 2026 · LLM-judged requirement satisfaction

rn-evals.vercel.app →

31.9pt

Animation gap

Best vs worst on animation evals

4.5pt

Navigation ceiling

Top 5 models within 4.5pts

$63/run

Grok 4 cost

Worst cost-to-performance ratio

5.13M

DeepSeek tokens

10x more tokens, 16pts less

By Category

Navigation is nearly solved. Animation is the true differentiator.

Navigation

6pt spread

13 evals

Composer 2

98.9

GPT 5.4

95.6

Codex 5.3

95.6

Gemini 3.1

94.4

Opus 4.6

93.3

Sonnet 4.6

93.3

Animation

31pt spread

13 evals

Composer 2

94.3

Opus 4.6

77.4

GPT 5.4

68.9

Sonnet 4.6

65.1

Gemini 3.1

64.2

Codex 5.3

63.2

Async State

18pt spread

13 evals

Composer 2

98.5

GPT 5.4

85.4

Codex 5.3

85.3

Opus 4.6

84.6

Gemini 3.1

80.8

Sonnet 4.6

80.8

Anatomy of an Eval

Each eval ships a prompt, scaffold code, structured requirements, and a gold-standard reference.

prompt.md

Task description

app/App.tsx

Scaffold code

AI Model

Generates solution

requirements.yaml

Judging criteria

LLM Judge

Scores per-requirement

prompt.mdanimation/04

Create a scroll-linked collapsing header
using react-native-reanimated.

The header should:
- Start at 200px height
- Collapse to 60px as user scrolls
- Interpolate title font size (24→16)
- Fade out subtitle below 120px
- Use useAnimatedScrollHandler
- Apply extrapolation clamping

requirements.yaml

Uses useAnimatedScrollHandler for scroll tracking

Header height interpolates 200 → 60px with clamp

Title fontSize interpolates 24 → 16 with scroll

Subtitle opacity fades to 0 below 120px threshold

Uses Animated.ScrollView (not FlatList)

Claude Opus 4.6 on this eval4/5 — 80%

Eval Explorer

Per-eval pass rates across models. Green is high, red is low.

Composer 2

Opus 4.6

GPT 5.4

Sonnet 4.6

Gemini 3.1

DeepSeek

Pressable scale with timingeasy

100

Scroll-linked collapsing headerhard

Pan drag with snap pointshard

Pinch + pan simultaneous photohard

Long-press then pan gatehard

Stack product detailseasy

100

Tabs three sectionseasy

100

Deep link profilemedium

Unsaved edit confirmationmedium

100

Optimistic update rollbackhard

Stale response guardhard

Persist hydration gatehard

100

95+

80+

65+

50+

<50

Where Models Break

The three hardest evals by average pass rate. These require deep framework understanding.

Pinch + pan simultaneous photo

animation·reanimatedgesture-handler

50%

avg pass rate

Composer 2

88%

Opus 4.6

65%

GPT 5.4

48%

Sonnet 4.6

42%

Gemini 3.1

38%

DeepSeek

20%

Long-press then pan gate

animation·gesture-handler

57%

avg pass rate

Composer 2

90%

Opus 4.6

70%

GPT 5.4

55%

Sonnet 4.6

50%

Gemini 3.1

48%

DeepSeek

28%

Scroll-linked collapsing header

animation·reanimated

57%

avg pass rate

Composer 2

92%

Opus 4.6

72%

GPT 5.4

55%

Sonnet 4.6

48%

Gemini 3.1

45%

DeepSeek

30%

Stay updated

Get notified when new models are benchmarked or categories are added.

Spot something wrong?

Report incorrect scores, suggest models to benchmark, or share ideas.

Interactive Leaderboard

rn-evals.vercel.app

GitHub Repository

callstackincubator/evals

Methodology & Blog Post

callstack.com

Try the Evals

Interactive web versions of actual eval tasks. These are the patterns AI models are tested on.

Press and hold the button

useSharedValue(scale)1.00

withTimingat rest

animation/01 — Pressable scale with timing

Example Evals

Eval prompt: Spring-based toggle animation

animation/02tsx

// Task: Create a toggle that animates using spring physics
// Requirements:
// - Must use useSharedValue (not Animated.Value)
// - Must use withSpring with damping/stiffness config
// - Must animate both scale and backgroundColor
// - Must handle onPress toggling state

// ❌ What most models generate (timing-based):
const opacity = useSharedValue(0);
const style = useAnimatedStyle(() => ({
  opacity: withTiming(opacity.value, { duration: 300 }),
}));

// ✅ What the eval requires (spring physics):
const progress = useSharedValue(0);
const animatedStyle = useAnimatedStyle(() => ({
  transform: [{ scale: interpolate(progress.value, [0, 1], [1, 0.95]) }],
  backgroundColor: interpolateColor(
    progress.value,
    [0, 1],
    ['#3b82f6', '#22c55e']
  ),
}));

const toggle = () => {
  progress.value = withSpring(
    progress.value === 0 ? 1 : 0,
    { damping: 15, stiffness: 150 }
  );
};

The eval distinguishes between timing-based animation (common model output) and spring physics (what the requirement demands). Models that default to withTiming fail this eval even if the result looks visually similar.

Eval prompt: FlashList chat with inverted scroll

lists/12tsx

// Task: Build a chat interface with FlashList
// Requirements:
// - MUST use FlashList, not FlatList
// - Must set inverted={true} for bottom-to-top message flow
// - Must provide estimatedItemSize (required by FlashList)
// - Must implement getItemType for sent vs received messages
// - Must scroll to end on new message

import { FlashList } from '@shopify/flash-list';

<FlashList
  data={messages}
  inverted
  estimatedItemSize={72}
  renderItem={({ item }) => <MessageBubble message={item} />}
  getItemType={(item) => item.sender === 'me' ? 'sent' : 'received'}
  ref={listRef}
  onContentSizeChange={() => listRef.current?.scrollToEnd()}
/>

Models frequently substitute FlatList when FlashList is specified. FlashList requires estimatedItemSize (FlatList doesn't) and has different recycling behavior. The eval catches both the wrong component and missing required props.

Eval prompt: Unsaved changes navigation guard

navigation/13tsx

// Task: Prevent navigation when form has unsaved changes
// Requirements:
// - Must use beforeRemove event listener
// - Must track dirty state from form inputs
// - Must show Alert with Discard/Cancel options
// - Must remove listener when changes are saved

import { useNavigation } from '@react-navigation/native';
import { Alert } from 'react-native';

const [isDirty, setIsDirty] = useState(false);

useEffect(() => {
  if (!isDirty) return;
  
  const unsubscribe = navigation.addListener('beforeRemove', (e) => {
    e.preventDefault();
    Alert.alert(
      'Unsaved Changes',
      'You have unsaved changes. Discard them?',
      [
        { text: 'Cancel', style: 'cancel' },
        {
          text: 'Discard',
          style: 'destructive',
          onPress: () => navigation.dispatch(e.data.action),
        },
      ]
    );
  });
  
  return unsubscribe;
}, [isDirty, navigation]);

Tests understanding of React Navigation lifecycle events. Models often use navigation.goBack() prevention or custom back handlers instead of the correct beforeRemove event pattern.

callstackincubator/evals — Open source benchmark suitehttps://github.com/callstackincubator/evals

Current Landscape

Before React Native Evals, there was no standardized way to measure how well AI models handle mobile development. HumanEval and SWE-bench test general coding. MBPP tests Python snippets. None touch the patterns mobile developers deal with daily: gesture handlers, navigation stacks, native module bridging, platform-specific rendering. This benchmark fills that gap with tasks extracted from production codebases.

Key Challenges

Animation requires understanding Reanimated's worklet threading model — running animations on the UI thread, not the JS thread. Most models mix paradigms.

React Navigation has grown complex: stack, tab, drawer, modal, and auth flow patterns each have distinct configuration. Models often merge incompatible patterns.

Async state management spans React Query, Zustand, and Jotai — each with different patterns for cache invalidation, optimistic updates, and hydration.

List performance is make-or-break. FlatList, FlashList, and LegendList have different APIs and recycling behaviors. Models frequently use the wrong component.

Platform-specific code (Platform.select, .ios.tsx/.android.tsx) is poorly represented in training data. Models write platform-agnostic code even when the eval requires branching.

Real RN projects use Expo and bare RN differently. The evals cover both, testing whether models adapt to project context.

What's Next

Three categories are in development: Expo SDK integration (camera, notifications, auth), brownfield embedding (RN inside existing native apps), and Nitro modules (native-to-JS bridge). The benchmark is open source and accepting contributions. As models improve on the existing 71 tasks, harder evals covering multi-screen flows and end-to-end feature implementation are planned.

Benchmarks & SOTA

React Native Evals

Callstack Incubator React Native Evaluation Suite

202540 results

A benchmark suite evaluating how AI coding models handle authentic React Native development tasks. 71 evals across 5 categories: animation (14), async-state management (14), lists (19), navigation (14), and React Native APIs (10). Each eval specifies explicit, judgeable requirements. Model outputs are scored on requirement satisfaction using LLM-based judging. Covers real libraries: Reanimated, React Navigation, Zustand, Jotai, React Query, FlatList, FlashList, LegendList.

State of the Art

Composer 2

Anysphere

98.9

navigation-satisfaction

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep React Native Code Generation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Mobile Development