A benchmark suite evaluating how AI coding models handle authentic React Native development tasks. 71 evals across 5 categories: animation (14), async-state management (14), lists (19), navigation (14), and React Native APIs (10). Each eval specifies explicit, judgeable requirements. Model outputs are scored on requirement satisfaction using LLM-based judging. Covers real libraries: Reanimated, React Navigation, Zustand, Jotai, React Query, FlatList, FlashList, LegendList.
40 results indexed across 4 metrics. Shaded row marks current SOTA; ties broken by submission date.
| # | Model | Org | Submitted | Paper / code | animation-satisfaction |
|---|---|---|---|---|---|
| 01 | Composer 2 | Anysphere | Mar 2026 | Callstack Incubator | 94.30 |
| 02 | Claude Opus 4.6API | Anthropic | Mar 2026 | Callstack Incubator | 77.40 |
| 03 | GPT-5.4API | OpenAI | Mar 2026 | Callstack Incubator | 68.90 |
| 04 | GLM-5OSS | Zhipu AI | Mar 2026 | Callstack Incubator | 66 |
| 05 | Claude Sonnet 4.6API | Anthropic | Mar 2026 | Callstack Incubator | 65.10 |
| 06 | Gemini 3.1 ProAPI | Anthropic/OpenAI | Mar 2026 | Callstack Incubator | 64.20 |
| 07 | GPT 5.3 CodexAPI | OpenAI | Mar 2026 | Callstack Incubator | 63.20 |
| 08 | Grok 4API | xAI | Mar 2026 | Callstack Incubator | 59.40 |
| 09 | Kimi K2.5API | Moonshot AI | Mar 2026 | Callstack Incubator | 59.40 |
| 10 | DeepSeek-V3.2API | DeepSeek | Mar 2026 | Callstack Incubator | 56.40 |
| # | Model | Org | Submitted | Paper / code | async-state-satisfaction |
|---|---|---|---|---|---|
| 01 | Composer 2 | Anysphere | Mar 2026 | Callstack Incubator | 98.50 |
| 02 | GPT-5.4API | OpenAI | Mar 2026 | Callstack Incubator | 85.40 |
| 03 | GPT 5.3 CodexAPI | OpenAI | Mar 2026 | Callstack Incubator | 85.30 |
| 04 | Claude Opus 4.6API | Anthropic | Mar 2026 | Callstack Incubator | 84.60 |
| 05 | Claude Sonnet 4.6API | Anthropic | Mar 2026 | Callstack Incubator | 80.80 |
| 06 | Gemini 3.1 ProAPI | Anthropic/OpenAI | Mar 2026 | Callstack Incubator | 80.80 |
| 07 | Kimi K2.5API | Moonshot AI | Mar 2026 | Callstack Incubator | 77.70 |
| 08 | DeepSeek-V3.2API | DeepSeek | Mar 2026 | Callstack Incubator | 77.70 |
| 09 | Grok 4API | xAI | Mar 2026 | Callstack Incubator | 73.80 |
| 10 | GLM-5OSS | Zhipu AI | Mar 2026 | Callstack Incubator | 73.80 |
| # | Model | Org | Submitted | Paper / code | navigation-satisfaction |
|---|---|---|---|---|---|
| 01 | Composer 2 | Anysphere | Mar 2026 | Callstack Incubator | 98.90 |
| 02 | GPT 5.3 CodexAPI | OpenAI | Mar 2026 | Callstack Incubator | 95.60 |
| 03 | GPT-5.4API | OpenAI | Mar 2026 | Callstack Incubator | 95.60 |
| 04 | Gemini 3.1 ProAPI | Anthropic/OpenAI | Mar 2026 | Callstack Incubator | 94.40 |
| 05 | Claude Sonnet 4.6API | Anthropic | Mar 2026 | Callstack Incubator | 93.30 |
| 06 | Claude Opus 4.6API | Anthropic | Mar 2026 | Callstack Incubator | 93.30 |
| 07 | Kimi K2.5API | Moonshot AI | Mar 2026 | Callstack Incubator | 93.30 |
| 08 | GLM-5OSS | Zhipu AI | Mar 2026 | Callstack Incubator | 86.70 |
| 09 | Grok 4API | xAI | Mar 2026 | Callstack Incubator | 84.40 |
| 10 | DeepSeek-V3.2API | DeepSeek | Mar 2026 | Callstack Incubator | 75.70 |
| # | Model | Org | Submitted | Paper / code | requirement-satisfaction |
|---|---|---|---|---|---|
| 01 | Composer 2 | Anysphere | Mar 2026 | Callstack Incubator | 96.20 |
| 02 | Claude Opus 4.6API | Anthropic | Mar 2026 | Callstack Incubator | 84.36 |
| 03 | GPT-5.4API | OpenAI | Mar 2026 | Callstack Incubator | 82.64 |
| 04 | GPT 5.3 CodexAPI | OpenAI | Mar 2026 | Callstack Incubator | 80.88 |
| 05 | Gemini 3.1 ProAPI | Anthropic/OpenAI | Mar 2026 | Callstack Incubator | 78.90 |
| 06 | Claude Sonnet 4.6API | Anthropic | Mar 2026 | Callstack Incubator | 77.91 |
| 07 | Kimi K2.5API | Moonshot AI | Mar 2026 | Callstack Incubator | 74.91 |
| 08 | GLM-5OSS | Zhipu AI | Mar 2026 | Callstack Incubator | 74.23 |
| 09 | Grok 4API | xAI | Mar 2026 | Callstack Incubator | 70.06 |
| 10 | DeepSeek-V3.2API | DeepSeek | Mar 2026 | Callstack Incubator | 68.98 |
Each row below marks a model that broke the previous record on requirement-satisfaction. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.
Higher scores win. Each subsequent entry improved upon the previous best.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.