Both are terminal agents. Both can run unattended for half an hour. Both ship regularly. One is wrapped around Claude Opus 4.5/4.7; the other around GPT-5.2/5.3-Codex. Here is where they actually diverge.
| Attribute | Claude Code | OpenAI Codex CLI |
|---|---|---|
| Vendor | Anthropic | OpenAI |
| Primary model | Opus 4.5 / 4.6 / 4.7, Sonnet 4.5, Haiku 4.5 | GPT-5.2 / 5.3, Codex variants, Mini, Nano |
| Reasoning control | Soft (prompted) | Explicit: minimal / medium / high / xhigh |
| SWE-Bench Verified peak | 87.6% (Opus 4.7) | 85% (GPT-5.3-Codex xhigh) |
| Patch format | str_replace (surgical) | apply_patch (unified diff) |
| Sandboxing | Host shell + optional --safe | Default Docker sandbox |
| Extension mechanism | MCP servers (Linear, DB, etc) | custom_tools JSON schema |
| Context window | 200k / 1M (Opus 4.7 1M) | 200k (GPT-5) / 400k (Codex long) |
| Pricing signal | Pro $20/mo + API usage | Pro $20/mo + API usage |
| Best for | Multi-file refactor, long autonomy | Reasoning-heavy bugs, cheap bulk runs |
April 2026, pass@1
Both are plan-act-reflect agents, but the tool verbs and reasoning modes differ.
Architecture
str_replace + bash + MCP
Architecture
apply_patch + sandboxed shell
Eval-run $ for a full SWE-Bench Verified pass vs resolve rate. Mini/Haiku tiers are competitive on price-per-score.
The money visual
X: $ per resolved issue (log scale). Y: Verified %. Pink line = Pareto frontier.
SWE-Bench Verified scores are taken from Anthropic's leaderboard runs (Claude Code) and OpenAI's release notes for the Codex variants of GPT-5.2/5.3. See our SWE-Bench hub for the full leaderboard.
Cost figures reflect a full Verified run at published API rates as of April 2026. Codex's Nano and Mini tiers are priced for bulk; the xhigh reasoning tier carries the same per-token rate but consumes more reasoning tokens.
The pipeline diagrams are editorial — they capture the dominant tool verbs each CLI exposes, not every internal step.