Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.
Human-validated subset of 500 GitHub issues from real Python repositories. Models must produce a patch that passes hidden tests. Standard benchmark for autonomous coding agents end-to-end (repo navigation, editing, testing).
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
2 datasets tracked for this task.
Still looking for something on Autonomous Coding? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.