Recent Papers / arXiv:2606.05570
TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework
Authors pending
Abstract
199 repo-level tasks on a real PyTorch extension; strongest agent passes 64.8%, and pairwise agreement between agents is low (κ=0.05), revealing task-specific skill gaps.
Tasks
editResults
No benchmark results recorded yet.
Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →
CodeSOTA extraction
Benchmark evidence
- TensorBench pass rates and pairwise Cohen's κ across 7 agents