Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.
Seeking canonical benchmark for this task.
Suggest one →Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
Still looking for something on Tool Use? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.