NexusBench (Nexus function-calling / tool-use benchmark, Nexusflow).

Nexus (NexusBench) is a collection of function-calling / tool-use evaluation benchmarks and associated evaluation data released by Nexusflow (Nexusflow.ai). It is intended to measure LLMs' ability to (a) select and call external functions/APIs correctly, (b) parameterize tool calls, and (c) perform multi-step agentic workflows. The Nexus project is provided as a benchmark suite (NexusBench) on GitHub and as multiple evaluation datasets on Hugging Face (examples include Nexusflow/NexusRaven_API_evaluation, Function_Call_Definitions, and several per-task benchmark shards such as VirusTotalBenchmark, NVDLibraryBenchmark, TicketTrackingBenchmark, etc.). The suite is used as the basis for Nexusflow's function-calling leaderboard and model evaluations (e.g., NexusRaven). Primary sources: NexusBench GitHub repository (nexusflowai/NexusBench) and Nexusflow Hugging Face datasets (Nexusflow/*).

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

No results indexed yet — be the first to submit a score.

No benchmark results indexed yet

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

NexusBench (Nexus function-calling / tool-use benchmark, Nexusflow).

Best published scores.

Neighbouring benchmarks.

Have a score that beatsthis table?

Have a score that beats
this table?