Nexus (NexusBench) is a collection of function-calling / tool-use evaluation benchmarks and associated evaluation data released by Nexusflow (Nexusflow.ai). It is intended to measure LLMs' ability to (a) select and call external functions/APIs correctly, (b) parameterize tool calls, and (c) perform multi-step agentic workflows. The Nexus project is provided as a benchmark suite (NexusBench) on GitHub and as multiple evaluation datasets on Hugging Face (examples include Nexusflow/NexusRaven_API_evaluation, Function_Call_Definitions, and several per-task benchmark shards such as VirusTotalBenchmark, NVDLibraryBenchmark, TicketTrackingBenchmark, etc.). The suite is used as the basis for Nexusflow's function-calling leaderboard and model evaluations (e.g., NexusRaven). Primary sources: NexusBench GitHub repository (nexusflowai/NexusBench) and Nexusflow Hugging Face datasets (Nexusflow/*).
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.