BFCL is a comprehensive benchmark designed to evaluate the function calling (also known as tool use) capabilities of Large Language Models (LLMs) in a wide range of real-world settings. It assesses models across various scenarios, including serial (simple), parallel, and multi-turn interactions, and evaluates agentic capabilities such as reasoning in stateful multi-step environments, memory, web search, and format sensitivity.
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.