The Berkeley Function Calling Leaderboard (BFCL) is a comprehensive benchmark created by UC Berkeley's Gorilla project for evaluating the ability of large language models (LLMs) to invoke functions, APIs, and tools accurately. First released in February 2024, BFCL has become the de facto standard for measuring function calling performance across the AI industry. The benchmark uses a novel Abstract Syntax Tree (AST) evaluation method that enables scalable testing across thousands of functions without requiring actual execution. As of 2025, BFCL has evolved through four major versions, progressing from simple single-turn function calls to complex agentic evaluations involving multi-hop reasoning, error recovery, and memory management.
BFCL was presented as a paper titled "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models" at the 42nd International Conference on Machine Learning (ICML 2025). All leaderboard data, evaluation code, and model results are released under the Apache 2.0 license.
Function calling (also referred to as tool use) is a core capability that allows LLMs to interact with external services, databases, and APIs rather than relying solely on their internal knowledge. As LLM-powered AI agents became more prevalent in 2023 and 2024, the need for a rigorous, standardized benchmark to evaluate this capability grew urgent. Prior to BFCL, there was no comprehensive evaluation framework dedicated to assessing how accurately models could generate structured function calls in response to natural language queries.
The challenge of evaluating function calling at scale is significant. Deterministic validation of function calls typically requires executing the corresponding functions, which introduces dependencies on external services, authentication, rate limits, and infrastructure costs. This makes large-scale evaluation impractical. BFCL addressed this problem by introducing an AST-based evaluation strategy that avoids the need for function execution entirely, drawing on techniques from programming language research.
BFCL emerged from the broader Gorilla project at UC Berkeley, which began with the "Gorilla: Large Language Model Connected with Massive APIs" paper published at NeurIPS 2024. The Gorilla model, a fine-tuned LLaMA-based system, demonstrated that LLMs could be trained to accurately invoke over 1,600 API calls while reducing hallucination. BFCL extended this work from model development to model evaluation, providing the community with an open benchmark for comparing any model's function calling capabilities.
BFCL was created by researchers at UC Berkeley's Sky Computing Lab. The primary authors of the ICML 2025 paper are:
The project is hosted on GitHub under the ShishirPatil/gorilla repository and is maintained as a community-driven, open-source effort.
The central innovation of BFCL is its use of Abstract Syntax Tree (AST) sub-string matching as a proxy for actual function execution. An AST is a tree representation of the syntactic structure of source code, commonly used in compilers and interpreters. BFCL parses the function calls generated by an LLM into AST form and then compares the resulting structure against a set of known correct answers.
The AST evaluation process works as follows:
This approach offers a significant advantage in scalability. Because it does not require running the actual functions, BFCL can evaluate function calls across thousands of distinct APIs and multiple programming languages without any infrastructure for executing those functions.
Alongside AST evaluation, BFCL includes an executable evaluation track where function calls are actually executed against real endpoints or locally defined functions. In this track, manually written Python functions inspired by free REST API endpoints compute results directly, and the outputs are compared against expected values.
The BFCL authors have shown that AST-based evaluation and executable evaluation produce strongly correlated results. This correlation validates the AST approach as a reliable proxy for execution-based testing, while being far more practical at scale.
BFCL also evaluates a model's ability to determine when a function call should or should not be made. In irrelevance detection scenarios, the model is presented with function choices that are not relevant to the user's query. The correct behavior is to decline to invoke any function. In relevance detection scenarios, the model must correctly identify that at least one of the available functions is appropriate and proceed to call it.
BFCL has progressed through four major versions, each expanding the scope and complexity of evaluation.
The first version established the core benchmark with 2,000 question-function-answer pairs designed by domain experts. V1 focused exclusively on single-turn interactions where the model receives a query and a set of function definitions, then must produce the correct function call in a single response.
Test categories in V1 included:
| Category | Description | Evaluation Method |
|---|---|---|
| Simple Function (Python) | Single function call with one available function | AST + Executable |
| Simple Function (Java) | Single function call in Java syntax | AST |
| Simple Function (JavaScript) | Single function call in JavaScript syntax | AST |
| Multiple Function | Choosing the correct function from several options | AST + Executable |
| Parallel Function | Invoking the same function multiple times with different parameters | AST + Executable |
| Parallel Multiple Function | Combining parallel and multiple function patterns | AST + Executable |
| REST API | Function calls targeting RESTful API endpoints | Executable |
| Relevance Detection | Determining if any function is relevant to the query | Relevance metric |
Dataset composition by language:
| Language | Number of Test Cases |
|---|---|
| Python | 1,680 |
| Java | 100 |
| JavaScript | 50 |
| REST API | 70 |
| SQL | 100 |
The dataset covered 40 sub-domains including mathematics, finance, sports, and law. V1 found that proprietary models such as GPT-4 demonstrated superior performance in multiple and parallel function scenarios, while open-source models achieved comparable performance in simple function calling.
V2 addressed a critical limitation of V1: the risk of data contamination. Because V1's test data was static and publicly available, there was a concern that models could achieve inflated scores by memorizing the test set during training. V2 introduced a "live" dataset composed entirely of real-world function calling queries submitted by actual users.
Data collection details:
The raw dataset was constructed from 64,517 queries received by the BFCL hosted endpoint between February 26 and April 1, 2024. A three-stage pipeline processed this raw data through deduplication (using ROUGE-L scores and text embeddings), quality filtering, and standardization while preserving the original intent.
Final dataset composition (2,251 entries):
| Category | Number of Entries |
|---|---|
| Simple | 258 |
| Multiple | 1,053 |
| Parallel | 16 |
| Parallel Multiple | 24 |
| Irrelevance Detection | 882 |
| Relevance Detection | 18 |
Functions in V2 averaged 3 choices per entry (with a maximum of 37), and 4 parameters per function (with a maximum of 28). The contributors included large banks, technology corporations, agent developers, hobbyists, and enterprises.
Key findings from V2:
V3 represented a substantial expansion of the benchmark's scope by introducing multi-turn and multi-step function calling scenarios. While V1 and V2 tested only single-turn interactions (one query, one response), V3 evaluated how models perform in back-and-forth conversations that require planning, sequential reasoning, and state tracking across multiple dialogue turns.
Test categories in V3:
| Category | Test Cases | Description |
|---|---|---|
| Base Multi-Turn | 200 | Foundational multi-turn scenarios where all necessary information is available through user requests or prior execution results |
| Missing Parameters | 200 | Tests the model's ability to request clarification when essential information is absent |
| Missing Functions | 200 | Models must recognize when available functions are insufficient to fulfill the request |
| Long-Context Multi-Turn | 200 | Evaluates performance when large volumes of extraneous data are present |
| Composite | 200 | Combines all three augmented challenges simultaneously |
Total: 1,000 multi-turn test cases across eight API domains: Vehicle Control, Trading Bots, Travel Booking, Gorilla File System, Message API, Twitter API, Ticket API, and Math API.
Evaluation methodology shift:
V3 moved beyond AST matching to state-based verification for multi-turn scenarios. Instead of comparing the structure of individual function calls, the evaluation system checks the actual state of the backend API system (file systems, booking records, database modifications) after the model executes its sequence of function calls. This is complemented by response-based evaluation for read-only requests, ensuring models invoke necessary exploratory functions rather than hallucinating results.
Data curation process:
The V3 dataset underwent 11 rounds of data filtering. The team emphasized that synthetic data alone was insufficient and that human labeling was essential. Ground truth trajectories were manually labeled by experts, with validation covering question clarity, executability, initial configurations, and API code compliance.
Key failure patterns identified in V3:
V4, announced at ICML 2025, introduced holistic agentic evaluation. Recognizing that function calling forms the foundation of agentic AI systems, V4 tests tool-calling in real-world agentic settings across three new evaluation dimensions: web search with multi-hop reasoning, agent memory management, and format sensitivity.
The web search evaluation assesses how effectively models can answer questions that require retrieving and integrating information from multiple sources through sequential search queries. For example, answering "Who founded the fastest-growing U.S. business in 2024?" requires first identifying the business, then searching for its founder.
Hop distribution: Questions span 2-hop, 3-hop, and 4-hop categories across domains including technology, business, geography, and culture.
Error injection for robustness testing: BFCL V4 injects randomized real-world failures into the search pipeline to test error recovery capabilities. The injected errors include:
| Error Type | Description |
|---|---|
| 503 Server Error | Service unavailable |
| 429 Too Many Requests | Rate limiting |
| 403 Forbidden | Access denied |
| ConnectTimeout | Connection timeout |
| ReadTimeout | Read timeout |
| ConnectionError | Network connection failure |
Models are tested at 0%, 50%, and 100% URL fetch failure rates. The evaluation also tests models with and without DuckDuckGo search snippets, and with tools entirely disabled to measure reliance on internal knowledge versus genuine search use.
Scoring: Responses must follow a structured format with answer and context fields. Evaluation uses exact-match metrics on the answer field, with text normalization (lowercase conversion, punctuation removal) before comparison.
Key web search findings:
The memory evaluation tests how models manage conversational memory using dedicated memory tools. This assesses the ability to store, retrieve, update, and reason over information accumulated across a conversation.
Three memory architectures are evaluated:
| Architecture | Mechanism | Best For |
|---|---|---|
| Key-Value Store | BM25+ keyword search for exact lookups | Structured recall with clear naming conventions |
| Vector Store | FAISS with all-MiniLM-L6-v2 embeddings for similarity-based retrieval | Semantic generalization across phrasing variations |
| Recursive Summarization | Running text buffer that accumulates and compresses conversation history (10,000 character limit) | Long conversation tracking with periodic compression |
Five practical domains are tested:
The evaluation uses a "snapshot-and-reload" approach: models first populate memory through multi-turn prerequisite conversations, then the memory state is serialized and preserved. Follow-up evaluation questions test retrieval accuracy without the benefit of prior dialogue context.
Key memory findings:
The format sensitivity evaluation examines how consistently models produce correct output when the input prompt format varies. The principle is that a model should produce the same correct function call regardless of how the prompt or expected output format is specified.
Five dimensions are tested:
| Dimension | Variations |
|---|---|
| Return format | Python, JSON, verbose XML, concise XML |
| Function documentation format | Python, XML, JSON |
| Tool calling tags | With or without XML-style wrappers |
| Prompt format | Plaintext vs. Markdown |
| Prompt style | Variations in tone and phrasing |
A total of 26 distinct prompt configurations are generated by combining these variations. Testing across 20 models confirmed a strong linear relationship between sampled and full dataset results, validating the sample's representativeness.
Key format sensitivity findings:
The BFCL leaderboard is hosted at gorilla.cs.berkeley.edu/leaderboard.html and is updated periodically as new models are evaluated. Rankings reflect overall accuracy across all test categories.
| Rank | Model | Organization | Overall Score |
|---|---|---|---|
| 1 | GLM-4.5 (FC) | Zhipu AI | 70.85% |
| 2 | Claude Opus 4.1 | Anthropic | 70.36% |
| 3 | Claude Sonnet 4 | Anthropic | 70.29% |
| 7 | GPT-5 | OpenAI | 59.22% |
The rankings reveal that top models excel at single-turn function calling but continue to face challenges with memory management, long-horizon reasoning, and determining when not to invoke tools.
On the V1/V2 single-turn evaluations (using AST scoring), several models have achieved high scores. Meta's Llama 3.1 405B Instruct model scored 88.5% on the V1 AST evaluation, one of the highest recorded single-turn scores. Open-source models have generally narrowed the gap with proprietary models on simple function calling, though differences remain more pronounced for parallel and multiple function calling scenarios.
BFCL originated within the Gorilla project, which also produced the Gorilla LLM and the OpenFunctions model family. Gorilla OpenFunctions v2 is an open-source model specifically built for function calling, achieving performance on par with GPT-4 at the time of its release while supporting multiple programming languages. While BFCL serves as the evaluation benchmark, the OpenFunctions models serve as open-source baselines demonstrating that smaller, specialized models can compete with much larger general-purpose systems on function calling tasks.
The original Gorilla paper introduced APIBench, a dataset of HuggingFace, TorchHub, and TensorHub APIs, along with retriever-aware training (RAT), a technique that enables models to adapt to test-time changes in APIs such as version updates and argument modifications. BFCL built on the evaluation lessons from APIBench while substantially expanding the scope and rigor of testing.
Since its launch, BFCL has been widely adopted across the AI industry as a primary benchmark for function calling evaluation. Model developers including OpenAI, Anthropic, Google, Meta, Mistral, and numerous open-source projects report their BFCL scores alongside other benchmark results.
BFCL's approach to data contamination detection through its live dataset (V2) has been particularly influential. By comparing model performance on the static V1 set versus the continuously refreshed V2 set, researchers can identify models that may have memorized test data during training. This methodology has been adopted as a best practice in benchmark design more broadly.
The benchmark's evolution from single-turn to agentic evaluation tracks the industry's shift from basic prompt engineering toward building autonomous AI agents. V4's focus on multi-hop reasoning, error recovery, and memory management directly addresses the capabilities needed for production agent deployments.
BFCL is also available as a pip-installable package (bfcl-eval) and supports integration with experiment tracking tools like Weights & Biases (WandB). NVIDIA has incorporated BFCL evaluation into its NeMo Microservices platform for enterprise model evaluation.
BFCL provides a command-line interface for generating model responses and scoring them:
Generation phase:
bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY
Evaluation phase:
bfcl evaluate --model MODEL_NAME --test-category TEST_CATEGORY
The framework supports both API-based models (GPT, Claude, Gemini, Mistral, Amazon Nova) and self-hosted models through vLLM or SGLang backends. GPU utilization, thread count, and LoRA adapter support are configurable. Results are output as CSV files covering overall, live, non-live, and multi-turn score breakdowns.
The complete BFCL dataset is available on Hugging Face under the gorilla-llm/Berkeley-Function-Calling-Leaderboard repository. The dataset includes all question-function-answer pairs, function definitions, and ground truth annotations across all versions.
While BFCL has become the standard benchmark for function calling, several limitations have been noted by the research community: