Berkeley Function Calling Leaderboard

AI Benchmarks Large Language Models

20 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v4 · 4,067 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Berkeley Function Calling Leaderboard (BFCL) is the standard benchmark for measuring how accurately large language models (LLMs) invoke functions, APIs, and tools, created by the Gorilla project at UC Berkeley and first released in February 2024. It scores function calling (also called tool use) using a novel Abstract Syntax Tree (AST) matching method that evaluates thousands of function calls without executing them, alongside an executable track that runs real functions. As of June 2026, BFCL has evolved through four major versions, advancing from single-turn calls (V1, February 2024) to live enterprise-contributed data (V2, August 2024), multi-turn interactions (V3, September 2024), and holistic agentic evaluation (V4, July 2025).^[1]^[4]^[5]^[6]

BFCL was presented in a paper titled "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models" at the 42nd International Conference on Machine Learning (ICML 2025), published in PMLR Volume 267, pages 48371-48392.^[1] The paper frames function calling as "an essential capability for agentic LLM applications" and notes that before BFCL "there did not exist a standard benchmark to evaluate function calling abilities," both because validating a function call is hard and because acquiring diverse, real-world functions is difficult.^[1] All leaderboard data, evaluation code, and model results are released under the Apache 2.0 license.

What does BFCL evaluate?

Function calling (also referred to as tool use) is a core capability that allows LLMs to interact with external services, databases, and APIs rather than relying solely on their internal knowledge. As LLM-powered AI agents became more prevalent in 2023 and 2024, the need for a rigorous, standardized benchmark to evaluate this capability grew urgent. Prior to BFCL, there was no comprehensive evaluation framework dedicated to assessing how accurately models could generate structured function calls in response to natural language queries. As the ICML 2025 paper puts it, the benchmark "evaluates serial and parallel function calls, across various programming languages using a novel Abstract Syntax Tree (AST) evaluation method that can easily scale to thousands of functions."^[1]

The challenge of evaluating function calling at scale is significant. Deterministic validation of function calls typically requires executing the corresponding functions, which introduces dependencies on external services, authentication, rate limits, and infrastructure costs. This makes large-scale evaluation impractical. BFCL addressed this problem by introducing an AST-based evaluation strategy that avoids the need for function execution entirely, drawing on techniques from programming language research.

BFCL emerged from the broader Gorilla project at UC Berkeley, which began with the "Gorilla: Large Language Model Connected with Massive APIs" paper published at NeurIPS 2024.^[2] The Gorilla model, a fine-tuned LLaMA-based system, demonstrated that LLMs could be trained to accurately invoke over 1,600 API calls while reducing hallucination. BFCL extended this work from model development to model evaluation, providing the community with an open benchmark for comparing any model's function calling capabilities.

Who created BFCL?

BFCL was created by researchers at UC Berkeley's Sky Computing Lab. The primary authors of the ICML 2025 paper are:^[1]

Shishir G. Patil (lead author): Ph.D. student in Computer Science at UC Berkeley, advised by Joseph Gonzalez, Prabal Dutta, and Ion Stoica. Patil is the creator and principal maintainer of the Gorilla project, including BFCL, GoEx, RAFT, and OpenFunctions.
Huanzhi Mao: Contributor to the evaluation framework and data curation.
Fanjia Yan: Contributor to evaluation design and data pipeline.
Charlie Cheng-Jie Ji: Contributor to benchmark infrastructure.
Vishnu Suresh: Contributor to agentic evaluation components.
Ion Stoica: Professor at UC Berkeley, co-director of the Sky Computing Lab, and co-founder of Databricks and Anyscale.
Joseph E. Gonzalez: Professor at UC Berkeley, co-lead of the Sky Computing Lab.

The project is hosted on GitHub under the ShishirPatil/gorilla repository and is maintained as a community-driven, open-source effort.^[9]

How does AST evaluation work?

Abstract Syntax Tree (AST) Evaluation

The central innovation of BFCL is its use of Abstract Syntax Tree (AST) sub-string matching as a proxy for actual function execution. An AST is a tree representation of the syntactic structure of source code, commonly used in compilers and interpreters. BFCL parses the function calls generated by an LLM into AST form and then compares the resulting structure against a set of known correct answers.

The AST evaluation process works as follows:

Function name extraction: The system parses the model's output to identify the function being called and confirms it matches the expected function.
Parameter matching: Each parameter in the generated function call is extracted and compared against the ground truth.
Type checking: Parameters undergo strict type verification, checking for correct data types including booleans, integers, floats, strings, lists, and dictionaries.
Value normalization: String values are compared in a case-insensitive manner with whitespace and punctuation normalization to accommodate minor formatting differences.
Sub-tree matching: The overall AST structure is compared using sub-string matching to determine whether the generated call is functionally equivalent to the expected answer.

This approach offers a significant advantage in scalability. Because it does not require running the actual functions, BFCL can evaluate function calls across thousands of distinct APIs and multiple programming languages without any infrastructure for executing those functions.

Executable Evaluation

Alongside AST evaluation, BFCL includes an executable evaluation track where function calls are actually executed against real endpoints or locally defined functions. In this track, manually written Python functions inspired by free REST API endpoints compute results directly, and the outputs are compared against expected values.

The BFCL authors have shown that AST-based evaluation and executable evaluation produce strongly correlated results. This correlation validates the AST approach as a reliable proxy for execution-based testing, while being far more practical at scale.

Relevance and Irrelevance Detection

BFCL also evaluates a model's ability to determine when a function call should or should not be made. In irrelevance detection scenarios, the model is presented with function choices that are not relevant to the user's query. The correct behavior is to decline to invoke any function. In relevance detection scenarios, the model must correctly identify that at least one of the available functions is appropriate and proceed to call it.

What are the BFCL versions?

BFCL has progressed through four major versions, each expanding the scope and complexity of evaluation. The leaderboard summarizes the lineage as: V1 introduced AST as an evaluation metric, V2 introduced enterprise and OSS-contributed functions, V3 introduced multi-turn interactions, and V4 introduced holistic agentic evaluation.^[11]

Version	Release	Headline addition	Dataset scale
V1	February 2024	AST evaluation, single-turn categories	2,000 expert-curated pairs
V2 (Live)	August 2024	Live, user/enterprise-contributed data	2,251 entries from 64,517 raw queries
V3 (Multi-Turn)	September 2024	Multi-turn and multi-step, state-based scoring	1,000 multi-turn test cases
V4 (Agentic)	July 2025	Web search, memory, format sensitivity	100 multihop web-search questions plus memory and format tracks

BFCL V1: Expert-Curated Single-Turn Evaluation (February 2024)

The first version established the core benchmark with 2,000 question-function-answer pairs designed by domain experts. V1 focused exclusively on single-turn interactions where the model receives a query and a set of function definitions, then must produce the correct function call in a single response.^[3]

Test categories in V1 included:

Category	Description	Evaluation Method
Simple Function (Python)	Single function call with one available function	AST + Executable
Simple Function (Java)	Single function call in Java syntax	AST
Simple Function (JavaScript)	Single function call in JavaScript syntax	AST
Multiple Function	Choosing the correct function from several options	AST + Executable
Parallel Function	Invoking the same function multiple times with different parameters	AST + Executable
Parallel Multiple Function	Combining parallel and multiple function patterns	AST + Executable
REST API	Function calls targeting RESTful API endpoints	Executable
Relevance Detection	Determining if any function is relevant to the query	Relevance metric

Dataset composition by language:

Language	Number of Test Cases
Python	1,680
Java	100
JavaScript	50
REST API	70
SQL	100

The dataset covered 40 sub-domains including mathematics, finance, sports, and law. V1 found that proprietary models such as GPT-4 demonstrated superior performance in multiple and parallel function scenarios, while open-source models achieved comparable performance in simple function calling.^[3]

BFCL V2: Live User-Contributed Dataset (August 2024)

V2 addressed a critical limitation of V1: the risk of data contamination. Because V1's test data was static and publicly available, there was a concern that models could achieve inflated scores by memorizing the test set during training. V2 introduced a "live" dataset composed entirely of real-world function calling queries submitted by actual users.^[4]

Data collection details:

The raw dataset was constructed from 64,517 queries (post basic filtering for spam) received by the BFCL hosted endpoint between February 26 and April 1, 2024. A three-stage pipeline processed this raw data through deduplication (using ROUGE-L scores and text embeddings), quality filtering, and standardization while preserving the original intent. The pipeline reduced the raw queries to just under 3% of the original volume before reaching the final curated set.^[4]

Final dataset composition (2,251 entries):

Category	Number of Entries
Simple	258
Multiple	1,053
Parallel	16
Parallel Multiple	24
Irrelevance Detection	882
Relevance Detection	18

Functions in V2 averaged 3 choices per entry (with a maximum of 37), and 4 parameters per function (with a maximum of 28). The contributors included large banks, technology corporations, agent developers, hobbyists, and enterprises.

Key findings from V2:

Real-world users prioritized selecting between multiple functions ("intelligently choosing") far more than parallel function calls, a distribution markedly different from V1's expert-curated assumptions.
Comparing V1 versus V2 performance exposed potential data contamination in several models, as some models scored much higher on the static V1 set than on the live V2 data.
Irrelevance detection became a dominant category in V2, reflecting the importance of knowing when not to call a function.

BFCL V3: Multi-Turn and Multi-Step Evaluation (September 2024)

V3 represented a substantial expansion of the benchmark's scope by introducing multi-turn and multi-step function calling scenarios. While V1 and V2 tested only single-turn interactions (one query, one response), V3 evaluated how models perform in back-and-forth conversations that require planning, sequential reasoning, and state tracking across multiple dialogue turns.^[5]

Test categories in V3:

Category	Test Cases	Description
Base Multi-Turn	200	Foundational multi-turn scenarios where all necessary information is available through user requests or prior execution results
Missing Parameters	200	Tests the model's ability to request clarification when essential information is absent
Missing Functions	200	Models must recognize when available functions are insufficient to fulfill the request
Long-Context Multi-Turn	200	Evaluates performance when large volumes of extraneous data are present
Composite	200	Combines all three augmented challenges simultaneously

Total: 1,000 multi-turn test cases across eight API domains: Vehicle Control, Trading Bots, Travel Booking, Gorilla File System, Message API, Twitter API, Ticket API, and Math API.^[5]

Evaluation methodology shift:

V3 moved beyond AST matching to state-based verification for multi-turn scenarios. Instead of comparing the structure of individual function calls, the evaluation system checks the actual state of the backend API system (file systems, booking records, database modifications) after the model executes its sequence of function calls. In the authors' words, "If the model successfully brings the system to the correct state at the end of each turn, it passes the evaluation."^[5] This is complemented by response-based evaluation for read-only requests, ensuring models invoke necessary exploratory functions rather than hallucinating results.

Data curation process:

The V3 dataset underwent 11 rounds of data filtering. The team emphasized that synthetic data alone was insufficient and that human labeling was essential. Ground truth trajectories were manually labeled by experts, with validation covering question clarity, executability, initial configurations, and API code compliance.

Key failure patterns identified in V3:

Implicit action gaps: Models struggled to infer prerequisite exploration steps. For example, a model might try to set a vehicle's speed without first checking whether the engine was running.
State unawareness: Models modified state without first verifying current conditions, leading to incorrect or redundant actions.
Unnecessary authentication: Models redundantly planned authentication steps when credentials were not required or had already been provided.

BFCL V4: Agentic Evaluation (July 2025)

V4, released on July 17, 2025 and presented at ICML 2025, introduced holistic agentic evaluation.^[6] Recognizing that function calling forms the foundation of agentic AI systems, V4 tests tool-calling in real-world agentic settings across three new evaluation dimensions: web search with multi-hop reasoning, agent memory management, and format sensitivity. The team motivated the web-search track by noting that "LLM agents often rely on static knowledge bases with a fixed cutoff date, which makes them less effective when handling queries about recent events or information beyond their training scope."^[6]

Web Search with Multi-Hop Reasoning

The web search evaluation assesses how effectively models can answer questions that require retrieving and integrating information from multiple sources through sequential search queries. The category "contains 100 human-crafted multihop questions spanning various domains," which expand into a weighted 200-point web-search score through error-injection variants.^[6] For example, answering "Who founded the fastest-growing U.S. business in 2024?" requires first identifying the business, then searching for its founder.

Hop distribution: Questions span 2-hop, 3-hop, and 4-hop categories across domains including technology, business, geography, and culture.

Error injection for robustness testing: BFCL V4 injects randomized real-world failures into the search pipeline to test error recovery capabilities. The injected errors include:^[6]

Error Type	Description
503 Server Error	Service unavailable
429 Too Many Requests	Rate limiting
403 Forbidden	Access denied
ConnectTimeout	Connection timeout
ReadTimeout	Read timeout
ConnectionError	Network connection failure

Models are tested at 0%, 50%, and 100% URL fetch failure rates. The evaluation also tests models with and without DuckDuckGo search snippets, and with tools entirely disabled to measure reliance on internal knowledge versus genuine search use.

Scoring: Responses must follow a structured format with answer and context fields. Evaluation uses exact-match metrics on the answer field, with text normalization (lowercase conversion, punctuation removal) before comparison.

Key web search findings:

Higher hop counts correlate with increased reasoning steps required.
Accuracy drops dramatically when tools are disabled, confirming genuine search reliance.
Common failure modes include avoiding tool usage despite availability, incorrect temporal assumptions, poor keyword selection, and misinterpreting retrieved content.
Some recent models answered from training knowledge rather than performing actual searches.

Memory Management (465 Test Cases)

The memory evaluation tests how models manage conversational memory using dedicated memory tools. This assesses the ability to store, retrieve, update, and reason over information accumulated across a conversation.^[7]

Three memory architectures are evaluated:

Architecture	Mechanism	Best For
Key-Value Store	BM25+ keyword search for exact lookups	Structured recall with clear naming conventions
Vector Store	FAISS with all-MiniLM-L6-v2 embeddings for similarity-based retrieval	Semantic generalization across phrasing variations
Recursive Summarization	Running text buffer that accumulates and compresses conversation history (10,000 character limit)	Long conversation tracking with periodic compression

Five practical domains are tested:

College Student Advising
Customer Support
Personal To-Do List Management
Healthcare Patient Tracking
Finance Managing Director scenarios

The evaluation uses a "snapshot-and-reload" approach: models first populate memory through multi-turn prerequisite conversations, then the memory state is serialized and preserved. Follow-up evaluation questions test retrieval accuracy without the benefit of prior dialogue context.

Key memory findings:

High-performing models separated concerns between core memory (identity facts) and archival memory (dynamic details).
A critical failure pattern was identified where models prioritized newly received urgent information and discarded previously stored details prematurely, causing recall failures despite having adequate storage capacity.
Claude 3 Sonnet emerged as the top performer across all three memory architectures, scoring 53.55% (Key-Value), 63.87% (Vector Store), and 67.74% (Recursive Summarization).^[7]

Format Sensitivity (200 Test Cases)

The format sensitivity evaluation examines how consistently models produce correct output when the input prompt format varies. The principle is that a model should produce the same correct function call regardless of how the prompt or expected output format is specified.^[8]

Five dimensions are tested:

Dimension	Variations
Return format	Python, JSON, verbose XML, concise XML
Function documentation format	Python, XML, JSON
Tool calling tags	With or without XML-style wrappers
Prompt format	Plaintext vs. Markdown
Prompt style	Variations in tone and phrasing

A total of 26 distinct prompt configurations are generated by combining these variations. Testing across 20 models confirmed a strong linear relationship between sampled and full dataset results, validating the sample's representativeness.^[8]

Key format sensitivity findings:

Models consistently achieved higher accuracy with Python and JSON return formats compared to XML variants.
Function documentation in JSON format yielded the highest performance, followed by XML, with Python documentation format producing the lowest scores.
Tool call tags had minimal overall impact, though smaller models showed notable degradation due to difficulties generating XML syntax.
Prompt format and style variations had no meaningful impact on performance, demonstrating robustness across these dimensions.
Some tool-specialized models paradoxically struggled with format variations. For example, watt-tool-70B outputted Python when JSON was requested, and CoALM-70B showed near-zero performance when tool call tags were required.

Which models top the BFCL leaderboard?

The BFCL leaderboard is hosted at gorilla.cs.berkeley.edu/leaderboard.html and is updated periodically as new models are evaluated. Rankings reflect overall accuracy, which the project defines as the unweighted average across all test categories. Because the leaderboard is refreshed continuously and models are evaluated under different version snapshots, the rankings below are point-in-time and shift frequently.^[11]

BFCL V4 rankings (as of late 2025)

Rank	Model	Organization	Overall Score
1	GLM-4.5 (FC)	Zhipu AI	70.85%
2	Claude Opus 4.1	Anthropic	70.36%
3	Claude Sonnet 4	Anthropic	70.29%
7	GPT-5	OpenAI	59.22%

BFCL V3 rankings (June 2026 snapshot)

On the BFCL V3 single-turn-plus-multi-turn leaderboard, as of June 2026 the top entries were GLM-4.5 (Zhipu AI) at 77.8%, GLM-4.5-Air at 76.4%, LongCat-Flash-Thinking (Meituan) at 74.4%, and several Alibaba Qwen3 variants clustered around 71-72%; the average across the roughly 19 evaluated models was about 70.0%.^[11] These figures illustrate how open-weight Chinese frontier models, led by Zhipu AI's GLM-4.5 and Alibaba's Qwen3 family, have come to dominate the public function-calling leaderboard.

The rankings reveal that top models excel at single-turn function calling but continue to face challenges with memory management, long-horizon reasoning, and determining when not to invoke tools. As the ICML 2025 paper summarizes, "while state-of-the-art LLMs excel at single-turn calls, memory, dynamic decision-making, and long-horizon reasoning remain open challenges."^[1]

On the V1/V2 single-turn evaluations (using AST scoring), several models have achieved high scores. Meta's Llama 3.1 405B Instruct model scored 88.5% on the V1 AST evaluation, one of the highest recorded single-turn scores. Open-source models have generally narrowed the gap with proprietary models on simple function calling, though differences remain more pronounced for parallel and multiple function calling scenarios.

How does BFCL relate to Gorilla and OpenFunctions?

BFCL originated within the Gorilla project, which also produced the Gorilla LLM and the OpenFunctions model family. Gorilla OpenFunctions v2 is an open-source model specifically built for function calling, achieving performance on par with GPT-4 at the time of its release while supporting multiple programming languages. While BFCL serves as the evaluation benchmark, the OpenFunctions models serve as open-source baselines demonstrating that smaller, specialized models can compete with much larger general-purpose systems on function calling tasks.

The original Gorilla paper introduced APIBench, a dataset of HuggingFace, TorchHub, and TensorHub APIs, along with retriever-aware training (RAT), a technique that enables models to adapt to test-time changes in APIs such as version updates and argument modifications.^[2] BFCL built on the evaluation lessons from APIBench while substantially expanding the scope and rigor of testing.

Impact and Adoption

Since its launch, BFCL has been widely adopted across the AI industry as a primary benchmark for function calling evaluation. Model developers including OpenAI, Anthropic, Google, Meta, Mistral, and numerous open-source projects report their BFCL scores alongside other benchmark results.

BFCL's approach to data contamination detection through its live dataset (V2) has been particularly influential. By comparing model performance on the static V1 set versus the continuously refreshed V2 set, researchers can identify models that may have memorized test data during training. This methodology has been adopted as a best practice in benchmark design more broadly.

The benchmark's evolution from single-turn to agentic evaluation tracks the industry's shift from basic prompt engineering toward building autonomous AI agents. V4's focus on multi-hop reasoning, error recovery, and memory management directly addresses the capabilities needed for production agent deployments.

BFCL is also available as a pip-installable package (bfcl-eval) and supports integration with experiment tracking tools like Weights & Biases (WandB). NVIDIA has incorporated BFCL evaluation into its NeMo Microservices platform for enterprise model evaluation.

Technical Infrastructure

Running Evaluations

BFCL provides a command-line interface for generating model responses and scoring them:^[9]

Generation phase:

bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY

Evaluation phase:

bfcl evaluate --model MODEL_NAME --test-category TEST_CATEGORY

The framework supports both API-based models (GPT, Claude, Gemini, Mistral, Amazon Nova) and self-hosted models through vLLM or SGLang backends. GPU utilization, thread count, and LoRA adapter support are configurable. Results are output as CSV files covering overall, live, non-live, and multi-turn score breakdowns.

Dataset Availability

The complete BFCL dataset is available on Hugging Face under the gorilla-llm/Berkeley-Function-Calling-Leaderboard repository. The dataset includes all question-function-answer pairs, function definitions, and ground truth annotations across all versions.^[10]

Limitations and Criticisms

While BFCL has become the standard benchmark for function calling, several limitations have been noted by the research community:

AST evaluation approximation: Although AST evaluation correlates strongly with executable evaluation, it remains a structural proxy. Edge cases exist where a structurally correct function call may not produce the intended behavior at runtime due to semantic differences not captured by syntax.
English-centric: The benchmark primarily evaluates function calling in English, though the functions themselves span multiple programming languages. A Chinese-language variant (BFCL-CN) has been developed separately to address multilingual needs.
Dynamic leaderboard: Because the leaderboard is updated periodically and new models are added continuously, point-in-time comparisons can be misleading if models were evaluated under different versions of the benchmark.
Self-reported results: Some model scores on third-party tracking sites are self-reported and not independently verified, which can introduce inconsistencies.

References

Patil, S. G., Mao, H., Yan, F., Ji, C. C., Suresh, V., Stoica, I., & Gonzalez, J. E. (2025). "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models." *Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)*, PMLR 267, pp. 48371-48392. https://proceedings.mlr.press/v267/patil25a.html ↩
Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2024). "Gorilla: Large Language Model Connected with Massive APIs." *Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)*. https://arxiv.org/abs/2305.15334 ↩
Berkeley Function Calling Leaderboard V1 Blog Post. Gorilla Project, UC Berkeley. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html ↩
BFCL V2: Live Dataset Blog Post. Gorilla Project, UC Berkeley. https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html ↩
BFCL V3: Multi-Turn & Multi-Step Function Calling Blog Post. Gorilla Project, UC Berkeley. https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html ↩
BFCL V4: Agentic Part 1 - Web Search Blog Post. Gorilla Project, UC Berkeley. https://gorilla.cs.berkeley.edu/blogs/15_bfcl_v4_web_search.html ↩
BFCL V4: Memory Blog Post. Gorilla Project, UC Berkeley. https://gorilla.cs.berkeley.edu/blogs/16_bfcl_v4_memory.html ↩
BFCL V4: Format Sensitivity Blog Post. Gorilla Project, UC Berkeley. https://gorilla.cs.berkeley.edu/blogs/17_bfcl_v4_prompt_variation.html ↩
BFCL GitHub Repository. https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard ↩
BFCL Dataset on Hugging Face. https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard ↩
BFCL Official Leaderboard. https://gorilla.cs.berkeley.edu/leaderboard.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Agent evaluation Cohere Command A Essential AI Function calling Grok 4.1 Fast LLM Rankings Qwen2-VL Qwen3 τ-bench

What does BFCL evaluate?

Who created BFCL?

How does AST evaluation work?

Abstract Syntax Tree (AST) Evaluation

Executable Evaluation

Relevance and Irrelevance Detection

What are the BFCL versions?

BFCL V1: Expert-Curated Single-Turn Evaluation (February 2024)

BFCL V2: Live User-Contributed Dataset (August 2024)

BFCL V3: Multi-Turn and Multi-Step Evaluation (September 2024)

BFCL V4: Agentic Evaluation (July 2025)

Web Search with Multi-Hop Reasoning

Memory Management (465 Test Cases)

Format Sensitivity (200 Test Cases)

Which models top the BFCL leaderboard?

BFCL V4 rankings (as of late 2025)

BFCL V3 rankings (June 2026 snapshot)

How does BFCL relate to Gorilla and OpenFunctions?

Impact and Adoption

Technical Infrastructure

Running Evaluations

Dataset Availability

Limitations and Criticisms

See Also

References

Improve this article

Related Articles

MMLU-Pro

Chatbot Arena

BIG-Bench

MT-Bench

GSM8K

MBPP

What links here

Related Articles

MMLU-Pro

Chatbot Arena

BIG-Bench

MT-Bench

GSM8K

MBPP

What links here