Function calling

Function calling (also called tool use or tool calling) is a capability of large language models (LLMs) that allows the model to generate structured JSON output specifying which external function or API to invoke and what arguments to pass. Rather than producing only natural language text, a model with function calling capabilities can decide during generation that it needs to call a specific tool, output the function name and parameters in a structured format, and then incorporate the result of that function call into its subsequent response.

Function calling was introduced as a formal API feature by OpenAI on June 13, 2023, and has since been adopted by every major LLM provider, including Anthropic (Claude), Google (Gemini), Mistral AI, xAI (Grok), and Meta (Llama). The capability is foundational to building AI agents, enabling LLMs to interact with databases, APIs, file systems, code interpreters, and other external systems in a controlled, structured manner [1][2].

In modern usage, function calling is often the lowest-level building block beneath higher abstractions such as tool use, the Model Context Protocol (MCP), and built-in agent platforms like the OpenAI Responses API and xAI Agent Tools API. By 2026, every major frontier model treats function calling as a default feature trained directly into the base model rather than a thin wrapper layered on top.

how function calling works

The function calling process follows a well-defined loop involving the developer, the model, and the application layer. Understanding this loop is essential to building reliable tool-augmented AI systems.

step 1: define available functions

The developer provides the model with a list of available functions (tools) as part of the API request. Each function definition includes a name, a natural language description of what the function does, and a JSON Schema specifying the expected parameters and their types. Most providers use a subset of the OpenAPI/JSON Schema specification, which means features like recursive schemas, references, or rare keywords may not be supported [3].

For example, a weather application might define a function like:

{
  "name": "get_weather",
  "description": "Get the current weather for a given location",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name, e.g. San Francisco"
      },
      "unit": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"]
      }
    },
    "required": ["location"]
  }
}

The model uses the function descriptions and parameter schemas to decide when and how to call each function. Clear, specific descriptions significantly improve the model's ability to choose the right function and provide correct arguments [1].

step 2: model decides to call a function

When the user sends a message (for example, "What's the weather in Tokyo?"), the model analyzes the query against the available function definitions. If the model determines that a function call would help it answer the query, it generates a structured output indicating the function to call and the arguments to pass, rather than generating a direct text response.

The model does not execute the function itself. It outputs a structured request that the application must intercept and act upon. This is a critical design choice: the model proposes actions, but the application controls execution [4].

step 3: application executes the function

The application parses the model's function call output, validates the arguments, and executes the actual function (for example, calling a weather API with the specified location). This step happens entirely outside the model. The developer's code is responsible for the actual execution, error handling, and any security checks. Validation typically uses runtime libraries like Pydantic in Python or Zod in TypeScript to confirm that arguments match the declared schema.

step 4: return results to the model

The function's result is sent back to the model as a new message in the conversation. The model then uses this result to generate its final response to the user. For example, after receiving the weather data, the model might respond: "The current weather in Tokyo is 22 degrees Celsius with partly cloudy skies."

This four-step loop can repeat multiple times within a single interaction. The model might call one function, examine the result, decide it needs additional information, call another function, and then synthesize all results into a final response. This iterative process is sometimes called the agentic loop, and it sits at the core of every tool-using LLM application [4].

diagram of the function calling flow

Step	Actor	Action
1	Developer	Defines available functions with names, descriptions, and JSON schemas
2	User	Sends a message or query
3	Model	Analyzes query, decides to call a function, outputs function name and arguments as JSON
4	Application	Parses function call, validates arguments, executes the actual function
5	Application	Returns function result to the model as a new message
6	Model	Incorporates result into final response, or makes additional function calls

history

before formal function calling

Before API-level function calling support, developers used ad hoc methods to get LLMs to interact with external systems. Common approaches included instructing the model to output JSON in a specific format within its text response, then parsing that text with regular expressions or custom parsers. These approaches were fragile. Models would frequently produce malformed JSON, include extra text around the JSON, or forget the required format entirely.

The early ReAct paper (Yao et al., 2022) formalized a pattern where the model alternates between reasoning steps and action steps, with the actions being tool invocations parsed out of free-form text. Some early frameworks like LangChain (released October 2022) provided abstractions for tool use that worked around these limitations through careful prompt engineering and output parsing. Results were unreliable compared to native function calling support [5].

OpenAI introduces function calling (June 2023)

OpenAI launched function calling as a first-class API feature on June 13, 2023, alongside the release of updated GPT-3.5 Turbo and GPT-4 models specifically fine-tuned for function calling. The new model snapshots gpt-4-0613 and gpt-3.5-turbo-0613 were trained to detect when a function should be called and to respond with structured JSON. Developers could now define functions using JSON Schema and receive structured, validated function call outputs from the model with high reliability [1][6].

The initial release supported a single function call per turn. The API used a function_call parameter in the request and returned a function_call field in the response containing the function name and arguments as a JSON string. OpenAI explicitly framed this as an implementation of the ReAct pattern, but with models fine-tuned to execute it reliably rather than improvised through prompting [6].

Anthropic tool use (May 2024)

Anthropic introduced tool use for Claude models in beta during early 2024 and made it generally available on May 30, 2024 across the entire Claude 3 model family on the Anthropic Messages API, Amazon Bedrock, and Google Cloud Vertex AI [7]. Anthropic uses a slightly different API design. Rather than a separate function_call field, Anthropic integrated tool use into its content block structure. Claude models return a tool_use content block containing the tool name, a unique ID, and the input arguments. Results are sent back as tool_result content blocks referencing that ID. This design was motivated by Anthropic's broader content block architecture, which treats text, images, and tool calls as parallel content types within a single message [8].

Anthropic's GA release also introduced the tool_choice parameter (with values auto, any, or a specific tool name) and added support for tools that return images, enabling multimodal tool results. Later versions added strict tool definitions that guarantee schema conformance through constrained decoding.

Google Gemini function calling (December 2023)

Google's Gemini models support function calling through a functionCall response type, declared via FunctionDeclaration objects that use a subset of the OpenAPI schema format. Google's implementation supports both single-turn and multi-turn function calling patterns and exposes a function_calling_config parameter with modes AUTO, ANY, VALIDATED, and NONE to control whether the model is forced to emit a tool call [3]. Gemini 3 added unique IDs for each function call so applications can map results back unambiguously, which matters when several calls run in parallel.

Mistral, Llama, and xAI catch up (2024)

Mistral Large 2 (mistral-large-2407), released July 24, 2024, was trained for both parallel and sequential function calling on La Plateforme and Le Chat [9]. Meta's Llama 3.1 release on July 23, 2024 added native tool calling for the first time in the Llama family, including the 405B parameter open-weight model that briefly topped the BFCL leaderboard [10]. xAI added tool calling to Grok in 2024 and shipped a dedicated Agent Tools API alongside Grok 4.1 Fast on November 19, 2025, treating function calling as a server-managed primitive rather than a parameter on the chat endpoint [11].

Parallel function calling (November 2023)

In November 2023, OpenAI introduced parallel function calling with the release of GPT-4 Turbo. This feature allows the model to request multiple function calls in a single turn rather than calling functions one at a time. For example, if a user asks "What's the weather in Tokyo and New York?", the model can output two function calls simultaneously, allowing the application to execute them in parallel and return both results at once [1].

Parallel function calling significantly reduces the number of round trips between the application and the API, improving both latency and user experience for multi-tool queries. The feature can be controlled with the parallel_tool_calls parameter. Setting it to false ensures at most one tool call per turn. Anthropic, Google, Mistral, xAI, and Meta have since added equivalent parallel-call support to their own APIs.

Evolution of terminology

OpenAI initially used the term "function calling" but later transitioned to "tool calling" in its API, renaming the functions parameter to tools and function_call to tool_choice. The older function calling format is now considered legacy. Despite this rename, "function calling" remains the more widely recognized term in the broader community, and most providers use both terms interchangeably [1].

Responses API and built-in tools (March 2025)

On March 11, 2025, OpenAI launched the Responses API, a new API primitive that combines elements of the Chat Completions and Assistants APIs and natively supports built-in server-side tools including web search, file search, and computer use [12]. The same release introduced the open-source OpenAI Agents SDK and shifted the agent development story away from raw tools arrays toward higher-level agent objects with built-in tool routing. Reasoning models such as o3 and o4-mini can call tools and functions directly within their chain of thought in the Responses API, producing more contextually grounded answers and significantly higher tool-calling scores on industry benchmarks [12].

Anthropic added equivalent server tools (web_search, code_execution, web_fetch, tool_search) and an MCP connector to the Messages API in 2025, while Google added live tool routing inside Gemini 3, and xAI shipped the Agent Tools API alongside Grok 4.1 Fast in November 2025 [11].

json schema and tool definitions

Function definitions across all major providers share three required pieces: a name (an identifier the model emits when calling), a description (natural language guidance about when and how to use the tool), and a parameters block (a JSON Schema describing the arguments).

Most providers support a useful subset of JSON Schema:

type (object, string, number, integer, boolean, array)
properties and required
enum for fixed value sets
description on individual fields
items for arrays
oneOf, anyOf, allOf (with provider-specific limits)

Deeply nested objects, recursive references, and arbitrary $ref chains are commonly rejected or silently flattened. Gemini's documentation explicitly notes that very large or deeply nested schemas may be rejected in ANY mode [3]. OpenAI's strict mode and Anthropic's strict tool use both impose additional restrictions in exchange for guaranteed schema conformance, including required additionalProperties: false and a small set of supported keywords.

A practical rule: keep schemas flat, name fields descriptively, use enum whenever there is a closed set of options, and lean on the description field to disambiguate near-duplicate parameters. The model treats descriptions as part of the prompt, so a clear description of unit: "celsius" | "fahrenheit" does more work than a long if-then in the system prompt.

provider comparison

Each major LLM provider implements function calling with a different API design, though the core concept is the same across all of them.

Feature	OpenAI (GPT-4o, GPT-4.1, o-series)	Anthropic (Claude)	Google (Gemini)	Mistral AI	xAI (Grok)	Meta (Llama) 3.1+
API term	Tool calls	Tool use	Function calling	Tool calls	Tool calls / Agent Tools API	Tool calls
First public release	June 13, 2023	Beta in early 2024, GA May 30, 2024	December 2023 (Gemini API)	February 2024	2024, Agent Tools API Nov 19, 2025	July 23, 2024 (Llama 3.1)
Definition format	JSON Schema in `tools` array	JSON Schema in `tools` array with `input_schema`	`FunctionDeclaration` (OpenAPI subset)	JSON Schema in `tools` array	JSON Schema in `tools` array	JSON Schema or built-in tool tags
Response format	`tool_calls` array with `function.name` and `function.arguments`	`tool_use` content block with `name` and `input`	`functionCall` with `name` and `args` (with unique `id` in Gemini 3)	`tool_calls` with `function` object	`tool_calls` array	`<
Parallel calls	Yes, default on, `parallel_tool_calls` parameter	Yes, supported across Claude 3, 4, and 4.x families	Yes, with per-call IDs in Gemini 3	Yes (Mistral Large 2 onward)	Yes	Yes
Strict / schema enforcement	`strict: true` (Structured Outputs)	`strict: true` (strict tool use)	`function_calling_config: ANY` plus VALIDATED mode	JSON mode	Schema-aware decoding	Depends on inference engine (vLLM, llama.cpp)
Force tool use	`tool_choice: {"type": "function", "function": {"name": "..."}}`	`tool_choice: {"type": "tool", "name": "..."}`	`tool_config` with `ANY` mode	`tool_choice: "any"` or specific function	`tool_choice` with function name	Provider-dependent
Streaming	Yes, with delta chunks for tool call arguments	Yes, with `content_block_delta` events	Yes	Yes	Yes	Yes
Built-in server tools	Web search, file search, computer use, code interpreter, image generation, MCP (Responses API, March 2025)	Web search, code execution, web fetch, tool search, MCP connector	Google Search, code execution, URL context	Limited	Web search, X search, file search, code execution, MCP	None native; via inference frameworks

The overall picture by 2026: every provider exposes the same conceptual loop, but field names and content-block structures differ enough that most production codebases either pin to one provider or use an adapter library such as LiteLLM, LangChain, or the Vercel AI SDK to abstract the differences.

structured output and function calling

Function calling is closely related to, and often overlaps with, structured output capabilities. The two features share most of their underlying machinery (JSON Schema and constrained decoding) but solve slightly different problems.

json mode

JSON mode, available from OpenAI, Google, Mistral, and others, constrains the model to produce valid JSON in its response without enforcing a specific schema. It is useful when you need structured output but the schema varies between requests, or when an old code path expects a JSON blob and you do not want to refactor it into a tool definition.

structured outputs

OpenAI introduced Structured Outputs on August 6, 2024, going beyond JSON mode by guaranteeing that the model's output conforms to a developer-supplied JSON Schema. The new response_format option accepts a json_schema field, and a parallel strict: true flag on tool definitions delivers the same guarantee for function calls. Under the hood, OpenAI uses constrained decoding (sometimes called constrained sampling) to restrict the model's token generation at each step to only tokens that would produce valid JSON matching the schema. The launch model gpt-4o-2024-08-06 was trained to handle complicated schemas and reportedly hit 100% schema conformance on internal evaluations [13].

Anthropic added equivalent strict tool use in 2025, building on its existing tool use mechanism. Developers define a tool with an input_schema and add strict: true, and Claude returns structured data with the same conformance guarantees. Google's Gemini API offers responseSchema and the VALIDATED function calling mode for similar guarantees, and Mistral exposes JSON mode plus schema-constrained outputs on La Plateforme.

Approach	Schema enforcement	Provider support	Best for
JSON mode	Guarantees valid JSON, no schema enforcement	OpenAI, Google, Mistral, Anthropic	Variable-structure JSON output
Structured Outputs	Guarantees JSON matching a specific schema	OpenAI (native, Aug 2024), Anthropic (strict tool use, 2025), Google (responseSchema), Mistral (json_object with schema)	Strict data extraction, form filling, tool args
Function calling	Guarantees structured function call output matching tool schema	All major providers	Tool invocation with typed arguments

reasoning models and tool use

Reasoning-focused models that produce extended chains of thought have changed how tool use is integrated into the generation loop. OpenAI's o-series (o1, o3, o3-mini, o4-mini) and the GPT-4.1 line can call tools directly inside their reasoning traces in the Responses API, interleaving thought and action without a full round-trip to the application after every step [12]. This pattern improves complex tool-calling benchmarks by letting the model think, call, observe, and revise within a single response.

Claude's extended thinking, available on Claude Sonnet 4.5 and Claude Opus 4.x, similarly supports tool use during the thinking phase. Google's Gemini Deep Think and the Gemini 3 family expose a comparable interleaved tool mode, and xAI's grok-4-1-fast-reasoning is explicitly marketed as the reasoning variant of Grok 4.1 Fast for complex tool workflows [11]. The general trend by 2026 is that the boundary between "reasoning" and "tool use" has dissolved. Frontier models reason, call tools, observe results, and reason again as a single continuous process.

frameworks and ecosystem

The ecosystem around function calling has grown well beyond raw API calls. Developers commonly use higher-level frameworks that handle the agentic loop, memory, multi-agent coordination, and integration with vector stores or databases.

OpenAI Agents SDK (launched March 11, 2025) is OpenAI's open-source replacement for the experimental Swarm SDK, offering Agents, Handoffs, Guardrails, and tracing on top of the Responses API [12].
LangChain and LangGraph offer provider-agnostic tool abstractions and a graph-based runtime for stateful multi-step agents.
LlamaIndex provides agent loops and tight integration with retrieval pipelines.
CrewAI focuses on multi-agent collaboration with role-based agents.
AutoGen (Microsoft Research) supports conversational multi-agent systems with structured tool calling.
The Anthropic computer use API and OpenAI computer use tool let models drive a full desktop through screenshots and synthesized keyboard or mouse events, with each action expressed as a tool call.
Browser automation systems such as OpenAI's Operator, Google's Project Mariner, and the Anthropic computer use reference implementation extend the same pattern to a real browser.

These frameworks all share a common core: a function calling loop that emits tool calls, executes them, and feeds results back into the conversation.

benchmarks

Function calling capability is now evaluated by a small family of public benchmarks, with the Berkeley Function Calling Leaderboard (BFCL) and Sierra Research's tau-bench family being the most influential.

Berkeley Function Calling Leaderboard (BFCL)

The Berkeley Function Calling Leaderboard (Yan et al., 2024) is the de facto standard for evaluating tool use in LLMs. It uses an Abstract Syntax Tree (AST) evaluation method that scales to thousands of functions and tests both serial and parallel function calls across multiple programming languages [14]. BFCL has gone through four major versions:

BFCL v1 (2024) focused on single-turn function calling correctness.
BFCL v2 added multilingual coverage and live functions.
BFCL v3 added multi-turn evaluation.
BFCL v4 (2025) added agentic evaluation, including web search, memory, and format sensitivity. The overall score is computed as Agentic 40%, Multi-Turn 30%, Live 10%, Non-Live 10%, and Hallucination 10%.

The BFCL paper appeared at ICML 2025 (Patil et al.) under the title "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models" [14]. As of April 2026, top scores cluster between 0.70 and 0.89 on the public leaderboard, with Llama 3.1 405B Instruct at the top of the open-weight rankings (0.885) and a tight pack of frontier proprietary models behind it [14][15].

tau-bench and tau2-bench

Sierra Research released tau-bench in 2024 and the harder follow-up tau2-bench in 2025 (arXiv 2506.07982). tau-bench simulates dynamic conversations between a user (also played by an LLM) and a tool-using language agent that must follow domain-specific policies. It originally covered Mock, Airline, and Retail domains. tau2-bench (Barres et al., 2025) introduced a dual-control environment where both the agent and the user can modify a shared world state, and added a Telecom domain that tests technical troubleshooting, problem solving, and clear communication [16].

The Telecom subset of tau2-bench is one of the hardest public tool-use benchmarks. Modern reasoning models including the GLM 4.7 and GLM 5 series report scores above 98% on the Artificial Analysis tau2-bench Telecom leaderboard, and Grok 4.1 Fast reasoning posts strong results in the same range [11][17]. Older non-reasoning models often fall below 50% on the same domain, which is a useful indicator of how much reasoning depth a model brings to multi-turn tool conversations.

use cases

Function calling enables a broad range of applications by bridging the gap between language understanding and programmatic action.

api integration

The most common use case is connecting an LLM to external APIs. A travel assistant might call flight search, hotel booking, and weather APIs. A productivity assistant might call calendar, email, and task management APIs. Function calling provides the structured interface that makes these integrations reliable.

database queries

Models can generate SQL queries or call database wrapper functions to answer questions about structured data. A business analyst chatbot might receive a question like "What were our top 10 products by revenue last quarter?" and generate a function call to execute the appropriate database query, then present the results in natural language.

calculations and data processing

LLMs are unreliable at arithmetic and precise calculations. Function calling allows models to delegate mathematical operations, unit conversions, statistical analyses, and other computations to dedicated tools that produce exact results. Code interpreter tools take this further by letting the model write and execute Python in a sandbox.

web browsing and search

Models can call web search functions to retrieve up-to-date information beyond their training data. ChatGPT Search, Perplexity AI, and other AI search products use function calling internally to trigger web searches, scrape pages, and retrieve relevant content. The Responses API (OpenAI), Anthropic web_search server tool, Gemini Google Search tool, and xAI Agent Tools API all expose this as a built-in primitive.

code execution

Code interpreter tools allow models to write and execute code in sandboxed environments. OpenAI's Code Interpreter (now called Advanced Data Analysis), Anthropic's analysis tool, the Gemini code execution tool, and the Grok Agent Tools code execution sandbox all rely on function calling to pass generated code to an execution environment and return results.

file and document operations

Models can read, write, and manipulate files through function calls. This is the basis for AI-powered coding assistants like GitHub Copilot, Cursor, and Claude Code, which use tool calls to read source files, make edits, run tests, and interact with version control systems.

multi-step workflows

Complex tasks often require chaining multiple function calls together. An AI agent handling a customer refund might need to look up the customer's order, check the refund policy, calculate the refund amount, process the refund through the payment system, and send a confirmation email. Each step involves a separate function call, with the model orchestrating the sequence based on intermediate results.

desktop and browser automation

The Anthropic computer use API, OpenAI Operator, and Google Project Mariner translate full GUI interactions into tool calls. The model receives screenshots, decides on a click or keystroke, and emits a tool call describing the action. Despite the visual interface, the underlying transport is still function calling.

relationship to ai agents and mcp

Function calling is the foundational mechanism that enables AI agents to take actions in the world. Without it, LLMs can only produce text. With it, they can query databases, call APIs, execute code, and interact with external systems.

ai agents

An agent is a system where an LLM autonomously decides what actions to take, executes those actions via function calls, observes the results, and plans next steps. Frameworks like LangChain, LangGraph, CrewAI, and AutoGen provide abstractions for building agent loops on top of function calling [5].

The quality of function calling directly determines agent reliability. Models that accurately select the right function, provide correct arguments, and handle errors gracefully produce more effective agents. Conversely, function calling failures (wrong function selection, malformed arguments, hallucinated function names) are a primary source of agent failures.

model context protocol (mcp)

Model Context Protocol (MCP) is an open standard introduced by Anthropic in November 2024 that provides a standardized way for AI models to discover and interact with external tools and data sources. While function calling defines how a model invokes a specific tool, MCP standardizes how tools are described, discovered, and connected to models across different applications and providers [18].

MCP gained rapid adoption in 2025. OpenAI added official MCP support in March 2025, Google added MCP for Gemini, and xAI integrated MCP into the Agent Tools API in November 2025 [11]. On December 9, 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation co-founded by Anthropic, Block, and OpenAI, with support from Google, Microsoft, AWS, Cloudflare, and Bloomberg [19]. By early 2026, more than 10,000 MCP servers had been published, covering everything from databases and file systems to enterprise SaaS connectors [18].

MCP complements rather than replaces function calling. MCP provides the transport, discovery, and capability description layer, while function calling provides the actual invocation mechanism that the model uses inside a single conversation.

Concept	Function calling	MCP
What it does	Lets a model invoke a specific function with structured arguments	Standardizes how tools are described, discovered, and connected
Scope	Single model-to-tool interaction	Cross-application tool ecosystem
Defined by	Each LLM provider (OpenAI, Anthropic, Google, etc.)	Open standard (originally Anthropic, now AAIF / Linux Foundation)
Analogy	Making a specific API call	A USB-C connector for AI tools

best practices

Building reliable function calling applications requires attention to several design principles.

clear function descriptions

The model relies on function descriptions to decide when and how to use each tool. Descriptions should be specific, include examples of when the function should be used, and clarify any constraints or limitations. Vague descriptions lead to incorrect tool selection. Treat the description as part of the prompt, because that is exactly how the model uses it [1].

keep schemas flat and use enums

Deeply nested schemas hurt accuracy and increase token cost. Prefer flat objects with descriptive field names. Use enum whenever the parameter has a known finite set of values, and use the description field on each property to disambiguate similar parameters.

schema validation

Always validate the arguments generated by the model before executing a function. Modern models with strict tool use produce valid JSON with very high reliability, but edge cases still occur, especially with complex nested schemas. Libraries like Pydantic (Python) and Zod (TypeScript) provide runtime validation that catches malformed arguments before they reach production systems.

error handling

When a function call fails (due to invalid arguments, API errors, timeout, or permission issues), the error should be communicated back to the model in a clear, structured format. Good error messages allow the model to adjust its approach, perhaps trying different arguments or an alternative function. Silently swallowing errors leads to confusing model behavior.

limit tool scope and use tool routing

Provide only the functions relevant to the current task. Including too many function definitions wastes context tokens and increases the chance of the model selecting the wrong tool. For applications with dozens of tools, consider implementing a tool selection or routing layer (Anthropic's tool_search server tool is one production implementation) that presents only a relevant subset to the model on each turn.

security considerations

Function calling grants the model indirect access to external systems. Always implement proper authorization, input sanitization, and rate limiting on the execution side. Never allow the model to directly execute arbitrary code or unrestricted database queries without safeguards. The principle of least privilege applies: each function should have only the permissions it needs. Treat any output from a tool as untrusted user input from a prompt injection perspective, and require user confirmation for actions with real-world impact like sending email, making payments, or posting publicly [6].

challenges and limitations

hallucinated function calls

Models sometimes generate calls to functions that do not exist in the provided tool list, or fabricate argument values. This is a form of hallucination specific to function calling. Strict schema enforcement and validation reduce but do not eliminate this issue. The BFCL Hallucination subscore is a direct attempt to measure this failure mode [14].

complex multi-step reasoning

While models handle simple, single-step function calls reliably, complex workflows requiring many sequential function calls with conditional logic are more error-prone. The model may lose track of intermediate results, repeat function calls unnecessarily, or fail to synthesize information from multiple tool outputs. tau2-bench was designed in part to expose exactly this kind of multi-turn drift [16].

latency

Each function call adds a round trip between the model and the application, increasing total response time. For workflows requiring five or more sequential function calls, latency can become noticeable. Parallel function calling helps for independent calls, but sequential dependencies cannot be parallelized. Server-side built-in tools (Responses API, Anthropic server tools, xAI Agent Tools API) reduce client-side round trips at the cost of less control over execution.

token cost of long tool definitions

Tool schemas live in the context window and are charged as input tokens. A large tool catalog with verbose descriptions can easily consume thousands of tokens per request before the user message is even processed. Anthropic publishes per-model token costs for the special system prompt that enables tool use (346 input tokens for Claude Opus 4.x and Sonnet 4.x with tool_choice: auto), and the tool definitions themselves add on top of that [8].

interoperability

Despite conceptual similarity, the different API formats across providers create friction for developers building provider-agnostic applications. MCP addresses part of this problem at the tool description layer, but the invocation-level differences in request and response formats still require provider-specific code or adapter libraries.

current state (2025-2026)

As of early 2026, function calling is a mature and widely adopted capability supported by all major LLM providers and most open-source models. The focus has shifted from basic capability to reliability, performance, and ecosystem development.

Key 2025 to 2026 developments include:

The convergence around MCP as a standard for tool discovery and description, now stewarded by the Linux Foundation Agentic AI Foundation [19].
Server-side built-in tools across all major APIs (OpenAI Responses API, Anthropic server tools, Gemini built-in tools, xAI Agent Tools API), reducing the need for developers to implement common tools like web search and code execution from scratch.
Reasoning models that interleave tool calls with extended chain-of-thought, lifting top BFCL v4 and tau2-bench Telecom scores into the 0.85+ and 90%+ range respectively [11][14][17].
Open-weight models like Llama 3.1 405B, Qwen3 235B, GLM 5, and DeepSeek-V3 reaching parity with closed models on tool-calling benchmarks, making robust tool use accessible outside proprietary API ecosystems [10][15].
Increased emphasis on agentic evaluation (BFCL v4 agentic split, tau2-bench, AgentBench, SWE-bench) over single-turn benchmarks.

The ongoing challenge is improving reliability of function calling in long-horizon, multi-step scenarios where models must make dozens of tool decisions in sequence, often interleaved with reasoning, search, and memory. Research into better tool-use training data, improved constrained decoding, and more robust agent architectures continues across both industry and academia.

references

[1] OpenAI. "Function Calling." OpenAI API Documentation. https://platform.openai.com/docs/guides/function-calling

[2] Descope. "MCP vs. Function Calling: How They Differ and Which to Use." 2025. https://www.descope.com/blog/post/mcp-vs-function-calling

[3] Google. "Function Calling." Gemini API Documentation. https://ai.google.dev/gemini-api/docs/function-calling

[4] Anthropic. "Tool use with Claude." Anthropic Platform Documentation. https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/overview

[5] LangChain Documentation. "Tool Calling." https://docs.langchain.com/

[6] OpenAI. "Function calling and other API updates." June 13, 2023. https://openai.com/index/function-calling-and-other-api-updates/ (See also Simon Willison, "OpenAI: Function calling and other API updates," June 13, 2023, https://simonwillison.net/2023/Jun/13/function-calling/)

[7] Anthropic. "Claude can now use tools." May 30, 2024. https://claude.com/blog/tool-use-ga

[8] Anthropic. "Tool use overview." Anthropic Platform Documentation. https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/overview

[9] Mistral AI. "Large Enough." July 24, 2024. https://mistral.ai/news/mistral-large-2407

[10] Meta AI. "Introducing Llama 3.1: Our most capable models to date." July 23, 2024. https://ai.meta.com/blog/meta-llama-3-1/

[11] xAI. "Grok 4.1 Fast and Agent Tools API." November 19, 2025. https://x.ai/news/grok-4-1-fast

[12] OpenAI. "New tools for building agents." March 11, 2025. https://openai.com/index/new-tools-for-building-agents/

[13] OpenAI. "Introducing Structured Outputs in the API." August 6, 2024. https://openai.com/index/introducing-structured-outputs-in-the-api/

[14] Patil, Shishir, et al. "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models." ICML 2025. https://proceedings.mlr.press/v267/patil25a.html

[15] Berkeley Function Calling Leaderboard V4. https://gorilla.cs.berkeley.edu/leaderboard.html

[16] Barres, Victor, et al. "τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment." 2025. arXiv:2506.07982. https://arxiv.org/abs/2506.07982

[17] Artificial Analysis. "τ²-Bench Telecom Benchmark Leaderboard." https://artificialanalysis.ai/evaluations/tau2-bench

[18] Anthropic. "Introducing the Model Context Protocol." November 2024. https://www.anthropic.com/news/model-context-protocol

[19] Linux Foundation. "Linux Foundation Announces the Formation of the Agentic AI Foundation (AAIF)." December 9, 2025. https://www.linuxfoundation.org/press/linux-foundation-announces-the-formation-of-the-agentic-ai-foundation

how function calling works

step 1: define available functions

step 2: model decides to call a function

step 3: application executes the function

step 4: return results to the model

diagram of the function calling flow

history

before formal function calling

OpenAI introduces function calling (June 2023)

Anthropic tool use (May 2024)

Google Gemini function calling (December 2023)

Mistral, Llama, and xAI catch up (2024)

Parallel function calling (November 2023)

Evolution of terminology

Responses API and built-in tools (March 2025)

json schema and tool definitions

provider comparison

structured output and function calling

json mode

structured outputs

reasoning models and tool use

frameworks and ecosystem

benchmarks

Berkeley Function Calling Leaderboard (BFCL)

tau-bench and tau2-bench

use cases

api integration

database queries

calculations and data processing

web browsing and search

code execution

file and document operations

multi-step workflows

desktop and browser automation

relationship to ai agents and mcp

ai agents

model context protocol (mcp)

best practices

clear function descriptions

keep schemas flat and use enums

schema validation

error handling

limit tool scope and use tool routing

security considerations

challenges and limitations

hallucinated function calls

complex multi-step reasoning

latency

token cost of long tool definitions

interoperability

current state (2025-2026)

see also

references

Improve this article

Related Articles

22.500 Best Custom GPTs

ChatGPT Classic

GPT Search

GPT Shop Keeper

GPT Store Finder

GPTavern

how function calling works

step 1: define available functions

step 2: model decides to call a function

step 3: application executes the function

step 4: return results to the model

diagram of the function calling flow

history

before formal function calling

OpenAI introduces function calling (June 2023)

Anthropic tool use (May 2024)

Google Gemini function calling (December 2023)

Mistral, Llama, and xAI catch up (2024)

Parallel function calling (November 2023)

Evolution of terminology

Responses API and built-in tools (March 2025)

json schema and tool definitions

provider comparison

structured output and function calling

json mode