Function calling
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v8 · 8,152 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v8 · 8,152 words
Add missing citations, update stale details, or suggest a clearer explanation.
Function calling (also called tool use or tool calling) is a capability of large language models (LLMs) that allows the model to generate structured JSON output specifying which external function or API to invoke and what arguments to pass. Rather than producing only natural language text, a model with function calling capabilities can decide during generation that it needs to call a specific tool, output the function name and parameters in a structured format, and then incorporate the result of that function call into its subsequent response.
Function calling was introduced as a formal API feature by OpenAI on June 13, 2023, and has since been adopted by every major LLM provider, including Anthropic (Claude), Google (Gemini), Mistral AI, xAI (Grok), and Meta (Llama). The capability is foundational to building AI agents, enabling LLMs to interact with databases, APIs, file systems, code interpreters, and other external systems in a controlled, structured manner [1][2].
In modern usage, function calling is often the lowest-level building block beneath higher abstractions such as tool use, the Model Context Protocol (MCP), and built-in agent platforms like the OpenAI Responses API and xAI Agent Tools API. By 2026, every major frontier model treats function calling as a default feature trained directly into the base model rather than a thin wrapper layered on top, and OpenAI's GPT-5.2 reports 98.7% on the tau2-bench Telecom multi-turn tool-use benchmark, illustrating how much the capability has matured since 2023 [3].
The function calling process follows a well-defined loop involving the developer, the model, and the application layer. Understanding this loop is essential to building reliable tool-augmented AI systems.
The developer provides the model with a list of available functions (tools) as part of the API request. Each function definition includes a name, a natural language description of what the function does, and a JSON Schema specifying the expected parameters and their types. Most providers use a subset of the OpenAPI/JSON Schema specification, which means features like recursive schemas, references, or rare keywords may not be supported [4].
For example, a weather application might define a function like:
{
"name": "get_weather",
"description": "Get the current weather for a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. San Francisco"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
The model uses the function descriptions and parameter schemas to decide when and how to call each function. Clear, specific descriptions significantly improve the model's ability to choose the right function and provide correct arguments [1].
When the user sends a message (for example, "What's the weather in Tokyo?"), the model analyzes the query against the available function definitions. If the model determines that a function call would help it answer the query, it generates a structured output indicating the function to call and the arguments to pass, rather than generating a direct text response.
The model does not execute the function itself. It outputs a structured request that the application must intercept and act upon. This is a critical design choice: the model proposes actions, but the application controls execution [5].
The application parses the model's function call output, validates the arguments, and executes the actual function (for example, calling a weather API with the specified location). This step happens entirely outside the model. The developer's code is responsible for the actual execution, error handling, and any security checks. Validation typically uses runtime libraries like Pydantic in Python or Zod in TypeScript to confirm that arguments match the declared schema.
The function's result is sent back to the model as a new message in the conversation. The model then uses this result to generate its final response to the user. For example, after receiving the weather data, the model might respond: "The current weather in Tokyo is 22 degrees Celsius with partly cloudy skies."
This four-step loop can repeat multiple times within a single interaction. The model might call one function, examine the result, decide it needs additional information, call another function, and then synthesize all results into a final response. This iterative process is sometimes called the agentic loop, and it sits at the core of every tool-using LLM application [5].
| Step | Actor | Action |
|---|---|---|
| 1 | Developer | Defines available functions with names, descriptions, and JSON schemas |
| 2 | User | Sends a message or query |
| 3 | Model | Analyzes query, decides to call a function, outputs function name and arguments as JSON |
| 4 | Application | Parses function call, validates arguments, executes the actual function |
| 5 | Application | Returns function result to the model as a new message |
| 6 | Model | Incorporates result into final response, or makes additional function calls |
Two research lines from 2022 and 2023 set the conceptual stage for production function calling. The ReAct paper by Shunyu Yao and colleagues (arXiv:2210.03629, October 6, 2022) showed that a language model could interleave reasoning traces and discrete actions, with each action expressed as a structured call against a tool such as a Wikipedia search API. On HotpotQA and Fever, ReAct reduced hallucination relative to chain-of-thought reasoning, and on the ALFWorld and WebShop environments it outperformed imitation and reinforcement-learning baselines by 34 and 10 absolute percentage points respectively [6].
Toolformer (Schick et al., Meta AI and Universitat Pompeu Fabra, arXiv:2302.04761, February 9, 2023) went a step further by training a 6.7B-parameter GPT-J model to insert API calls into its own generations through a self-supervised procedure. Toolformer integrated five tools (calculator, Q&A system, two search engines, a translation system, and a calendar), demonstrated zero-shot gains over the much larger GPT-3 on math and factual tasks, and was presented at NeurIPS 2023 [7].
Gorilla (Patil et al., UC Berkeley, arXiv:2305.15334, May 24, 2023) was the first system specifically focused on training a model to write valid API calls at scale. Gorilla fine-tuned LLaMA on 1,645 API calls from HuggingFace, TorchHub, and TensorHub via the APIBench dataset, introduced Retriever-Aware Training so the model adapts to changing API documentation, and reported that the resulting model outperformed GPT-4 on API-call correctness for that benchmark [8]. The same research group later published the Berkeley Function Calling Leaderboard (BFCL), which became the dominant public benchmark for tool use [9].
Before API-level function calling support, developers used ad hoc methods to get LLMs to interact with external systems. Common approaches included instructing the model to output JSON in a specific format within its text response, then parsing that text with regular expressions or custom parsers. These approaches were fragile. Models would frequently produce malformed JSON, include extra text around the JSON, or forget the required format entirely.
The early ReAct pattern was popularized in tooling through frameworks like LangChain (released October 2022), which provided abstractions for tool use that worked around the lack of native API support through careful prompt engineering and output parsing. Results were unreliable compared to native function calling support [10].
OpenAI launched function calling as a first-class API feature on June 13, 2023, alongside the release of updated GPT-3.5 Turbo and GPT-4 models specifically fine-tuned for function calling. The new model snapshots gpt-4-0613 and gpt-3.5-turbo-0613 were trained to detect when a function should be called and to respond with structured JSON. Developers could now define functions using JSON Schema and receive structured, validated function call outputs from the model with high reliability [1][11].
The initial release supported a single function call per turn. The API used a function_call parameter in the request and returned a function_call field in the response containing the function name and arguments as a JSON string. OpenAI explicitly framed this as an implementation of the ReAct pattern, but with models fine-tuned to execute it reliably rather than improvised through prompting [11].
Anthropic introduced tool use for Claude models in beta during early 2024 and made it generally available on May 30, 2024 across the entire Claude 3 model family on the Anthropic Messages API, Amazon Bedrock, and Google Cloud Vertex AI [12]. The GA release shipped with streaming tool calls (for incremental rendering of arguments), forced tool selection, and support for image inputs as tool inputs and outputs [12]. Anthropic uses a slightly different API design from OpenAI. Rather than a separate function_call field, Anthropic integrated tool use into its content block structure. Claude models return a tool_use content block containing the tool name, a unique ID, and the input arguments. Results are sent back as tool_result content blocks referencing that ID. This design was motivated by Anthropic's broader content block architecture, which treats text, images, and tool calls as parallel content types within a single message [13].
Anthropic's GA release also introduced the tool_choice parameter (with values auto, any, or a specific tool name) and added support for tools that return images, enabling multimodal tool results. Later versions added strict tool definitions that guarantee schema conformance through constrained decoding.
Google's Gemini models support function calling through a functionCall response type, declared via FunctionDeclaration objects that use a subset of the OpenAPI schema format. Google's implementation supports both single-turn and multi-turn function calling patterns and exposes a function_calling_config parameter with modes AUTO, ANY, VALIDATED, and NONE to control whether the model is forced to emit a tool call [4]. Gemini 3, released November 18, 2025, added unique IDs for each function call, encrypted thought signatures that callers must echo back to preserve reasoning continuity, and 100+ simultaneous tool calls in a single turn for the Flash tier [14].
Mistral Large 2 (mistral-large-2407), released July 24, 2024, was trained for both parallel and sequential function calling on La Plateforme and Le Chat [15]. Meta's Llama 3.1 release on July 23, 2024 added native tool calling for the first time in the Llama family, including the 405B parameter open-weight model that briefly topped the BFCL leaderboard [16]. Llama 4 (released April 2025) extended this with native parallel tool calling, optimized for the llama4_pythonic chat template, while community testing reports somewhat less consistent JSON adherence than Claude or GPT-5 [17]. xAI added tool calling to Grok in 2024 and shipped a dedicated Agent Tools API alongside Grok 4.1 Fast on November 19, 2025, treating function calling as a server-managed primitive rather than a parameter on the chat endpoint [18].
In November 2023, OpenAI introduced parallel function calling with the release of GPT-4 Turbo. This feature allows the model to request multiple function calls in a single turn rather than calling functions one at a time. For example, if a user asks "What's the weather in Tokyo and New York?", the model can output two function calls simultaneously, allowing the application to execute them in parallel and return both results at once [1].
Parallel function calling significantly reduces the number of round trips between the application and the API, improving both latency and user experience for multi-tool queries. The feature can be controlled with the parallel_tool_calls parameter. Setting it to false ensures at most one tool call per turn. Anthropic, Google, Mistral, xAI, and Meta have since added equivalent parallel-call support to their own APIs.
OpenAI initially used the term "function calling" but later transitioned to "tool calling" in its API, renaming the functions parameter to tools and function_call to tool_choice. The older function calling format is now considered legacy. Despite this rename, "function calling" remains the more widely recognized term in the broader community, and most providers use both terms interchangeably [1].
On October 22, 2024, Anthropic released the Computer Use capability for claude-3-5-sonnet-20241022 in public beta, marking the first frontier model trained to control a desktop through tool calls [19]. The model receives screenshots and emits computer tool calls (mouse_move, left_click, type, key, screenshot) using a polymorphic tool type signalled via the new tool.type property and the anthropic_beta: computer-use-2024-10-22 header [19]. OpenAI followed with a Computer Use tool in the Responses API (March 2025), Google released Project Mariner in the same window, and all three rely on standard function calling as the transport layer for GUI actions [20]. The pattern reframed function calling as the foundation not just for API access but for embodied action through a display.
On March 11, 2025, OpenAI launched the Responses API, a new API primitive that combines elements of the Chat Completions and Assistants APIs and natively supports built-in server-side tools including web search, file search, and computer use [20]. The same release introduced the open-source OpenAI Agents SDK and shifted the agent development story away from raw tools arrays toward higher-level agent objects with built-in tool routing. Reasoning models such as o3 and o4-mini can call tools and functions directly within their chain of thought in the Responses API, producing more contextually grounded answers and significantly higher tool-calling scores on industry benchmarks [20].
Anthropic added equivalent server tools (web_search, code_execution, web_fetch, tool_search) and an MCP connector to the Messages API in 2025, while Google added live tool routing inside Gemini 3, and xAI shipped the Agent Tools API alongside Grok 4.1 Fast in November 2025 [18].
Function calling expanded beyond text in October 2024 with OpenAI's Realtime API for voice agents, then matured with the gpt-realtime model GA in 2025. The new model lifted scores on the ComplexFuncBench audio evaluation from 49.7% (December 2024 baseline) to 66.5%, added support for remote MCP servers, SIP-based phone calling, and image inputs, and crucially allowed long-running tool calls to execute without disrupting an open voice session [21]. The same pattern (continuous audio in, tool calls streamed out, results streamed back) now underpins production voice agents for customer support, scheduling, and field operations.
The GPT-5 family, released by OpenAI in 2025, expanded function calling along two new dimensions [22]. Freeform tool calls let developers register a tool with { "type": "custom" } and receive raw text payloads from the model, with no JSON envelope. This is useful when the destination tool expects Python, SQL, a shell command, or another DSL natively, since wrapping that payload in JSON forced an awkward escaping step in earlier APIs [22]. Context-free grammar (CFG) constraints let the developer attach a Lark grammar (or a regex) to a custom tool, after which OpenAI's LLGuidance constrained-decoding backend forces the model's output to conform to the grammar [23]. Together these features cover use cases that pure JSON Schema could not express, like SQL dialects, structured configuration files, or domain-specific languages.
GPT-5.2 (released December 2025) added an allowed_tools parameter under tool_choice that defines a full toolkit but restricts the model to a specific subset per request, improving safety and prompt-cache hit rates without prompt-engineering hacks [3]. GPT-5.5 (April 24, 2026) continued the same trajectory with better error recovery mid-task, more efficient tool sequencing, and reduced rate of confidently bad plans, positioning it as an agentic model rather than a chat model [24].
On November 24, 2025, Anthropic launched a coordinated Advanced Tool Use beta on the Claude Developer Platform, introducing three new features [25]:
These features sit alongside Claude's prior strict tool use, parallel calling, and server tool catalog (web search, code execution, web fetch, tool search, MCP connector), and they marked Anthropic's clearest statement that tool catalogs of hundreds to thousands of tools are now the design target.
Function definitions across all major providers share three required pieces: a name (an identifier the model emits when calling), a description (natural language guidance about when and how to use the tool), and a parameters block (a JSON Schema describing the arguments).
Most providers support a useful subset of JSON Schema:
type (object, string, number, integer, boolean, array)properties and requiredenum for fixed value setsdescription on individual fieldsitems for arraysoneOf, anyOf, allOf (with provider-specific limits)Deeply nested objects, recursive references, and arbitrary $ref chains are commonly rejected or silently flattened. Gemini's documentation explicitly notes that very large or deeply nested schemas may be rejected in ANY mode [4]. OpenAI's strict mode and Anthropic's strict tool use both impose additional restrictions in exchange for guaranteed schema conformance, including required additionalProperties: false and a small set of supported keywords.
A practical rule: keep schemas flat, name fields descriptively, use enum whenever there is a closed set of options, and lean on the description field to disambiguate near-duplicate parameters. The model treats descriptions as part of the prompt, so a clear description of unit: "celsius" | "fahrenheit" does more work than a long if-then in the system prompt.
For inputs that do not fit JSON well, GPT-5 custom tools accept either a raw freeform string or a Lark/regex grammar that constrains the generated text, letting the developer enforce a SQL dialect, a shell-style command, or any other DSL without round-tripping through JSON [22][23]. Anthropic's Tool Use Examples cover a similar gap from the other direction by demonstrating correct usage in-context rather than relying solely on schema and description fields [25].
Each major LLM provider implements function calling with a different API design, though the core concept is the same across all of them.
| Feature | OpenAI (GPT-5.x, GPT-4o, o-series) | Anthropic (Claude) | Google (Gemini) | Mistral AI | xAI (Grok) | Meta (Llama) 3.1+ |
|---|---|---|---|---|---|---|
| API term | Tool calls | Tool use | Function calling | Tool calls | Tool calls / Agent Tools API | Tool calls |
| First public release | June 13, 2023 | Beta in early 2024, GA May 30, 2024 | December 2023 (Gemini API) | February 2024 | 2024, Agent Tools API Nov 19, 2025 | July 23, 2024 (Llama 3.1) |
| Definition format | JSON Schema in tools array; custom type for freeform / CFG | JSON Schema in tools array with input_schema | FunctionDeclaration (OpenAPI subset) | JSON Schema in tools array | JSON Schema in tools array | JSON Schema or built-in tool tags |
| Response format | tool_calls array with function.name and function.arguments; raw text for custom tools | tool_use content block with name and input | functionCall with name and args (with unique id in Gemini 3) | tool_calls with function object | tool_calls array | `< |
| Parallel calls | Yes, default on, parallel_tool_calls parameter | Yes, supported across Claude 3, 4, and 4.x families | Yes, 100+ per turn in Gemini 3 Flash, per-call IDs | Yes (Mistral Large 2 onward) | Yes | Yes (parallel in Llama 4; serial in Llama 3) |
| Strict / schema enforcement | strict: true (Structured Outputs); CFG via Lark/regex | strict: true (strict tool use) | function_calling_config: ANY plus VALIDATED mode | JSON mode | Schema-aware decoding | Depends on inference engine (vLLM, llama.cpp) |
| Force tool use | tool_choice: {"type": "function", "function": {"name": "..."}}; allowed_tools subset (GPT-5.2) | tool_choice: {"type": "tool", "name": "..."} | tool_config with ANY mode | tool_choice: "any" or specific function | tool_choice with function name | Provider-dependent |
| Streaming | Yes, with delta chunks for tool call arguments | Yes, with content_block_delta events | Yes | Yes | Yes | Yes |
| Built-in server tools | Web search, file search, computer use, code interpreter, image generation, MCP (Responses API, March 2025) | Web search, code execution, web fetch, tool search, MCP connector | Google Search, code execution, URL context | Limited | Web search, X search, file search, code execution, MCP | None native; via inference frameworks |
| Advanced features | Freeform/custom tools, Lark/CFG grammars, allowed_tools, Responses API agents | Programmatic Tool Calling (PTC), Tool Search, Tool Use Examples | Thought signatures, live tool routing | JSON mode + schema | Server-managed Agent Tools API | Inference-time grammar via vLLM/llama.cpp |
The overall picture by 2026: every provider exposes the same conceptual loop, but field names and content-block structures differ enough that most production codebases either pin to one provider or use an adapter library such as LiteLLM, LangChain, or the Vercel AI SDK to abstract the differences.
Function calling is closely related to, and often overlaps with, structured output capabilities. The two features share most of their underlying machinery (JSON Schema and constrained decoding) but solve slightly different problems.
JSON mode, available from OpenAI, Google, Mistral, and others, constrains the model to produce valid JSON in its response without enforcing a specific schema. It is useful when you need structured output but the schema varies between requests, or when an old code path expects a JSON blob and you do not want to refactor it into a tool definition.
OpenAI introduced Structured Outputs on August 6, 2024, going beyond JSON mode by guaranteeing that the model's output conforms to a developer-supplied JSON Schema. The new response_format option accepts a json_schema field, and a parallel strict: true flag on tool definitions delivers the same guarantee for function calls. Under the hood, OpenAI uses constrained decoding (sometimes called constrained sampling) to restrict the model's token generation at each step to only tokens that would produce valid JSON matching the schema. The launch model gpt-4o-2024-08-06 was trained to handle complicated schemas and reportedly hit 100% schema conformance on internal evaluations [27].
Anthropic added equivalent strict tool use in 2025, building on its existing tool use mechanism. Developers define a tool with an input_schema and add strict: true, and Claude returns structured data with the same conformance guarantees. Google's Gemini API offers responseSchema and the VALIDATED function calling mode for similar guarantees, and Mistral exposes JSON mode plus schema-constrained outputs on La Plateforme.
| Approach | Schema enforcement | Provider support | Best for |
|---|---|---|---|
| JSON mode | Guarantees valid JSON, no schema enforcement | OpenAI, Google, Mistral, Anthropic | Variable-structure JSON output |
| Structured Outputs | Guarantees JSON matching a specific schema | OpenAI (native, Aug 2024), Anthropic (strict tool use, 2025), Google (responseSchema), Mistral (json_object with schema) | Strict data extraction, form filling, tool args |
| Function calling | Guarantees structured function call output matching tool schema | All major providers | Tool invocation with typed arguments |
| Freeform custom tools + CFG | Guarantees raw text matching a Lark/regex grammar | OpenAI (GPT-5+) | SQL, DSLs, code payloads where JSON would be awkward |
Reasoning-focused models that produce extended chains of thought have changed how tool use is integrated into the generation loop. OpenAI's o-series (o1, o3, o3-mini, o4-mini) and the GPT-5 line can call tools directly inside their reasoning traces in the Responses API, interleaving thought and action without a full round-trip to the application after every step [20]. This pattern improves complex tool-calling benchmarks by letting the model think, call, observe, and revise within a single response.
Claude's extended thinking, available on Claude Sonnet 4.5 and Claude Opus 4.x (including Opus 4.7, released April 16, 2026), similarly supports tool use during the thinking phase [28]. Google's Gemini Deep Think and the Gemini 3 family expose a comparable interleaved tool mode using encrypted thought signatures that the developer echoes back across turns to preserve reasoning continuity [14], and xAI's grok-4-1-fast-reasoning is explicitly marketed as the reasoning variant of Grok 4.1 Fast for complex tool workflows [18]. The general trend by 2026 is that the boundary between "reasoning" and "tool use" has dissolved. Frontier models reason, call tools, observe results, and reason again as a single continuous process, and the resulting tau2-bench Telecom and BFCL v4 scores cluster much higher than pre-2025 non-reasoning baselines [3].
Tool catalogs are typically static across many requests, which makes them ideal candidates for prompt caching. Both OpenAI and Anthropic now cache tool definitions explicitly:
cache_control breakpoints per request. The request hierarchy is Tools then System then Messages, so placing a breakpoint on the last system block automatically caches the entire tools array as part of the prefix. Cache reads bill at 0.1 of normal input pricing. As of early 2026 Anthropic's default cache TTL is 5 minutes (resetting on each read), with an optional 1-hour TTL at a higher write cost [30].For agents with large tool catalogs (the multi-server MCP setups Anthropic measures at roughly 55,000 tokens of definitions [25]), caching can change the cost profile by an order of magnitude relative to including the same definitions uncached in every request.
The ecosystem around function calling has grown well beyond raw API calls. Developers commonly use higher-level frameworks that handle the agentic loop, memory, multi-agent coordination, and integration with vector stores or databases.
These frameworks all share a common core: a function calling loop that emits tool calls, executes them, and feeds results back into the conversation.
Function calling capability is now evaluated by a small family of public benchmarks, with the Berkeley Function Calling Leaderboard (BFCL) and Sierra Research's tau-bench family being the most influential.
The Berkeley Function Calling Leaderboard (Yan et al., 2024) is the de facto standard for evaluating tool use in LLMs. It uses an Abstract Syntax Tree (AST) evaluation method that scales to thousands of functions and tests both serial and parallel function calls across multiple programming languages [9]. BFCL has gone through four major versions:
The BFCL paper appeared at ICML 2025 (Patil et al.) under the title "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models" [9]. As of early 2026, top scores on the public leaderboard cluster between 0.70 and 0.89, with Llama 3.1 405B Instruct historically near the top of the open-weight rankings at 0.885 and a tight pack of frontier proprietary models behind it [9][31]. By April 2026, frontier proprietary models like GPT-5 and Claude Opus 4.6 also lead WebArena and OSWorld in addition to BFCL, with DeepSeek-V3.2 the strongest open-weight model and typically sitting within 10 percentage points of the proprietary leaders at a fraction of the per-token cost [32].
Sierra Research released tau-bench in 2024 and the harder follow-up tau2-bench in 2025 (arXiv 2506.07982). tau-bench simulates dynamic conversations between a user (also played by an LLM) and a tool-using language agent that must follow domain-specific policies. It originally covered Mock, Airline, and Retail domains. tau2-bench (Barres et al., 2025) introduced a dual-control environment where both the agent and the user can modify a shared world state, and added a Telecom domain that tests technical troubleshooting, problem solving, and clear communication [33].
The Telecom subset of tau2-bench is one of the hardest public tool-use benchmarks. GPT-5.2 Thinking reports a state-of-the-art 98.7% on tau2-bench Telecom, the GLM 4.7 and GLM 5 series exceed 98%, and Grok 4.1 Fast reasoning posts strong results in the same range [3][18][34]. Older non-reasoning models often fall below 50% on the same domain, which is a useful indicator of how much reasoning depth a model brings to multi-turn tool conversations.
Tool calling benchmarks have extended beyond text. OpenAI's ComplexFuncBench evaluates function calling on audio inputs, where gpt-realtime scored 66.5% versus 49.7% for the December 2024 baseline [21]. Voice agents introduce new failure modes (long-running calls during continuous audio, barge-in, interruption recovery) that text-only benchmarks did not previously cover.
Function calling enables a broad range of applications by bridging the gap between language understanding and programmatic action.
The most common use case is connecting an LLM to external APIs. A travel assistant might call flight search, hotel booking, and weather APIs. A productivity assistant might call calendar, email, and task management APIs. Function calling provides the structured interface that makes these integrations reliable.
Models can generate SQL queries or call database wrapper functions to answer questions about structured data. A business analyst chatbot might receive a question like "What were our top 10 products by revenue last quarter?" and generate a function call to execute the appropriate database query, then present the results in natural language. GPT-5's freeform custom tools combined with a Lark grammar for SQL syntax are particularly effective here because the model can emit valid SQL directly rather than through a JSON envelope [22][23].
LLMs are unreliable at arithmetic and precise calculations. Function calling allows models to delegate mathematical operations, unit conversions, statistical analyses, and other computations to dedicated tools that produce exact results. Code interpreter tools take this further by letting the model write and execute Python in a sandbox.
Models can call web search functions to retrieve up-to-date information beyond their training data. ChatGPT Search, Perplexity AI, and other AI search products use function calling internally to trigger web searches, scrape pages, and retrieve relevant content. The Responses API (OpenAI), Anthropic web_search server tool, Gemini Google Search tool, and xAI Agent Tools API all expose this as a built-in primitive.
Code interpreter tools allow models to write and execute code in sandboxed environments. OpenAI's Code Interpreter (now called Advanced Data Analysis), Anthropic's analysis tool, the Gemini code execution tool, and the Grok Agent Tools code execution sandbox all rely on function calling to pass generated code to an execution environment and return results. Anthropic's Programmatic Tool Calling pushes this further by letting Claude write Python that orchestrates many other tools in a single inference, reducing context pollution [25][26].
Models can read, write, and manipulate files through function calls. This is the basis for AI-powered coding assistants like GitHub Copilot, Cursor, and Claude Code, which use tool calls to read source files, make edits, run tests, and interact with version control systems.
Complex tasks often require chaining multiple function calls together. An AI agent handling a customer refund might need to look up the customer's order, check the refund policy, calculate the refund amount, process the refund through the payment system, and send a confirmation email. Each step involves a separate function call, with the model orchestrating the sequence based on intermediate results.
The Anthropic computer use API, OpenAI Operator, and Google Project Mariner translate full GUI interactions into tool calls. The model receives screenshots, decides on a click or keystroke, and emits a tool call describing the action. Despite the visual interface, the underlying transport is still function calling [19].
Voice agents built on OpenAI's Realtime API or equivalent providers use function calling to look up customer records, place orders, schedule calls, transfer to human agents, and so on, all without interrupting the audio session. The Realtime API specifically guarantees that long-running tool calls do not block the conversational flow, so an agent can keep speaking while a slow backend call resolves [21].
Function calling is the foundational mechanism that enables AI agents to take actions in the world. Without it, LLMs can only produce text. With it, they can query databases, call APIs, execute code, and interact with external systems.
An agent is a system where an LLM autonomously decides what actions to take, executes those actions via function calls, observes the results, and plans next steps. Frameworks like LangChain, LangGraph, CrewAI, and AutoGen provide abstractions for building agent loops on top of function calling [10].
The quality of function calling directly determines agent reliability. Models that accurately select the right function, provide correct arguments, and handle errors gracefully produce more effective agents. Conversely, function calling failures (wrong function selection, malformed arguments, hallucinated function names) are a primary source of agent failures.
Model Context Protocol (MCP) is an open standard introduced by Anthropic in November 2024 that provides a standardized way for AI models to discover and interact with external tools and data sources. While function calling defines how a model invokes a specific tool, MCP standardizes how tools are described, discovered, and connected to models across different applications and providers [35].
MCP gained rapid adoption in 2025. OpenAI added official MCP support in March 2025, Google added MCP for Gemini, and xAI integrated MCP into the Agent Tools API in November 2025 [18]. On December 9, 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation co-founded by Anthropic, Block, and OpenAI, with support from Google, Microsoft, AWS, Cloudflare, and Bloomberg [36]. By early 2026, more than 10,000 MCP servers had been published, covering everything from databases and file systems to enterprise SaaS connectors [35].
MCP complements rather than replaces function calling. MCP provides the transport, discovery, and capability description layer, while function calling provides the actual invocation mechanism that the model uses inside a single conversation.
| Concept | Function calling | MCP |
|---|---|---|
| What it does | Lets a model invoke a specific function with structured arguments | Standardizes how tools are described, discovered, and connected |
| Scope | Single model-to-tool interaction | Cross-application tool ecosystem |
| Defined by | Each LLM provider (OpenAI, Anthropic, Google, etc.) | Open standard (originally Anthropic, now AAIF / Linux Foundation) |
| Analogy | Making a specific API call | A USB-C connector for AI tools |
Building reliable function calling applications requires attention to several design principles.
The model relies on function descriptions to decide when and how to use each tool. Descriptions should be specific, include examples of when the function should be used, and clarify any constraints or limitations. Vague descriptions lead to incorrect tool selection. Treat the description as part of the prompt, because that is exactly how the model uses it [1]. Anthropic's Tool Use Examples feature formalizes this by letting developers attach concrete example calls to a tool definition for the model to imitate [25].
Deeply nested schemas hurt accuracy and increase token cost. Prefer flat objects with descriptive field names. Use enum whenever the parameter has a known finite set of values, and use the description field on each property to disambiguate similar parameters.
Always validate the arguments generated by the model before executing a function. Modern models with strict tool use produce valid JSON with very high reliability, but edge cases still occur, especially with complex nested schemas. Libraries like Pydantic (Python) and Zod (TypeScript) provide runtime validation that catches malformed arguments before they reach production systems.
When a function call fails (due to invalid arguments, API errors, timeout, or permission issues), the error should be communicated back to the model in a clear, structured format. Good error messages allow the model to adjust its approach, perhaps trying different arguments or an alternative function. Silently swallowing errors leads to confusing model behavior.
Provide only the functions relevant to the current task. Including too many function definitions wastes context tokens and increases the chance of the model selecting the wrong tool. Tool-selection accuracy on Claude is reported to degrade noticeably past 30 to 50 tools when the entire catalog is loaded at once, which is the motivation for Anthropic's tool_search server tool that surfaces only a relevant subset on each turn [25]. OpenAI's GPT-5.2 allowed_tools parameter solves the same problem from the other side by letting a developer keep a large registered toolset but restrict the model to a request-specific subset, which also improves prompt-cache hit rates [3].
Tool definitions are typically static and worth caching. Place a cache breakpoint on the last block of the prefix (after tools and system on Anthropic, or in the natural prefix order on OpenAI) and keep the tool array byte-identical and ordered identically across requests so the cache prefix matches [29][30].
Function calling grants the model indirect access to external systems. Always implement proper authorization, input sanitization, and rate limiting on the execution side. Never allow the model to directly execute arbitrary code or unrestricted database queries without safeguards. The principle of least privilege applies: each function should have only the permissions it needs. Treat any output from a tool as untrusted user input from a prompt injection perspective, and require user confirmation for actions with real-world impact like sending email, making payments, or posting publicly [11][37].
Models sometimes generate calls to functions that do not exist in the provided tool list, or fabricate argument values. This is a form of hallucination specific to function calling. Strict schema enforcement and validation reduce but do not eliminate this issue. The BFCL Hallucination subscore is a direct attempt to measure this failure mode [9].
While models handle simple, single-step function calls reliably, complex workflows requiring many sequential function calls with conditional logic are more error-prone. The model may lose track of intermediate results, repeat function calls unnecessarily, or fail to synthesize information from multiple tool outputs. tau2-bench was designed in part to expose exactly this kind of multi-turn drift [33].
Each function call adds a round trip between the model and the application, increasing total response time. For workflows requiring five or more sequential function calls, latency can become noticeable. Parallel function calling helps for independent calls, but sequential dependencies cannot be parallelized. Server-side built-in tools (Responses API, Anthropic server tools, xAI Agent Tools API) reduce client-side round trips at the cost of less control over execution. Programmatic Tool Calling pushes much further by letting the model itself orchestrate multiple tools inside a code-execution container, collapsing many round-trips into one [25][26].
Tool schemas live in the context window and are charged as input tokens. A large tool catalog with verbose descriptions can easily consume thousands of tokens per request before the user message is even processed. Multi-server MCP setups are reported to use roughly 55,000 tokens of tool definitions, which is a large fraction of common context windows. Mitigations include explicit prompt caching of the tool block, on-demand tool discovery (tool_search on Claude, allowed_tools on GPT-5.2), and Anthropic's per-model token cost transparency for the special system prompt that enables tool use (346 input tokens for Claude Opus 4.x and Sonnet 4.x with tool_choice: auto) [13][25][3].
Despite conceptual similarity, the different API formats across providers create friction for developers building provider-agnostic applications. MCP addresses part of this problem at the tool description layer, but the invocation-level differences in request and response formats still require provider-specific code or adapter libraries.
Because tools effectively grant the model write access to real systems, function calling is now the primary attack surface for prompt-injection-based exploits. OWASP's 2025 LLM Top 10 ranks prompt injection as the #1 risk for LLM applications, and 2025 saw documented production incidents including a "second-order" injection in ServiceNow Now Assist (a low-privilege agent persuading a higher-privilege peer to exfiltrate case files) and three injection CVEs in Anthropic's official Git MCP server (CVE-2025-68143, CVE-2025-68144, CVE-2025-68145) leading to code execution and data exfiltration when an attacker controlled file contents the agent reads [37][38]. Mitigations include treating all tool output as untrusted input, requiring confirmation for high-impact actions, narrowing tool scopes per request, and using policy engines that intercept and approve every tool call (for example Microsoft's Agent Governance Toolkit, released April 2026, which applies YAML, OPA Rego, or Cedar policies in under 0.1 ms p99) [39].
As of early to mid 2026, function calling is a mature and widely adopted capability supported by all major LLM providers and most open-source models. The focus has shifted from basic capability to reliability, performance, and ecosystem development.
Key 2025 to 2026 developments include:
The ongoing challenge is improving reliability of function calling in long-horizon, multi-step scenarios where models must make dozens of tool decisions in sequence, often interleaved with reasoning, search, and memory. Research into better tool-use training data, improved constrained decoding, more robust agent architectures, and tool-call security continues across both industry and academia.