Function calling (also called tool use or tool calling) is a capability of large language models (LLMs) that allows the model to generate structured JSON output specifying which external function or API to invoke and what arguments to pass. Rather than producing only natural language text, a model with function calling capabilities can decide during generation that it needs to call a specific tool, output the function name and parameters in a structured format, and then incorporate the result of that function call into its subsequent response.
Function calling was introduced as a formal API feature by OpenAI in June 2023 and has since been adopted by all major LLM providers, including Anthropic (Claude), Google (Gemini), and Mistral AI. The capability is foundational to building AI agents, enabling LLMs to interact with databases, APIs, file systems, code interpreters, and other external systems in a controlled, structured manner [1].
The function calling process follows a well-defined loop involving the developer, the model, and the application layer. Understanding this loop is essential to building reliable tool-augmented AI systems.
The developer provides the model with a list of available functions (tools) as part of the API request. Each function definition includes a name, a natural language description of what the function does, and a JSON Schema specifying the expected parameters and their types.
For example, a weather application might define a function like:
{
"name": "get_weather",
"description": "Get the current weather for a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. San Francisco"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
The model uses the function descriptions and parameter schemas to decide when and how to call each function. Clear, specific descriptions significantly improve the model's ability to choose the right function and provide correct arguments [1].
When the user sends a message (e.g., "What's the weather in Tokyo?"), the model analyzes the query against the available function definitions. If the model determines that a function call would help it answer the query, it generates a structured output indicating the function to call and the arguments to pass, rather than generating a direct text response.
The model does not execute the function itself. It outputs a structured request that the application must intercept and act upon. This is a critical design choice: the model proposes actions, but the application controls execution [2].
The application parses the model's function call output, validates the arguments, and executes the actual function (e.g., calling a weather API with the specified location). This step happens entirely outside the model; the developer's code is responsible for the actual execution, error handling, and any security checks.
The function's result is sent back to the model as a new message in the conversation. The model then uses this result to generate its final response to the user. For example, after receiving the weather data, the model might respond: "The current weather in Tokyo is 22 degrees Celsius with partly cloudy skies."
This four-step loop can repeat multiple times within a single interaction. The model might call one function, examine the result, decide it needs additional information, call another function, and then synthesize all results into a final response.
| Step | Actor | Action |
|---|---|---|
| 1 | Developer | Define available functions with names, descriptions, and JSON schemas |
| 2 | User | Send a message or query |
| 3 | Model | Analyze query; decide to call a function; output function name and arguments as JSON |
| 4 | Application | Parse function call; validate arguments; execute the actual function |
| 5 | Application | Return function result to the model as a new message |
| 6 | Model | Incorporate result into final response (or make additional function calls) |
Before API-level function calling support, developers used ad hoc methods to get LLMs to interact with external systems. Common approaches included instructing the model to output JSON in a specific format within its text response, then parsing that text with regular expressions or custom parsers. These approaches were fragile: models would frequently produce malformed JSON, include extra text around the JSON, or forget the required format entirely.
Some early frameworks like LangChain (released October 2022) provided abstractions for tool use that worked around these limitations through careful prompt engineering and output parsing. However, the results were unreliable compared to native function calling support [3].
OpenAI launched function calling as a first-class API feature on June 13, 2023, alongside the release of updated GPT-3.5 Turbo and GPT-4 models specifically fine-tuned for function calling. This was a landmark moment. For the first time, developers could define functions with JSON Schema and receive structured, validated function call outputs from the model with high reliability [1].
The initial release supported a single function call per turn. The API used a function_call parameter in the request and returned a function_call field in the response containing the function name and arguments as a JSON string.
Anthropic introduced tool use for Claude models in 2024, using a slightly different API design. Rather than a separate function_call field, Anthropic integrated tool use into its content block structure. Claude models return a tool_use content block containing the tool name, a unique ID, and the input arguments. This design was motivated by Anthropic's broader content block architecture, which treats text, images, and tool calls as parallel content types within a single message [4].
Google's Gemini models support function calling through a functionCall response type. Google's implementation emphasizes integration with Google Cloud services and supports both single-turn and multi-turn function calling patterns. Gemini's approach is similar in concept to OpenAI's but uses different field names and nesting structures [5].
In November 2023, OpenAI introduced parallel function calling with the release of GPT-4 Turbo. This feature allows the model to request multiple function calls in a single turn rather than calling functions one at a time. For example, if a user asks "What's the weather in Tokyo and New York?", the model can output two function calls simultaneously, allowing the application to execute them in parallel and return both results at once [6].
Parallel function calling significantly reduces the number of round trips between the application and the API, improving both latency and user experience for multi-tool queries. The feature can be controlled with the parallel_tool_calls parameter; setting it to false ensures at most one tool call per turn.
OpenAI initially used the term "function calling" but later transitioned to "tool calling" in its API, renaming the functions parameter to tools and function_call to tool_choice. The older function calling format is now considered legacy. Despite this rename, "function calling" remains the more widely recognized term in the broader community [1].
Each major LLM provider implements function calling with a different API design, though the core concept is the same across all of them.
| Feature | OpenAI (GPT-4, GPT-4o) | Anthropic (Claude) | Google (Gemini) | Mistral AI |
|---|---|---|---|---|
| API term | Tool calls | Tool use | Function calling | Tool calls |
| Definition format | JSON Schema in tools array | JSON Schema in tools array with input_schema | FunctionDeclaration with parameters | JSON Schema in tools array |
| Response format | tool_calls array with function.name and function.arguments | tool_use content block with name and input | functionCall with name and args | tool_calls with function object |
| Parallel calls | Yes (default on; parallel_tool_calls parameter) | Yes (supported in recent models) | Yes | Yes |
| Structured output | Native JSON Schema enforcement (Structured Outputs) | Via tool use with input_schema; dedicated structured output support added 2025 | JSON Schema enforcement | JSON mode supported |
| Force tool use | tool_choice: {"type": "function", "function": {"name": "..."}} | tool_choice: {"type": "tool", "name": "..."} | tool_config with FORCED mode | tool_choice: "any" or specific function |
| Streaming | Yes, with delta chunks for tool call arguments | Yes, with content_block_delta events | Yes | Yes |
Function calling enables a broad range of applications by bridging the gap between language understanding and programmatic action.
The most common use case is connecting an LLM to external APIs. A travel assistant might call flight search, hotel booking, and weather APIs. A productivity assistant might call calendar, email, and task management APIs. Function calling provides the structured interface that makes these integrations reliable [1].
Models can generate SQL queries or call database wrapper functions to answer questions about structured data. A business analyst chatbot might receive a question like "What were our top 10 products by revenue last quarter?" and generate a function call to execute the appropriate database query, then present the results in natural language.
LLMs are notoriously unreliable at arithmetic and precise calculations. Function calling allows models to delegate mathematical operations, unit conversions, statistical analyses, and other computations to dedicated tools that produce exact results.
Models can call web search functions to retrieve up-to-date information beyond their training data. ChatGPT Search, Perplexity AI, and other AI search products use function calling internally to trigger web searches, scrape pages, and retrieve relevant content [7].
Code interpreter tools allow models to write and execute code in sandboxed environments. OpenAI's Code Interpreter (now called Advanced Data Analysis), Anthropic's analysis tool, and similar features rely on function calling to pass generated code to an execution environment and return results.
Models can read, write, and manipulate files through function calls. This is the basis for AI-powered coding assistants like GitHub Copilot, Cursor, and Claude Code, which use tool calls to read source files, make edits, run tests, and interact with version control systems.
Complex tasks often require chaining multiple function calls together. An AI agent handling a customer refund might need to: (1) look up the customer's order, (2) check the refund policy, (3) calculate the refund amount, (4) process the refund through the payment system, and (5) send a confirmation email. Each step involves a separate function call, with the model orchestrating the sequence based on intermediate results.
Function calling is closely related to, and often overlaps with, structured output capabilities.
JSON mode (available from OpenAI, Google, and others) constrains the model to produce valid JSON in its response, but without enforcing a specific schema. This is useful when you need structured output but the schema varies between requests.
OpenAI introduced Structured Outputs in August 2024, which goes beyond JSON mode by guaranteeing that the model's output conforms to a developer-supplied JSON Schema. This uses constrained decoding, a technique where the model's token generation is restricted at each step to only tokens that would produce valid JSON matching the schema [8].
Anthropic added dedicated structured output support in 2025, building on its existing tool use mechanism. With Anthropic's approach, developers define a "tool" with an input_schema, and Claude returns structured data as if calling that tool. This effectively uses the function calling infrastructure to deliver structured output [4].
| Approach | Schema enforcement | Provider support | Best for |
|---|---|---|---|
| JSON mode | Guarantees valid JSON; no schema enforcement | OpenAI, Google, Mistral | Variable-structure JSON output |
| Structured Outputs | Guarantees JSON matching a specific schema | OpenAI (native), Anthropic (via tool use) | Strict data extraction, form filling |
| Function calling | Guarantees structured function call output matching tool schema | All major providers | Tool invocation with typed arguments |
Function calling is the foundational mechanism that enables AI agents to take actions in the world. Without it, LLMs can only produce text; with it, they can query databases, call APIs, execute code, and interact with external systems.
An AI agent is a system where an LLM autonomously decides what actions to take, executes those actions via function calls, observes the results, and plans next steps. Frameworks like LangChain, LangGraph, CrewAI, and AutoGen provide abstractions for building agent loops on top of function calling [3].
The quality of function calling directly determines agent reliability. Models that accurately select the right function, provide correct arguments, and handle errors gracefully produce more effective agents. Conversely, function calling failures (wrong function selection, malformed arguments, hallucinated function names) are a primary source of agent failures.
Model Context Protocol (MCP) is an open standard introduced by Anthropic in November 2024 that provides a standardized way for AI models to discover and interact with external tools and data sources. While function calling defines how a model invokes a specific tool, MCP standardizes how tools are described, discovered, and connected to models across different applications and providers [9].
MCP gained rapid adoption in 2025. OpenAI officially adopted it in March 2025, and in December 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation co-founded by Anthropic, Block, and OpenAI [9]. MCP complements rather than replaces function calling: MCP provides the transport and discovery layer, while function calling provides the invocation mechanism.
| Concept | Function calling | MCP |
|---|---|---|
| What it does | Lets a model invoke a specific function with structured arguments | Standardizes how tools are described, discovered, and connected |
| Scope | Single model-to-tool interaction | Cross-application tool ecosystem |
| Defined by | Each LLM provider (OpenAI, Anthropic, Google) | Open standard (originally Anthropic, now AAIF) |
| Analogy | Making a specific API call | USB-C for AI tools: a universal connector |
Building reliable function calling applications requires attention to several design principles.
The model relies on function descriptions to decide when and how to use each tool. Descriptions should be specific, include examples of when the function should be used, and clarify any constraints or limitations. Vague descriptions lead to incorrect tool selection [1].
Always validate the arguments generated by the model before executing a function. While modern models produce valid JSON with high reliability, edge cases still occur, especially with complex nested schemas. Libraries like Pydantic (Python) and Zod (TypeScript) provide runtime validation that catches malformed arguments before they reach production systems [8].
When a function call fails (due to invalid arguments, API errors, timeout, or permission issues), the error should be communicated back to the model in a clear, structured format. Good error messages allow the model to adjust its approach, perhaps trying different arguments or an alternative function. Silently swallowing errors leads to confusing model behavior.
Provide only the functions relevant to the current task. Including too many function definitions wastes context tokens and increases the chance of the model selecting the wrong tool. For applications with dozens of tools, consider implementing a tool selection or routing layer that presents only a relevant subset to the model on each turn.
Function calling grants the model indirect access to external systems. Always implement proper authorization, input sanitization, and rate limiting on the execution side. Never allow the model to directly execute arbitrary code or unrestricted database queries without safeguards. The principle of least privilege applies: each function should have only the permissions it needs [2].
Models sometimes generate calls to functions that do not exist in the provided tool list, or fabricate argument values. This is a form of hallucination specific to function calling. Strict schema enforcement and validation reduce but do not eliminate this issue.
While models handle simple, single-step function calls reliably, complex workflows requiring many sequential function calls with conditional logic are more error-prone. The model may lose track of intermediate results, repeat function calls unnecessarily, or fail to synthesize information from multiple tool outputs.
Each function call adds a round trip between the model and the application, increasing total response time. For workflows requiring five or more sequential function calls, latency can become noticeable. Parallel function calling helps for independent calls, but sequential dependencies cannot be parallelized.
Despite conceptual similarity, the different API formats across providers create friction for developers building provider-agnostic applications. MCP addresses part of this problem at the tool description layer, but the invocation-level differences in request and response formats still require provider-specific code or adapter libraries.
As of early 2026, function calling is a mature, widely adopted capability supported by all major LLM providers and most open-source models. The focus has shifted from basic capability to reliability, performance, and ecosystem development.
Key developments include the convergence around MCP as a standard for tool discovery and description, the integration of function calling with structured output guarantees (constrained decoding), and the rise of agentic frameworks that orchestrate complex multi-tool workflows. Open-source models like LLaMA 4, Qwen 3, and Mistral Large 2 now support function calling natively, making tool use accessible outside of proprietary API ecosystems [10].
The ongoing challenge is improving the reliability of function calling in complex, multi-step scenarios where models must make dozens of tool use decisions in sequence. Research into better tool-use training data, improved constrained decoding, and more robust agent architectures continues across both industry and academia.