# Function calling

> Source: https://aiwiki.ai/wiki/function_calling
> Updated: 2026-06-21
> Categories: AI Agents, Large Language Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Function calling** is a capability of [large language models](/wiki/large_language_model) (LLMs) that lets a model decide, during generation, to invoke an external function or API and emit the function name plus its arguments as structured JSON, rather than answering only in natural language text. Also called **tool use** or **tool calling**, it is the foundational mechanism that lets LLMs query databases, call APIs, run code, and act in the world, and it is the lowest-level building block beneath higher abstractions such as [tool use](/wiki/tool_use), the [Model Context Protocol](/wiki/model_context_protocol) (MCP), and modern [AI agents](/wiki/ai_agent). The model proposes the call; the application executes it and feeds the result back, so the LLM stays in control of the decision while the developer's code keeps control of execution.

Function calling was introduced as a formal API feature by [OpenAI](/wiki/openai) on June 13, 2023, and has since been adopted by every major LLM provider, including [Anthropic](/wiki/anthropic) ([Claude](/wiki/claude)), Google ([Gemini](/wiki/gemini)), Mistral AI, [xAI](/wiki/xai) ([Grok](/wiki/grok)), and Meta ([Llama](/wiki/llama)). The capability is foundational to building [AI agents](/wiki/ai_agent), enabling LLMs to interact with databases, APIs, file systems, code interpreters, and other external systems in a controlled, structured manner [^1][^2].

In modern usage, function calling is often the lowest-level building block beneath higher abstractions such as [tool use](/wiki/tool_use), the [Model Context Protocol](/wiki/model_context_protocol) (MCP), and built-in agent platforms like the OpenAI Responses API and xAI Agent Tools API. By 2026, every major frontier model treats function calling as a default feature trained directly into the base model rather than a thin wrapper layered on top, and OpenAI's GPT-5.2 Thinking reports a state-of-the-art 98.7% on the tau2-bench Telecom multi-turn tool-use benchmark, illustrating how much the capability has matured since 2023 [^3].

## how does function calling work?

The function calling process follows a well-defined loop involving the developer, the model, and the application layer. Understanding this loop is essential to building reliable tool-augmented AI systems.

### step 1: define available functions

The developer provides the model with a list of available functions (tools) as part of the API request. Each function definition includes a name, a natural language description of what the function does, and a [JSON Schema](/wiki/json_schema) specifying the expected parameters and their types. Most providers use a subset of the OpenAPI/JSON Schema specification, which means features like recursive schemas, references, or rare keywords may not be supported [^4].

For example, a weather application might define a function like:

```json
{
  "name": "get_weather",
  "description": "Get the current weather for a given location",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name, e.g. San Francisco"
      },
      "unit": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"]
      }
    },
    "required": ["location"]
  }
}
```

The model uses the function descriptions and parameter schemas to decide when and how to call each function. Clear, specific descriptions significantly improve the model's ability to choose the right function and provide correct arguments [^1].

### step 2: model decides to call a function

When the user sends a message (for example, "What's the weather in Tokyo?"), the model analyzes the query against the available function definitions. If the model determines that a function call would help it answer the query, it generates a structured output indicating the function to call and the arguments to pass, rather than generating a direct text response.

The model does not execute the function itself. It outputs a structured request that the application must intercept and act upon. This is a critical design choice: the model proposes actions, but the application controls execution [^5].

### step 3: application executes the function

The application parses the model's function call output, validates the arguments, and executes the actual function (for example, calling a weather API with the specified location). This step happens entirely outside the model. The developer's code is responsible for the actual execution, error handling, and any security checks. Validation typically uses runtime libraries like Pydantic in Python or Zod in TypeScript to confirm that arguments match the declared schema.

### step 4: return results to the model

The function's result is sent back to the model as a new message in the conversation. The model then uses this result to generate its final response to the user. For example, after receiving the weather data, the model might respond: "The current weather in Tokyo is 22 degrees Celsius with partly cloudy skies."

This four-step loop can repeat multiple times within a single interaction. The model might call one function, examine the result, decide it needs additional information, call another function, and then synthesize all results into a final response. This iterative process is sometimes called the agentic loop, and it sits at the core of every tool-using LLM application [^5].

### diagram of the function calling flow

| Step | Actor | Action |
|---|---|---|
| 1 | Developer | Defines available functions with names, descriptions, and JSON schemas |
| 2 | User | Sends a message or query |
| 3 | Model | Analyzes query, decides to call a function, outputs function name and arguments as JSON |
| 4 | Application | Parses function call, validates arguments, executes the actual function |
| 5 | Application | Returns function result to the model as a new message |
| 6 | Model | Incorporates result into final response, or makes additional function calls |

## history

### academic precursors (2022 to 2023)

Two research lines from 2022 and 2023 set the conceptual stage for production function calling. The **ReAct** paper by Shunyu Yao and colleagues (arXiv:2210.03629, October 6, 2022) showed that a language model could interleave reasoning traces and discrete actions, with each action expressed as a structured call against a tool such as a Wikipedia search API. On HotpotQA and Fever, ReAct reduced hallucination relative to chain-of-thought reasoning, and on the ALFWorld and WebShop environments it outperformed imitation and reinforcement-learning baselines by 34 and 10 absolute percentage points respectively, "while being prompted with only one or two in-context examples" [^6].

**Toolformer** (Schick et al., Meta AI and Universitat Pompeu Fabra, arXiv:2302.04761, February 9, 2023) went a step further by training a 6.7B-parameter GPT-J model to insert API calls into its own generations through a self-supervised procedure. Its authors describe Toolformer as "a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction" [^7]. Toolformer integrated five tools (calculator, Q&A system, two search engines, a translation system, and a calendar), demonstrated zero-shot gains over the much larger GPT-3 on math and factual tasks, and was presented at NeurIPS 2023 [^7].

**Gorilla** (Patil et al., UC Berkeley, arXiv:2305.15334, May 24, 2023) was the first system specifically focused on training a model to write valid API calls at scale. Gorilla fine-tuned LLaMA on 1,645 API calls from HuggingFace, TorchHub, and TensorHub via the APIBench dataset, introduced Retriever-Aware Training so the model adapts to changing API documentation, and reported that the resulting model outperformed GPT-4 on API-call correctness for that benchmark [^8]. The same research group later published the [Berkeley Function Calling Leaderboard](/wiki/bfcl) (BFCL), which became the dominant public benchmark for tool use [^9].

### before formal function calling

Before API-level function calling support, developers used ad hoc methods to get LLMs to interact with external systems. Common approaches included instructing the model to output JSON in a specific format within its text response, then parsing that text with regular expressions or custom parsers. These approaches were fragile. Models would frequently produce malformed JSON, include extra text around the JSON, or forget the required format entirely.

The early ReAct pattern was popularized in tooling through frameworks like LangChain (released October 2022), which provided abstractions for tool use that worked around the lack of native API support through careful [prompt engineering](/wiki/prompt_engineering) and output parsing. Results were unreliable compared to native function calling support [^10].

### when did OpenAI introduce function calling? (June 2023)

OpenAI launched function calling as a first-class API feature on June 13, 2023, alongside the release of updated [GPT-3.5](/wiki/gpt-3.5) Turbo and [GPT-4](/wiki/gpt-4) models specifically fine-tuned for function calling. The new model snapshots `gpt-4-0613` and `gpt-3.5-turbo-0613` were trained to detect when a function should be called and to respond with structured JSON. In the launch announcement, OpenAI wrote that the models "have been fine-tuned to both detect when a function needs to be called (depending on the user's input) and to respond with JSON that adheres to the function signature" [^11]. Developers could now define functions using JSON Schema and receive structured, validated function call outputs from the model with high reliability [^1][^11].

The initial release supported a single function call per turn. The API used a `function_call` parameter in the request and returned a `function_call` field in the response containing the function name and arguments as a JSON string. OpenAI explicitly framed this as an implementation of the ReAct pattern, but with models fine-tuned to execute it reliably rather than improvised through prompting [^11].

### Anthropic tool use (May 2024)

Anthropic introduced tool use for [Claude](/wiki/claude) models in beta during early 2024 and made it generally available on May 30, 2024 across the entire Claude 3 model family on the Anthropic Messages API, Amazon Bedrock, and Google Cloud Vertex AI [^12]. The GA release shipped with streaming tool calls (for incremental rendering of arguments), forced tool selection, and support for image inputs as tool inputs and outputs [^12]. Anthropic uses a slightly different API design from OpenAI. Rather than a separate `function_call` field, Anthropic integrated tool use into its content block structure. Claude models return a `tool_use` content block containing the tool name, a unique ID, and the input arguments. Results are sent back as `tool_result` content blocks referencing that ID. This design was motivated by Anthropic's broader content block architecture, which treats text, images, and tool calls as parallel content types within a single message [^13].

Anthropic's GA release also introduced the `tool_choice` parameter (with values `auto`, `any`, or a specific tool name) and added support for tools that return images, enabling multimodal tool results. Later versions added `strict` tool definitions that guarantee schema conformance through constrained decoding.

### Google Gemini function calling (December 2023)

Google's [Gemini](/wiki/gemini) models support function calling through a `functionCall` response type, declared via `FunctionDeclaration` objects that use a subset of the OpenAPI schema format. Google's implementation supports both single-turn and multi-turn function calling patterns and exposes a `function_calling_config` parameter with modes `AUTO`, `ANY`, `VALIDATED`, and `NONE` to control whether the model is forced to emit a tool call [^4]. Gemini 3, released November 18, 2025, added unique IDs for each function call, encrypted thought signatures that callers must echo back to preserve reasoning continuity, and 100+ simultaneous tool calls in a single turn for the Flash tier [^14].

### Mistral, Llama, and xAI catch up (2024)

Mistral Large 2 (`mistral-large-2407`), released July 24, 2024, was trained for both parallel and sequential function calling on La Plateforme and Le Chat [^15]. Meta's Llama 3.1 release on July 23, 2024 added native tool calling for the first time in the Llama family, including the 405B parameter open-weight model that briefly topped the BFCL leaderboard [^16]. Llama 4 (released April 2025) extended this with native parallel tool calling, optimized for the `llama4_pythonic` chat template, while community testing reports somewhat less consistent JSON adherence than Claude or GPT-5 [^17]. xAI added tool calling to Grok in 2024 and shipped a dedicated Agent Tools API alongside Grok 4.1 Fast on November 19, 2025, treating function calling as a server-managed primitive rather than a parameter on the chat endpoint [^18].

### Parallel function calling (November 2023)

In November 2023, OpenAI introduced parallel function calling with the release of GPT-4 Turbo. This feature allows the model to request multiple function calls in a single turn rather than calling functions one at a time. For example, if a user asks "What's the weather in Tokyo and New York?", the model can output two function calls simultaneously, allowing the application to execute them in parallel and return both results at once [^1].

Parallel function calling significantly reduces the number of round trips between the application and the API, improving both latency and user experience for multi-tool queries. The feature can be controlled with the `parallel_tool_calls` parameter. Setting it to `false` ensures at most one tool call per turn. Anthropic, Google, Mistral, xAI, and Meta have since added equivalent parallel-call support to their own APIs.

### Evolution of terminology

OpenAI initially used the term "function calling" but later transitioned to "tool calling" in its API, renaming the `functions` parameter to `tools` and `function_call` to `tool_choice`. The older function calling format is now considered legacy. Despite this rename, "function calling" remains the more widely recognized term in the broader community, and most providers use both terms interchangeably [^1].

### Computer use as function calling (October 2024)

On October 22, 2024, Anthropic released the **Computer Use** capability for `claude-3-5-sonnet-20241022` in public beta, marking the first frontier model trained to control a desktop through tool calls [^19]. The model receives screenshots and emits `computer` tool calls (`mouse_move`, `left_click`, `type`, `key`, `screenshot`) using a polymorphic tool type signalled via the new `tool.type` property and the `anthropic_beta: computer-use-2024-10-22` header [^19]. OpenAI followed with a Computer Use tool in the Responses API (March 2025), Google released Project Mariner in the same window, and all three rely on standard function calling as the transport layer for GUI actions [^20]. The pattern reframed function calling as the foundation not just for API access but for embodied action through a display.

### Responses API and built-in tools (March 2025)

On March 11, 2025, OpenAI launched the [Responses API](/wiki/openai_responses_api), a new API primitive that combines elements of the Chat Completions and Assistants APIs and natively supports built-in server-side tools including web search, file search, and computer use [^20]. The same release introduced the open-source OpenAI Agents SDK and shifted the agent development story away from raw `tools` arrays toward higher-level agent objects with built-in tool routing. Reasoning models such as o3 and o4-mini can call tools and functions directly within their chain of thought in the Responses API, producing more contextually grounded answers and significantly higher tool-calling scores on industry benchmarks [^20].

Anthropic added equivalent server tools (`web_search`, `code_execution`, `web_fetch`, `tool_search`) and an MCP connector to the Messages API in 2025, while Google added live tool routing inside Gemini 3, and xAI shipped the Agent Tools API alongside Grok 4.1 Fast in November 2025 [^18].

### Realtime and voice tool calls (2024 to 2025)

Function calling expanded beyond text in October 2024 with OpenAI's Realtime API for voice agents, then matured with the `gpt-realtime` model GA in 2025. The new model lifted scores on the ComplexFuncBench audio evaluation from 49.7% (December 2024 baseline) to 66.5%, added support for remote MCP servers, SIP-based phone calling, and image inputs, and crucially allowed long-running tool calls to execute without disrupting an open voice session [^21]. The same pattern (continuous audio in, tool calls streamed out, results streamed back) now underpins production voice agents for customer support, scheduling, and field operations.

### GPT-5 freeform tool calls and grammar constraints (August 2025)

The GPT-5 family, released by OpenAI in 2025, expanded function calling along two new dimensions [^22]. **Freeform tool calls** let developers register a tool with `{ "type": "custom" }` and receive raw text payloads from the model, with no JSON envelope. This is useful when the destination tool expects Python, SQL, a shell command, or another DSL natively, since wrapping that payload in JSON forced an awkward escaping step in earlier APIs [^22]. **Context-free grammar (CFG) constraints** let the developer attach a Lark grammar (or a regex) to a custom tool, after which OpenAI's LLGuidance constrained-decoding backend forces the model's output to conform to the grammar [^23]. Together these features cover use cases that pure JSON Schema could not express, like SQL dialects, structured configuration files, or domain-specific languages.

GPT-5.2 (released December 11, 2025) added an `allowed_tools` parameter under `tool_choice` that defines a full toolkit but restricts the model to a specific subset per request, improving safety and prompt-cache hit rates without prompt-engineering hacks [^3]. GPT-5.5 (April 24, 2026) continued the same trajectory with better error recovery mid-task, more efficient tool sequencing, and reduced rate of confidently bad plans, positioning it as an agentic model rather than a chat model [^24].

### Advanced tool use beta (November 2025)

On November 24, 2025, Anthropic launched a coordinated **Advanced Tool Use** beta on the Claude Developer Platform, introducing three new features [^25]:

- **Tool Search**: a server-side tool that lets Claude query a catalog of tool definitions on demand rather than loading all of them into context up front. Anthropic reports that a typical multi-server setup of 58 tools consumes approximately 55,000 tokens "before the conversation even starts," and that tool search cuts total context consumption from roughly 77,000 tokens to about 8,700 tokens, "an 85% reduction in token usage while maintaining access to your full tool library," while also keeping tool-selection accuracy high beyond the 30 to 50 tool threshold where unfiltered catalogs typically start degrading [^25].
- **Programmatic Tool Calling (PTC)**: Claude writes Python code inside a sandboxed code-execution container that invokes the registered tools, processes intermediate results, and decides what enters the model's main context. Anthropic reports that "average usage dropped from 43,588 to 27,297 tokens, a 37% reduction on complex research tasks," plus accuracy gains from explicit programmatic control flow [^25][^26].
- **Tool Use Examples**: a standardized way to attach concrete input-output examples to a tool definition so the model sees demonstrations of correct argument shapes, units, and conventions that a JSON Schema cannot fully express. Anthropic reports that tool use examples "improved accuracy from 72% to 90% on complex parameter handling" [^25].

These features sit alongside Claude's prior strict tool use, parallel calling, and server tool catalog (web search, code execution, web fetch, tool search, MCP connector), and they marked Anthropic's clearest statement that tool catalogs of hundreds to thousands of tools are now the design target.

## json schema and tool definitions

Function definitions across all major providers share three required pieces: a `name` (an identifier the model emits when calling), a `description` (natural language guidance about when and how to use the tool), and a `parameters` block (a JSON Schema describing the arguments).

Most providers support a useful subset of JSON Schema:

- `type` (`object`, `string`, `number`, `integer`, `boolean`, `array`)
- `properties` and `required`
- `enum` for fixed value sets
- `description` on individual fields
- `items` for arrays
- `oneOf`, `anyOf`, `allOf` (with provider-specific limits)

Deeply nested objects, recursive references, and arbitrary `$ref` chains are commonly rejected or silently flattened. Gemini's documentation explicitly notes that very large or deeply nested schemas may be rejected in `ANY` mode [^4]. OpenAI's strict mode and Anthropic's strict tool use both impose additional restrictions in exchange for guaranteed schema conformance, including required `additionalProperties: false` and a small set of supported keywords.

A practical rule: keep schemas flat, name fields descriptively, use `enum` whenever there is a closed set of options, and lean on the `description` field to disambiguate near-duplicate parameters. The model treats descriptions as part of the prompt, so a clear description of `unit: "celsius" | "fahrenheit"` does more work than a long if-then in the system prompt.

### beyond JSON Schema

For inputs that do not fit JSON well, GPT-5 custom tools accept either a raw freeform string or a Lark/regex grammar that constrains the generated text, letting the developer enforce a SQL dialect, a shell-style command, or any other DSL without round-tripping through JSON [^22][^23]. Anthropic's Tool Use Examples cover a similar gap from the other direction by demonstrating correct usage in-context rather than relying solely on schema and description fields [^25].

## how do providers differ on function calling?

Each major LLM provider implements function calling with a different API design, though the core concept is the same across all of them.

| Feature | OpenAI (GPT-5.x, GPT-4o, o-series) | Anthropic ([Claude](/wiki/claude)) | Google ([Gemini](/wiki/gemini)) | Mistral AI | xAI ([Grok](/wiki/grok_4_1_fast)) | Meta ([Llama](/wiki/llama)) 3.1+ |
|---|---|---|---|---|---|---|
| API term | Tool calls | Tool use | Function calling | Tool calls | Tool calls / Agent Tools API | Tool calls |
| First public release | June 13, 2023 | Beta in early 2024, GA May 30, 2024 | December 2023 (Gemini API) | February 2024 | 2024, Agent Tools API Nov 19, 2025 | July 23, 2024 (Llama 3.1) |
| Definition format | JSON Schema in `tools` array; `custom` type for freeform / CFG | JSON Schema in `tools` array with `input_schema` | `FunctionDeclaration` (OpenAPI subset) | JSON Schema in `tools` array | JSON Schema in `tools` array | JSON Schema or built-in tool tags |
| Response format | `tool_calls` array with `function.name` and `function.arguments`; raw text for custom tools | `tool_use` content block with `name` and `input` | `functionCall` with `name` and `args` (with unique `id` in Gemini 3) | `tool_calls` with `function` object | `tool_calls` array | `<|python_tag|>` tokens or JSON tool calls |
| Parallel calls | Yes, default on, `parallel_tool_calls` parameter | Yes, supported across Claude 3, 4, and 4.x families | Yes, 100+ per turn in Gemini 3 Flash, per-call IDs | Yes (Mistral Large 2 onward) | Yes | Yes (parallel in Llama 4; serial in Llama 3) |
| Strict / schema enforcement | `strict: true` (Structured Outputs); CFG via Lark/regex | `strict: true` (strict tool use) | `function_calling_config: ANY` plus VALIDATED mode | JSON mode | Schema-aware decoding | Depends on inference engine (vLLM, llama.cpp) |
| Force tool use | `tool_choice: {"type": "function", "function": {"name": "..."}}`; `allowed_tools` subset (GPT-5.2) | `tool_choice: {"type": "tool", "name": "..."}` | `tool_config` with `ANY` mode | `tool_choice: "any"` or specific function | `tool_choice` with function name | Provider-dependent |
| Streaming | Yes, with delta chunks for tool call arguments | Yes, with `content_block_delta` events | Yes | Yes | Yes | Yes |
| Built-in server tools | Web search, file search, computer use, code interpreter, image generation, MCP (Responses API, March 2025) | Web search, code execution, web fetch, tool search, MCP connector | Google Search, code execution, URL context | Limited | Web search, X search, file search, code execution, MCP | None native; via inference frameworks |
| Advanced features | Freeform/custom tools, Lark/CFG grammars, `allowed_tools`, Responses API agents | Programmatic Tool Calling (PTC), Tool Search, Tool Use Examples | Thought signatures, live tool routing | JSON mode + schema | Server-managed Agent Tools API | Inference-time grammar via vLLM/llama.cpp |

The overall picture by 2026: every provider exposes the same conceptual loop, but field names and content-block structures differ enough that most production codebases either pin to one provider or use an adapter library such as LiteLLM, LangChain, or the Vercel AI SDK to abstract the differences.

## structured output and function calling

Function calling is closely related to, and often overlaps with, structured output capabilities. The two features share most of their underlying machinery (JSON Schema and constrained decoding) but solve slightly different problems.

### json mode

JSON mode, available from OpenAI, Google, Mistral, and others, constrains the model to produce valid JSON in its response without enforcing a specific schema. It is useful when you need structured output but the schema varies between requests, or when an old code path expects a JSON blob and you do not want to refactor it into a tool definition.

### structured outputs

OpenAI introduced [Structured Outputs](/wiki/structured_output) on August 6, 2024, going beyond JSON mode by guaranteeing that the model's output conforms to a developer-supplied JSON Schema. The new `response_format` option accepts a `json_schema` field, and a parallel `strict: true` flag on tool definitions delivers the same guarantee for function calls. Under the hood, OpenAI uses constrained decoding (sometimes called constrained sampling) to restrict the model's token generation at each step to only tokens that would produce valid JSON matching the schema. The launch model `gpt-4o-2024-08-06` was trained to handle complicated schemas and reportedly hit 100% schema conformance on internal evaluations [^27].

Anthropic added equivalent strict tool use in 2025, building on its existing tool use mechanism. Developers define a tool with an `input_schema` and add `strict: true`, and Claude returns structured data with the same conformance guarantees. Google's Gemini API offers `responseSchema` and the `VALIDATED` function calling mode for similar guarantees, and Mistral exposes JSON mode plus schema-constrained outputs on La Plateforme.

| Approach | Schema enforcement | Provider support | Best for |
|---|---|---|---|
| JSON mode | Guarantees valid JSON, no schema enforcement | OpenAI, Google, Mistral, Anthropic | Variable-structure JSON output |
| Structured Outputs | Guarantees JSON matching a specific schema | OpenAI (native, Aug 2024), Anthropic (strict tool use, 2025), Google (responseSchema), Mistral (json_object with schema) | Strict data extraction, form filling, tool args |
| Function calling | Guarantees structured function call output matching tool schema | All major providers | Tool invocation with typed arguments |
| Freeform custom tools + CFG | Guarantees raw text matching a Lark/regex grammar | OpenAI (GPT-5+) | SQL, DSLs, code payloads where JSON would be awkward |

## reasoning models and tool use

Reasoning-focused models that produce extended chains of thought have changed how tool use is integrated into the generation loop. OpenAI's o-series (o1, o3, o3-mini, o4-mini) and the GPT-5 line can call tools directly inside their reasoning traces in the Responses API, interleaving thought and action without a full round-trip to the application after every step [^20]. This pattern improves complex tool-calling benchmarks by letting the model think, call, observe, and revise within a single response.

Claude's extended thinking, available on Claude Sonnet 4.5 and Claude Opus 4.x (including Opus 4.7, released April 16, 2026), similarly supports tool use during the thinking phase [^28]. Google's Gemini Deep Think and the Gemini 3 family expose a comparable interleaved tool mode using encrypted thought signatures that the developer echoes back across turns to preserve reasoning continuity [^14], and xAI's `grok-4-1-fast-reasoning` is explicitly marketed as the reasoning variant of Grok 4.1 Fast for complex tool workflows [^18]. The general trend by 2026 is that the boundary between "reasoning" and "tool use" has dissolved. Frontier models reason, call tools, observe results, and reason again as a single continuous process, and the resulting tau2-bench Telecom and BFCL v4 scores cluster much higher than pre-2025 non-reasoning baselines [^3].

## prompt caching for tool definitions

Tool catalogs are typically static across many requests, which makes them ideal candidates for [prompt caching](/wiki/prompt_caching). Both OpenAI and Anthropic now cache tool definitions explicitly:

- **OpenAI** caches the longest matching prefix of a request that exceeds 1,024 tokens, with cache hits extending in 128-token increments. The cacheable prefix includes the messages array, images, audio, **tool definitions, and structured-output schemas**. Tool definitions must be byte-identical and ordered identically across requests to share a cache entry [^29].
- **Anthropic** uses up to four explicit `cache_control` breakpoints per request. The request hierarchy is **Tools then System then Messages**, so placing a breakpoint on the last system block automatically caches the entire tools array as part of the prefix. Cache reads bill at 0.1 of normal input pricing. As of early 2026 Anthropic's default cache TTL is 5 minutes (resetting on each read), with an optional 1-hour TTL at a higher write cost [^30].

For agents with large tool catalogs (the multi-server MCP setups Anthropic measures at roughly 55,000 tokens of definitions [^25]), caching can change the cost profile by an order of magnitude relative to including the same definitions uncached in every request.

## frameworks and ecosystem

The ecosystem around function calling has grown well beyond raw API calls. Developers commonly use higher-level frameworks that handle the agentic loop, memory, multi-agent coordination, and integration with vector stores or databases.

- [OpenAI Agents SDK](/wiki/openai_agents_sdk) (launched March 11, 2025) is OpenAI's open-source replacement for the experimental Swarm SDK, offering Agents, Handoffs, Guardrails, and tracing on top of the Responses API [^20].
- LangChain and LangGraph offer provider-agnostic tool abstractions and a graph-based runtime for stateful multi-step agents.
- LlamaIndex provides agent loops and tight integration with retrieval pipelines.
- CrewAI focuses on multi-agent collaboration with role-based agents.
- AutoGen (Microsoft Research) supports conversational multi-agent systems with structured tool calling.
- The Anthropic computer use API and OpenAI computer use tool let models drive a full desktop through screenshots and synthesized keyboard or mouse events, with each action expressed as a tool call.
- Browser automation systems such as OpenAI's Operator, Google's Project Mariner, and the Anthropic computer use reference implementation extend the same pattern to a real browser.

These frameworks all share a common core: a function calling loop that emits tool calls, executes them, and feeds results back into the conversation.

## how is function calling benchmarked?

Function calling capability is now evaluated by a small family of public benchmarks, with the Berkeley Function Calling Leaderboard (BFCL) and Sierra Research's tau-bench family being the most influential.

### Berkeley Function Calling Leaderboard (BFCL)

The [Berkeley Function Calling Leaderboard](/wiki/bfcl) (Yan et al., 2024) is the de facto standard for evaluating tool use in LLMs. It uses an Abstract Syntax Tree (AST) evaluation method that scales to thousands of functions and tests both serial and parallel function calls across multiple programming languages [^9]. BFCL has gone through four major versions:

- BFCL v1 (2024) focused on single-turn function calling correctness.
- BFCL v2 added multilingual coverage and live functions.
- BFCL v3 added multi-turn evaluation.
- BFCL v4 (2025) added agentic evaluation, including web search, memory, and format sensitivity. The overall score is computed as Agentic 40%, Multi-Turn 30%, Live 10%, Non-Live 10%, and Hallucination 10%.

The BFCL paper appeared at ICML 2025 (Patil et al.) under the title "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models" [^9]. The authors note that "while state-of-the-art LLMs excel at single-turn calls, memory, dynamic decision-making, and long-horizon reasoning remain open challenges" [^9]. As of early 2026, top scores on the public leaderboard cluster between 0.70 and 0.89, with Llama 3.1 405B Instruct historically near the top of the open-weight rankings at 0.885 and a tight pack of frontier proprietary models behind it [^9][^31]. By April 2026, frontier proprietary models like GPT-5 and Claude Opus 4.6 also lead WebArena and OSWorld in addition to BFCL, with DeepSeek-V3.2 the strongest open-weight model and typically sitting within 10 percentage points of the proprietary leaders at a fraction of the per-token cost [^32].

### tau-bench and tau2-bench

Sierra Research released [tau-bench](https://taubench.com/) in 2024 and the harder follow-up tau2-bench in 2025 (arXiv 2506.07982). tau-bench simulates dynamic conversations between a user (also played by an LLM) and a tool-using language agent that must follow domain-specific policies. It originally covered Mock, Airline, and Retail domains. tau2-bench (Barres et al., 2025) introduced a dual-control environment where both the agent and the user can modify a shared world state, and added a Telecom domain that tests technical troubleshooting, problem solving, and clear communication [^33].

The Telecom subset of tau2-bench is one of the hardest public tool-use benchmarks. GPT-5.2 Thinking reports a state-of-the-art 98.7% on tau2-bench Telecom, the GLM 4.7 and GLM 5 series exceed 98%, and Grok 4.1 Fast reasoning posts strong results in the same range [^3][^18][^34]. Older non-reasoning models often fall below 50% on the same domain, which is a useful indicator of how much reasoning depth a model brings to multi-turn tool conversations.

### audio and other modalities

Tool calling benchmarks have extended beyond text. OpenAI's **ComplexFuncBench** evaluates function calling on audio inputs, where `gpt-realtime` scored 66.5% versus 49.7% for the December 2024 baseline [^21]. Voice agents introduce new failure modes (long-running calls during continuous audio, barge-in, interruption recovery) that text-only benchmarks did not previously cover.

## what is function calling used for?

Function calling enables a broad range of applications by bridging the gap between language understanding and programmatic action.

### api integration

The most common use case is connecting an LLM to external APIs. A travel assistant might call flight search, hotel booking, and weather APIs. A productivity assistant might call calendar, email, and task management APIs. Function calling provides the structured interface that makes these integrations reliable.

### database queries

Models can generate SQL queries or call database wrapper functions to answer questions about structured data. A business analyst chatbot might receive a question like "What were our top 10 products by revenue last quarter?" and generate a function call to execute the appropriate database query, then present the results in natural language. GPT-5's freeform custom tools combined with a Lark grammar for SQL syntax are particularly effective here because the model can emit valid SQL directly rather than through a JSON envelope [^22][^23].

### calculations and data processing

LLMs are unreliable at arithmetic and precise calculations. Function calling allows models to delegate mathematical operations, unit conversions, statistical analyses, and other computations to dedicated tools that produce exact results. Code interpreter tools take this further by letting the model write and execute Python in a sandbox.

### web browsing and search

Models can call web search functions to retrieve up-to-date information beyond their training data. ChatGPT Search, [Perplexity AI](/wiki/perplexity_ai), and other [AI search](/wiki/ai_search) products use function calling internally to trigger web searches, scrape pages, and retrieve relevant content. The Responses API (OpenAI), Anthropic `web_search` server tool, Gemini Google Search tool, and xAI Agent Tools API all expose this as a built-in primitive.

### code execution

Code interpreter tools allow models to write and execute code in sandboxed environments. OpenAI's Code Interpreter (now called Advanced Data Analysis), Anthropic's analysis tool, the Gemini code execution tool, and the Grok Agent Tools code execution sandbox all rely on function calling to pass generated code to an execution environment and return results. Anthropic's Programmatic Tool Calling pushes this further by letting Claude write Python that orchestrates many other tools in a single inference, reducing context pollution [^25][^26].

### file and document operations

Models can read, write, and manipulate files through function calls. This is the basis for AI-powered coding assistants like [GitHub Copilot](/wiki/github_copilot), [Cursor](/wiki/cursor), and [Claude Code](/wiki/claude_code), which use tool calls to read source files, make edits, run tests, and interact with version control systems.

### multi-step workflows

Complex tasks often require chaining multiple function calls together. An AI agent handling a customer refund might need to look up the customer's order, check the refund policy, calculate the refund amount, process the refund through the payment system, and send a confirmation email. Each step involves a separate function call, with the model orchestrating the sequence based on intermediate results.

### desktop and browser automation

The Anthropic computer use API, OpenAI Operator, and Google Project Mariner translate full GUI interactions into tool calls. The model receives screenshots, decides on a click or keystroke, and emits a tool call describing the action. Despite the visual interface, the underlying transport is still function calling [^19].

### voice and realtime agents

Voice agents built on OpenAI's Realtime API or equivalent providers use function calling to look up customer records, place orders, schedule calls, transfer to human agents, and so on, all without interrupting the audio session. The Realtime API specifically guarantees that long-running tool calls do not block the conversational flow, so an agent can keep speaking while a slow backend call resolves [^21].

## how does function calling relate to ai agents and mcp?

Function calling is the foundational mechanism that enables [AI agents](/wiki/ai_agent) to take actions in the world. Without it, LLMs can only produce text. With it, they can query databases, call APIs, execute code, and interact with external systems.

### ai agents

An [agent](/wiki/agent) is a system where an LLM autonomously decides what actions to take, executes those actions via function calls, observes the results, and plans next steps. Frameworks like LangChain, LangGraph, CrewAI, and AutoGen provide abstractions for building agent loops on top of function calling [^10].

The quality of function calling directly determines agent reliability. Models that accurately select the right function, provide correct arguments, and handle errors gracefully produce more effective agents. Conversely, function calling failures (wrong function selection, malformed arguments, hallucinated function names) are a primary source of agent failures.

### how does function calling differ from MCP?

[Model Context Protocol](/wiki/model_context_protocol) (MCP) is an open standard introduced by Anthropic in November 2024 that provides a standardized way for AI models to discover and interact with external tools and data sources. While function calling defines how a model invokes a specific tool, MCP standardizes how tools are described, discovered, and connected to models across different applications and providers [^35].

MCP gained rapid adoption in 2025. OpenAI added official MCP support in March 2025, Google added MCP for Gemini, and xAI integrated MCP into the Agent Tools API in November 2025 [^18]. On December 9, 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation co-founded by Anthropic, Block, and OpenAI, with support from Google, Microsoft, AWS, Cloudflare, and Bloomberg [^36]. By early 2026, more than 10,000 MCP servers had been published, covering everything from databases and file systems to enterprise SaaS connectors [^35].

MCP complements rather than replaces function calling. MCP provides the transport, discovery, and capability description layer, while function calling provides the actual invocation mechanism that the model uses inside a single conversation.

| Concept | Function calling | MCP |
|---|---|---|
| What it does | Lets a model invoke a specific function with structured arguments | Standardizes how tools are described, discovered, and connected |
| Scope | Single model-to-tool interaction | Cross-application tool ecosystem |
| Defined by | Each LLM provider (OpenAI, Anthropic, Google, etc.) | Open standard (originally Anthropic, now AAIF / Linux Foundation) |
| Analogy | Making a specific API call | A USB-C connector for AI tools |

## best practices

Building reliable function calling applications requires attention to several design principles.

### clear function descriptions

The model relies on function descriptions to decide when and how to use each tool. Descriptions should be specific, include examples of when the function should be used, and clarify any constraints or limitations. Vague descriptions lead to incorrect tool selection. Treat the description as part of the prompt, because that is exactly how the model uses it [^1]. Anthropic's Tool Use Examples feature formalizes this by letting developers attach concrete example calls to a tool definition for the model to imitate, which it reports raised accuracy from 72% to 90% on complex parameter handling [^25].

### keep schemas flat and use enums

Deeply nested schemas hurt accuracy and increase token cost. Prefer flat objects with descriptive field names. Use `enum` whenever the parameter has a known finite set of values, and use the `description` field on each property to disambiguate similar parameters.

### schema validation

Always validate the arguments generated by the model before executing a function. Modern models with strict tool use produce valid JSON with very high reliability, but edge cases still occur, especially with complex nested schemas. Libraries like Pydantic (Python) and Zod (TypeScript) provide runtime validation that catches malformed arguments before they reach production systems.

### error handling

When a function call fails (due to invalid arguments, API errors, timeout, or permission issues), the error should be communicated back to the model in a clear, structured format. Good error messages allow the model to adjust its approach, perhaps trying different arguments or an alternative function. Silently swallowing errors leads to confusing model behavior.

### limit tool scope and use tool routing

Provide only the functions relevant to the current task. Including too many function definitions wastes context tokens and increases the chance of the model selecting the wrong tool. Tool-selection accuracy on Claude is reported to degrade noticeably past 30 to 50 tools when the entire catalog is loaded at once, which is the motivation for Anthropic's `tool_search` server tool that surfaces only a relevant subset on each turn [^25]. OpenAI's GPT-5.2 `allowed_tools` parameter solves the same problem from the other side by letting a developer keep a large registered toolset but restrict the model to a request-specific subset, which also improves prompt-cache hit rates [^3].

### cache tool definitions

Tool definitions are typically static and worth caching. Place a cache breakpoint on the last block of the prefix (after tools and system on Anthropic, or in the natural prefix order on OpenAI) and keep the tool array byte-identical and ordered identically across requests so the cache prefix matches [^29][^30].

### security considerations

Function calling grants the model indirect access to external systems. Always implement proper authorization, input sanitization, and rate limiting on the execution side. Never allow the model to directly execute arbitrary code or unrestricted database queries without safeguards. The principle of least privilege applies: each function should have only the permissions it needs. Treat any output from a tool as untrusted user input from a prompt injection perspective, and require user confirmation for actions with real-world impact like sending email, making payments, or posting publicly [^11][^37].

## challenges and limitations

### hallucinated function calls

Models sometimes generate calls to functions that do not exist in the provided tool list, or fabricate argument values. This is a form of hallucination specific to function calling. Strict schema enforcement and validation reduce but do not eliminate this issue. The BFCL Hallucination subscore is a direct attempt to measure this failure mode [^9].

### complex multi-step reasoning

While models handle simple, single-step function calls reliably, complex workflows requiring many sequential function calls with conditional logic are more error-prone. The model may lose track of intermediate results, repeat function calls unnecessarily, or fail to synthesize information from multiple tool outputs. tau2-bench was designed in part to expose exactly this kind of multi-turn drift [^33].

### latency

Each function call adds a round trip between the model and the application, increasing total response time. For workflows requiring five or more sequential function calls, latency can become noticeable. Parallel function calling helps for independent calls, but sequential dependencies cannot be parallelized. Server-side built-in tools (Responses API, Anthropic server tools, xAI Agent Tools API) reduce client-side round trips at the cost of less control over execution. Programmatic Tool Calling pushes much further by letting the model itself orchestrate multiple tools inside a code-execution container, collapsing many round-trips into one [^25][^26].

### token cost of long tool definitions

Tool schemas live in the context window and are charged as input tokens. A large tool catalog with verbose descriptions can easily consume thousands of tokens per request before the user message is even processed. Multi-server MCP setups are reported to use roughly 55,000 tokens of tool definitions, which is a large fraction of common context windows. Mitigations include explicit prompt caching of the tool block, on-demand tool discovery (`tool_search` on Claude, `allowed_tools` on GPT-5.2), and Anthropic's per-model token cost transparency for the special system prompt that enables tool use (346 input tokens for Claude Opus 4.x and Sonnet 4.x with `tool_choice: auto`) [^13][^25][^3].

### interoperability

Despite conceptual similarity, the different API formats across providers create friction for developers building provider-agnostic applications. MCP addresses part of this problem at the tool description layer, but the invocation-level differences in request and response formats still require provider-specific code or adapter libraries.

### prompt injection and tool abuse

Because tools effectively grant the model write access to real systems, function calling is now the primary attack surface for prompt-injection-based exploits. OWASP's 2025 LLM Top 10 ranks prompt injection as the #1 risk for LLM applications, and 2025 saw documented production incidents including a "second-order" injection in ServiceNow Now Assist (a low-privilege agent persuading a higher-privilege peer to exfiltrate case files) and three injection CVEs in Anthropic's official Git MCP server (CVE-2025-68143, CVE-2025-68144, CVE-2025-68145) leading to code execution and data exfiltration when an attacker controlled file contents the agent reads [^37][^38]. Mitigations include treating all tool output as untrusted input, requiring confirmation for high-impact actions, narrowing tool scopes per request, and using policy engines that intercept and approve every tool call (for example Microsoft's Agent Governance Toolkit, released April 2026, which applies YAML, OPA Rego, or Cedar policies in under 0.1 ms p99) [^39].

## current state (2025-2026)

As of early to mid 2026, function calling is a mature and widely adopted capability supported by all major LLM providers and most open-source models. The focus has shifted from basic capability to reliability, performance, and ecosystem development.

Key 2025 to 2026 developments include:

- The convergence around MCP as a standard for tool discovery and description, now stewarded by the Linux Foundation Agentic AI Foundation [^36].
- Server-side built-in tools across all major APIs (OpenAI Responses API, Anthropic server tools, Gemini built-in tools, xAI Agent Tools API), reducing the need for developers to implement common tools like web search and code execution from scratch.
- Reasoning models that interleave tool calls with extended chain-of-thought, lifting top tau2-bench Telecom scores above 98% (GPT-5.2 Thinking at 98.7%, GLM 4.7/5 series above 98%) and BFCL v4 results well above pre-2025 baselines [^3][^9][^34].
- A wave of advanced tool-use features in late 2025: GPT-5 freeform custom tools with Lark/CFG grammars (August 2025), Anthropic's Advanced Tool Use beta (Programmatic Tool Calling, Tool Search, Tool Use Examples; November 24, 2025), and Gemini 3's thought signatures plus 100+ parallel calls per turn (November 18, 2025) [^14][^22][^23][^25].
- Open-weight models like Llama 3.1 405B, Llama 4, Qwen3 235B, GLM 5, and DeepSeek-V3 reaching parity with closed models on tool-calling benchmarks, making robust tool use accessible outside proprietary API ecosystems [^16][^17][^32].
- Increased emphasis on agentic evaluation (BFCL v4 agentic split, tau2-bench, AgentBench, SWE-bench, ComplexFuncBench audio) over single-turn benchmarks [^21].
- Prompt-injection-driven tool abuse moving from theoretical concern to documented CVEs, prompting policy-engine and governance products like Microsoft's Agent Governance Toolkit (April 2026) [^37][^38][^39].
- Continued model release cadence focused on tool-call reliability: GPT-5.2 (December 11, 2025), Claude Opus 4.7 (April 16, 2026), and GPT-5.5 (April 24, 2026) all foreground tool-use accuracy, error recovery, and long-horizon agent behavior as headline improvements [^3][^24][^28].

The ongoing challenge is improving reliability of function calling in long-horizon, multi-step scenarios where models must make dozens of tool decisions in sequence, often interleaved with reasoning, search, and memory. The BFCL authors frame the open problem directly: "while state-of-the-art LLMs excel at single-turn calls, memory, dynamic decision-making, and long-horizon reasoning remain open challenges" [^9]. Research into better tool-use training data, improved constrained decoding, more robust agent architectures, and tool-call security continues across both industry and academia.

## see also

- [AI agent](/wiki/ai_agent)
- [Agent](/wiki/agent)
- [Tool use](/wiki/tool_use)
- [Model Context Protocol](/wiki/model_context_protocol)
- [OpenAI Responses API](/wiki/openai_responses_api)
- [Agent Tools API](/wiki/agent_tools_api)
- [Structured outputs](/wiki/structured_output)
- [Berkeley Function Calling Leaderboard](/wiki/bfcl)
- [Prompt engineering](/wiki/prompt_engineering)
- [Prompt caching](/wiki/prompt_caching)
- [Large language model](/wiki/large_language_model)
- [JSON Schema](/wiki/json_schema)

## references

[^1]: OpenAI. "Function Calling." OpenAI API Documentation. https://platform.openai.com/docs/guides/function-calling

[^2]: Descope. "MCP vs. Function Calling: How They Differ and Which to Use." 2025. https://www.descope.com/blog/post/mcp-vs-function-calling

[^3]: OpenAI. "Introducing GPT-5.2." December 11, 2025. https://openai.com/index/introducing-gpt-5-2/ (See also Cobus Greyling, "Enhanced Safety, Predictability & Control in GPT-5.2 Tool Calling," https://cobusgreyling.medium.com/enhanced-safety-predictability-control-in-gpt-5-2-tool-calling-5a2452ed3e6a)

[^4]: Google. "Function Calling." Gemini API Documentation. https://ai.google.dev/gemini-api/docs/function-calling

[^5]: Anthropic. "Tool use with Claude." Anthropic Platform Documentation. https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview

[^6]: Yao, Shunyu, et al. "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629, October 6, 2022. https://arxiv.org/abs/2210.03629

[^7]: Schick, Timo, et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761, February 9, 2023. https://arxiv.org/abs/2302.04761 (NeurIPS 2023 poster, https://neurips.cc/virtual/2023/poster/71288)

[^8]: Patil, Shishir G., et al. "Gorilla: Large Language Model Connected with Massive APIs." arXiv:2305.15334, May 24, 2023. https://arxiv.org/abs/2305.15334

[^9]: Patil, Shishir, et al. "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models." ICML 2025. https://proceedings.mlr.press/v267/patil25a.html (Live leaderboard at https://gorilla.cs.berkeley.edu/leaderboard.html)

[^10]: LangChain Documentation. "Tool Calling." https://docs.langchain.com/

[^11]: OpenAI. "Function calling and other API updates." June 13, 2023. https://openai.com/index/function-calling-and-other-api-updates/ (See also Simon Willison, "OpenAI: Function calling and other API updates," June 13, 2023, https://simonwillison.net/2023/Jun/13/function-calling/)

[^12]: Anthropic. "Claude can now use tools." May 30, 2024. https://www.anthropic.com/news/tool-use-ga (mirror at https://claude.com/blog/tool-use-ga)

[^13]: Anthropic. "Tool use overview." Anthropic Platform Documentation. https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview

[^14]: Google. "New Gemini API updates for Gemini 3." Google Developers Blog, November 18, 2025. https://developers.googleblog.com/new-gemini-api-updates-for-gemini-3/ (See also "Gemini 3 Pro," https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-pro)

[^15]: Mistral AI. "Large Enough." July 24, 2024. https://mistral.ai/news/mistral-large-2407

[^16]: Meta AI. "Introducing Llama 3.1: Our most capable models to date." July 23, 2024. https://ai.meta.com/blog/meta-llama-3-1/

[^17]: Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ (See also vLLM Tool Calling docs, https://docs.vllm.ai/en/latest/features/tool_calling/)

[^18]: xAI. "Grok 4.1 Fast and Agent Tools API." November 19, 2025. https://x.ai/news/grok-4-1-fast

[^19]: Anthropic. "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." October 22, 2024. https://www.anthropic.com/news/3-5-models-and-computer-use (See also Simon Willison, "Initial explorations of Anthropic's new Computer Use capability," October 22, 2024, https://simonwillison.net/2024/Oct/22/computer-use/)

[^20]: OpenAI. "New tools for building agents." March 11, 2025. https://openai.com/index/new-tools-for-building-agents/

[^21]: OpenAI. "Introducing gpt-realtime and Realtime API updates for production voice agents." 2025. https://openai.com/index/introducing-gpt-realtime/

[^22]: Microsoft Foundry Blog. "Unlocking GPT-5's Freeform Tool Calling: A New Era of Seamless Integration." 2025. https://devblogs.microsoft.com/foundry/unlocking-gpt-5s-freeform-tool-calling-a-new-era-of-seamless-integration/ (Reference implementation in OpenAI Cookbook, "GPT-5 New Params and Tools," https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_and_tools)

[^23]: OpenAI. "GPT-5 New Params and Tools" (CFG grammars via Lark/regex, custom tools). OpenAI Cookbook, 2025. https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_and_tools

[^24]: OpenAI. "Introducing GPT-5.5." April 23, 2026. https://openai.com/index/introducing-gpt-5-5/

[^25]: Anthropic. "Introducing advanced tool use on the Claude Developer Platform." November 24, 2025. https://www.anthropic.com/engineering/advanced-tool-use

[^26]: Anthropic. "Programmatic tool calling." Claude API Docs. https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling

[^27]: OpenAI. "Introducing Structured Outputs in the API." August 6, 2024. https://openai.com/index/introducing-structured-outputs-in-the-api/

[^28]: Anthropic. "Claude Opus 4.7." 2026. https://www.anthropic.com/claude/opus (See also "Models overview," https://platform.claude.com/docs/en/about-claude/models/overview)

[^29]: OpenAI. "Prompt caching." OpenAI API Documentation. https://developers.openai.com/api/docs/guides/prompt-caching (See also "Prompt Caching in the API," https://openai.com/index/api-prompt-caching/)

[^30]: Anthropic. "Tool use with prompt caching." Claude API Docs. https://platform.claude.com/docs/en/agents-and-tools/tool-use/tool-use-with-prompt-caching (See also "Prompt caching," https://platform.claude.com/docs/en/build-with-claude/prompt-caching)

[^31]: Berkeley Function Calling Leaderboard V4. https://gorilla.cs.berkeley.edu/leaderboard.html

[^32]: ClickRank.ai. "LLM Leaderboard 2026: Best AI Models Benchmark & Ranking." 2026. https://www.clickrank.ai/llm-leaderboard/

[^33]: Barres, Victor, et al. "τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment." 2025. arXiv:2506.07982. https://arxiv.org/abs/2506.07982

[^34]: Artificial Analysis. "τ²-Bench Telecom Benchmark Leaderboard." https://artificialanalysis.ai/evaluations/tau2-bench

[^35]: Anthropic. "Introducing the Model Context Protocol." November 2024. https://www.anthropic.com/news/model-context-protocol

[^36]: Linux Foundation. "Linux Foundation Announces the Formation of the Agentic AI Foundation (AAIF)." December 9, 2025. https://www.linuxfoundation.org/press/linux-foundation-announces-the-formation-of-the-agentic-ai-foundation

[^37]: OWASP. "LLM01:2025 Prompt Injection." OWASP Gen AI Security Project, 2025. https://genai.owasp.org/llmrisk/llm01-prompt-injection/

[^38]: Microsoft Security Blog. "When prompts become shells: RCE vulnerabilities in AI agent frameworks." May 7, 2026. https://www.microsoft.com/en-us/security/blog/2026/05/07/prompts-become-shells-rce-vulnerabilities-ai-agent-frameworks/

[^39]: Microsoft. "Agent Governance Toolkit." April 2026 (release referenced in Zylos Research, "Tool Use and Function Calling in AI Agents," April 7, 2026, https://zylos.ai/research/2026-04-07-tool-use-function-calling-standards-benchmarks).