Function calling (also called tool use or tool calling) is a capability of large language models (LLMs) that allows the model to generate structured JSON output specifying which external function or API to invoke and what arguments to pass. Rather than producing only natural language text, a model with function calling capabilities can decide during generation that it needs to call a specific tool, output the function name and parameters in a structured format, and then incorporate the result of that function call into its subsequent response.
Function calling was introduced as a formal API feature by OpenAI on June 13, 2023, and has since been adopted by every major LLM provider, including Anthropic (Claude), Google (Gemini), Mistral AI, xAI (Grok), and Meta (Llama). The capability is foundational to building AI agents, enabling LLMs to interact with databases, APIs, file systems, code interpreters, and other external systems in a controlled, structured manner [1][2].
In modern usage, function calling is often the lowest-level building block beneath higher abstractions such as tool use, the Model Context Protocol (MCP), and built-in agent platforms like the OpenAI Responses API and xAI Agent Tools API. By 2026, every major frontier model treats function calling as a default feature trained directly into the base model rather than a thin wrapper layered on top.
The function calling process follows a well-defined loop involving the developer, the model, and the application layer. Understanding this loop is essential to building reliable tool-augmented AI systems.
The developer provides the model with a list of available functions (tools) as part of the API request. Each function definition includes a name, a natural language description of what the function does, and a JSON Schema specifying the expected parameters and their types. Most providers use a subset of the OpenAPI/JSON Schema specification, which means features like recursive schemas, references, or rare keywords may not be supported [3].
For example, a weather application might define a function like:
{
"name": "get_weather",
"description": "Get the current weather for a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. San Francisco"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
The model uses the function descriptions and parameter schemas to decide when and how to call each function. Clear, specific descriptions significantly improve the model's ability to choose the right function and provide correct arguments [1].
When the user sends a message (for example, "What's the weather in Tokyo?"), the model analyzes the query against the available function definitions. If the model determines that a function call would help it answer the query, it generates a structured output indicating the function to call and the arguments to pass, rather than generating a direct text response.
The model does not execute the function itself. It outputs a structured request that the application must intercept and act upon. This is a critical design choice: the model proposes actions, but the application controls execution [4].
The application parses the model's function call output, validates the arguments, and executes the actual function (for example, calling a weather API with the specified location). This step happens entirely outside the model. The developer's code is responsible for the actual execution, error handling, and any security checks. Validation typically uses runtime libraries like Pydantic in Python or Zod in TypeScript to confirm that arguments match the declared schema.
The function's result is sent back to the model as a new message in the conversation. The model then uses this result to generate its final response to the user. For example, after receiving the weather data, the model might respond: "The current weather in Tokyo is 22 degrees Celsius with partly cloudy skies."
This four-step loop can repeat multiple times within a single interaction. The model might call one function, examine the result, decide it needs additional information, call another function, and then synthesize all results into a final response. This iterative process is sometimes called the agentic loop, and it sits at the core of every tool-using LLM application [4].
| Step | Actor | Action |
|---|---|---|
| 1 | Developer | Defines available functions with names, descriptions, and JSON schemas |
| 2 | User | Sends a message or query |
| 3 | Model | Analyzes query, decides to call a function, outputs function name and arguments as JSON |
| 4 | Application | Parses function call, validates arguments, executes the actual function |
| 5 | Application | Returns function result to the model as a new message |
| 6 | Model | Incorporates result into final response, or makes additional function calls |
Before API-level function calling support, developers used ad hoc methods to get LLMs to interact with external systems. Common approaches included instructing the model to output JSON in a specific format within its text response, then parsing that text with regular expressions or custom parsers. These approaches were fragile. Models would frequently produce malformed JSON, include extra text around the JSON, or forget the required format entirely.
The early ReAct paper (Yao et al., 2022) formalized a pattern where the model alternates between reasoning steps and action steps, with the actions being tool invocations parsed out of free-form text. Some early frameworks like LangChain (released October 2022) provided abstractions for tool use that worked around these limitations through careful prompt engineering and output parsing. Results were unreliable compared to native function calling support [5].
OpenAI launched function calling as a first-class API feature on June 13, 2023, alongside the release of updated GPT-3.5 Turbo and GPT-4 models specifically fine-tuned for function calling. The new model snapshots gpt-4-0613 and gpt-3.5-turbo-0613 were trained to detect when a function should be called and to respond with structured JSON. Developers could now define functions using JSON Schema and receive structured, validated function call outputs from the model with high reliability [1][6].
The initial release supported a single function call per turn. The API used a function_call parameter in the request and returned a function_call field in the response containing the function name and arguments as a JSON string. OpenAI explicitly framed this as an implementation of the ReAct pattern, but with models fine-tuned to execute it reliably rather than improvised through prompting [6].
Anthropic introduced tool use for Claude models in beta during early 2024 and made it generally available on May 30, 2024 across the entire Claude 3 model family on the Anthropic Messages API, Amazon Bedrock, and Google Cloud Vertex AI [7]. Anthropic uses a slightly different API design. Rather than a separate function_call field, Anthropic integrated tool use into its content block structure. Claude models return a tool_use content block containing the tool name, a unique ID, and the input arguments. Results are sent back as tool_result content blocks referencing that ID. This design was motivated by Anthropic's broader content block architecture, which treats text, images, and tool calls as parallel content types within a single message [8].
Anthropic's GA release also introduced the tool_choice parameter (with values auto, any, or a specific tool name) and added support for tools that return images, enabling multimodal tool results. Later versions added strict tool definitions that guarantee schema conformance through constrained decoding.
Google's Gemini models support function calling through a functionCall response type, declared via FunctionDeclaration objects that use a subset of the OpenAPI schema format. Google's implementation supports both single-turn and multi-turn function calling patterns and exposes a function_calling_config parameter with modes AUTO, ANY, VALIDATED, and NONE to control whether the model is forced to emit a tool call [3]. Gemini 3 added unique IDs for each function call so applications can map results back unambiguously, which matters when several calls run in parallel.
Mistral Large 2 (mistral-large-2407), released July 24, 2024, was trained for both parallel and sequential function calling on La Plateforme and Le Chat [9]. Meta's Llama 3.1 release on July 23, 2024 added native tool calling for the first time in the Llama family, including the 405B parameter open-weight model that briefly topped the BFCL leaderboard [10]. xAI added tool calling to Grok in 2024 and shipped a dedicated Agent Tools API alongside Grok 4.1 Fast on November 19, 2025, treating function calling as a server-managed primitive rather than a parameter on the chat endpoint [11].
In November 2023, OpenAI introduced parallel function calling with the release of GPT-4 Turbo. This feature allows the model to request multiple function calls in a single turn rather than calling functions one at a time. For example, if a user asks "What's the weather in Tokyo and New York?", the model can output two function calls simultaneously, allowing the application to execute them in parallel and return both results at once [1].
Parallel function calling significantly reduces the number of round trips between the application and the API, improving both latency and user experience for multi-tool queries. The feature can be controlled with the parallel_tool_calls parameter. Setting it to false ensures at most one tool call per turn. Anthropic, Google, Mistral, xAI, and Meta have since added equivalent parallel-call support to their own APIs.
OpenAI initially used the term "function calling" but later transitioned to "tool calling" in its API, renaming the functions parameter to tools and function_call to tool_choice. The older function calling format is now considered legacy. Despite this rename, "function calling" remains the more widely recognized term in the broader community, and most providers use both terms interchangeably [1].
On March 11, 2025, OpenAI launched the Responses API, a new API primitive that combines elements of the Chat Completions and Assistants APIs and natively supports built-in server-side tools including web search, file search, and computer use [12]. The same release introduced the open-source OpenAI Agents SDK and shifted the agent development story away from raw tools arrays toward higher-level agent objects with built-in tool routing. Reasoning models such as o3 and o4-mini can call tools and functions directly within their chain of thought in the Responses API, producing more contextually grounded answers and significantly higher tool-calling scores on industry benchmarks [12].
Anthropic added equivalent server tools (web_search, code_execution, web_fetch, tool_search) and an MCP connector to the Messages API in 2025, while Google added live tool routing inside Gemini 3, and xAI shipped the Agent Tools API alongside Grok 4.1 Fast in November 2025 [11].
Function definitions across all major providers share three required pieces: a name (an identifier the model emits when calling), a description (natural language guidance about when and how to use the tool), and a parameters block (a JSON Schema describing the arguments).
Most providers support a useful subset of JSON Schema:
type (object, string, number, integer, boolean, array)properties and requiredenum for fixed value setsdescription on individual fieldsitems for arraysoneOf, anyOf, allOf (with provider-specific limits)Deeply nested objects, recursive references, and arbitrary $ref chains are commonly rejected or silently flattened. Gemini's documentation explicitly notes that very large or deeply nested schemas may be rejected in ANY mode [3]. OpenAI's strict mode and Anthropic's strict tool use both impose additional restrictions in exchange for guaranteed schema conformance, including required additionalProperties: false and a small set of supported keywords.
A practical rule: keep schemas flat, name fields descriptively, use enum whenever there is a closed set of options, and lean on the description field to disambiguate near-duplicate parameters. The model treats descriptions as part of the prompt, so a clear description of unit: "celsius" | "fahrenheit" does more work than a long if-then in the system prompt.
Each major LLM provider implements function calling with a different API design, though the core concept is the same across all of them.
| Feature | OpenAI (GPT-4o, GPT-4.1, o-series) | Anthropic (Claude) | Google (Gemini) | Mistral AI | xAI (Grok) | Meta (Llama) 3.1+ |
|---|---|---|---|---|---|---|
| API term | Tool calls | Tool use | Function calling | Tool calls | Tool calls / Agent Tools API | Tool calls |
| First public release | June 13, 2023 | Beta in early 2024, GA May 30, 2024 | December 2023 (Gemini API) | February 2024 | 2024, Agent Tools API Nov 19, 2025 | July 23, 2024 (Llama 3.1) |
| Definition format | JSON Schema in tools array | JSON Schema in tools array with input_schema | FunctionDeclaration (OpenAPI subset) | JSON Schema in tools array | JSON Schema in tools array | JSON Schema or built-in tool tags |
| Response format | tool_calls array with function.name and function.arguments | tool_use content block with name and input | functionCall with name and args (with unique id in Gemini 3) | tool_calls with function object | tool_calls array | `< |
| Parallel calls | Yes, default on, parallel_tool_calls parameter | Yes, supported across Claude 3, 4, and 4.x families | Yes, with per-call IDs in Gemini 3 | Yes (Mistral Large 2 onward) | Yes | Yes |
| Strict / schema enforcement | strict: true (Structured Outputs) | strict: true (strict tool use) | function_calling_config: ANY plus VALIDATED mode | JSON mode | Schema-aware decoding | Depends on inference engine (vLLM, llama.cpp) |
| Force tool use | tool_choice: {"type": "function", "function": {"name": "..."}} | tool_choice: {"type": "tool", "name": "..."} | tool_config with ANY mode | tool_choice: "any" or specific function | tool_choice with function name | Provider-dependent |
| Streaming | Yes, with delta chunks for tool call arguments | Yes, with content_block_delta events | Yes | Yes | Yes | Yes |
| Built-in server tools | Web search, file search, computer use, code interpreter, image generation, MCP (Responses API, March 2025) | Web search, code execution, web fetch, tool search, MCP connector | Google Search, code execution, URL context | Limited | Web search, X search, file search, code execution, MCP | None native; via inference frameworks |
The overall picture by 2026: every provider exposes the same conceptual loop, but field names and content-block structures differ enough that most production codebases either pin to one provider or use an adapter library such as LiteLLM, LangChain, or the Vercel AI SDK to abstract the differences.
Function calling is closely related to, and often overlaps with, structured output capabilities. The two features share most of their underlying machinery (JSON Schema and constrained decoding) but solve slightly different problems.
JSON mode, available from OpenAI, Google, Mistral, and others, constrains the model to produce valid JSON in its response without enforcing a specific schema. It is useful when you need structured output but the schema varies between requests, or when an old code path expects a JSON blob and you do not want to refactor it into a tool definition.
OpenAI introduced Structured Outputs on August 6, 2024, going beyond JSON mode by guaranteeing that the model's output conforms to a developer-supplied JSON Schema. The new response_format option accepts a json_schema field, and a parallel strict: true flag on tool definitions delivers the same guarantee for function calls. Under the hood, OpenAI uses constrained decoding (sometimes called constrained sampling) to restrict the model's token generation at each step to only tokens that would produce valid JSON matching the schema. The launch model gpt-4o-2024-08-06 was trained to handle complicated schemas and reportedly hit 100% schema conformance on internal evaluations [13].
Anthropic added equivalent strict tool use in 2025, building on its existing tool use mechanism. Developers define a tool with an input_schema and add strict: true, and Claude returns structured data with the same conformance guarantees. Google's Gemini API offers responseSchema and the VALIDATED function calling mode for similar guarantees, and Mistral exposes JSON mode plus schema-constrained outputs on La Plateforme.
| Approach | Schema enforcement | Provider support | Best for |
|---|---|---|---|
| JSON mode | Guarantees valid JSON, no schema enforcement | OpenAI, Google, Mistral, Anthropic | Variable-structure JSON output |
| Structured Outputs | Guarantees JSON matching a specific schema | OpenAI (native, Aug 2024), Anthropic (strict tool use, 2025), Google (responseSchema), Mistral (json_object with schema) | Strict data extraction, form filling, tool args |
| Function calling | Guarantees structured function call output matching tool schema | All major providers | Tool invocation with typed arguments |
Reasoning-focused models that produce extended chains of thought have changed how tool use is integrated into the generation loop. OpenAI's o-series (o1, o3, o3-mini, o4-mini) and the GPT-4.1 line can call tools directly inside their reasoning traces in the Responses API, interleaving thought and action without a full round-trip to the application after every step [12]. This pattern improves complex tool-calling benchmarks by letting the model think, call, observe, and revise within a single response.
Claude's extended thinking, available on Claude Sonnet 4.5 and Claude Opus 4.x, similarly supports tool use during the thinking phase. Google's Gemini Deep Think and the Gemini 3 family expose a comparable interleaved tool mode, and xAI's grok-4-1-fast-reasoning is explicitly marketed as the reasoning variant of Grok 4.1 Fast for complex tool workflows [11]. The general trend by 2026 is that the boundary between "reasoning" and "tool use" has dissolved. Frontier models reason, call tools, observe results, and reason again as a single continuous process.
The ecosystem around function calling has grown well beyond raw API calls. Developers commonly use higher-level frameworks that handle the agentic loop, memory, multi-agent coordination, and integration with vector stores or databases.
These frameworks all share a common core: a function calling loop that emits tool calls, executes them, and feeds results back into the conversation.
Function calling capability is now evaluated by a small family of public benchmarks, with the Berkeley Function Calling Leaderboard (BFCL) and Sierra Research's tau-bench family being the most influential.
The Berkeley Function Calling Leaderboard (Yan et al., 2024) is the de facto standard for evaluating tool use in LLMs. It uses an Abstract Syntax Tree (AST) evaluation method that scales to thousands of functions and tests both serial and parallel function calls across multiple programming languages [14]. BFCL has gone through four major versions:
The BFCL paper appeared at ICML 2025 (Patil et al.) under the title "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models" [14]. As of April 2026, top scores cluster between 0.70 and 0.89 on the public leaderboard, with Llama 3.1 405B Instruct at the top of the open-weight rankings (0.885) and a tight pack of frontier proprietary models behind it [14][15].
Sierra Research released tau-bench in 2024 and the harder follow-up tau2-bench in 2025 (arXiv 2506.07982). tau-bench simulates dynamic conversations between a user (also played by an LLM) and a tool-using language agent that must follow domain-specific policies. It originally covered Mock, Airline, and Retail domains. tau2-bench (Barres et al., 2025) introduced a dual-control environment where both the agent and the user can modify a shared world state, and added a Telecom domain that tests technical troubleshooting, problem solving, and clear communication [16].
The Telecom subset of tau2-bench is one of the hardest public tool-use benchmarks. Modern reasoning models including the GLM 4.7 and GLM 5 series report scores above 98% on the Artificial Analysis tau2-bench Telecom leaderboard, and Grok 4.1 Fast reasoning posts strong results in the same range [11][17]. Older non-reasoning models often fall below 50% on the same domain, which is a useful indicator of how much reasoning depth a model brings to multi-turn tool conversations.
Function calling enables a broad range of applications by bridging the gap between language understanding and programmatic action.
The most common use case is connecting an LLM to external APIs. A travel assistant might call flight search, hotel booking, and weather APIs. A productivity assistant might call calendar, email, and task management APIs. Function calling provides the structured interface that makes these integrations reliable.
Models can generate SQL queries or call database wrapper functions to answer questions about structured data. A business analyst chatbot might receive a question like "What were our top 10 products by revenue last quarter?" and generate a function call to execute the appropriate database query, then present the results in natural language.
LLMs are unreliable at arithmetic and precise calculations. Function calling allows models to delegate mathematical operations, unit conversions, statistical analyses, and other computations to dedicated tools that produce exact results. Code interpreter tools take this further by letting the model write and execute Python in a sandbox.
Models can call web search functions to retrieve up-to-date information beyond their training data. ChatGPT Search, Perplexity AI, and other AI search products use function calling internally to trigger web searches, scrape pages, and retrieve relevant content. The Responses API (OpenAI), Anthropic web_search server tool, Gemini Google Search tool, and xAI Agent Tools API all expose this as a built-in primitive.
Code interpreter tools allow models to write and execute code in sandboxed environments. OpenAI's Code Interpreter (now called Advanced Data Analysis), Anthropic's analysis tool, the Gemini code execution tool, and the Grok Agent Tools code execution sandbox all rely on function calling to pass generated code to an execution environment and return results.
Models can read, write, and manipulate files through function calls. This is the basis for AI-powered coding assistants like GitHub Copilot, Cursor, and Claude Code, which use tool calls to read source files, make edits, run tests, and interact with version control systems.
Complex tasks often require chaining multiple function calls together. An AI agent handling a customer refund might need to look up the customer's order, check the refund policy, calculate the refund amount, process the refund through the payment system, and send a confirmation email. Each step involves a separate function call, with the model orchestrating the sequence based on intermediate results.
The Anthropic computer use API, OpenAI Operator, and Google Project Mariner translate full GUI interactions into tool calls. The model receives screenshots, decides on a click or keystroke, and emits a tool call describing the action. Despite the visual interface, the underlying transport is still function calling.
Function calling is the foundational mechanism that enables AI agents to take actions in the world. Without it, LLMs can only produce text. With it, they can query databases, call APIs, execute code, and interact with external systems.
An agent is a system where an LLM autonomously decides what actions to take, executes those actions via function calls, observes the results, and plans next steps. Frameworks like LangChain, LangGraph, CrewAI, and AutoGen provide abstractions for building agent loops on top of function calling [5].
The quality of function calling directly determines agent reliability. Models that accurately select the right function, provide correct arguments, and handle errors gracefully produce more effective agents. Conversely, function calling failures (wrong function selection, malformed arguments, hallucinated function names) are a primary source of agent failures.
Model Context Protocol (MCP) is an open standard introduced by Anthropic in November 2024 that provides a standardized way for AI models to discover and interact with external tools and data sources. While function calling defines how a model invokes a specific tool, MCP standardizes how tools are described, discovered, and connected to models across different applications and providers [18].
MCP gained rapid adoption in 2025. OpenAI added official MCP support in March 2025, Google added MCP for Gemini, and xAI integrated MCP into the Agent Tools API in November 2025 [11]. On December 9, 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation co-founded by Anthropic, Block, and OpenAI, with support from Google, Microsoft, AWS, Cloudflare, and Bloomberg [19]. By early 2026, more than 10,000 MCP servers had been published, covering everything from databases and file systems to enterprise SaaS connectors [18].
MCP complements rather than replaces function calling. MCP provides the transport, discovery, and capability description layer, while function calling provides the actual invocation mechanism that the model uses inside a single conversation.
| Concept | Function calling | MCP |
|---|---|---|
| What it does | Lets a model invoke a specific function with structured arguments | Standardizes how tools are described, discovered, and connected |
| Scope | Single model-to-tool interaction | Cross-application tool ecosystem |
| Defined by | Each LLM provider (OpenAI, Anthropic, Google, etc.) | Open standard (originally Anthropic, now AAIF / Linux Foundation) |
| Analogy | Making a specific API call | A USB-C connector for AI tools |
Building reliable function calling applications requires attention to several design principles.
The model relies on function descriptions to decide when and how to use each tool. Descriptions should be specific, include examples of when the function should be used, and clarify any constraints or limitations. Vague descriptions lead to incorrect tool selection. Treat the description as part of the prompt, because that is exactly how the model uses it [1].
Deeply nested schemas hurt accuracy and increase token cost. Prefer flat objects with descriptive field names. Use enum whenever the parameter has a known finite set of values, and use the description field on each property to disambiguate similar parameters.
Always validate the arguments generated by the model before executing a function. Modern models with strict tool use produce valid JSON with very high reliability, but edge cases still occur, especially with complex nested schemas. Libraries like Pydantic (Python) and Zod (TypeScript) provide runtime validation that catches malformed arguments before they reach production systems.
When a function call fails (due to invalid arguments, API errors, timeout, or permission issues), the error should be communicated back to the model in a clear, structured format. Good error messages allow the model to adjust its approach, perhaps trying different arguments or an alternative function. Silently swallowing errors leads to confusing model behavior.
Provide only the functions relevant to the current task. Including too many function definitions wastes context tokens and increases the chance of the model selecting the wrong tool. For applications with dozens of tools, consider implementing a tool selection or routing layer (Anthropic's tool_search server tool is one production implementation) that presents only a relevant subset to the model on each turn.
Function calling grants the model indirect access to external systems. Always implement proper authorization, input sanitization, and rate limiting on the execution side. Never allow the model to directly execute arbitrary code or unrestricted database queries without safeguards. The principle of least privilege applies: each function should have only the permissions it needs. Treat any output from a tool as untrusted user input from a prompt injection perspective, and require user confirmation for actions with real-world impact like sending email, making payments, or posting publicly [6].
Models sometimes generate calls to functions that do not exist in the provided tool list, or fabricate argument values. This is a form of hallucination specific to function calling. Strict schema enforcement and validation reduce but do not eliminate this issue. The BFCL Hallucination subscore is a direct attempt to measure this failure mode [14].
While models handle simple, single-step function calls reliably, complex workflows requiring many sequential function calls with conditional logic are more error-prone. The model may lose track of intermediate results, repeat function calls unnecessarily, or fail to synthesize information from multiple tool outputs. tau2-bench was designed in part to expose exactly this kind of multi-turn drift [16].
Each function call adds a round trip between the model and the application, increasing total response time. For workflows requiring five or more sequential function calls, latency can become noticeable. Parallel function calling helps for independent calls, but sequential dependencies cannot be parallelized. Server-side built-in tools (Responses API, Anthropic server tools, xAI Agent Tools API) reduce client-side round trips at the cost of less control over execution.
Tool schemas live in the context window and are charged as input tokens. A large tool catalog with verbose descriptions can easily consume thousands of tokens per request before the user message is even processed. Anthropic publishes per-model token costs for the special system prompt that enables tool use (346 input tokens for Claude Opus 4.x and Sonnet 4.x with tool_choice: auto), and the tool definitions themselves add on top of that [8].
Despite conceptual similarity, the different API formats across providers create friction for developers building provider-agnostic applications. MCP addresses part of this problem at the tool description layer, but the invocation-level differences in request and response formats still require provider-specific code or adapter libraries.
As of early 2026, function calling is a mature and widely adopted capability supported by all major LLM providers and most open-source models. The focus has shifted from basic capability to reliability, performance, and ecosystem development.
Key 2025 to 2026 developments include:
The ongoing challenge is improving reliability of function calling in long-horizon, multi-step scenarios where models must make dozens of tool decisions in sequence, often interleaved with reasoning, search, and memory. Research into better tool-use training data, improved constrained decoding, and more robust agent architectures continues across both industry and academia.
[1] OpenAI. "Function Calling." OpenAI API Documentation. https://platform.openai.com/docs/guides/function-calling
[2] Descope. "MCP vs. Function Calling: How They Differ and Which to Use." 2025. https://www.descope.com/blog/post/mcp-vs-function-calling
[3] Google. "Function Calling." Gemini API Documentation. https://ai.google.dev/gemini-api/docs/function-calling
[4] Anthropic. "Tool use with Claude." Anthropic Platform Documentation. https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/overview
[5] LangChain Documentation. "Tool Calling." https://docs.langchain.com/
[6] OpenAI. "Function calling and other API updates." June 13, 2023. https://openai.com/index/function-calling-and-other-api-updates/ (See also Simon Willison, "OpenAI: Function calling and other API updates," June 13, 2023, https://simonwillison.net/2023/Jun/13/function-calling/)
[7] Anthropic. "Claude can now use tools." May 30, 2024. https://claude.com/blog/tool-use-ga
[8] Anthropic. "Tool use overview." Anthropic Platform Documentation. https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/overview
[9] Mistral AI. "Large Enough." July 24, 2024. https://mistral.ai/news/mistral-large-2407
[10] Meta AI. "Introducing Llama 3.1: Our most capable models to date." July 23, 2024. https://ai.meta.com/blog/meta-llama-3-1/
[11] xAI. "Grok 4.1 Fast and Agent Tools API." November 19, 2025. https://x.ai/news/grok-4-1-fast
[12] OpenAI. "New tools for building agents." March 11, 2025. https://openai.com/index/new-tools-for-building-agents/
[13] OpenAI. "Introducing Structured Outputs in the API." August 6, 2024. https://openai.com/index/introducing-structured-outputs-in-the-api/
[14] Patil, Shishir, et al. "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models." ICML 2025. https://proceedings.mlr.press/v267/patil25a.html
[15] Berkeley Function Calling Leaderboard V4. https://gorilla.cs.berkeley.edu/leaderboard.html
[16] Barres, Victor, et al. "τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment." 2025. arXiv:2506.07982. https://arxiv.org/abs/2506.07982
[17] Artificial Analysis. "τ²-Bench Telecom Benchmark Leaderboard." https://artificialanalysis.ai/evaluations/tau2-bench
[18] Anthropic. "Introducing the Model Context Protocol." November 2024. https://www.anthropic.com/news/model-context-protocol
[19] Linux Foundation. "Linux Foundation Announces the Formation of the Agentic AI Foundation (AAIF)." December 9, 2025. https://www.linuxfoundation.org/press/linux-foundation-announces-the-formation-of-the-agentic-ai-foundation