Toolformer
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,619 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,619 words
Add missing citations, update stale details, or suggest a clearer explanation.
Toolformer is a research language model from Meta AI that learns, in a self-supervised way, to call external software tools through simple text-based API calls. It was introduced in the paper "Toolformer: Language Models Can Teach Themselves to Use Tools" by Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom, posted to arXiv on 9 February 2023 and later accepted as an oral presentation at NeurIPS 2023.[1][2] The model decides on its own which API to call, when to call it, what arguments to pass, and how to fold the returned result into its subsequent text generation.[1]
The work targets a familiar weakness of large language models: although they handle few-shot tasks well, they stumble on functions that simpler programs do easily, such as precise arithmetic, factual lookup, working in low-resource languages, and reasoning about the current date.[1] Rather than scaling the model to paper over these gaps, Toolformer gives the model access to outside tools and lets it teach itself when each one is worth using.
Toolformer represents every API call as a short text snippet wrapped in special tokens. An unanswered call is written as <API> name(input) </API>, and a completed call carries its result after an arrow: <API> name(input) → result </API>.[1] In the implementation, those markers are the literal token sequences [, ], and ->, so the method works without changing the base model's vocabulary.[1] Because both the inputs and the outputs of each tool are plain text, calls can be inserted directly into ordinary running text.
The training pipeline runs in three stages on a plain-text corpus.[1]
<API> token exceeds a threshold), keeps the top candidate positions, and then samples several possible calls at each one.The surviving calls are then interleaved back into the original text, producing an augmented dataset. The base model is finetuned on this dataset with a standard language modeling objective.[1] A key property is that, apart from the inserted calls, the augmented data contains exactly the same text the model was already trained on, so finetuning exposes it to the same content and is meant not to erode its general abilities.[1] The corpus used was a subset of CCNet, and to cut the cost of annotation the authors applied heuristics so that, for instance, only passages containing at least three numbers were considered for the calculator.[1] The procedure is data-hungry: processing more than a million documents yielded only a few thousand useful calculator calls.[1]
At inference time the finetuned model generates normally until it emits the arrow token, which signals that it expects a tool result. Decoding pauses, the API is called, the response and closing tag are inserted, and generation resumes.[1] To make the model more willing to use tools, the authors let it begin a call whenever <API> is among the ten most likely next tokens rather than only when it is the single most likely, and they cap usage at one call per input to avoid loops.[1]
The paper demonstrates five tools, chosen to address distinct shortcomings of plain language models.[1] The table below lists each one with its underlying implementation and an example.
| Tool | Implementation | Example call → result |
|---|---|---|
| Question answering | Atlas, a retrieval-augmented model finetuned on Natural Questions | Where was the Knights of Columbus founded? → New Haven, Connecticut |
| Wikipedia search | A BM25 retriever indexing the KILT Wikipedia dump, returning short snippets | Fishing Reel Types → a passage on spin fishing reels |
| Calculator | Supports the four basic arithmetic operations, results rounded to two decimals | 27 + 4 * 2 → 35 |
| Calendar | Returns the current date, taking no input | (empty) → Today is Monday, January 30, 2023 |
| Machine translation | The 600M-parameter NLLB model covering 200 languages, with source language detected by a fastText classifier and target fixed to English | sûreté nucléaire → nuclear safety |
The question answering tool and the Wikipedia search tool both supply factual information, but they differ in kind: the QA system returns a direct answer, whereas the search engine returns raw text that the model must read and extract from itself.[1] The calendar exists to give the model temporal context, since a plain model has no awareness of what day it is.[1]
Toolformer is built on a pretrained GPT-J model from EleutherAI.[1][3] The paper describes this base as having 6.7 billion parameters; the publicly released model, GPT-J-6B, is documented as a 6-billion-parameter transformer trained on the Pile.[3] The authors evaluate in a strict zero-shot setting, with no in-context examples provided, which is harder than the task-specific prompting used in much earlier tool-use work.[1]
The main comparisons are against GPT-J without tools, GPT-J finetuned on the same corpus without API calls (GPT-J + CC), a version of Toolformer with calls disabled at decoding, and the much larger OPT (66B) and GPT-3 (175B), which are roughly 10 and 25 times the size of the base model.[1] Selected scores follow.
| Benchmark (metric) | GPT-J | Toolformer | OPT (66B) | GPT-3 (175B) |
|---|---|---|---|---|
| LAMA, T-REx (accuracy) | 31.9 | 53.5 | 30.1 | 39.8 |
| LAMA, SQuAD | 17.8 | 33.8 | 21.6 | 26.8 |
| Math, SVAMP | 5.2 | 29.4 | 4.9 | 10.0 |
| Math, MAWPS | 9.9 | 44.0 | 7.9 | 19.8 |
| TriviaQA | 43.9 | 48.8 | 45.7 | 65.9 |
| Web Questions | 18.5 | 26.3 | 18.6 | 29.0 |
On the factual LAMA subsets, Toolformer leans on the question answering tool for almost all examples (98.1 percent on T-REx) and beats both the much larger baselines.[1] On the math datasets it calls the calculator for 97.9 percent of examples and more than doubles the score of the next-best GPT-J variant, again surpassing OPT and GPT-3.[1] For open-domain question answering it relies mainly on Wikipedia search (99.3 percent of examples) and clearly beats same-size baselines, though it still trails GPT-3, which the authors attribute partly to their simple, non-interactive search engine.[1] On multilingual MLQA the translation tool helps for every language, and on the DATESET temporal benchmark the calendar tool, used for about 55 percent of examples, drives a large improvement.[1] Notably, on the related TempLAMA dataset the gains came from search and QA rather than the calendar, which was used in only 0.2 percent of cases.[1]
Two further findings stand out. First, finetuning with API calls does not raise perplexity on WikiText or held-out CCNet when calls are disabled, supporting the claim that tool use is added without degrading core language modeling.[1] Second, in a scaling study across GPT-2 models from 124M to 1.6B parameters plus GPT-J, the ability to make good use of tools only emerges around 775M parameters; smaller models perform about the same with or without tools.[1]
The authors list several constraints of the method as published.[1] Toolformer cannot chain tools, since each tool's calls are sampled independently and the training data therefore contains no examples of feeding one tool's output into another. It also cannot use tools interactively, so it cannot browse through many search results or reformulate a query that returns nothing useful. The model is sensitive to the exact wording of its input when deciding whether to call an API, a known trait of prompted language models. The pipeline is sample-inefficient, as the calculator example shows. Finally, the model does not weigh the computational cost of a call when deciding whether to make it.[1]
Toolformer was an early demonstration that a model could learn tool use from its own feedback rather than from task-specific human supervision or hand-crafted prompts, and that doing so could let a modest 6.7B model match or beat models more than twenty times its size on the tasks the tools addressed.[1] The paper's closest predecessor was TALM, which used a similar self-supervised objective but only in finetuned, task-specific settings.[1] Toolformer is widely cited as a step toward the tool-using and agentic language models that followed, in which an LLM invokes calculators, search, code execution, and other services as part of solving a task. Its limitations, especially the lack of chained and interactive tool use, point directly to the function-calling and multi-step agent frameworks developed afterward, and the broad theme of grounding generation in external systems connects it to retrieval-augmented generation.