Toolformer

AI Agents Large Language Models Meta AI

8 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

3 citations

Revision

v2 · 1,617 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Toolformer is a research language model from Meta AI that learns, in a self-supervised way, to call external software tools through simple text-based API calls. It was introduced in the paper "Toolformer: Language Models Can Teach Themselves to Use Tools" by Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom, posted to arXiv on 9 February 2023 and later accepted as an oral presentation at NeurIPS 2023.^[1]^[2] The model decides on its own which API to call, when to call it, what arguments to pass, and how to fold the returned result into its subsequent text generation.^[1]

The work targets a familiar weakness of large language models: although they handle few-shot tasks well, they stumble on functions that simpler programs do easily, such as precise arithmetic, factual lookup, working in low-resource languages, and reasoning about the current date.^[1] Rather than scaling the model to paper over these gaps, Toolformer gives the model access to outside tools and lets it teach itself when each one is worth using.

Approach

Toolformer represents every API call as a short text snippet wrapped in special tokens. An unanswered call is written as <API> name(input) </API>, and a completed call carries its result after an arrow: <API> name(input) → result </API>.^[1] In the implementation, those markers are the literal token sequences [, ], and ->, so the method works without changing the base model's vocabulary.^[1] Because both the inputs and the outputs of each tool are plain text, calls can be inserted directly into ordinary running text.

The training pipeline runs in three stages on a plain-text corpus.^[1]

Sampling. Using a handful of human-written demonstrations as a prompt, the model annotates passages with candidate API calls. It first finds positions where it is likely to want a call (the probability assigned to starting an <API> token exceeds a threshold), keeps the top candidate positions, and then samples several possible calls at each one.
Executing. Each candidate call is actually run against its tool to obtain a text result. How this happens depends on the tool: it may query another neural network, run a small script, or search a corpus.
Filtering. The method keeps a call only if seeing both the call and its result lowers the model's loss on the tokens that follow, compared with making no call or with making the call but hiding its result. Formally, a call is retained when the loss reduction exceeds a threshold τf. A weighting function focuses this loss on the tokens just after the call, where the retrieved information should matter most.

The surviving calls are then interleaved back into the original text, producing an augmented dataset. The base model is finetuned on this dataset with a standard language modeling objective.^[1] A key property is that, apart from the inserted calls, the augmented data contains exactly the same text the model was already trained on, so finetuning exposes it to the same content and is meant not to erode its general abilities.^[1] The corpus used was a subset of CCNet, and to cut the cost of annotation the authors applied heuristics so that, for instance, only passages containing at least three numbers were considered for the calculator.^[1] The procedure is data-hungry: processing more than a million documents yielded only a few thousand useful calculator calls.^[1]

At inference time the finetuned model generates normally until it emits the arrow token, which signals that it expects a tool result. Decoding pauses, the API is called, the response and closing tag are inserted, and generation resumes.^[1] To make the model more willing to use tools, the authors let it begin a call whenever <API> is among the ten most likely next tokens rather than only when it is the single most likely, and they cap usage at one call per input to avoid loops.^[1]

Tools

The paper demonstrates five tools, chosen to address distinct shortcomings of plain language models.^[1] The table below lists each one with its underlying implementation and an example.

Tool	Implementation	Example call → result
Question answering	Atlas, a retrieval-augmented model finetuned on Natural Questions	`Where was the Knights of Columbus founded?` → New Haven, Connecticut
Wikipedia search	A BM25 retriever indexing the KILT Wikipedia dump, returning short snippets	`Fishing Reel Types` → a passage on spin fishing reels
Calculator	Supports the four basic arithmetic operations, results rounded to two decimals	`27 + 4 * 2` → 35
Calendar	Returns the current date, taking no input	(empty) → Today is Monday, January 30, 2023
Machine translation	The 600M-parameter NLLB model covering 200 languages, with source language detected by a fastText classifier and target fixed to English	`sûreté nucléaire` → nuclear safety

The question answering tool and the Wikipedia search tool both supply factual information, but they differ in kind: the QA system returns a direct answer, whereas the search engine returns raw text that the model must read and extract from itself.^[1] The calendar exists to give the model temporal context, since a plain model has no awareness of what day it is.^[1]

Base model and results

Toolformer is built on a pretrained GPT-J model from EleutherAI.^[1]^[3] The paper describes this base as having 6.7 billion parameters; the publicly released model, GPT-J-6B, is documented as a 6-billion-parameter transformer trained on the Pile.^[3] The authors evaluate in a strict zero-shot setting, with no in-context examples provided, which is harder than the task-specific prompting used in much earlier tool-use work.^[1]

The main comparisons are against GPT-J without tools, GPT-J finetuned on the same corpus without API calls (GPT-J + CC), a version of Toolformer with calls disabled at decoding, and the much larger OPT (66B) and GPT-3 (175B), which are roughly 10 and 25 times the size of the base model.^[1] Selected scores follow.

Benchmark (metric)	GPT-J	Toolformer	OPT (66B)	GPT-3 (175B)
LAMA, T-REx (accuracy)	31.9	53.5	30.1	39.8
LAMA, SQuAD	17.8	33.8	21.6	26.8
Math, SVAMP	5.2	29.4	4.9	10.0
Math, MAWPS	9.9	44.0	7.9	19.8
TriviaQA	43.9	48.8	45.7	65.9
Web Questions	18.5	26.3	18.6	29.0

On the factual LAMA subsets, Toolformer leans on the question answering tool for almost all examples (98.1 percent on T-REx) and beats both the much larger baselines.^[1] On the math datasets it calls the calculator for 97.9 percent of examples and more than doubles the score of the next-best GPT-J variant, again surpassing OPT and GPT-3.^[1] For open-domain question answering it relies mainly on Wikipedia search (99.3 percent of examples) and clearly beats same-size baselines, though it still trails GPT-3, which the authors attribute partly to their simple, non-interactive search engine.^[1] On multilingual MLQA the translation tool helps for every language, and on the DATESET temporal benchmark the calendar tool, used for about 55 percent of examples, drives a large improvement.^[1] Notably, on the related TempLAMA dataset the gains came from search and QA rather than the calendar, which was used in only 0.2 percent of cases.^[1]

Two further findings stand out. First, finetuning with API calls does not raise perplexity on WikiText or held-out CCNet when calls are disabled, supporting the claim that tool use is added without degrading core language modeling.^[1] Second, in a scaling study across GPT-2 models from 124M to 1.6B parameters plus GPT-J, the ability to make good use of tools only emerges around 775M parameters; smaller models perform about the same with or without tools.^[1]

Limitations

The authors list several constraints of the method as published.^[1] Toolformer cannot chain tools, since each tool's calls are sampled independently and the training data therefore contains no examples of feeding one tool's output into another. It also cannot use tools interactively, so it cannot browse through many search results or reformulate a query that returns nothing useful. The model is sensitive to the exact wording of its input when deciding whether to call an API, a known trait of prompted language models. The pipeline is sample-inefficient, as the calculator example shows. Finally, the model does not weigh the computational cost of a call when deciding whether to make it.^[1]

Significance

Toolformer was an early demonstration that a model could learn tool use from its own feedback rather than from task-specific human supervision or hand-crafted prompts, and that doing so could let a modest 6.7B model match or beat models more than twenty times its size on the tasks the tools addressed.^[1] The paper's closest predecessor was TALM, which used a similar self-supervised objective but only in finetuned, task-specific settings.^[1] Toolformer is widely cited as a step toward the tool-using and agentic language models that followed, in which an LLM invokes calculators, search, code execution, and other services as part of solving a task. Its limitations, especially the lack of chained and interactive tool use, point directly to the function-calling and multi-step agent frameworks developed afterward, and the broad theme of grounding generation in external systems connects it to retrieval-augmented generation.

References

Schick, Timo; Dwivedi-Yu, Jane; Dessì, Roberto; Raileanu, Roberta; Lomeli, Maria; Zettlemoyer, Luke; Cancedda, Nicola; Scialom, Thomas. "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761, 9 February 2023. https://arxiv.org/abs/2302.04761 ↩
"Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS 2023 poster and oral. https://neurips.cc/virtual/2023/poster/71288 ↩
"GPT-J." Wikipedia. https://en.wikipedia.org/wiki/GPT-J ; EleutherAI, "GPT-J-6B" model card, Hugging Face. https://huggingface.co/EleutherAI/gpt-j-6b ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AI agents ReAct (prompting)

Approach

Tools

Base model and results

Limitations

Significance

References

Improve this article

Related Articles

Moltbook

CICERO (AI)

LLaMA

LLaMA/Model Card

Llama 3

Llama 2

What links here

Related Articles

Moltbook

CICERO (AI)

LLaMA

LLaMA/Model Card

Llama 3

Llama 2

What links here