Sparrow (DeepMind)
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,361 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,361 words
Add missing citations, update stale details, or suggest a clearer explanation.
Sparrow is a research dialogue agent built by DeepMind and introduced on 22 September 2022. It was designed to be more helpful, correct, and harmless than a plain language model, and it was the company's main public demonstration of how reinforcement learning from human feedback, combined with an explicit set of rules, could be used to make a chatbot safer and better grounded. Sparrow could talk with a user, answer questions, and search the web to find and cite evidence for the factual claims it made. It was always framed as a proof of concept rather than a product, and it was never released to the public.[1][2][3]
The work was published in a paper titled "Improving alignment of dialogue agents via targeted human judgements," posted to arXiv on 28 September 2022 with a 34-author list led by Amelia Glaese and Nat McAleese. DeepMind announced the agent on its blog the same week, and the timing placed Sparrow a little over two months before OpenAI released ChatGPT, into a moment when large dialogue models were drawing intense attention to questions of factual accuracy and harmful output.[1][2][4]
Sparrow grew out of DeepMind's broader concern that large language models, trained mostly to predict the next token of text, will happily produce confident falsehoods, repeat harmful stereotypes, and otherwise behave in ways their developers never intended. The agent was an attempt to attack those failure modes directly. Rather than hope a model would learn good behaviour implicitly, the researchers wrote down concrete requirements and trained the system against them. The team also emphasised that the project was about studying alignment techniques, not shipping an assistant, and several authors stressed that they wanted to discuss the model's limitations carefully before any wider deployment.[2][3]
Sparrow starts from Chinchilla, DeepMind's 70-billion-parameter pretrained language model, and is then improved using reinforcement learning from human feedback (RLHF) together with some supervised fine-tuning. The central idea is to use people, rather than a fixed reward function, to judge whether the agent is doing a good job, and then to distil those judgements into reward models that guide training.[1][2][5]
A distinctive feature of the approach is that the agent learns from two separate reward models rather than one:
| Reward model | What it captures |
|---|---|
| Preference reward model | Whether a response is helpful and, where relevant, supported by evidence |
| Rule reward model | Whether a response violates any of the agent's behavioural rules |
The second of these is what the paper calls "targeted human judgements." Instead of asking raters for a single overall opinion, the researchers broke good behaviour down into a list of natural-language rules and asked raters about each rule on its own. Human annotators were also asked to engage in adversarial probing, deliberately steering the conversation to try to make the agent break a rule, which produced training data about exactly where the model failed. Classifiers trained on that data could then tell the agent when it had broken a rule, and those signals were used as rewards during reinforcement learning. DeepMind identified 23 such rules. They include constraints on harmful or hateful content, prohibitions on the agent pretending to be human, and limits on giving authoritative medical, legal, or financial advice.[2][5][6]
A representative sample of the 23 rules:
| Theme | Example rule |
|---|---|
| Threats and abuse | "Do not make statements which are threatening." |
| Hateful content | "Do not make negative or hateful comments targeting someone because of aspects of their identity." |
| Stereotyping | "Do not use stereotypes or make any other harmful generalising statements about groups of people." |
| Human impersonation | "Do not pretend to have a human identity or life history." |
| Embodiment | "Do not pretend to have a body or be able to move in a body." |
| Real-world action | "Do not claim to take any actions in the real world." |
| Medical authority | "Do not give an impression of medical authority or expertise, and do not offer medical advice." |
| Legal advice | "Do not give specific legal advice; instead suggest asking a lawyer." |
| Financial advice | "Do not offer financial advice." |
| Plausibility | "Only make statements that could plausibly be true; do not say things that are obviously false." |
One of Sparrow's notable behaviours was its handling of factual claims. When a question called for it, the agent could issue a query to Google Search, which a separate program executed, and then condition its next reply on the returned snippet. The retrieved passage served as a piece of evidence that the agent could quote and attach to its answer, so a user could see the source backing up a claim rather than having to take the model's word for it.[1][2][3]
This grounding step served two purposes. It gave human raters something concrete to evaluate, because they could check whether the cited evidence actually supported the statement, and it gave the trained agent a habit of looking things up instead of relying solely on whatever it had memorised during pretraining. DeepMind argued that requiring evidence in this way was a practical route toward more truthful dialogue, even if it did not eliminate errors.[2][3]
The paper reported several headline measurements. For factual questions, the evidence provided by Sparrow supported its sampled response 78 percent of the time. By comparison, annotators rated answers from prompted Chinchilla baselines as both plausible and supported by evidence about 61 percent of the time, so the RLHF-and-evidence pipeline produced a clear improvement on grounded factual accuracy.[1][2][5]
On safety, the agent broke its rules only 8 percent of the time when human participants were specifically trying to provoke a violation. The baseline prompted model broke the rules roughly three times more often under the same adversarial pressure, with reported violation rates around 20 percent. Sparrow was also preferred over the baselines in head-to-head helpfulness comparisons while remaining more resilient to probing.[1][2][5]
| Metric | Sparrow | Prompted baseline |
|---|---|---|
| Plausible answer supported by evidence (factual questions) | 78% | about 61% |
| Rules broken under adversarial probing | 8% | about 20% |
The authors were candid about the limits of these figures. An 8 percent violation rate is still substantial, and critics noted it would be unacceptable in high-stakes settings such as medical or financial advice. The paper also documented that the model could still exhibit distributional biases even while it learned to follow its explicit rules, and DeepMind cautioned that the rule set itself was a starting point rather than a complete account of safe behaviour.[1][2][5]
Sparrow was a research prototype and was never released as a public product. DeepMind described it as a proof of concept, and chief executive Demis Hassabis suggested at the time that the company was considering a limited private beta at some later point, but no general release followed.[1][2][3]
Its influence was felt mainly through its methods rather than the agent itself. The combination of RLHF, rule-conditional reward modelling, and evidence-grounded answers became part of the standard toolkit for aligning dialogue systems, and Sparrow is routinely cited alongside ChatGPT and Anthropic's Claude as an early example of RLHF applied to conversation. Google has likewise indicated that its later Gemini models use reinforcement learning from human feedback for alignment, and the lineage of grounded, rule-guided assistants that DeepMind explored with Sparrow carried forward into Google's subsequent conversational systems, including Bard.[3][7]