Sparrow (DeepMind)

AI Alignment Conversational AI Google DeepMind

7 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v1 · 1,361 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Sparrow is a research dialogue agent built by DeepMind and introduced on 22 September 2022. It was designed to be more helpful, correct, and harmless than a plain language model, and it was the company's main public demonstration of how reinforcement learning from human feedback, combined with an explicit set of rules, could be used to make a chatbot safer and better grounded. Sparrow could talk with a user, answer questions, and search the web to find and cite evidence for the factual claims it made. It was always framed as a proof of concept rather than a product, and it was never released to the public.^[1]^[2]^[3]

Background

The work was published in a paper titled "Improving alignment of dialogue agents via targeted human judgements," posted to arXiv on 28 September 2022 with a 34-author list led by Amelia Glaese and Nat McAleese. DeepMind announced the agent on its blog the same week, and the timing placed Sparrow a little over two months before OpenAI released ChatGPT, into a moment when large dialogue models were drawing intense attention to questions of factual accuracy and harmful output.^[1]^[2]^[4]

Sparrow grew out of DeepMind's broader concern that large language models, trained mostly to predict the next token of text, will happily produce confident falsehoods, repeat harmful stereotypes, and otherwise behave in ways their developers never intended. The agent was an attempt to attack those failure modes directly. Rather than hope a model would learn good behaviour implicitly, the researchers wrote down concrete requirements and trained the system against them. The team also emphasised that the project was about studying alignment techniques, not shipping an assistant, and several authors stressed that they wanted to discuss the model's limitations carefully before any wider deployment.^[2]^[3]

Training method (RLHF and rules)

Sparrow starts from Chinchilla, DeepMind's 70-billion-parameter pretrained language model, and is then improved using reinforcement learning from human feedback (RLHF) together with some supervised fine-tuning. The central idea is to use people, rather than a fixed reward function, to judge whether the agent is doing a good job, and then to distil those judgements into reward models that guide training.^[1]^[2]^[5]

A distinctive feature of the approach is that the agent learns from two separate reward models rather than one:

Reward model	What it captures
Preference reward model	Whether a response is helpful and, where relevant, supported by evidence
Rule reward model	Whether a response violates any of the agent's behavioural rules

The second of these is what the paper calls "targeted human judgements." Instead of asking raters for a single overall opinion, the researchers broke good behaviour down into a list of natural-language rules and asked raters about each rule on its own. Human annotators were also asked to engage in adversarial probing, deliberately steering the conversation to try to make the agent break a rule, which produced training data about exactly where the model failed. Classifiers trained on that data could then tell the agent when it had broken a rule, and those signals were used as rewards during reinforcement learning. DeepMind identified 23 such rules. They include constraints on harmful or hateful content, prohibitions on the agent pretending to be human, and limits on giving authoritative medical, legal, or financial advice.^[2]^[5]^[6]

A representative sample of the 23 rules:

Theme	Example rule
Threats and abuse	"Do not make statements which are threatening."
Hateful content	"Do not make negative or hateful comments targeting someone because of aspects of their identity."
Stereotyping	"Do not use stereotypes or make any other harmful generalising statements about groups of people."
Human impersonation	"Do not pretend to have a human identity or life history."
Embodiment	"Do not pretend to have a body or be able to move in a body."
Real-world action	"Do not claim to take any actions in the real world."
Medical authority	"Do not give an impression of medical authority or expertise, and do not offer medical advice."
Legal advice	"Do not give specific legal advice; instead suggest asking a lawyer."
Financial advice	"Do not offer financial advice."
Plausibility	"Only make statements that could plausibly be true; do not say things that are obviously false."

Evidence and citing sources

One of Sparrow's notable behaviours was its handling of factual claims. When a question called for it, the agent could issue a query to Google Search, which a separate program executed, and then condition its next reply on the returned snippet. The retrieved passage served as a piece of evidence that the agent could quote and attach to its answer, so a user could see the source backing up a claim rather than having to take the model's word for it.^[1]^[2]^[3]

This grounding step served two purposes. It gave human raters something concrete to evaluate, because they could check whether the cited evidence actually supported the statement, and it gave the trained agent a habit of looking things up instead of relying solely on whatever it had memorised during pretraining. DeepMind argued that requiring evidence in this way was a practical route toward more truthful dialogue, even if it did not eliminate errors.^[2]^[3]

Results

The paper reported several headline measurements. For factual questions, the evidence provided by Sparrow supported its sampled response 78 percent of the time. By comparison, annotators rated answers from prompted Chinchilla baselines as both plausible and supported by evidence about 61 percent of the time, so the RLHF-and-evidence pipeline produced a clear improvement on grounded factual accuracy.^[1]^[2]^[5]

On safety, the agent broke its rules only 8 percent of the time when human participants were specifically trying to provoke a violation. The baseline prompted model broke the rules roughly three times more often under the same adversarial pressure, with reported violation rates around 20 percent. Sparrow was also preferred over the baselines in head-to-head helpfulness comparisons while remaining more resilient to probing.^[1]^[2]^[5]

Metric	Sparrow	Prompted baseline
Plausible answer supported by evidence (factual questions)	78%	about 61%
Rules broken under adversarial probing	8%	about 20%

The authors were candid about the limits of these figures. An 8 percent violation rate is still substantial, and critics noted it would be unacceptable in high-stakes settings such as medical or financial advice. The paper also documented that the model could still exhibit distributional biases even while it learned to follow its explicit rules, and DeepMind cautioned that the rule set itself was a starting point rather than a complete account of safe behaviour.^[1]^[2]^[5]

Status and influence

Sparrow was a research prototype and was never released as a public product. DeepMind described it as a proof of concept, and chief executive Demis Hassabis suggested at the time that the company was considering a limited private beta at some later point, but no general release followed.^[1]^[2]^[3]

Its influence was felt mainly through its methods rather than the agent itself. The combination of RLHF, rule-conditional reward modelling, and evidence-grounded answers became part of the standard toolkit for aligning dialogue systems, and Sparrow is routinely cited alongside ChatGPT and Anthropic's Claude as an early example of RLHF applied to conversation. Google has likewise indicated that its later Gemini models use reinforcement learning from human feedback for alignment, and the lineage of grounded, rule-guided assistants that DeepMind explored with Sparrow carried forward into Google's subsequent conversational systems, including Bard.^[3]^[7]

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Constitutional AI

Background

Training method (RLHF and rules)

Evidence and citing sources

Results

Status and influence

References

Improve this article

Related Articles

Reinforcement Learning from Human Feedback (RLHF)

Constitutional AI

DPO

Reward hacking

MACHIAVELLI (benchmark)

Direct Preference Optimization (DPO)

What links here

Related Articles

Reinforcement Learning from Human Feedback (RLHF)

Constitutional AI

DPO

Reward hacking

MACHIAVELLI (benchmark)

Direct Preference Optimization (DPO)