Instruction backtranslation (Humpback)

Machine Learning Reinforcement Learning

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v1 · 1,566 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Instruction backtranslation is a self-alignment method for generating instruction tuning data, introduced by researchers at Meta AI in the paper "Self-Alignment with Instruction Backtranslation," first posted in August 2023 and later published at ICLR 2024. ^[1] It targets a central bottleneck in aligning large language models: the supervised data used to teach a base model to follow instructions is scarce and costly. Such data normally comes either from paid human annotators writing prompts and answers, or from distilling the outputs of a stronger proprietary model, a practice that raises both cost and licensing concerns. Instruction backtranslation avoids both routes by harvesting instruction-following examples from the large quantity of unlabeled human-written text already on the web. ^[1]

The method inverts the usual order of data creation. Rather than starting from an instruction and collecting a response, it starts from a document, treats that document as though it were a high-quality response, and asks a model to reconstruct the instruction that the document would plausibly answer. A second stage then has the model grade its own generated pairs and retain only the best. The model produced this way, built on LLaMA and named Humpback, became the top-performing LLaMA-based system on the AlpacaEval leaderboard among models that do not depend on distillation, at both the 33B and 65B parameter scales. ^[1]

Background: the backtranslation analogy

The name and the core mechanism come from backtranslation, a data-augmentation technique established in neural machine translation by Rico Sennrich and colleagues in 2016. ^[2] In translation, parallel sentence pairs are scarce, but monolingual text in the target language is plentiful. Backtranslation exploits this by training a reverse model that translates target-language sentences back into the source language, manufacturing synthetic source sentences paired with real target sentences. The synthetic source side may be imperfect, but the target side, which is what the final system must learn to produce, is genuine human text.

Instruction backtranslation maps this idea onto instruction tuning. The response plays the role of the abundant "target" text, and a web document is treated as such a response. The instruction plays the role of the scarce "source," which a backward model synthesizes. As in translation, the generated instruction may be noisy, but the response the model ultimately learns to write is authentic, fluent human writing drawn from the web rather than from another model. This is a key reason the approach is described as non-distilled: no stronger teacher model supplies the answers. ^[1]

How it works

The procedure begins with two ingredients: a small seed set of human-written (instruction, output) pairs, and a large corpus of unlabeled documents. Two models are derived from the same base LLaMA checkpoint. A forward seed model, denoted M0, is finetuned on the seed pairs to follow instructions in the normal direction, predicting an output from an instruction. A backward model, denoted Myx, is finetuned on the same pairs but with the fields swapped, so that it predicts an instruction from a given output. ^[1]

Self-augmentation

In the self-augmentation step, the backward model Myx is run over every document in the unlabeled corpus. Each document is treated as if it were the output, and the model generates a candidate instruction for it. This yields a large pool of synthetic (instruction, output) pairs in which every output is real human text and every instruction is machine-generated. Because the instructions are inferred rather than curated, the pool is large but uneven in quality. ^[1]

Self-curation

The self-curation step filters that pool. The current instruction-following model, starting from M0, is prompted to score each candidate pair for quality on a scale from 1 to 5, using a rubric that rewards pairs whose output is a helpful, complete answer to the instruction. Only pairs that receive the top score are kept. The model is then finetuned on the seed data plus this curated subset, producing an improved model. That improved model re-scores the pool in the next round, an iterative loop the authors run for two iterations. Because the same model both generates and judges the data, the process is self-contained, with no external annotator or teacher in the loop. ^[1]

The data funnel for the published run is summarized below. ^[1]

Stage	Source	Count
Seed (instruction, output) pairs	OpenAssistant, first turn, rank 0	3,200
Unlabeled documents	ClueWeb (English, preprocessed)	about 502,000
Curated at score 4 or higher, iteration 2	augmented set A4	195,043
Curated at the top score, iteration 2	augmented set A5, used to train Humpback	41,821

The contrast between roughly 502,000 candidates and the roughly 42,000 retained illustrates the method's central bet: a smaller, aggressively filtered set of high-quality pairs trains a better instruction follower than a larger but noisier one. ^[1]

Humpback model and results

The seed data was drawn from the OpenAssistant human-conversation dataset, keeping 3,200 first-turn, top-ranked examples. ^[1]^[4] The unlabeled corpus was the English portion of ClueWeb, reduced to about 502,000 segments after deduplication, length filtering, and removal of low-quality text. The base models were LLaMA at 7B, 33B, and 65B parameters. ^[1]^[3] The final models, finetuned on the seed plus the top-scored curated set, were named Humpback, a play on the camelid naming of LLaMA and its relatives that signals a jump in scale from camels to whales. ^[1]

Humpback was evaluated on AlpacaEval, which reports the pairwise win rate of a model's responses against the reference system text-davinci-003 as judged automatically. Performance improved steadily with scale, and Humpback was the strongest non-distilled LLaMA-based model at the 33B and 65B sizes. ^[1]

Model (65B, non-distilled)	AlpacaEval win rate vs text-davinci-003
Humpback 65B	83.71%
Guanaco 65B	71.80%
LIMA 65B	62.70%

At 33B, Humpback reached 79.84 percent, again leading its non-distilled peers. ^[1] A later revision of the paper applied the method to a LLaMA 2 70B base and reported an AlpacaEval win rate of roughly 87.9 percent. ^[1] Some distilled models, such as Vicuna, posted higher raw scores, but they were trained on outputs collected from stronger systems like ChatGPT, the very dependency instruction backtranslation was designed to avoid. The authors also measured data efficiency by fitting how response quality scales with the number of training examples, and found their self-curated data more efficient per example than instruction sets such as those used by WizardLM, Alpaca-GPT4, and LIMA. ^[1]

Significance

Instruction backtranslation showed that a base model, a small seed set, and a pile of unlabeled text could bootstrap competitive instruction-following ability without a stronger teacher, placing it within the broader family of synthetic data and self-improvement methods for LLMs. It contrasts sharply with distillation-based pipelines such as Self-Instruct and the original Alpaca recipe, which generate training data by prompting a more capable model. ^[1]^[5] It is conceptually adjacent to self-training methods like STaR and ReST, which also have a model filter and then learn from its own outputs, but instruction backtranslation is distinctive in generating the instruction side while anchoring the response side in real human documents.

The two-part recipe, generate candidates then have the model curate them, reinforced a wider lesson of the instruction-tuning literature that data quality matters more than sheer quantity. The approach influenced later work, including a 2024 follow-up, "Better Alignment with Instruction Back-and-Forth Translation," which adds a step that rewrites retrieved documents into cleaner responses before training ^[6], and REInstruct, which builds instruction data from unlabeled corpora along similar lines. ^[7]

Limitations

The method has several constraints. Its ceiling is set by the seed and base models: the backward model can write instructions only as well as 3,200 examples allow, and the judge can rate quality only as well as the seed model can discern it, so errors in self-curation propagate into training. Because the model both writes and grades the data, the loop can reinforce its own biases rather than correct them. The responses are limited to what already exists as web documents, so tasks whose ideal answers are not well represented online, such as multi-step reasoning, code, or carefully formatted outputs, are covered poorly, and any factual errors or unsafe content in the source text can be inherited. The published work focuses on single-turn English instructions, leaving multi-turn dialogue, multilingual coverage, and safety alignment to other techniques. Finally, the quality scores come from a model prompted with a fixed rubric, so the filter is only as calibrated as that prompt and the model behind it. ^[1]

References

Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Omer Levy, Luke Zettlemoyer, Jason Weston, Mike Lewis. "Self-Alignment with Instruction Backtranslation." ICLR 2024. arXiv:2308.06259 (August 2023). https://arxiv.org/abs/2308.06259 ↩
Rico Sennrich, Barry Haddow, Alexandra Birch. "Improving Neural Machine Translation Models with Monolingual Data." Proceedings of ACL 2016. arXiv:1511.06709. https://arxiv.org/abs/1511.06709 ↩
Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971 (February 2023). https://arxiv.org/abs/2302.13971 ↩
Andreas Kopf, Yannic Kilcher, Dimitri von Rutte, et al. "OpenAssistant Conversations: Democratizing Large Language Model Alignment." NeurIPS 2023. arXiv:2304.07327. https://arxiv.org/abs/2304.07327 ↩
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, et al. "Self-Instruct: Aligning Language Models with Self-Generated Instructions." Proceedings of ACL 2023. arXiv:2212.10560. https://arxiv.org/abs/2212.10560 ↩
"Better Alignment with Instruction Back-and-Forth Translation." arXiv:2408.04614 (2024). https://arxiv.org/abs/2408.04614 ↩
"REInstruct: Building Instruction Data from Unlabeled Corpus." Findings of ACL 2024. arXiv:2408.10663. https://arxiv.org/abs/2408.10663 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Reinforcement learning