Instruction backtranslation (Humpback)
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,566 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,566 words
Add missing citations, update stale details, or suggest a clearer explanation.
Instruction backtranslation is a self-alignment method for generating instruction tuning data, introduced by researchers at Meta AI in the paper "Self-Alignment with Instruction Backtranslation," first posted in August 2023 and later published at ICLR 2024. [1] It targets a central bottleneck in aligning large language models: the supervised data used to teach a base model to follow instructions is scarce and costly. Such data normally comes either from paid human annotators writing prompts and answers, or from distilling the outputs of a stronger proprietary model, a practice that raises both cost and licensing concerns. Instruction backtranslation avoids both routes by harvesting instruction-following examples from the large quantity of unlabeled human-written text already on the web. [1]
The method inverts the usual order of data creation. Rather than starting from an instruction and collecting a response, it starts from a document, treats that document as though it were a high-quality response, and asks a model to reconstruct the instruction that the document would plausibly answer. A second stage then has the model grade its own generated pairs and retain only the best. The model produced this way, built on LLaMA and named Humpback, became the top-performing LLaMA-based system on the AlpacaEval leaderboard among models that do not depend on distillation, at both the 33B and 65B parameter scales. [1]
The name and the core mechanism come from backtranslation, a data-augmentation technique established in neural machine translation by Rico Sennrich and colleagues in 2016. [2] In translation, parallel sentence pairs are scarce, but monolingual text in the target language is plentiful. Backtranslation exploits this by training a reverse model that translates target-language sentences back into the source language, manufacturing synthetic source sentences paired with real target sentences. The synthetic source side may be imperfect, but the target side, which is what the final system must learn to produce, is genuine human text.
Instruction backtranslation maps this idea onto instruction tuning. The response plays the role of the abundant "target" text, and a web document is treated as such a response. The instruction plays the role of the scarce "source," which a backward model synthesizes. As in translation, the generated instruction may be noisy, but the response the model ultimately learns to write is authentic, fluent human writing drawn from the web rather than from another model. This is a key reason the approach is described as non-distilled: no stronger teacher model supplies the answers. [1]
The procedure begins with two ingredients: a small seed set of human-written (instruction, output) pairs, and a large corpus of unlabeled documents. Two models are derived from the same base LLaMA checkpoint. A forward seed model, denoted M0, is finetuned on the seed pairs to follow instructions in the normal direction, predicting an output from an instruction. A backward model, denoted Myx, is finetuned on the same pairs but with the fields swapped, so that it predicts an instruction from a given output. [1]
In the self-augmentation step, the backward model Myx is run over every document in the unlabeled corpus. Each document is treated as if it were the output, and the model generates a candidate instruction for it. This yields a large pool of synthetic (instruction, output) pairs in which every output is real human text and every instruction is machine-generated. Because the instructions are inferred rather than curated, the pool is large but uneven in quality. [1]
The self-curation step filters that pool. The current instruction-following model, starting from M0, is prompted to score each candidate pair for quality on a scale from 1 to 5, using a rubric that rewards pairs whose output is a helpful, complete answer to the instruction. Only pairs that receive the top score are kept. The model is then finetuned on the seed data plus this curated subset, producing an improved model. That improved model re-scores the pool in the next round, an iterative loop the authors run for two iterations. Because the same model both generates and judges the data, the process is self-contained, with no external annotator or teacher in the loop. [1]
The data funnel for the published run is summarized below. [1]
| Stage | Source | Count |
|---|---|---|
| Seed (instruction, output) pairs | OpenAssistant, first turn, rank 0 | 3,200 |
| Unlabeled documents | ClueWeb (English, preprocessed) | about 502,000 |
| Curated at score 4 or higher, iteration 2 | augmented set A4 | 195,043 |
| Curated at the top score, iteration 2 | augmented set A5, used to train Humpback | 41,821 |
The contrast between roughly 502,000 candidates and the roughly 42,000 retained illustrates the method's central bet: a smaller, aggressively filtered set of high-quality pairs trains a better instruction follower than a larger but noisier one. [1]
The seed data was drawn from the OpenAssistant human-conversation dataset, keeping 3,200 first-turn, top-ranked examples. [1][4] The unlabeled corpus was the English portion of ClueWeb, reduced to about 502,000 segments after deduplication, length filtering, and removal of low-quality text. The base models were LLaMA at 7B, 33B, and 65B parameters. [1][3] The final models, finetuned on the seed plus the top-scored curated set, were named Humpback, a play on the camelid naming of LLaMA and its relatives that signals a jump in scale from camels to whales. [1]
Humpback was evaluated on AlpacaEval, which reports the pairwise win rate of a model's responses against the reference system text-davinci-003 as judged automatically. Performance improved steadily with scale, and Humpback was the strongest non-distilled LLaMA-based model at the 33B and 65B sizes. [1]
| Model (65B, non-distilled) | AlpacaEval win rate vs text-davinci-003 |
|---|---|
| Humpback 65B | 83.71% |
| Guanaco 65B | 71.80% |
| LIMA 65B | 62.70% |
At 33B, Humpback reached 79.84 percent, again leading its non-distilled peers. [1] A later revision of the paper applied the method to a LLaMA 2 70B base and reported an AlpacaEval win rate of roughly 87.9 percent. [1] Some distilled models, such as Vicuna, posted higher raw scores, but they were trained on outputs collected from stronger systems like ChatGPT, the very dependency instruction backtranslation was designed to avoid. The authors also measured data efficiency by fitting how response quality scales with the number of training examples, and found their self-curated data more efficient per example than instruction sets such as those used by WizardLM, Alpaca-GPT4, and LIMA. [1]
Instruction backtranslation showed that a base model, a small seed set, and a pile of unlabeled text could bootstrap competitive instruction-following ability without a stronger teacher, placing it within the broader family of synthetic data and self-improvement methods for LLMs. It contrasts sharply with distillation-based pipelines such as Self-Instruct and the original Alpaca recipe, which generate training data by prompting a more capable model. [1][5] It is conceptually adjacent to self-training methods like STaR and ReST, which also have a model filter and then learn from its own outputs, but instruction backtranslation is distinctive in generating the instruction side while anchoring the response side in real human documents.
The two-part recipe, generate candidates then have the model curate them, reinforced a wider lesson of the instruction-tuning literature that data quality matters more than sheer quantity. The approach influenced later work, including a 2024 follow-up, "Better Alignment with Instruction Back-and-Forth Translation," which adds a step that rewrites retrieved documents into cleaner responses before training [6], and REInstruct, which builds instruction data from unlabeled corpora along similar lines. [7]
The method has several constraints. Its ceiling is set by the seed and base models: the backward model can write instructions only as well as 3,200 examples allow, and the judge can rate quality only as well as the seed model can discern it, so errors in self-curation propagate into training. Because the model both writes and grades the data, the loop can reinforce its own biases rather than correct them. The responses are limited to what already exists as web documents, so tasks whose ideal answers are not well represented online, such as multi-step reasoning, code, or carefully formatted outputs, are covered poorly, and any factual errors or unsafe content in the source text can be inherited. The published work focuses on single-turn English instructions, leaving multi-turn dialogue, multilingual coverage, and safety alignment to other techniques. Finally, the quality scores come from a model prompted with a fixed rubric, so the filter is only as calibrated as that prompt and the model behind it. [1]