WRAP (Web Rephrase Augmented Pre-training)
WRAP (Web Rephrase Augmented Pre-training) is a synthetic-data pre-training method introduced in the paper "Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling," posted to arXiv on 29 January 2024 by Pratyush Maini (Carnegie Mellon University), Skyler Seto, He Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly (Apple).[^1][^2] The method uses an off-the-shelf instruction-tuned generator, specifically Mistral-7B-Instruct-v0.2, to rephrase raw web documents into one of several controlled styles (Wikipedia-like, terse, simple, or question/answer), then trains a language model jointly on the original web text and the synthetic rephrases at a one-to-one mixing ratio.[^1][^3] On the noisy C4 web corpus, WRAP reports an approximately three times pre-training speedup at matched downstream quality, plus average perplexity improvements above ten percent across subsets of The Pile and gains above two percent in zero-shot accuracy across thirteen question-answering tasks.[^1][^2][^4] WRAP was accepted to ACL 2024 (long paper) and is widely cited as a foundational reference for the synthetic-rephrasing line of pre-training research that includes the Phi family, Cosmopedia, and Nemotron-CC.[^5][^6][^7]
Background and motivation
Raw web text scraped from sources such as Common Crawl is the dominant data substrate for large-scale pre-training, but it is stylistically heterogeneous, noisy, and often poorly phrased; many documents are boilerplate, SEO content, fragmented HTML, or low-information listings.[^1][^4] Under prevailing scaling laws, a compute-optimal model trained from such data needs an abundance of both compute and tokens, and the supply of unique high-quality tokens on the open web is finite.[^1][^8] Maini and coauthors observe that the gap between the style distribution of pre-training data and the style of downstream reading-comprehension and QA evaluations is a tractable inefficiency: if the same factual content were re-expressed in a cleaner, more pedagogical register, a model could learn the same knowledge with fewer gradient updates.[^1][^4]
The motivation builds on two adjacent lines of work. The first is the Microsoft "Textbooks Are All You Need" program, which produced Phi-1 in June 2023 by training a 1.3 billion parameter code model on roughly six billion tokens of filtered web data plus about one billion tokens of GPT-3.5-generated textbook-style synthetic data, showing that small models on carefully curated synthetic content could match much larger conventionally trained baselines.[^9] The second is research on noise filtering, distillation, and reading-comprehension augmentation. WRAP differs from the Phi recipe in that it does not generate synthetic content from scratch around curated topics; instead, it conditions on real web documents and asks an instruction-tuned model to rewrite them, preserving the underlying factual content while changing register and structure.[^1][^4]
A related practical motivation is data exhaustion. Several 2023 and 2024 studies warned that high-quality public text would be largely exhausted by frontier training runs before 2030; rephrasing offers a way to multiply the effective number of high-utility tokens without licensing new sources.[^8]
Method
WRAP starts from raw web documents in C4 (the cleaned Common Crawl corpus released with T5) and chunks them into segments of at most three hundred tokens; the authors note that asking the generator to rephrase longer spans tended to produce information loss and was prone to omission.[^3][^4] Each segment is then passed to a frozen instruction-tuned generator with a style-conditioned prompt. The default generator in the paper is Mistral-7B-Instruct-v0.2 from Mistral AI, chosen because it offered the best quality-throughput trade-off in the authors' ablations; alternatives explored include Qwen-1.8B-Chat (about three times faster but slightly lower quality) and Vicuna-13B-v1.3.[^3][^4] A T5-base fine-tuned rephraser was also tested and underperformed the larger instruction-tuned models, indicating that the quality of the rephrasing model matters more than its raw parameter count above a certain threshold.[^3][^4]
Four rephrasing styles
The paper defines four target styles, each operationalized as a short natural-language instruction in the system prompt of the rephrasing model.[^1][^3][^4]
| Style | Prompt gloss | Intended effect |
|---|
| Easy | Rephrase as text that even a toddler could understand. | Simplified vocabulary, short sentences, accessible register. |
| Medium | Rephrase in high-quality English as on Wikipedia. | Encyclopedic, declarative, factually structured prose. |
| Hard | Rephrase in terse and abstruse language. | Dense, technical register, mimicking academic or formal text. |
| Q/A | Rephrase as a conversational question-and-answer exchange. | Reading-comprehension-aligned format with explicit question framings. |
The Medium and Q/A styles are the primary workhorses; Medium aligns the rewritten text with the style of language modeling benchmarks dominated by clean expository prose, while Q/A aligns with downstream reading-comprehension and zero-shot QA evaluations such as PIQA, ARC, and OpenBookQA.[^1][^3][^4]
Mixing real and synthetic text
WRAP does not replace original web text with synthetic rephrases; instead, the pre-training stream draws from real C4 and from synthetic rephrases in a one-to-one ratio by token count.[^1][^3] The authors find that purely synthetic training degrades performance on domains containing special characters, code-like content, and structured tokens that the rephraser tends to smooth away, while a balanced mix preserves robustness to natural web noise without sacrificing the gains from rewritten content.[^3] Alternative mixing strategies were ablated, including 1:2 ratios and combining multiple synthetic styles together; mixing two styles offered only marginal gains over the simpler one-style-plus-real baseline.[^3]
Pre-training configuration
Three decoder-only transformer sizes were trained from scratch in the paper to demonstrate that the gains generalize across scale: a 128 million parameter model (12 layers, 12 heads, hidden dimension 768), a 350 million parameter model (24 layers, 16 heads, hidden 1024), and an XL model at 1.3 billion parameters (24 layers, 16 heads, hidden 2048).[^3] Each was trained for 300,000 steps at a one-million-token batch size, with the Adam optimizer (beta values 0.9 and 0.999), cosine schedule with one percent warmup, peak learning rates between 2e-4 and 3e-4, weight decay 0.01, gradient clip norm 1.0, and a maximum sequence length of 1,024.[^3]
Per-style ablations were run primarily on the medium model size to keep compute budgets tractable while still producing statistically meaningful comparisons across the four rephrasing styles, the choice of generator, and the mixing ratio.[^3] The 1.3 billion parameter XL configuration was reserved for the headline results in the main tables and for comparisons against external baselines such as TinyLlama and Pythia-1.4B.[^3]
Cost analysis
A key argument of the paper is that synthetic rephrasing is economically rational despite its up-front compute cost because the cost is amortized across multiple training runs and is fully parallelizable.[^1][^3] On a single H100-class accelerator (the paper benchmarks on an A100 with vLLM, reporting roughly three million generated tokens per hour with Mistral-7B), rephrasing the 85 billion-token synthetic corpus used in the paper required approximately twenty-five thousand GPU-hours of inference.[^3] For comparison, training the 1.3 billion parameter model on three hundred billion tokens across 64 A100s consumed about six thousand GPU-hours, while training a 13 billion parameter baseline at the same token budget would consume on the order of thirty thousand GPU-hours.[^3] In other words, at thirteen-billion-parameter scale and above, the up-front rephrasing cost is dominated by the savings from a single matched-quality training run, and the synthetic corpus can be reused indefinitely.
The authors identify two paths to further reduction. First, smaller paraphrasers such as Qwen-1.8B-Chat run roughly three times faster than Mistral-7B at comparable downstream quality.[^3] Second, contemporary inference-optimization techniques (speculative decoding, fused attention kernels, batched continuous generation) offer an additional three to five times throughput on the same hardware.[^3] Combined, these would bring the cost of generating a large WRAP-style corpus to a small fraction of training cost for any frontier model.
Experimental results
Pre-training efficiency on C4
WRAP-trained models reach the same validation perplexity as the C4-only baseline using roughly three times less data or three times less compute, with the largest reported speedup of about fifteen times at very early checkpoints (where stylistic priors dominate the loss curve).[^1][^4] At matched compute, the 1.3 billion parameter WRAP model improves average perplexity over The Pile subsets by more than ten percent compared with the C4-only baseline, with domain-level gains of up to three times reduction on ArXiv and HackerNews subsets where the Wikipedia-style and Q/A rephrasings most closely match downstream register.[^1][^3]
Downstream zero-shot tasks
The paper evaluates zero-shot accuracy across thirteen standard benchmarks and reports an average gain above two percent for WRAP-trained models over the C4-only baseline at matched compute.[^1][^4] The thirteen tasks are partitioned by the authors into general-understanding evaluations (ARC-Easy, BoolQ, WinoGrande, PIQA, HellaSwag, TruthfulQA, OpenBookQA, and LogiQA-2) and specialized-knowledge evaluations (ARC-Challenge, SciQ, PubMedQA, MathQA, and MMLU).[^14] Evaluations were run with the LLM Evaluation Harness at batch size thirty-two, and perplexity numbers were computed over twenty-one Pile subsets (excluding Europarl), using the first ten thousand documents per domain capped at 1,024 tokens.[^14]
Selected per-task numbers from the 1.3 billion parameter setting include PIQA (76.1 percent for a synthetic-only variant versus 74.9 percent on C4), ARC-Challenge (29.9 percent on synthetic plus C4 versus 26.3 percent on C4 only), and OpenBookQA (24.1 percent on synthetic plus C4 versus 22.4 percent on C4 only).[^4] Across the benchmark suite, WRAP models reach roughly 49.4 percent average accuracy on the standardized zero-shot evaluation, compared to 47.4 percent on the C4-only baseline.[^4] In a head-to-head comparison reported in the paper, the WRAP-trained 1.3 billion parameter model outperforms TinyLlama (1.1 billion parameters, trained on roughly one trillion SlimPajama plus StarCoder tokens for three epochs) on several QA benchmarks despite consuming substantially less data and compute.[^3]
Ablations
Several ablations support the design choices. Combining real C4 with synthetic rephrases is necessary; synthetic-only training underperforms on domains with non-prose content because the rephraser strips special tokens and code-like text.[^3] Different downstream domains benefit from different rephrase styles, and an oracle that selects the best style per evaluation domain would yield an additional sixteen percent perplexity improvement, suggesting room for style-routing extensions.[^3] Finally, the authors confirm using semantic-similarity probes that rephrased documents preserve meaning relative to their sources without introducing new factual content drawn from the rephraser's parametric knowledge, addressing concerns about knowledge leakage.[^3]
The choice of rephraser was itself ablated. The team compared T5-base (a fine-tuned encoder-decoder rephraser), Qwen-1.8B-Chat (a smaller instruction-tuned model), Mistral-7B-Instruct-v0.2, and Vicuna-13B-Chat-v1.3.[^3] T5-base produced significantly worse downstream perplexity, suggesting that the encoder-decoder structure and limited capacity prevented faithful and stylistically diverse rewriting; among the instruction-tuned decoders, Qwen-1.8B-Chat and Mistral-7B-Instruct-v0.2 were closest to the Pareto frontier of throughput and quality, with the larger Vicuna-13B occasionally producing slightly higher perplexity outcomes despite the additional inference cost.[^3] This non-monotonic relationship between rephraser size and downstream gain has been a key empirical finding for subsequent groups designing rephrasing pipelines.
Research questions and analysis
The paper organizes its ablations around six explicit research questions, providing a checklist that subsequent rephrasing methods reuse.[^14]
- RQ1: Is real data necessary? Synthetic-only training shows significant degradation in perplexity on Pile subdomains that contain special characters or code-like content; pairing synthetic rephrases with the original web text restores robustness.[^3][^14]
- RQ2: How many styles are needed? Combining multiple synthetic styles yields only small additional improvements beyond the best single style; the Q/A style alone performs comparably to two-style mixtures on most downstream tasks.[^14]
- RQ3: How small can the paraphraser be? Qwen-1.8B-Chat produces high-quality rephrases nearly on par with Mistral-7B; however, T5-base (fine-tuned, encoder-decoder) underperforms significantly, indicating a quality floor below which the recipe breaks down.[^3][^14]
- RQ4: How does rephrasing compare to traditional data augmentation? WRAP-style synthesis substantially outperforms standard text augmentation techniques (back-translation, masked span filling, paraphrase by lexical substitution).[^14]
- RQ5: Does style matter per domain? No single style is dominant across all 21 Pile subdomains; an oracle that selects the best style per evaluation domain yields an additional 16 percent perplexity improvement, suggesting that style-routing during training is a fruitful direction.[^3][^14]
- RQ6: Does the rephraser leak knowledge? Semantic-similarity probes confirm rephrases preserve meaning without introducing factual content drawn from the rephraser's parametric memory.[^3][^14]
The conclusion explicitly positions rephrasing as a means of obtaining compute and data efficiency without sacrificing semantic fidelity, and the authors emphasize three practitioner-facing design choices that subsequent work has had to navigate: choice of generator, real-synthetic mixing ratio, and diminishing returns from repeated synthetic generation cycles.[^14]
WRAP belongs to a cluster of synthetic-data pre-training methods that emerged in 2023 and 2024. Each adopts a different stance on how strongly to depart from raw web text.
Phi family ("Textbooks Are All You Need")
The first widely circulated demonstration that small models trained on synthetic textbook-style data could rival much larger conventionally trained models was Suriya Gunasekar and coauthors' Phi-1 paper "Textbooks Are All You Need," posted in June 2023 by Microsoft Research.[^9] Phi-1, a 1.3 billion parameter code model, was trained on roughly six billion tokens of filtered web code and about one billion tokens of GPT-3.5-generated synthetic textbook-style exercises, achieving 50.6 percent pass@1 on HumanEval despite four days of training on eight A100s.[^9] Ronen Eldan and Yuanzhi Li followed with Phi-1.5 in September 2023 and the Phi family later extended to Phi-4 and downstream variants, all explicitly relying on heavily synthetic curricula.[^9] WRAP differs from Phi in that it conditions on real documents and rewrites them rather than synthesizing free-form synthetic content from topic prompts.[^1][^4]
Cosmopedia
In March 2024, Loubna Ben Allal and colleagues at Hugging Face released Cosmopedia, an open synthetic-data corpus of more than thirty million synthetic textbooks, blog posts, stories, and WikiHow articles totaling roughly twenty-five billion tokens, all generated by Mixtral-8x7B-Instruct-v0.1.[^10] Cosmopedia was explicitly motivated by the lack of public detail in the Phi recipes and required more than ten thousand GPU-hours to generate.[^10] Unlike WRAP, Cosmopedia is closer in spirit to the Phi recipe (prompting around curated topics) rather than rewriting source documents, although it does seed many prompts from clustered web data.[^10]
Nemotron-CC
NVIDIA's Nemotron-CC dataset, released in late 2024, takes the rephrasing recipe to trillion-token scale, producing a 6.3 trillion-token English corpus of which roughly 1.9 trillion tokens are synthetically generated.[^7] Nemotron-CC uses multiple targeted prompts for high-quality web documents (knowledge extraction, diverse QA generation, distillation, condensation) and explicitly cites the WRAP recipe of rewriting noisy web text into cleaner formats. A reported eight billion parameter model trained on Nemotron-CC outperforms Llama 3.1 8B on several tasks at matched compute.[^7]
Comparison across recipes
| Method | First public release | Generator | Style of synthesis | Scale | Mixes with real text |
|---|
| Phi-1 / Phi-1.5 (Microsoft) | 2023-06 / 2023-09 | GPT-3.5 (closed) | Topic-conditioned synthetic textbooks and exercises | About 1B synthetic tokens for Phi-1 | Filtered web code or web text alongside synthetic |
| WRAP (Apple, CMU) | 2024-01 | Mistral-7B-Instruct-v0.2 | Rewriting source web docs into Easy / Medium / Hard / Q-A | About 85B synthetic tokens | 1:1 real C4 plus synthetic rephrases |
| Cosmopedia (Hugging Face) | 2024-03 | Mixtral-8x7B-Instruct-v0.1 | Topic-conditioned synthetic textbooks, blogs, stories, WikiHows | About 25B synthetic tokens | Synthetic-dominant for cosmo-1b; can be mixed |
| Nemotron-CC (NVIDIA) | 2024-12 | Multiple Nemotron generators | Five-prompt rephrasing plus distillation, condensation, diverse QA on Common Crawl | About 1.9T synthetic tokens within a 6.3T-token corpus | Original Common Crawl plus synthetic rewrites |
Sources: Phi-1 "Textbooks Are All You Need" paper.[^9] WRAP paper and Apple ML Research page.[^1][^2][^3] Cosmopedia blog.[^10] Nemotron-CC research page and developer blog.[^7][^8]
Other follow-up work
The synthetic-rephrasing recipe has since become standard in production pre-training. Subsequent papers and product releases described as building on or generalizing the WRAP idea include Phi-4 (Microsoft) which reportedly trained on around forty percent synthetic data, the FineWeb-style curation efforts at Hugging Face that combine filtering with rephrase-style augmentation, and other rewriting-based corpora that target additional formats such as tutorials, FAQs, and mathematical reformulations.[^6][^10] Several open-source replications and extensions appeared on arXiv across 2024 and 2025, including studies of multilingual rephrasing and of selectively rephrasing only low-quality segments.[^11]
The DCLM (DataComp for Language Models) benchmark, released in 2024 as a standardized testbed for pre-training data interventions, has been used to compare rephrasing-based augmentation against more conservative filtering and deduplication baselines.[^15] Rephrasing techniques have also been ported to coding corpora and to specialized scientific domains, demonstrating that the underlying recipe (use a moderately sized instruction-tuned generator to rewrite source documents into a register matching downstream evaluation style) is broadly portable.
Why rephrasing helps
The paper offers two complementary explanations for why mixing rephrased text with raw web data systematically lowers downstream perplexity and improves zero-shot QA accuracy.[^1][^3][^4]
The first is style alignment. Standard evaluations such as PIQA, ARC, OpenBookQA, BoolQ, HellaSwag, and MMLU are predominantly written in encyclopedic or reading-comprehension register. Raw web text is stylistically heterogeneous and includes large amounts of forum discussion, classified listings, machine-translated content, and SEO boilerplate. By rewriting source documents into a Wikipedia-like register or into explicit Q/A format, WRAP narrows the distributional gap between the pre-training distribution and the evaluation distribution. Empirically, the Medium and Q/A styles produce the largest gains on the subset of evaluations whose own text most closely resembles those registers.[^3][^4]
The second is signal-to-noise improvement. Although the rephrasing model is not a perfect denoiser, an instruction-tuned model trained on supervised dialogue data tends to produce well-formed sentences, fewer typos, fewer fragments, and more consistent grammar than the raw source. This effectively re-weights the loss toward higher-quality tokens. Combined with the original text (which preserves linguistic noise the model must still learn to handle), the resulting mixed corpus contains roughly half "cleaned" tokens at no additional labeling cost.[^1][^3]
The paper is careful to note that these two effects are not fully separable in its experiments, since changing the rephrasing prompt simultaneously changes both style and quality. Subsequent work has attempted to isolate the two by varying one while holding the other constant, but the joint contribution remains the simplest explanation of the empirical gains.[^3][^4]
Authorship and institutional context
The paper's first author, Pratyush Maini, was a PhD student in Machine Learning at Carnegie Mellon University advised by Zico Kolter and Zachary Lipton at the time of submission, and conducted the work as an Apple research intern.[^12] The remaining authors, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly, were all affiliated with Apple's machine-learning research group.[^1][^12] Apple has since published the work in its Apple Machine Learning Research portal and lists it among its publicly disclosed contributions to efficient pre-training for foundation models.[^2] Maini subsequently co-founded DatologyAI, a data-centric company focused on training-set curation, and continues to develop WRAP-style techniques in that setting.[^12]
The paper was accepted at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) in Bangkok, Thailand, appearing in the Long Papers volume on pages 14044 to 14072.[^5] It is listed under DOI 10.18653/v1/2024.acl-long.757 in the ACL Anthology.[^5]
Limitations and criticisms
The authors and subsequent commentators have catalogued several limitations of WRAP-style rephrasing.[^1][^3][^4]
- Cost of the rephraser. Although amortized across training runs, generating tens of billions of rephrased tokens with a 7B-class instruction model still requires tens of thousands of GPU-hours and assumes access to a sufficiently capable open-source generator.[^3]
- Diversity collapse. Instruction-tuned models tend to produce stylistically uniform outputs and may degrade lexical and topical diversity in the training stream; this has been linked in other work to model collapse when synthetic data dominates the corpus across recursive training generations.[^4][^13]
- Knowledge bias. Although the WRAP authors verify that rephrases do not measurably leak parametric knowledge from the generator into the student, there is no guarantee that subtler factual drift or stylistic bias does not enter at scale.[^3]
- Domain mismatch. Synthetic-only training degrades performance on web domains rich in code, special characters, and structural markers; the 1:1 real-synthetic mix is empirically necessary.[^3]
- Minimum rephraser quality. The paper does not settle the question of how small a rephraser can be while preserving downstream gains; small fine-tuned encoder-decoder models such as T5-base were found insufficient.[^3]
A broader concern raised in coverage of the paper is that recursive use of LLM-generated content for training may reduce content diversity across the open web ecosystem if generators converge toward similar outputs.[^4]
Open questions
Open questions raised by the work and by subsequent commentary include: the optimal mixing ratio at scale beyond the explored 1:1 default; whether rephrasing should be applied uniformly across all source documents or selectively to those flagged as low-quality by a separate classifier; how the rephraser should be chosen when the rephrasing model itself becomes part of a recursive training loop; and how to handle content that cannot be safely rephrased (for instance, code, structured data, or content where exact wording matters legally or factually).[^3][^4][^7] The authors explicitly note that the gain attribution between style alignment and quality improvement is not fully separable and that future work could try to isolate the two effects, for instance by holding style constant while varying rephraser quality and vice versa.[^3]
Significance
WRAP is widely cited as the moment at which rewriting-based synthetic data became a credible alternative to either purely scaled-up real-text pre-training or fully synthetic Phi-style curricula.[^4][^7] It articulated three claims that have proven durable: that the bottleneck in web data is style and signal-to-noise rather than raw token count, that an instruction-tuned generator of moderate size is sufficient to rewrite at scale, and that a one-to-one mix of real and synthetic content is the most robust composition.[^1][^3] Subsequent open-source efforts including Cosmopedia, FineWeb's educational sub-corpora, and Nemotron-CC all bear methodological traces of WRAP, and commercial models including the Phi-4 series and several Qwen and Llama derivatives use related techniques.[^6][^7][^10]
Practical implications for compute budgets
The cost analysis in the paper has been particularly influential for organizations choosing between scaling raw web data and investing in synthetic data pipelines. For an open lab with access to a single H100 pod or A100 cluster, the marginal cost of rephrasing twenty to one hundred billion tokens is comparable to a single training run at the 1 to 3 billion parameter scale, but the resulting corpus can be reused across many architectures and many training-recipe iterations. As model sizes grow, the ratio between rephrasing cost and training cost shifts further in favor of rephrasing; at frontier scale, the synthetic-corpus cost is small relative to the savings from a single matched-quality training run.[^1][^3][^7]
Reception
Press coverage at the time of release framed WRAP as a credible response to the impending exhaustion of high-quality public web text and to the rising cost of frontier pre-training; commentators highlighted that the method shifted the data-quality conversation from filtering and deduplication toward active rewriting, and that the use of an off-the-shelf Mistral 7B-class generator made the recipe immediately reproducible by smaller labs.[^4][^6][^11] Within academic discourse, the paper has been cited in surveys of synthetic-data pre-training and in technical reports for several open and proprietary models.[^7][^10][^13]
See also
References