Synthetic data is data that has been artificially generated rather than collected from real-world events. In the context of artificial intelligence and machine learning, synthetic data is produced by algorithms, simulations, or generative models to serve as a substitute for or supplement to real data when training, validating, or testing AI systems. Use of synthetic data has grown dramatically since 2022, driven by the scaling demands of large language models, the increasing scarcity of high-quality human-generated training data, mounting privacy regulations, and the cost of human annotation. By late 2024, synthetic data sat at the center of frontier model training, with Microsoft's Phi-4 trained on roughly 400 billion synthetic tokens, Meta using its 405B Llama 3 model to generate training data for its smaller siblings, and DeepSeek using rejection-sampled reasoning traces to teach its V3 and R1 models.
At the same time, research has revealed serious risks. The most widely cited is model collapse, the phenomenon documented by Shumailov and colleagues in a 2024 Nature paper, where models trained recursively on synthetic data progressively lose the ability to represent the tails of the original distribution. Bias amplification, homogenization of style, and the steadily growing share of AI-generated content on the open web have all complicated the picture. The current consensus is that synthetic data is most effective when it augments rather than replaces human-generated data, and when the generation pipeline includes aggressive filtering, diversity controls, and verification.
Synthetic data is any data not collected from direct measurement of the real world. The category is broader than it sounds. A weather simulator producing fake satellite images, a language model writing instruction-response pairs, a game engine rendering pedestrians for self-driving training, and a statistical model sampling new rows from a fitted distribution all produce synthetic data. The data may be used to:
The key property is fidelity: how closely the synthetic data matches the structure and statistical properties of real data for a given downstream task. A synthetic dataset that fools a discriminator may still fail to train a useful classifier if the relationships it captures are superficial. Quality is always task-dependent.
Synthetic data takes many forms. The categories below are not mutually exclusive; modern pipelines often combine several modalities.
Synthetic tabular data consists of structured rows and columns that mimic the statistical properties of a real dataset while containing no actual records. This is the oldest form of synthetic data and remains widely used in healthcare, finance, telecommunications, and software testing. The Synthetic Data Vault (SDV), released by researchers at MIT in 2016, provides open-source frameworks for generating synthetic tabular data using Gaussian copulas, variational autoencoders (VAEs), and conditional tabular GANs (CTGAN). Commercial platforms from MOSTLY AI, Tonic.ai, Gretel.ai, and Hazy generate synthetic versions of customer databases, transaction logs, and electronic health records.
Synthetic text includes instruction-response pairs, dialogues, code, articles, reasoning chains, and entire books. Since the release of ChatGPT in late 2022, LLM-generated text has become the dominant form of synthetic data in AI training. Applications range from generating instruction tuning datasets like Alpaca and UltraChat to creating synthetic textbooks for pretraining as in the Phi series and Cosmopedia.
Synthetic images are generated using generative adversarial networks (GANs), diffusion models, or rendering engines like Unity, Unreal Engine, and Blender. Common applications include training object detection and computer vision models when real labeled images are scarce, expensive, or privacy-sensitive. Synthetic face datasets have been used to train facial recognition systems without using photographs of real people, and rendered images of warehouse environments are widely used to train pick-and-place robots.
Synthetic video extends image synthesis into the temporal domain, generating sequences of frames for autonomous driving simulation, action recognition training, robotics, and surveillance research. Simulation platforms like CARLA and NVIDIA Isaac Sim produce photorealistic synthetic video for training reinforcement learning agents and perception systems. Generative video models like Sora, Veo, and Runway can produce short clips that have been proposed as data sources for downstream training, though their use as training data for other generative models is contested.
Synthetic audio includes text-to-speech outputs, voice cloning, and music. It is used to train speech recognition and audio classification systems, particularly for low-resource languages where recorded speech is limited. TTS-generated training data has become standard for fine-tuning ASR systems on rare accents, code-switched speech, and domain-specific vocabularies.
Synthetic 3D scenes, point clouds, LiDAR returns, radar signatures, and IMU traces are generated using physics simulators and game engines. The CARLA simulator, introduced by Dosovitskiy and colleagues in 2017, became a standard benchmark for autonomous driving research, providing labeled sensor data for camera, depth, and semantic segmentation streams. NVIDIA's Omniverse and Isaac Sim extend this approach to industrial robotics, generating photorealistic, physically-accurate synthetic data with full ground-truth labels.
Generation methods can be grouped into five broad approaches. Each makes different trade-offs between fidelity, controllability, cost, and the kinds of structure it can capture.
| Method | How it works | Best for | Limitations |
|---|---|---|---|
| Rule-based | Predefined templates, grammars, and heuristics generate data | Software testing, fuzz testing, simulation, regulatory compliance | Cannot capture complex real-world distributions |
| Statistical | Fits a model (Gaussian copula, Bayesian network, KDE) to the empirical distribution and samples new points | Tabular data with well-defined distributions | Struggles with high dimensionality, mixed types, and complex dependencies |
| Generative neural networks | GAN, VAE, or diffusion model learns a generative model of the data | Images, audio, tabular data, time series | Mode collapse, training instability, difficulty with discrete outputs like text |
| Simulation | A physics or game engine renders synthetic environments and produces labeled sensor data | Robotics, autonomous driving, embodied AI, physical simulation | Sim-to-real gap; expensive engineering; may miss real-world quirks |
| LLM generation | A pretrained large language model is prompted to produce text, code, or structured data | Instruction tuning, distillation, code, reasoning chains, synthetic textbooks | Inherits the source model's biases and knowledge limits; risk of homogenization |
Rule-based methods are the simplest and oldest approach. A human defines rules: "generate a customer record with a random name from this list, an age between 18 and 90 drawn from a normal distribution, and a purchase amount correlated with age." These methods are transparent, reproducible, and fast, and they remain widely used in software testing, simulation, and regulatory compliance. Fuzz testing, which feeds programs randomly generated or template-derived inputs to find security bugs, is a long-standing rule-based application.
Statistical approaches fit a model to the empirical distribution of real data and then draw new samples from that fitted distribution. Techniques include Gaussian copulas (which model the dependency structure between variables separately from their marginal distributions), Bayesian networks, and kernel density estimation. The SDV library implements several of these methods. Statistical approaches work well for moderately complex tabular data but struggle when the data has high dimensionality, mixed types, or intricate conditional dependencies.
Generative adversarial networks, introduced by Ian Goodfellow and colleagues in 2014, revolutionized synthetic data generation for images and have been adapted for tabular and time-series data. The GAN framework pits two neural networks against each other: a generator that creates synthetic samples and a discriminator that tries to tell synthetic from real. Through this adversarial process, the generator learns to produce increasingly realistic data.
Key GAN variants for synthetic data include:
GANs have well-known challenges including mode collapse (where the generator produces only a narrow subset of possible outputs), training instability, and difficulty generating discrete data like text. Since 2022 they have been largely superseded for image generation by diffusion models, which are more stable to train and tend to produce more diverse outputs.
Simulation uses physics engines, game engines, or domain-specific simulators to render synthetic environments and produce labeled sensor data. The CARLA driving simulator, NVIDIA Isaac Sim and Omniverse Replicator, Unity Perception, and Microsoft's AirSim are widely used examples. The major advantage is exact ground truth: every pixel comes with perfect depth, semantic, and instance labels. The major drawback is the sim-to-real gap, the systematic differences between simulated and real-world distributions of light, texture, motion, and sensor noise. Domain randomization, where each scene varies textures, lighting, and physics parameters within wide ranges, is a standard technique for closing this gap.
Cost economics have driven adoption. NVIDIA reports that manually annotating an image typically costs around six dollars, while generating a labeled synthetic image in Omniverse Replicator costs about six cents, a roughly 100x reduction. For applications like warehouse robots and surface defect detection, these economics have made synthetic-first pipelines standard.
Since 2023, the most impactful method for generating synthetic training data has been prompting large language models. This approach leverages the broad knowledge and generative capabilities of models like GPT-4, Claude, and Llama to produce training examples at scale. The Self-Instruct method, introduced by Yizhong Wang and colleagues in 2022, pioneered this approach by using a language model to generate its own instruction-following training data through an iterative bootstrapping process.
LLM-generated synthetic data can take many forms: question-answer pairs, multi-turn conversations, code solutions, reasoning chains, textbook passages, and structured outputs. Quality and diversity depend heavily on the prompting strategy, the source model's capabilities, and the filtering and curation pipeline applied after generation.
LLM-generated synthetic data became a defining feature of post-2022 model development. A handful of papers and projects established the patterns that the rest of the field built on.
Self-Instruct (Wang et al., 2022) was the first widely-cited recipe for generating instruction-following data with a language model. The pipeline starts from a small seed set of human-written instructions (175 in the original paper), prompts a base model to generate new instructions in the same style, classifies each as a classification or generation task, generates inputs and outputs, and filters duplicates. Applied to GPT-3, the method produced 52,000 instructions and yielded a 33-point absolute improvement on Super-NaturalInstructions, roughly matching InstructGPT-001 (which had been trained on much more expensive human annotations). The paper, presented at ACL 2023, set the template for nearly every later instruction dataset.
Stanford's Alpaca, released in March 2023 by Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, and colleagues, applied the Self-Instruct recipe to OpenAI's text-davinci-003. They generated 52,000 instruction-output pairs at a total API cost of under $500, then fine-tuned a LLaMA 7B base model on the result. The released model qualitatively matched text-davinci-003 on simple instruction-following tasks. Alpaca is widely credited with starting the open-source instruction-tuning boom, though its dependence on a closed teacher model raised licensing questions.
Vicuna, released in March 2023 by Wei-Lin Chiang and colleagues at LMSYS (UC Berkeley, CMU, Stanford, UC San Diego, and MBZUAI), took a different approach. Rather than generating fresh instruction data, the team scraped roughly 70,000 user-shared conversations from ShareGPT.com (later expanded to 125,000 in v1.3). LLaMA was then fine-tuned on this multi-turn dialogue corpus. Vicuna-13B was rated by GPT-4 as reaching about 90% of ChatGPT's quality on a small evaluation set, and it became one of the most-downloaded open chat models of 2023. The data quality was uneven, since ShareGPT included low-quality and inappropriate content, but the project showed that scraped LLM outputs could substantially improve open base models.
WizardLM, introduced by Can Xu and colleagues at Microsoft and Peking University in 2023, addressed a weakness of Self-Instruct and Alpaca: the generated instructions tended to be simple. Their Evol-Instruct method takes an existing instruction and rewrites it into a more complex version using a fixed set of "evolution" prompts that add constraints, deepen the question, increase reasoning requirements, or broaden scope. Starting from Alpaca's 52K examples and using GPT-3.5-Turbo, the team produced a corpus of progressively harder instructions. Human evaluation showed Evol-Instruct outputs were preferred over the original Alpaca examples on a complexity-balanced test set. The same idea was extended to code (WizardCoder, 2023, presented at ICLR 2024) and math (WizardMath).
The Orca series, from Microsoft Research, focused on transferring not just the answers but the reasoning style of a stronger teacher. Orca 1 (Mukherjee et al., 2023) collected roughly 5 million examples in which GPT-4 was prompted to provide step-by-step reasoning, then fine-tuned a 13B base model on the resulting traces. Orca 2 (November 2023) added "explanation tuning": teaching the student to choose among different solution strategies (step-by-step processing, recall-then-generate, extract-generate, direct answer) depending on the task. Orca 2 was trained on a mix of FLAN-v2, 5 million ChatGPT examples from Orca 1, and 1 million GPT-4 examples. The 13B Orca 2 outperformed the 13B Llama 2 baseline by 47.5% on reasoning benchmarks, and the 7B model was reported as competitive with Llama 2 70B on reasoning tasks.
Microsoft's Phi family is the most prominent demonstration that small models trained on synthetic data can punch well above their parameter count. The thesis was set out in the 2023 paper by Suriya Gunasekar and colleagues, "Textbooks Are All You Need." Phi-1 is a 1.3B-parameter Transformer trained for four days on eight A100s, on a mix of 6B tokens of "textbook quality" code from the web (filtered using a GPT-4 classifier) and 1B tokens of synthetic Python textbooks and exercises generated by GPT-3.5. Despite its small size, Phi-1 reached 50.6% pass@1 on HumanEval and 55.5% on MBPP, beating models 10x its size at the time.
Phi-1.5 extended the recipe to general reasoning. Phi-2 (2.7B) outperformed models up to 25x its size on certain reasoning benchmarks. Phi-3 (released April 2024) introduced a more sophisticated synthetic data pipeline. Phi-4 (December 2024, technical report by Marah Abdin and colleagues) is a 14B model trained on roughly 400 billion synthetic tokens generated across 50 distinct synthetic dataset types, each produced through different seed sets and multi-stage prompting. Crucially, Phi-4 surpassed its teacher model (GPT-4) on STEM-focused QA, the first time a Phi model meaningfully outperformed the model used to generate its training data. A separate Phi-4-reasoning report followed in 2025.
Meta's Llama 3.1 release in July 2024 marked the broader arrival of synthetic data in production-scale LLM development. The 405B model was trained on more than 15 trillion tokens, but its more important role was as a teacher for the 70B and 8B variants. Meta updated the Llama license specifically to allow developers to use Llama outputs to train other models. The Llama 3 paper describes an iterative post-training procedure where each round used supervised fine-tuning and direct preference optimization on synthetic data generated by the previous round's model. AWS, NVIDIA, and Hugging Face have all published tutorials and pipelines on using Llama 3.1 405B to generate task-specific synthetic data for fine-tuning smaller models, making distillation from a strong open teacher a routine workflow.
DeepSeek-V3 and DeepSeek-R1, released in late 2024 and early 2025, made aggressive use of synthetic reasoning data. R1's training pipeline includes a stage where the model generates its own labeled reasoning data through rejection sampling: many candidate solutions are generated for each prompt, V3 is used as a judge, and the best examples are kept for supervised fine-tuning. DeepSeek then used R1 to generate roughly 800,000 high-quality reasoning samples and distilled six smaller open models (variants of Llama 3.1, Llama 3.3, and Qwen 2.5) on this synthetic corpus. The R1 paper, published in arXiv 2501.12948 and later in Nature in 2025, also showed that pure reinforcement learning with verifiable rewards on self-generated reasoning traces can produce strong reasoning capabilities without any human-annotated reasoning data, an unusual demonstration of bootstrapping.
Anthropic's Constitutional AI (Bai et al., December 2022, arXiv:2212.08073) is the earliest documented large-scale use of synthetic data for RLHF-style training. Rather than asking humans to label which of two model responses was safer, Anthropic used a language model itself to perform the labeling, guided by a written "constitution" of principles like "Choose the response that is least harmful." The pipeline produces synthetic critiques and revisions during a supervised stage, and synthetic preference labels during an RL stage (a procedure Anthropic called RLAIF, reinforcement learning from AI feedback). Constitutional AI is now part of Claude's training and has been replicated externally. As Nathan Lambert notes in the RLHF Book, synthetic preference data tends to be lower-noise but higher-bias than human preference data, since AI labelers apply rules consistently but encode the labeler model's blind spots.
Hugging Face's Cosmopedia, released in March 2024, generated over 30 million synthetic textbooks, blog posts, stories, and WikiHow articles using Mixtral-8x7B-Instruct-v0.1, totaling 25 billion tokens. The dataset was built using the llm-swarm library on H100 GPUs with TGI, taking over 10,000 GPU hours. Web data accounted for more than 80% of Cosmopedia's prompts: the team clustered RefinedWeb-style samples into 145 topic groups and asked Mixtral to identify the topic and then write educational content covering it. Cosmopedia was at the time the largest open synthetic dataset for pretraining and provided the first widely-available reproduction of the Phi-style synthetic pretraining recipe.
Synthetic data plays several distinct roles in modern AI training pipelines.
Knowledge distillation uses a large, capable teacher model to generate training data for a smaller student model. The student learns not just the correct answers but also the teacher's reasoning patterns, response style, and surface-level knowledge. This approach has been used extensively in the open-source LLM ecosystem: Stanford's Alpaca distilled instruction-following from text-davinci-003 into LLaMA 7B, the Orca project distilled GPT-4's reasoning traces into smaller models, and DeepSeek distilled R1 into six smaller open base models. The common pattern in 2024 to 2026 has been a strong frontier model used to teach smaller open models, for example GPT-4 to Llama 3 70B to Llama 3 8B.
Synthetic data can augment real datasets by filling gaps, balancing class distributions, or increasing diversity. In natural language processing this includes paraphrasing existing examples, translating into other languages, or generating additional examples for underrepresented categories. In computer vision, augmentation includes rendering rare objects, unusual lighting conditions, and edge-case scenarios. The boundary between augmentation and full synthetic generation is blurry: techniques like RandAugment (Cubuk et al., 2020) treat augmentation as a pipeline of random perturbations, while more recent diffusion-based augmentation produces what amounts to fully synthetic but distribution-anchored examples.
In self-play, a model generates data by interacting with itself or with copies of itself, and this data is then used for further training. AlphaGo and AlphaZero famously used self-play to achieve superhuman performance at board games. In the LLM domain, DeepSeek-R1 demonstrated that pure reinforcement learning with self-generated reasoning traces and verifiable rewards can produce emergent reasoning capabilities without any human-annotated reasoning data.
SPIN (Self-Play Fine-Tuning), proposed by Zixiang Chen and colleagues in early 2024 (arXiv:2401.01335), uses a self-play mechanism in which the model generates responses that are compared against ground-truth data, iteratively improving alignment without additional human annotation.
The most ambitious use of synthetic data is in pretraining. The Phi series demonstrated that small models pretrained primarily on synthetic textbook-quality data could outperform much larger models trained on web scrapes. Phi-1, trained on synthetic textbooks and exercises for coding, achieved strong code generation results despite its small size. Phi-4 used roughly 400 billion tokens of synthetic data across 50 distinct synthetic dataset types. Hugging Face's Cosmopedia took the same approach in the open, generating a 25-billion-token synthetic pretraining corpus with Mixtral.
Synthetic data is now standard in post-training. Llama 3's iterative post-training procedure used synthetic data at every round of supervised fine-tuning and direct preference optimization. Constitutional AI generates synthetic preference data for Claude. Most modern instruction-tuned and reasoning models use a mix of human and synthetic preference data; many use synthetic data exclusively for some capabilities like math reasoning and code, where verifiable rewards make quality control tractable.
In robotics, synthetic data is used for sim-to-real transfer: train a policy in simulation, then deploy it on physical hardware. Domain randomization and domain adaptation reduce the gap. As of 2024 to 2026 the trend has shifted toward foundation-model-based bridging: latent diffusion models conditioned on text or image prompts transform simulated images into more realistic counterparts, supporting few-shot adaptation. NVIDIA's Cosmos and Omniverse Replicator have made synthetic data generation a routine part of industrial robotics pipelines.
The following table summarizes notable examples of synthetic data use in AI training.
| Project | Year | Creator | Synthetic data type | Key details |
|---|---|---|---|---|
| Self-Instruct | 2022 | Allen Institute / UW | Instruction-following data from base model | 33-point gain on Super-NaturalInstructions; foundational recipe |
| Alpaca | 2023 | Stanford | 52K instruction-response pairs | Generated from text-davinci-003 for under $500; fine-tuned LLaMA 7B |
| Vicuna | 2023 | LMSYS | ~70K ShareGPT conversations | Multi-turn dialogue; LLaMA 13B reached ~90% of ChatGPT on simple eval |
| WizardLM / Evol-Instruct | 2023 | Microsoft / PKU | Evolved instructions of varying complexity | Iterative complexity rewriting via GPT-3.5-Turbo |
| Phi-1 | 2023 | Microsoft | Synthetic Python textbooks (1B tokens) | 1.3B model; 50.6% HumanEval; "Textbooks Are All You Need" |
| Phi-2 | 2023 | Microsoft | Synthetic textbooks plus filtered web | 2.7B model; outperformed models 25x its size |
| Phi-4 | 2024 | Microsoft | 400B tokens across 50 synthetic dataset types | 14B model; first Phi to surpass its teacher (GPT-4) on STEM QA |
| Orca | 2023 | Microsoft | 5M GPT-4 reasoning explanations | Explanation tuning: student learns teacher's reasoning |
| Orca 2 | 2023 | Microsoft | GPT-4 multi-strategy reasoning | 13B beat Llama 2 13B by 47.5% on reasoning benchmarks |
| UltraChat | 2023 | Tsinghua | 1.5M multi-turn synthetic dialogues | Generated by GPT-3.5-Turbo |
| WizardCoder | 2023 | Microsoft | Evolved code instructions | Code Evol-Instruct on Code Alpaca |
| Magicoder | 2023 | Various | OSS-Instruct generated code problems | Drew from real OSS code snippets to generate novel problems |
| Cosmopedia | 2024 | Hugging Face | 25B-token synthetic textbook corpus | Generated by Mixtral; 30M+ documents; 10k+ GPU hours |
| AgentInstruct | 2024 | Microsoft | Agentic instruction data | Multi-turn trajectories for tool use and reasoning |
| Llama 3.1 | 2024 | Meta | Iterative post-training synthetic data | 405B used to teach 70B and 8B; license updated to allow distillation |
| DeepSeek-R1 distillation | 2025 | DeepSeek | 800K reasoning traces from R1 | Used to fine-tune 6 open base models in Llama 3.1/3.3 and Qwen 2.5 families |
| Constitutional AI / Claude | 2022-present | Anthropic | Synthetic critiques and preference labels | First large-scale RLAIF; backbone of Claude alignment |
Synthetic data offers several advantages that have driven its rapid adoption.
Synthetic data can replicate the statistical properties of sensitive datasets without containing actual personal information. This is particularly valuable in healthcare, where patient records cannot be freely shared, and in finance, where transaction data is subject to strict regulatory requirements. Synthea, the open-source patient generator developed at MITRE by Jason Walonoski and colleagues (described in a 2018 JAMIA paper), simulates the lifespans of synthetic patients including the ten most frequent reasons for primary care visits and the ten chronic conditions with the highest morbidity in the United States. One million Synthea patient records, encoded in HL7 FHIR and C-CDA standards, are now freely available online and used widely for healthcare AI research.
Differential privacy synthesis combines synthetic data with formal privacy guarantees. Random noise is added to the generation process so that the final dataset cannot be used to identify any individual record. A 2024 study warned that strong differential privacy guarantees (epsilon less than or equal to 1) can inflate Type I error in downstream statistical tests, so practitioners need to validate that the synthetic data still supports valid inference for their target task.
Collecting and annotating real data is expensive. Labeling images for object detection can cost several cents per image, and expert annotation in domains like medical imaging or legal document analysis can cost dollars per example. Synthetic data generation can reduce these costs by orders of magnitude. Stanford's Alpaca demonstrated that an entire instruction tuning dataset could be generated for under $500. NVIDIA reports that synthetic image generation in Omniverse Replicator costs about six cents per image, compared to roughly six dollars for human annotation, a 100x reduction.
Modern AI training is data-hungry, and high-quality human-generated data is finite. Research from Epoch AI and others suggests that the stock of high-quality public text on the internet may be largely exhausted by 2026 to 2028. Synthetic data provides a way to continue scaling beyond the limits of organically available text. Microsoft's Phi-4 used 400 billion synthetic tokens, a scale that would be extraordinarily difficult to achieve through human authorship.
Synthetic data generation can be targeted to fill specific gaps in training data. If a model struggles with a particular type of question or task, synthetic examples covering those areas can be generated on demand. This targeted approach is more efficient than hoping organic data collection will cover all needed scenarios. WizardLM's Evol-Instruct, for example, deliberately generates harder examples than the seed set to push model capability ceilings.
Synthetic data generation is fully controllable and reproducible. Researchers can specify what characteristics the data should have, repeat generation with different random seeds, and version-control the generation pipeline. This level of control is impossible with organic data collection.
Synthetic data carries significant risks that the research community has increasingly recognized.
The most widely discussed risk is model collapse, the phenomenon where models trained on synthetic data generated by previous model generations progressively lose the ability to represent the tails of the original data distribution. In July 2024, Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal published a Nature paper titled "AI Models Collapse When Trained on Recursively Generated Data" (Nature 631, 755-759).
The paper showed that when models are trained iteratively on data generated by previous model generations (a "replace" scenario), each successive generation degrades in quality and diversity. The tails of the original distribution disappear first, meaning rare but important patterns are lost. In a vivid example, a model initially trained on text about medieval architecture devolved by the ninth generation into producing repetitive lists of jackrabbits. The effect was observed across LLMs, variational autoencoders, and Gaussian mixture models.
The mechanism involves two interacting effects:
These errors compound across generations, causing a progressive narrowing of the distribution. Even contamination of training data with as little as 0.1% synthetic data from previous model generations can contribute to eventual collapse in a pure replace setup.
The paper's findings come with important nuances. In an "accumulate" scenario, where each model generation trains on all previous real and synthetic data combined rather than replacing real data, collapse can be avoided or significantly delayed. Subsequent commentary, including a 2024 note by Ali Borji (arXiv:2410.12954), pointed out that the original experimental setup was more pessimistic than typical real-world workflows, since real pipelines mix human and synthetic data and apply quality filters. The practical takeaway has been clear: never fully replace real data with synthetic data; always anchor synthetic generations in human content.
| Aspect | Replace setup | Accumulate setup | Mitigation |
|---|---|---|---|
| Training data | Each generation trained only on synthetic data from prior model | Each generation trained on all real plus synthetic data so far | Always retain human anchor |
| Outcome | Severe distribution narrowing; tails vanish; collapse | Collapse delayed or avoided | Mix real and synthetic |
| Tail patterns | Lost first | Preserved | Diversity filtering |
| Practical relevance | Worst-case scenario | Closer to real workflows | Quality verification |
Synthetic data inherits and can amplify the biases present in the model that generated it. If a language model has learned gender stereotypes, racial biases, or cultural assumptions from its training data, these biases will be reflected in the synthetic data it produces. When this biased synthetic data is used to train new models, the biases can be amplified in a feedback loop. This is particularly concerning when synthetic data is used at scale, since the volume of biased examples can overwhelm any debiasing efforts applied to the real data portion.
Synthetic data quality is bounded by the capabilities of the generating model. A model cannot generate training data that is systematically better than what it can produce; it can only rearrange and recombine its existing knowledge. This means synthetic data is most useful for distillation (training smaller models to approximate larger ones) but has inherent limitations for pushing the frontier of capabilities. Subtle errors, inconsistencies, and hallucinated facts in synthetic text data can propagate to models trained on it. Phi-4 is a partial counterexample: by carefully constructing synthetic data with verified solutions and rejection sampling, Microsoft was able to surpass the teacher model on STEM QA.
LLM-generated text tends to be more homogeneous in style, vocabulary, and structure than human-written text. Models trained heavily on LLM-generated data may lose the diversity of expression found in human language, converging toward a narrow "AI voice." This is related to but distinct from model collapse: even without recursive training, a single generation of synthetic data can lack the variety of human-authored content.
As synthetic data becomes ubiquitous, distinguishing synthetic from real data becomes harder. By April 2025, estimates suggest more than 74% of newly created web pages contained AI-generated text. This contamination of the public web means future models trained on internet scrapes will inevitably train on synthetic data, whether intentionally or not, raising the risk of unintended recursive training effects and complicating any attempt to maintain a clean human-only baseline.
Serious synthetic data pipelines invest heavily in filtering and verification. Naively prompting an LLM and saving the outputs produces poor training data. The current best practice involves several layers.
Quality classifiers built on top of LLMs are now standard. The FineWeb-Edu pipeline, released by Hugging Face in 2024, uses Llama-3-70B-Instruct to score 500,000 web samples for educational quality on a 0 to 5 scale, then trains a Snowflake-based regression classifier to score the remaining web data. The classifier achieves an F1 of 82% on the binary classification task at threshold 3, and on MMLU, FineWeb-Edu can match the final performance of the much larger Matrix dataset using roughly 10x fewer tokens. Similar LLM-as-judge pipelines are used at filtering time in nearly every modern pretraining run.
MinHash deduplication, exact substring deduplication, and topic clustering are used to ensure synthetic data covers a wide range of inputs. Diversity is a known weakness of LLM-generated data, since prompting the same model with similar prompts yields similar outputs. Cosmopedia handled this by clustering web data into 145 topic groups before generation; Phi-4 used 50 distinct synthetic dataset types with different seed sets and prompting procedures.
For code and math, verification by execution provides a natural quality filter: keep only generated solutions that pass test cases or arrive at the correct answer. This is the basis of DeepSeek's reasoning data pipeline and most modern math/code post-training. The 2025 paper "Escaping Model Collapse via Synthetic Data Verification" (Feng et al., arXiv:2510.16657) extends this approach more broadly, showing that proper verification can mitigate many aspects of model collapse.
For high-stakes applications, human-in-the-loop review remains essential. Constitutional AI's principles were authored by Anthropic researchers; Phi-4's synthetic data was checked through extensive ablations; medical synthetic data is typically validated by domain experts before use. Human review is the slowest and most expensive layer, but it remains the only way to catch certain categories of failure.
Synthetic data is used across nearly every domain that uses machine learning. The following table summarizes the major application areas.
| Domain | Typical use | Representative tools and datasets |
|---|---|---|
| Computer vision | Object detection, segmentation, face recognition; rare-class augmentation | RandAugment, StyleGAN-generated faces, NVIDIA Omniverse Replicator |
| NLP | Instruction tuning, code training, reasoning data | Alpaca, WizardLM, Phi-4, Cosmopedia, UltraChat |
| Speech and audio | TTS for low-resource languages; voice cloning for ASR augmentation | ElevenLabs synthetic voices, Whisper fine-tuning corpora |
| Robotics | Sim-to-real transfer; manipulation policies; navigation | NVIDIA Isaac Sim, Mujoco, Habitat, AI2-THOR |
| Self-driving | Perception, planning, edge-case scenario generation | CARLA, Waymo simulators, NVIDIA DRIVE Sim, Mindtech |
| Healthcare | Privacy-preserving training; rare disease modeling | Synthea, MIT MDClone, MOSTLY AI |
| Finance | Fraud detection; rare-event modeling; regulatory testing | CTGAN, MOSTLY AI, Hazy, Tonic.ai |
| Cybersecurity | Adversarial example generation; intrusion detection training | Fuzz testers, GAN-based malware variants |
| Industrial inspection | Defect detection on rare or expensive parts | Datagen, NVIDIA Omniverse Replicator, Unity Perception |
| Gaming | NPC behavior data; procedural content | Self-play in AlphaGo, AlphaZero; LLM dialogue generation |
In computer vision, synthetic data ranges from classical augmentation (RandAugment, AutoAugment, MixUp) to fully rendered scenes. RandAugment, introduced by Ekin Cubuk and colleagues at CVPR 2020, reduced the data augmentation search space from 10^32 to 100 by using a single severity parameter shared across operations, achieving 85.0% top-1 accuracy on ImageNet at the time of publication. NVIDIA Isaac Sim, Unity Perception, and CARLA produce labeled synthetic images and video for object detection, semantic segmentation, and depth estimation.
Synthetic data is now central to LLM post-training for instruction following, code, math, reasoning, and safety alignment. The patterns set by Self-Instruct, Alpaca, and Evol-Instruct dominate the open ecosystem; Constitutional AI dominates synthetic safety alignment; the Phi recipe dominates synthetic pretraining.
In robotics and self-driving, simulation-based synthetic data is the only practical way to obtain certain edge cases (jaywalking pedestrians at night, sensor failures, rare weather). CARLA has been a workhorse since 2017. NVIDIA's Omniverse and Cosmos platforms now generate large-scale synthetic video for training robot foundation models. The 2024-2026 trend toward video world models and physics simulators feeding robotics pipelines makes synthetic data central to embodied AI.
Synthea provides openly-available synthetic patient records that comply with U.S. healthcare data formats. Differentially private synthesis is being applied to behavioral health datasets and clinical trial data. A 2024 Google DeepMind study found that complementing real data with synthetic data improved robustness across histopathology, radiology, and dermatology tasks.
Synthetic data is widely used in finance for fraud detection (where positive examples are rare) and regulatory testing (where real data cannot be shared between institutions). MOSTLY AI counts Fortune 100 banks and insurers among its core clients. CTGAN-based pipelines are commonly used for tabular fraud datasets.
| Company / project | Focus | Notes |
|---|---|---|
| MOSTLY AI | Synthetic tabular data for enterprises | Core clients are Fortune 100 banks, insurers, and telcos; raised $25M Series B in 2022 |
| Gretel.ai | Synthetic data platform with natural-language interface | Acquired by NVIDIA in 2025 to support its AI and cloud offerings |
| Tonic.ai | Synthetic data for software testing and ML | Strong presence in healthcare and fintech |
| Hazy | Synthetic data for financial services | UK-based, regulated industries focus |
| MITRE Synthea | Open-source synthetic patient generator | 1M+ patient records freely available; HL7 FHIR compliant |
| MIT SDV | Open-source synthetic data library | Statistical, GAN, and VAE methods for tabular data |
| Datagen | Synthetic data for computer vision | Faces, hands, indoor scenes |
| Mindtech | Synthetic data for video and surveillance | Focus on ethical, balanced datasets |
| NVIDIA Omniverse / Replicator / Cosmos | Industrial-grade synthetic data for robotics and self-driving | Used by Skild AI and others for robot policy training |
| Microsoft Phi team | Synthetic-first LLM pretraining | Phi-1 through Phi-4 demonstrated the approach |
| Anthropic | Synthetic preference data via Constitutional AI | RLAIF foundation of Claude alignment |
| Hugging Face Cosmopedia | Open synthetic pretraining dataset | 25B tokens via Mixtral |
The rapid growth of synthetic data has attracted regulatory attention worldwide.
The European Union has been the most active regulator in this space. The EU AI Act, which began phased implementation in 2024, includes provisions relevant to synthetic data. High-risk AI systems must document their training data, including any synthetic components. In April 2025, the European Data Protection Board issued guidelines specifically addressing synthetic data generation under GDPR, recognizing its potential for privacy preservation while establishing a framework for compliant generation. The guidelines require organizations to demonstrate that synthetic data cannot be re-identified and that the generation process does not involve unauthorized processing of personal data.
Several jurisdictions now require or are considering requirements for labeling AI-generated content, which extends to synthetic training data. The goal is to maintain data provenance and enable downstream users to understand what proportion of a model's training data was synthetic versus human-generated.
The use of one model's outputs to train another raises unresolved IP questions. OpenAI's terms of service have historically restricted using its API outputs to train competing models. Meta's Llama 3 license update specifically allows distillation. The legal status of synthetic training data generated by commercial models remains uncertain and varies by jurisdiction.
The synthetic data market has experienced rapid growth. Industry estimates place the global synthetic data generation market at approximately $580 million in 2025, with projections ranging from $2.67 billion by 2030 (at a 39.4% CAGR) to $7.22 billion by 2033 (at a 37.65% CAGR), depending on the research firm and market definition. The synthetic tabular sub-market alone was estimated at $1.36 billion in 2024 and projected to reach $1.88 billion in 2025.
| Year | Estimated market size (USD) | Key drivers |
|---|---|---|
| 2023 | ~$300M | LLM training boom; Phi models demonstrate synthetic pretraining viability |
| 2024 | ~$400-575M | Enterprise adoption for privacy compliance; EU AI Act implementation begins; Phi-4 released |
| 2025 | ~$580M | Mainstream adoption across industries; GDPR synthetic data guidelines issued; NVIDIA acquires Gretel |
| 2026 (projected) | ~$770M | Continued growth driven by data scarcity and regulatory requirements |
| 2030 (projected) | ~$2.7B | Established component of AI training infrastructure |
Major technology companies are heavily invested. Microsoft has built its Phi line around synthetic data. Google has used synthetic data extensively in training its Gemini models. Anthropic relies on synthetic preference data for Claude. Meta updated the Llama license specifically to encourage distillation. The data licensing market has also grown, with companies like Reddit and News Corp signing deals with AI labs to provide verified human-generated content as an anchor against the risks of pure synthetic training.
As of early 2026, synthetic data is a routine component of AI training pipelines, and the field is grappling with several evolving challenges.
The supply of high-quality, human-generated text data is tightening. Research suggests that the stock of high-quality text on the internet suitable for LLM training may approach its practical limits within the next few years. This scarcity has made synthetic data not merely convenient but necessary for continued scaling. The contamination of the public web with AI-generated content also means that even "organic" web scrapes increasingly contain synthetic text, blurring the distinction between real and synthetic data.
The most successful current approaches combine synthetic and real data strategically. Apple's "Rephrasing the Web" approach, Microsoft's Phi series, and Hugging Face's Cosmopedia all demonstrate that the best results come from anchoring synthetic data in human-generated foundations. The consensus recommendation is to never fully replace real data with synthetic data, but to use synthetic data to augment, diversify, and extend it.
Increasingly sophisticated pipelines verify and filter synthetic data before training. These include automated quality scoring (FineWeb-Edu style), factual consistency checking, diversity metrics, and human-in-the-loop review. Recent research on synthetic data verification suggests that proper verification can mitigate many of the risks associated with synthetic data, including some aspects of model collapse.
One of the most active areas is using synthetic data to improve reasoning. This includes step-by-step mathematical proofs, code solutions with test cases, and logical reasoning chains. The advantage in this domain is that verifiable rewards provide a natural quality filter: if the generated solution passes the test cases, it is useful regardless of who or what produced it. DeepSeek's R1 distillation pipeline and Microsoft's Phi-4-reasoning report exemplify this trend.
With the rise of multimodal models, synthetic data generation has expanded beyond text and images to include video, audio, 3D scenes, and interleaved modalities. Sora and other generative video models are being used to produce training data for robotics, autonomous driving, and embodied AI. NVIDIA's Cosmos world models generate photorealistic synthetic video specifically intended for training robot policies.
The open-source tooling around synthetic data has matured rapidly. Active projects include the Synthetic Data Vault (SDV), YData synthetic, Hugging Face's synthetic-data-generator, llm-swarm (used for Cosmopedia), distilabel (used by Argilla and others to build instruction datasets), and Microsoft's AgentInstruct framework. Commercial offerings from Gretel (now NVIDIA), MOSTLY AI, Tonic.ai, and Hazy provide enterprise platforms with built-in privacy controls.