Synthetic data

Synthetic data is data that has been artificially generated rather than collected from real-world events. In the context of artificial intelligence and machine learning, synthetic data is produced by algorithms, simulations, or generative models to serve as a substitute for or supplement to real data when training, validating, or testing AI systems. Use of synthetic data has grown dramatically since 2022, driven by the scaling demands of large language models, the increasing scarcity of high-quality human-generated training data, mounting privacy regulations, and the cost of human annotation. By late 2024, synthetic data sat at the center of frontier model training, with Microsoft's Phi-4 trained on roughly 400 billion synthetic tokens, Meta using its 405B Llama 3 model to generate training data for its smaller siblings, and DeepSeek using rejection-sampled reasoning traces to teach its V3 and R1 models.

At the same time, research has revealed serious risks. The most widely cited is model collapse, the phenomenon documented by Shumailov and colleagues in a 2024 Nature paper, where models trained recursively on synthetic data progressively lose the ability to represent the tails of the original distribution. Bias amplification, homogenization of style, and the steadily growing share of AI-generated content on the open web have all complicated the picture. The current consensus is that synthetic data is most effective when it augments rather than replaces human-generated data, and when the generation pipeline includes aggressive filtering, diversity controls, and verification.

definition and scope

Synthetic data is any data not collected from direct measurement of the real world. The category is broader than it sounds. A weather simulator producing fake satellite images, a language model writing instruction-response pairs, a game engine rendering pedestrians for self-driving training, and a statistical model sampling new rows from a fitted distribution all produce synthetic data. The data may be used to:

train new models when real data is scarce, expensive, or restricted
augment real datasets to fill class gaps or rare scenarios
test software and ML pipelines without exposing real records
preserve privacy by replacing identifying information with statistically equivalent surrogates
distill capabilities from a large teacher model into a smaller student model
generate adversarial or edge-case inputs for robustness evaluation

The key property is fidelity: how closely the synthetic data matches the structure and statistical properties of real data for a given downstream task. A synthetic dataset that fools a discriminator may still fail to train a useful classifier if the relationships it captures are superficial. Quality is always task-dependent.

types of synthetic data

Synthetic data takes many forms. The categories below are not mutually exclusive; modern pipelines often combine several modalities.

tabular data

Synthetic tabular data consists of structured rows and columns that mimic the statistical properties of a real dataset while containing no actual records. This is the oldest form of synthetic data and remains widely used in healthcare, finance, telecommunications, and software testing. The Synthetic Data Vault (SDV), released by researchers at MIT in 2016, provides open-source frameworks for generating synthetic tabular data using Gaussian copulas, variational autoencoders (VAEs), and conditional tabular GANs (CTGAN). Commercial platforms from MOSTLY AI, Tonic.ai, Gretel.ai, and Hazy generate synthetic versions of customer databases, transaction logs, and electronic health records.

text data

Synthetic text includes instruction-response pairs, dialogues, code, articles, reasoning chains, and entire books. Since the release of ChatGPT in late 2022, LLM-generated text has become the dominant form of synthetic data in AI training. Applications range from generating instruction tuning datasets like Alpaca and UltraChat to creating synthetic textbooks for pretraining as in the Phi series and Cosmopedia.

image data

Synthetic images are generated using generative adversarial networks (GANs), diffusion models, or rendering engines like Unity, Unreal Engine, and Blender. Common applications include training object detection and computer vision models when real labeled images are scarce, expensive, or privacy-sensitive. Synthetic face datasets have been used to train facial recognition systems without using photographs of real people, and rendered images of warehouse environments are widely used to train pick-and-place robots.

video data

Synthetic video extends image synthesis into the temporal domain, generating sequences of frames for autonomous driving simulation, action recognition training, robotics, and surveillance research. Simulation platforms like CARLA and NVIDIA Isaac Sim produce photorealistic synthetic video for training reinforcement learning agents and perception systems. Generative video models like Sora, Veo, and Runway can produce short clips that have been proposed as data sources for downstream training, though their use as training data for other generative models is contested.

audio and speech data

Synthetic audio includes text-to-speech outputs, voice cloning, and music. It is used to train speech recognition and audio classification systems, particularly for low-resource languages where recorded speech is limited. TTS-generated training data has become standard for fine-tuning ASR systems on rare accents, code-switched speech, and domain-specific vocabularies.

3D and sensor data

Synthetic 3D scenes, point clouds, LiDAR returns, radar signatures, and IMU traces are generated using physics simulators and game engines. The CARLA simulator, introduced by Dosovitskiy and colleagues in 2017, became a standard benchmark for autonomous driving research, providing labeled sensor data for camera, depth, and semantic segmentation streams. NVIDIA's Omniverse and Isaac Sim extend this approach to industrial robotics, generating photorealistic, physically-accurate synthetic data with full ground-truth labels.

generation methods

Generation methods can be grouped into five broad approaches. Each makes different trade-offs between fidelity, controllability, cost, and the kinds of structure it can capture.

Method	How it works	Best for	Limitations
Rule-based	Predefined templates, grammars, and heuristics generate data	Software testing, fuzz testing, simulation, regulatory compliance	Cannot capture complex real-world distributions
Statistical	Fits a model (Gaussian copula, Bayesian network, KDE) to the empirical distribution and samples new points	Tabular data with well-defined distributions	Struggles with high dimensionality, mixed types, and complex dependencies
Generative neural networks	GAN, VAE, or diffusion model learns a generative model of the data	Images, audio, tabular data, time series	Mode collapse, training instability, difficulty with discrete outputs like text
Simulation	A physics or game engine renders synthetic environments and produces labeled sensor data	Robotics, autonomous driving, embodied AI, physical simulation	Sim-to-real gap; expensive engineering; may miss real-world quirks
LLM generation	A pretrained large language model is prompted to produce text, code, or structured data	Instruction tuning, distillation, code, reasoning chains, synthetic textbooks	Inherits the source model's biases and knowledge limits; risk of homogenization

rule-based generation

Rule-based methods are the simplest and oldest approach. A human defines rules: "generate a customer record with a random name from this list, an age between 18 and 90 drawn from a normal distribution, and a purchase amount correlated with age." These methods are transparent, reproducible, and fast, and they remain widely used in software testing, simulation, and regulatory compliance. Fuzz testing, which feeds programs randomly generated or template-derived inputs to find security bugs, is a long-standing rule-based application.

statistical methods

Statistical approaches fit a model to the empirical distribution of real data and then draw new samples from that fitted distribution. Techniques include Gaussian copulas (which model the dependency structure between variables separately from their marginal distributions), Bayesian networks, and kernel density estimation. The SDV library implements several of these methods. Statistical approaches work well for moderately complex tabular data but struggle when the data has high dimensionality, mixed types, or intricate conditional dependencies.

gan-based generation

Generative adversarial networks, introduced by Ian Goodfellow and colleagues in 2014, revolutionized synthetic data generation for images and have been adapted for tabular and time-series data. The GAN framework pits two neural networks against each other: a generator that creates synthetic samples and a discriminator that tries to tell synthetic from real. Through this adversarial process, the generator learns to produce increasingly realistic data.

Key GAN variants for synthetic data include:

DCGAN (Deep Convolutional GAN), which generates realistic images using convolutional architectures
StyleGAN, developed by NVIDIA, which produces high-resolution face images with fine-grained control over style attributes
CTGAN (Conditional Tabular GAN), specifically designed for tabular data with mixed continuous and discrete columns
TimeGAN, which generates realistic time-series data while preserving temporal dynamics

GANs have well-known challenges including mode collapse (where the generator produces only a narrow subset of possible outputs), training instability, and difficulty generating discrete data like text. Since 2022 they have been largely superseded for image generation by diffusion models, which are more stable to train and tend to produce more diverse outputs.

simulation-based generation

Simulation uses physics engines, game engines, or domain-specific simulators to render synthetic environments and produce labeled sensor data. The CARLA driving simulator, NVIDIA Isaac Sim and Omniverse Replicator, Unity Perception, and Microsoft's AirSim are widely used examples. The major advantage is exact ground truth: every pixel comes with perfect depth, semantic, and instance labels. The major drawback is the sim-to-real gap, the systematic differences between simulated and real-world distributions of light, texture, motion, and sensor noise. Domain randomization, where each scene varies textures, lighting, and physics parameters within wide ranges, is a standard technique for closing this gap.

Cost economics have driven adoption. NVIDIA reports that manually annotating an image typically costs around six dollars, while generating a labeled synthetic image in Omniverse Replicator costs about six cents, a roughly 100x reduction. For applications like warehouse robots and surface defect detection, these economics have made synthetic-first pipelines standard.

llm-generated data

Since 2023, the most impactful method for generating synthetic training data has been prompting large language models. This approach leverages the broad knowledge and generative capabilities of models like GPT-4, Claude, and Llama to produce training examples at scale. The Self-Instruct method, introduced by Yizhong Wang and colleagues in 2022, pioneered this approach by using a language model to generate its own instruction-following training data through an iterative bootstrapping process.

LLM-generated synthetic data can take many forms: question-answer pairs, multi-turn conversations, code solutions, reasoning chains, textbook passages, and structured outputs. Quality and diversity depend heavily on the prompting strategy, the source model's capabilities, and the filtering and curation pipeline applied after generation.

llm synthetic data generation

LLM-generated synthetic data became a defining feature of post-2022 model development. A handful of papers and projects established the patterns that the rest of the field built on.

self-instruct

Self-Instruct (Wang et al., 2022) was the first widely-cited recipe for generating instruction-following data with a language model. The pipeline starts from a small seed set of human-written instructions (175 in the original paper), prompts a base model to generate new instructions in the same style, classifies each as a classification or generation task, generates inputs and outputs, and filters duplicates. Applied to GPT-3, the method produced 52,000 instructions and yielded a 33-point absolute improvement on Super-NaturalInstructions, roughly matching InstructGPT-001 (which had been trained on much more expensive human annotations). The paper, presented at ACL 2023, set the template for nearly every later instruction dataset.

alpaca

Stanford's Alpaca, released in March 2023 by Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, and colleagues, applied the Self-Instruct recipe to OpenAI's text-davinci-003. They generated 52,000 instruction-output pairs at a total API cost of under $500, then fine-tuned a LLaMA 7B base model on the result. The released model qualitatively matched text-davinci-003 on simple instruction-following tasks. Alpaca is widely credited with starting the open-source instruction-tuning boom, though its dependence on a closed teacher model raised licensing questions.

vicuna

Vicuna, released in March 2023 by Wei-Lin Chiang and colleagues at LMSYS (UC Berkeley, CMU, Stanford, UC San Diego, and MBZUAI), took a different approach. Rather than generating fresh instruction data, the team scraped roughly 70,000 user-shared conversations from ShareGPT.com (later expanded to 125,000 in v1.3). LLaMA was then fine-tuned on this multi-turn dialogue corpus. Vicuna-13B was rated by GPT-4 as reaching about 90% of ChatGPT's quality on a small evaluation set, and it became one of the most-downloaded open chat models of 2023. The data quality was uneven, since ShareGPT included low-quality and inappropriate content, but the project showed that scraped LLM outputs could substantially improve open base models.

wizardlm and evol-instruct

WizardLM, introduced by Can Xu and colleagues at Microsoft and Peking University in 2023, addressed a weakness of Self-Instruct and Alpaca: the generated instructions tended to be simple. Their Evol-Instruct method takes an existing instruction and rewrites it into a more complex version using a fixed set of "evolution" prompts that add constraints, deepen the question, increase reasoning requirements, or broaden scope. Starting from Alpaca's 52K examples and using GPT-3.5-Turbo, the team produced a corpus of progressively harder instructions. Human evaluation showed Evol-Instruct outputs were preferred over the original Alpaca examples on a complexity-balanced test set. The same idea was extended to code (WizardCoder, 2023, presented at ICLR 2024) and math (WizardMath).

orca and orca 2

The Orca series, from Microsoft Research, focused on transferring not just the answers but the reasoning style of a stronger teacher. Orca 1 (Mukherjee et al., 2023) collected roughly 5 million examples in which GPT-4 was prompted to provide step-by-step reasoning, then fine-tuned a 13B base model on the resulting traces. Orca 2 (November 2023) added "explanation tuning": teaching the student to choose among different solution strategies (step-by-step processing, recall-then-generate, extract-generate, direct answer) depending on the task. Orca 2 was trained on a mix of FLAN-v2, 5 million ChatGPT examples from Orca 1, and 1 million GPT-4 examples. The 13B Orca 2 outperformed the 13B Llama 2 baseline by 47.5% on reasoning benchmarks, and the 7B model was reported as competitive with Llama 2 70B on reasoning tasks.

the phi series

Microsoft's Phi family is the most prominent demonstration that small models trained on synthetic data can punch well above their parameter count. The thesis was set out in the 2023 paper by Suriya Gunasekar and colleagues, "Textbooks Are All You Need." Phi-1 is a 1.3B-parameter Transformer trained for four days on eight A100s, on a mix of 6B tokens of "textbook quality" code from the web (filtered using a GPT-4 classifier) and 1B tokens of synthetic Python textbooks and exercises generated by GPT-3.5. Despite its small size, Phi-1 reached 50.6% pass@1 on HumanEval and 55.5% on MBPP, beating models 10x its size at the time.

Phi-1.5 extended the recipe to general reasoning. Phi-2 (2.7B) outperformed models up to 25x its size on certain reasoning benchmarks. Phi-3 (released April 2024) introduced a more sophisticated synthetic data pipeline. Phi-4 (December 2024, technical report by Marah Abdin and colleagues) is a 14B model trained on roughly 400 billion synthetic tokens generated across 50 distinct synthetic dataset types, each produced through different seed sets and multi-stage prompting. Crucially, Phi-4 surpassed its teacher model (GPT-4) on STEM-focused QA, the first time a Phi model meaningfully outperformed the model used to generate its training data. A separate Phi-4-reasoning report followed in 2025.

llama 3 and self-distillation

Meta's Llama 3.1 release in July 2024 marked the broader arrival of synthetic data in production-scale LLM development. The 405B model was trained on more than 15 trillion tokens, but its more important role was as a teacher for the 70B and 8B variants. Meta updated the Llama license specifically to allow developers to use Llama outputs to train other models. The Llama 3 paper describes an iterative post-training procedure where each round used supervised fine-tuning and direct preference optimization on synthetic data generated by the previous round's model. AWS, NVIDIA, and Hugging Face have all published tutorials and pipelines on using Llama 3.1 405B to generate task-specific synthetic data for fine-tuning smaller models, making distillation from a strong open teacher a routine workflow.

deepseek v3 and r1

DeepSeek-V3 and DeepSeek-R1, released in late 2024 and early 2025, made aggressive use of synthetic reasoning data. R1's training pipeline includes a stage where the model generates its own labeled reasoning data through rejection sampling: many candidate solutions are generated for each prompt, V3 is used as a judge, and the best examples are kept for supervised fine-tuning. DeepSeek then used R1 to generate roughly 800,000 high-quality reasoning samples and distilled six smaller open models (variants of Llama 3.1, Llama 3.3, and Qwen 2.5) on this synthetic corpus. The R1 paper, published in arXiv 2501.12948 and later in Nature in 2025, also showed that pure reinforcement learning with verifiable rewards on self-generated reasoning traces can produce strong reasoning capabilities without any human-annotated reasoning data, an unusual demonstration of bootstrapping.

constitutional ai and synthetic preference data

Anthropic's Constitutional AI (Bai et al., December 2022, arXiv:2212.08073) is the earliest documented large-scale use of synthetic data for RLHF-style training. Rather than asking humans to label which of two model responses was safer, Anthropic used a language model itself to perform the labeling, guided by a written "constitution" of principles like "Choose the response that is least harmful." The pipeline produces synthetic critiques and revisions during a supervised stage, and synthetic preference labels during an RL stage (a procedure Anthropic called RLAIF, reinforcement learning from AI feedback). Constitutional AI is now part of Claude's training and has been replicated externally. As Nathan Lambert notes in the RLHF Book, synthetic preference data tends to be lower-noise but higher-bias than human preference data, since AI labelers apply rules consistently but encode the labeler model's blind spots.

cosmopedia

Hugging Face's Cosmopedia, released in March 2024, generated over 30 million synthetic textbooks, blog posts, stories, and WikiHow articles using Mixtral-8x7B-Instruct-v0.1, totaling 25 billion tokens. The dataset was built using the llm-swarm library on H100 GPUs with TGI, taking over 10,000 GPU hours. Web data accounted for more than 80% of Cosmopedia's prompts: the team clustered RefinedWeb-style samples into 145 topic groups and asked Mixtral to identify the topic and then write educational content covering it. Cosmopedia was at the time the largest open synthetic dataset for pretraining and provided the first widely-available reproduction of the Phi-style synthetic pretraining recipe.

use in ai training

Synthetic data plays several distinct roles in modern AI training pipelines.

knowledge distillation

Knowledge distillation uses a large, capable teacher model to generate training data for a smaller student model. The student learns not just the correct answers but also the teacher's reasoning patterns, response style, and surface-level knowledge. This approach has been used extensively in the open-source LLM ecosystem: Stanford's Alpaca distilled instruction-following from text-davinci-003 into LLaMA 7B, the Orca project distilled GPT-4's reasoning traces into smaller models, and DeepSeek distilled R1 into six smaller open base models. The common pattern in 2024 to 2026 has been a strong frontier model used to teach smaller open models, for example GPT-4 to Llama 3 70B to Llama 3 8B.

data augmentation

Synthetic data can augment real datasets by filling gaps, balancing class distributions, or increasing diversity. In natural language processing this includes paraphrasing existing examples, translating into other languages, or generating additional examples for underrepresented categories. In computer vision, augmentation includes rendering rare objects, unusual lighting conditions, and edge-case scenarios. The boundary between augmentation and full synthetic generation is blurry: techniques like RandAugment (Cubuk et al., 2020) treat augmentation as a pipeline of random perturbations, while more recent diffusion-based augmentation produces what amounts to fully synthetic but distribution-anchored examples.

self-play and self-improvement

In self-play, a model generates data by interacting with itself or with copies of itself, and this data is then used for further training. AlphaGo and AlphaZero famously used self-play to achieve superhuman performance at board games. In the LLM domain, DeepSeek-R1 demonstrated that pure reinforcement learning with self-generated reasoning traces and verifiable rewards can produce emergent reasoning capabilities without any human-annotated reasoning data.

SPIN (Self-Play Fine-Tuning), proposed by Zixiang Chen and colleagues in early 2024 (arXiv:2401.01335), uses a self-play mechanism in which the model generates responses that are compared against ground-truth data, iteratively improving alignment without additional human annotation.

pretraining data

The most ambitious use of synthetic data is in pretraining. The Phi series demonstrated that small models pretrained primarily on synthetic textbook-quality data could outperform much larger models trained on web scrapes. Phi-1, trained on synthetic textbooks and exercises for coding, achieved strong code generation results despite its small size. Phi-4 used roughly 400 billion tokens of synthetic data across 50 distinct synthetic dataset types. Hugging Face's Cosmopedia took the same approach in the open, generating a 25-billion-token synthetic pretraining corpus with Mixtral.

post-training and alignment

Synthetic data is now standard in post-training. Llama 3's iterative post-training procedure used synthetic data at every round of supervised fine-tuning and direct preference optimization. Constitutional AI generates synthetic preference data for Claude. Most modern instruction-tuned and reasoning models use a mix of human and synthetic preference data; many use synthetic data exclusively for some capabilities like math reasoning and code, where verifiable rewards make quality control tractable.

sim-to-real transfer

In robotics, synthetic data is used for sim-to-real transfer: train a policy in simulation, then deploy it on physical hardware. Domain randomization and domain adaptation reduce the gap. As of 2024 to 2026 the trend has shifted toward foundation-model-based bridging: latent diffusion models conditioned on text or image prompts transform simulated images into more realistic counterparts, supporting few-shot adaptation. NVIDIA's Cosmos and Omniverse Replicator have made synthetic data generation a routine part of industrial robotics pipelines.

key examples

The following table summarizes notable examples of synthetic data use in AI training.

Project	Year	Creator	Synthetic data type	Key details
Self-Instruct	2022	Allen Institute / UW	Instruction-following data from base model	33-point gain on Super-NaturalInstructions; foundational recipe
Alpaca	2023	Stanford	52K instruction-response pairs	Generated from text-davinci-003 for under $500; fine-tuned LLaMA 7B
Vicuna	2023	LMSYS	~70K ShareGPT conversations	Multi-turn dialogue; LLaMA 13B reached ~90% of ChatGPT on simple eval
WizardLM / Evol-Instruct	2023	Microsoft / PKU	Evolved instructions of varying complexity	Iterative complexity rewriting via GPT-3.5-Turbo
Phi-1	2023	Microsoft	Synthetic Python textbooks (1B tokens)	1.3B model; 50.6% HumanEval; "Textbooks Are All You Need"
Phi-2	2023	Microsoft	Synthetic textbooks plus filtered web	2.7B model; outperformed models 25x its size
Phi-4	2024	Microsoft	400B tokens across 50 synthetic dataset types	14B model; first Phi to surpass its teacher (GPT-4) on STEM QA
Orca	2023	Microsoft	5M GPT-4 reasoning explanations	Explanation tuning: student learns teacher's reasoning
Orca 2	2023	Microsoft	GPT-4 multi-strategy reasoning	13B beat Llama 2 13B by 47.5% on reasoning benchmarks
UltraChat	2023	Tsinghua	1.5M multi-turn synthetic dialogues	Generated by GPT-3.5-Turbo
WizardCoder	2023	Microsoft	Evolved code instructions	Code Evol-Instruct on Code Alpaca
Magicoder	2023	Various	OSS-Instruct generated code problems	Drew from real OSS code snippets to generate novel problems
Cosmopedia	2024	Hugging Face	25B-token synthetic textbook corpus	Generated by Mixtral; 30M+ documents; 10k+ GPU hours
AgentInstruct	2024	Microsoft	Agentic instruction data	Multi-turn trajectories for tool use and reasoning
Llama 3.1	2024	Meta	Iterative post-training synthetic data	405B used to teach 70B and 8B; license updated to allow distillation
DeepSeek-R1 distillation	2025	DeepSeek	800K reasoning traces from R1	Used to fine-tune 6 open base models in Llama 3.1/3.3 and Qwen 2.5 families
Constitutional AI / Claude	2022-present	Anthropic	Synthetic critiques and preference labels	First large-scale RLAIF; backbone of Claude alignment

benefits of synthetic data

Synthetic data offers several advantages that have driven its rapid adoption.

privacy preservation

Synthetic data can replicate the statistical properties of sensitive datasets without containing actual personal information. This is particularly valuable in healthcare, where patient records cannot be freely shared, and in finance, where transaction data is subject to strict regulatory requirements. Synthea, the open-source patient generator developed at MITRE by Jason Walonoski and colleagues (described in a 2018 JAMIA paper), simulates the lifespans of synthetic patients including the ten most frequent reasons for primary care visits and the ten chronic conditions with the highest morbidity in the United States. One million Synthea patient records, encoded in HL7 FHIR and C-CDA standards, are now freely available online and used widely for healthcare AI research.

Differential privacy synthesis combines synthetic data with formal privacy guarantees. Random noise is added to the generation process so that the final dataset cannot be used to identify any individual record. A 2024 study warned that strong differential privacy guarantees (epsilon less than or equal to 1) can inflate Type I error in downstream statistical tests, so practitioners need to validate that the synthetic data still supports valid inference for their target task.

cost reduction

Collecting and annotating real data is expensive. Labeling images for object detection can cost several cents per image, and expert annotation in domains like medical imaging or legal document analysis can cost dollars per example. Synthetic data generation can reduce these costs by orders of magnitude. Stanford's Alpaca demonstrated that an entire instruction tuning dataset could be generated for under $500. NVIDIA reports that synthetic image generation in Omniverse Replicator costs about six cents per image, compared to roughly six dollars for human annotation, a 100x reduction.

scale

Modern AI training is data-hungry, and high-quality human-generated data is finite. Research from Epoch AI and others suggests that the stock of high-quality public text on the internet may be largely exhausted by 2026 to 2028. Synthetic data provides a way to continue scaling beyond the limits of organically available text. Microsoft's Phi-4 used 400 billion synthetic tokens, a scale that would be extraordinarily difficult to achieve through human authorship.

diversity and coverage

Synthetic data generation can be targeted to fill specific gaps in training data. If a model struggles with a particular type of question or task, synthetic examples covering those areas can be generated on demand. This targeted approach is more efficient than hoping organic data collection will cover all needed scenarios. WizardLM's Evol-Instruct, for example, deliberately generates harder examples than the seed set to push model capability ceilings.

control and reproducibility

Synthetic data generation is fully controllable and reproducible. Researchers can specify what characteristics the data should have, repeat generation with different random seeds, and version-control the generation pipeline. This level of control is impossible with organic data collection.

risks and challenges

Synthetic data carries significant risks that the research community has increasingly recognized.

model collapse

The most widely discussed risk is model collapse, the phenomenon where models trained on synthetic data generated by previous model generations progressively lose the ability to represent the tails of the original data distribution. In July 2024, Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal published a Nature paper titled "AI Models Collapse When Trained on Recursively Generated Data" (Nature 631, 755-759).

The paper showed that when models are trained iteratively on data generated by previous model generations (a "replace" scenario), each successive generation degrades in quality and diversity. The tails of the original distribution disappear first, meaning rare but important patterns are lost. In a vivid example, a model initially trained on text about medieval architecture devolved by the ninth generation into producing repetitive lists of jackrabbits. The effect was observed across LLMs, variational autoencoders, and Gaussian mixture models.

The mechanism involves two interacting effects:

Statistical approximation error: each model generation introduces small errors in approximating the true data distribution
Functional approximation error: the model architecture itself has limited capacity to represent the full distribution

These errors compound across generations, causing a progressive narrowing of the distribution. Even contamination of training data with as little as 0.1% synthetic data from previous model generations can contribute to eventual collapse in a pure replace setup.

The paper's findings come with important nuances. In an "accumulate" scenario, where each model generation trains on all previous real and synthetic data combined rather than replacing real data, collapse can be avoided or significantly delayed. Subsequent commentary, including a 2024 note by Ali Borji (arXiv:2410.12954), pointed out that the original experimental setup was more pessimistic than typical real-world workflows, since real pipelines mix human and synthetic data and apply quality filters. The practical takeaway has been clear: never fully replace real data with synthetic data; always anchor synthetic generations in human content.

Aspect	Replace setup	Accumulate setup	Mitigation
Training data	Each generation trained only on synthetic data from prior model	Each generation trained on all real plus synthetic data so far	Always retain human anchor
Outcome	Severe distribution narrowing; tails vanish; collapse	Collapse delayed or avoided	Mix real and synthetic
Tail patterns	Lost first	Preserved	Diversity filtering
Practical relevance	Worst-case scenario	Closer to real workflows	Quality verification

bias amplification

Synthetic data inherits and can amplify the biases present in the model that generated it. If a language model has learned gender stereotypes, racial biases, or cultural assumptions from its training data, these biases will be reflected in the synthetic data it produces. When this biased synthetic data is used to train new models, the biases can be amplified in a feedback loop. This is particularly concerning when synthetic data is used at scale, since the volume of biased examples can overwhelm any debiasing efforts applied to the real data portion.

quality degradation

Synthetic data quality is bounded by the capabilities of the generating model. A model cannot generate training data that is systematically better than what it can produce; it can only rearrange and recombine its existing knowledge. This means synthetic data is most useful for distillation (training smaller models to approximate larger ones) but has inherent limitations for pushing the frontier of capabilities. Subtle errors, inconsistencies, and hallucinated facts in synthetic text data can propagate to models trained on it. Phi-4 is a partial counterexample: by carefully constructing synthetic data with verified solutions and rejection sampling, Microsoft was able to surpass the teacher model on STEM QA.

homogenization

LLM-generated text tends to be more homogeneous in style, vocabulary, and structure than human-written text. Models trained heavily on LLM-generated data may lose the diversity of expression found in human language, converging toward a narrow "AI voice." This is related to but distinct from model collapse: even without recursive training, a single generation of synthetic data can lack the variety of human-authored content.

detection and provenance

As synthetic data becomes ubiquitous, distinguishing synthetic from real data becomes harder. By April 2025, estimates suggest more than 74% of newly created web pages contained AI-generated text. This contamination of the public web means future models trained on internet scrapes will inevitably train on synthetic data, whether intentionally or not, raising the risk of unintended recursive training effects and complicating any attempt to maintain a clean human-only baseline.

quality control and filtering

Serious synthetic data pipelines invest heavily in filtering and verification. Naively prompting an LLM and saving the outputs produces poor training data. The current best practice involves several layers.

llm-as-judge

Quality classifiers built on top of LLMs are now standard. The FineWeb-Edu pipeline, released by Hugging Face in 2024, uses Llama-3-70B-Instruct to score 500,000 web samples for educational quality on a 0 to 5 scale, then trains a Snowflake-based regression classifier to score the remaining web data. The classifier achieves an F1 of 82% on the binary classification task at threshold 3, and on MMLU, FineWeb-Edu can match the final performance of the much larger Matrix dataset using roughly 10x fewer tokens. Similar LLM-as-judge pipelines are used at filtering time in nearly every modern pretraining run.

diversity filtering

MinHash deduplication, exact substring deduplication, and topic clustering are used to ensure synthetic data covers a wide range of inputs. Diversity is a known weakness of LLM-generated data, since prompting the same model with similar prompts yields similar outputs. Cosmopedia handled this by clustering web data into 145 topic groups before generation; Phi-4 used 50 distinct synthetic dataset types with different seed sets and prompting procedures.

verification by execution

For code and math, verification by execution provides a natural quality filter: keep only generated solutions that pass test cases or arrive at the correct answer. This is the basis of DeepSeek's reasoning data pipeline and most modern math/code post-training. The 2025 paper "Escaping Model Collapse via Synthetic Data Verification" (Feng et al., arXiv:2510.16657) extends this approach more broadly, showing that proper verification can mitigate many aspects of model collapse.

human spot-checks

For high-stakes applications, human-in-the-loop review remains essential. Constitutional AI's principles were authored by Anthropic researchers; Phi-4's synthetic data was checked through extensive ablations; medical synthetic data is typically validated by domain experts before use. Human review is the slowest and most expensive layer, but it remains the only way to catch certain categories of failure.

applications

Synthetic data is used across nearly every domain that uses machine learning. The following table summarizes the major application areas.

Domain	Typical use	Representative tools and datasets
Computer vision	Object detection, segmentation, face recognition; rare-class augmentation	RandAugment, StyleGAN-generated faces, NVIDIA Omniverse Replicator
NLP	Instruction tuning, code training, reasoning data	Alpaca, WizardLM, Phi-4, Cosmopedia, UltraChat
Speech and audio	TTS for low-resource languages; voice cloning for ASR augmentation	ElevenLabs synthetic voices, Whisper fine-tuning corpora
Robotics	Sim-to-real transfer; manipulation policies; navigation	NVIDIA Isaac Sim, Mujoco, Habitat, AI2-THOR
Self-driving	Perception, planning, edge-case scenario generation	CARLA, Waymo simulators, NVIDIA DRIVE Sim, Mindtech
Healthcare	Privacy-preserving training; rare disease modeling	Synthea, MIT MDClone, MOSTLY AI
Finance	Fraud detection; rare-event modeling; regulatory testing	CTGAN, MOSTLY AI, Hazy, Tonic.ai
Cybersecurity	Adversarial example generation; intrusion detection training	Fuzz testers, GAN-based malware variants
Industrial inspection	Defect detection on rare or expensive parts	Datagen, NVIDIA Omniverse Replicator, Unity Perception
Gaming	NPC behavior data; procedural content	Self-play in AlphaGo, AlphaZero; LLM dialogue generation

computer vision

In computer vision, synthetic data ranges from classical augmentation (RandAugment, AutoAugment, MixUp) to fully rendered scenes. RandAugment, introduced by Ekin Cubuk and colleagues at CVPR 2020, reduced the data augmentation search space from 10^32 to 100 by using a single severity parameter shared across operations, achieving 85.0% top-1 accuracy on ImageNet at the time of publication. NVIDIA Isaac Sim, Unity Perception, and CARLA produce labeled synthetic images and video for object detection, semantic segmentation, and depth estimation.

natural language processing

Synthetic data is now central to LLM post-training for instruction following, code, math, reasoning, and safety alignment. The patterns set by Self-Instruct, Alpaca, and Evol-Instruct dominate the open ecosystem; Constitutional AI dominates synthetic safety alignment; the Phi recipe dominates synthetic pretraining.

robotics and self-driving

In robotics and self-driving, simulation-based synthetic data is the only practical way to obtain certain edge cases (jaywalking pedestrians at night, sensor failures, rare weather). CARLA has been a workhorse since 2017. NVIDIA's Omniverse and Cosmos platforms now generate large-scale synthetic video for training robot foundation models. The 2024-2026 trend toward video world models and physics simulators feeding robotics pipelines makes synthetic data central to embodied AI.

healthcare

Synthea provides openly-available synthetic patient records that comply with U.S. healthcare data formats. Differentially private synthesis is being applied to behavioral health datasets and clinical trial data. A 2024 Google DeepMind study found that complementing real data with synthetic data improved robustness across histopathology, radiology, and dermatology tasks.

finance and rare-event modeling

Synthetic data is widely used in finance for fraud detection (where positive examples are rare) and regulatory testing (where real data cannot be shared between institutions). MOSTLY AI counts Fortune 100 banks and insurers among its core clients. CTGAN-based pipelines are commonly used for tabular fraud datasets.

notable companies and platforms

Company / project	Focus	Notes
MOSTLY AI	Synthetic tabular data for enterprises	Core clients are Fortune 100 banks, insurers, and telcos; raised $25M Series B in 2022
Gretel.ai	Synthetic data platform with natural-language interface	Acquired by NVIDIA in 2025 to support its AI and cloud offerings
Tonic.ai	Synthetic data for software testing and ML	Strong presence in healthcare and fintech
Hazy	Synthetic data for financial services	UK-based, regulated industries focus
MITRE Synthea	Open-source synthetic patient generator	1M+ patient records freely available; HL7 FHIR compliant
MIT SDV	Open-source synthetic data library	Statistical, GAN, and VAE methods for tabular data
Datagen	Synthetic data for computer vision	Faces, hands, indoor scenes
Mindtech	Synthetic data for video and surveillance	Focus on ethical, balanced datasets
NVIDIA Omniverse / Replicator / Cosmos	Industrial-grade synthetic data for robotics and self-driving	Used by Skild AI and others for robot policy training
Microsoft Phi team	Synthetic-first LLM pretraining	Phi-1 through Phi-4 demonstrated the approach
Anthropic	Synthetic preference data via Constitutional AI	RLAIF foundation of Claude alignment
Hugging Face Cosmopedia	Open synthetic pretraining dataset	25B tokens via Mixtral

regulation and governance

The rapid growth of synthetic data has attracted regulatory attention worldwide.

The European Union has been the most active regulator in this space. The EU AI Act, which began phased implementation in 2024, includes provisions relevant to synthetic data. High-risk AI systems must document their training data, including any synthetic components. In April 2025, the European Data Protection Board issued guidelines specifically addressing synthetic data generation under GDPR, recognizing its potential for privacy preservation while establishing a framework for compliant generation. The guidelines require organizations to demonstrate that synthetic data cannot be re-identified and that the generation process does not involve unauthorized processing of personal data.

transparency requirements

Several jurisdictions now require or are considering requirements for labeling AI-generated content, which extends to synthetic training data. The goal is to maintain data provenance and enable downstream users to understand what proportion of a model's training data was synthetic versus human-generated.

licensing and intellectual property

The use of one model's outputs to train another raises unresolved IP questions. OpenAI's terms of service have historically restricted using its API outputs to train competing models. Meta's Llama 3 license update specifically allows distillation. The legal status of synthetic training data generated by commercial models remains uncertain and varies by jurisdiction.

market size and industry adoption

The synthetic data market has experienced rapid growth. Industry estimates place the global synthetic data generation market at approximately $580 million in 2025, with projections ranging from $2.67 billion by 2030 (at a 39.4% CAGR) to $7.22 billion by 2033 (at a 37.65% CAGR), depending on the research firm and market definition. The synthetic tabular sub-market alone was estimated at $1.36 billion in 2024 and projected to reach $1.88 billion in 2025.

Year	Estimated market size (USD)	Key drivers
2023	~$300M	LLM training boom; Phi models demonstrate synthetic pretraining viability
2024	~$400-575M	Enterprise adoption for privacy compliance; EU AI Act implementation begins; Phi-4 released
2025	~$580M	Mainstream adoption across industries; GDPR synthetic data guidelines issued; NVIDIA acquires Gretel
2026 (projected)	~$770M	Continued growth driven by data scarcity and regulatory requirements
2030 (projected)	~$2.7B	Established component of AI training infrastructure

Major technology companies are heavily invested. Microsoft has built its Phi line around synthetic data. Google has used synthetic data extensively in training its Gemini models. Anthropic relies on synthetic preference data for Claude. Meta updated the Llama license specifically to encourage distillation. The data licensing market has also grown, with companies like Reddit and News Corp signing deals with AI labs to provide verified human-generated content as an anchor against the risks of pure synthetic training.

current state (2025-2026)

As of early 2026, synthetic data is a routine component of AI training pipelines, and the field is grappling with several evolving challenges.

the data scarcity problem

The supply of high-quality, human-generated text data is tightening. Research suggests that the stock of high-quality text on the internet suitable for LLM training may approach its practical limits within the next few years. This scarcity has made synthetic data not merely convenient but necessary for continued scaling. The contamination of the public web with AI-generated content also means that even "organic" web scrapes increasingly contain synthetic text, blurring the distinction between real and synthetic data.

hybrid approaches

The most successful current approaches combine synthetic and real data strategically. Apple's "Rephrasing the Web" approach, Microsoft's Phi series, and Hugging Face's Cosmopedia all demonstrate that the best results come from anchoring synthetic data in human-generated foundations. The consensus recommendation is to never fully replace real data with synthetic data, but to use synthetic data to augment, diversify, and extend it.

verification and filtering

Increasingly sophisticated pipelines verify and filter synthetic data before training. These include automated quality scoring (FineWeb-Edu style), factual consistency checking, diversity metrics, and human-in-the-loop review. Recent research on synthetic data verification suggests that proper verification can mitigate many of the risks associated with synthetic data, including some aspects of model collapse.

synthetic data for reasoning

One of the most active areas is using synthetic data to improve reasoning. This includes step-by-step mathematical proofs, code solutions with test cases, and logical reasoning chains. The advantage in this domain is that verifiable rewards provide a natural quality filter: if the generated solution passes the test cases, it is useful regardless of who or what produced it. DeepSeek's R1 distillation pipeline and Microsoft's Phi-4-reasoning report exemplify this trend.

multimodal synthetic data

With the rise of multimodal models, synthetic data generation has expanded beyond text and images to include video, audio, 3D scenes, and interleaved modalities. Sora and other generative video models are being used to produce training data for robotics, autonomous driving, and embodied AI. NVIDIA's Cosmos world models generate photorealistic synthetic video specifically intended for training robot policies.

tools and libraries

The open-source tooling around synthetic data has matured rapidly. Active projects include the Synthetic Data Vault (SDV), YData synthetic, Hugging Face's synthetic-data-generator, llm-swarm (used for Cosmopedia), distilabel (used by Argilla and others to build instruction datasets), and Microsoft's AgentInstruct framework. Commercial offerings from Gretel (now NVIDIA), MOSTLY AI, Tonic.ai, and Hazy provide enterprise platforms with built-in privacy controls.

references

Patki, N., Wedge, R., & Veeramachaneni, K. (2016). "The Synthetic Data Vault." IEEE International Conference on Data Science and Advanced Analytics. https://github.com/sdv-dev/SDV
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al. (2014). "Generative Adversarial Nets." NeurIPS 2014. arXiv:1406.2661. https://arxiv.org/abs/1406.2661
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2022). "Self-Instruct: Aligning Language Models with Self-Generated Instructions." arXiv:2212.10560. https://arxiv.org/abs/2212.10560
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). "Stanford Alpaca: An Instruction-following LLaMA Model." GitHub. https://github.com/tatsu-lab/stanford_alpaca
Chiang, W.-L., et al. (2023). "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality." LMSYS Blog. https://lmsys.org/blog/2023-03-30-vicuna/
Xu, C., Sun, Q., Zheng, K., et al. (2023). "WizardLM: Empowering Large Pre-trained Language Models to Follow Complex Instructions." arXiv:2304.12244. https://arxiv.org/abs/2304.12244
Mukherjee, S., Mitra, A., Jawahar, G., et al. (2023). "Orca: Progressive Learning from Complex Explanation Traces of GPT-4." arXiv:2306.02707. https://arxiv.org/abs/2306.02707
Mitra, A., et al. (2023). "Orca 2: Teaching Small Language Models How to Reason." arXiv:2311.11045. https://arxiv.org/abs/2311.11045
Gunasekar, S., Zhang, Y., Aneja, J., et al. (2023). "Textbooks Are All You Need." arXiv:2306.11644. https://arxiv.org/abs/2306.11644
Abdin, M., Aneja, J., Behl, H., Bubeck, S., et al. (2024). "Phi-4 Technical Report." Microsoft Research. arXiv:2412.08905. https://arxiv.org/abs/2412.08905
Ben Allal, L., Lozhkov, A., Penedo, G., Wolf, T., & von Werra, L. (2024). "Cosmopedia: How to Create Large-Scale Synthetic Data for Pre-training." Hugging Face Blog. https://huggingface.co/blog/cosmopedia
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic. arXiv:2212.08073. https://arxiv.org/abs/2212.08073
Llama Team, AI @ Meta. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
DeepSeek-AI. (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. https://arxiv.org/abs/2501.12948
Chen, Z., et al. (2024). "Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models." arXiv:2401.01335. https://arxiv.org/abs/2401.01335
Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). "AI Models Collapse When Trained on Recursively Generated Data." Nature 631, 755-759. https://www.nature.com/articles/s41586-024-07566-y
Borji, A. (2024). "A Note on Shumailov et al. (2024): 'AI Models Collapse When Trained on Recursively Generated Data.'" arXiv:2410.12954. https://arxiv.org/abs/2410.12954
Walonoski, J., Kramer, M., Nichols, J., et al. (2018). "Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record." Journal of the American Medical Informatics Association 25(3), 230-238. https://academic.oup.com/jamia/article/25/3/230/4098271
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). "CARLA: An Open Urban Driving Simulator." Conference on Robot Learning. arXiv:1711.03938. https://arxiv.org/abs/1711.03938
Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). "RandAugment: Practical Automated Data Augmentation with a Reduced Search Space." CVPR Workshops. arXiv:1909.13719. https://arxiv.org/abs/1909.13719
Penedo, G., Kydlicek, H., Allal, L. B., et al. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557. https://arxiv.org/abs/2406.17557
NVIDIA Developer Blog. (2024). "Creating Synthetic Data Using Llama 3.1 405B." https://developer.nvidia.com/blog/creating-synthetic-data-using-llama-3-1-405b/
AWS Machine Learning Blog. (2024). "Use Llama 3.1 405B for synthetic data generation and distillation to fine-tune smaller models." https://aws.amazon.com/blogs/machine-learning/use-llama-3-1-405b-to-generate-synthetic-data-for-fine-tuning-tasks/
European Data Protection Board. (2025). "Guidelines on Synthetic Data Generation under GDPR." EDPB. https://www.edpb.europa.eu/
Mordor Intelligence. (2025). "Synthetic Data Market Size, Share, Trends & Research Report, 2030." https://www.mordorintelligence.com/industry-reports/synthetic-data-market
Kings Research. (2025). "Synthetic Data Generation Market to Reach $7.22 Bn by 2033." https://www.kingsresearch.com/press-release/synthetic-data-generation-market
Feng, W., et al. (2025). "Escaping Model Collapse via Synthetic Data Verification." arXiv:2510.16657. https://arxiv.org/abs/2510.16657
Lambert, N. (2024-2025). "Synthetic Data and CAI." RLHF Book. https://rlhfbook.com/c/13-cai
NVIDIA Developer. "Isaac Sim: Robotics Simulation and Synthetic Data Generation." https://developer.nvidia.com/isaac/sim
Fortune. (2024). "Companies crave fresh data to train AI models. This startup's recipe? Data made from scratch by AI." https://fortune.com/2024/06/13/gretel-ai-startup-synthetic-data/

definition and scope

types of synthetic data

tabular data

text data

image data

video data

audio and speech data

3D and sensor data

generation methods

rule-based generation

statistical methods

gan-based generation

simulation-based generation

llm-generated data

llm synthetic data generation

self-instruct

alpaca

vicuna

wizardlm and evol-instruct

orca and orca 2

the phi series

llama 3 and self-distillation

deepseek v3 and r1

constitutional ai and synthetic preference data

cosmopedia

use in ai training

knowledge distillation

data augmentation

self-play and self-improvement

pretraining data

post-training and alignment

sim-to-real transfer

key examples

benefits of synthetic data

privacy preservation

cost reduction

scale

diversity and coverage

control and reproducibility

risks and challenges

model collapse

bias amplification

quality degradation

homogenization

detection and provenance

quality control and filtering

llm-as-judge

diversity filtering

verification by execution

human spot-checks

applications

computer vision

natural language processing

robotics and self-driving

healthcare

finance and rare-event modeling

notable companies and platforms

regulation and governance

eu ai act and gdpr

transparency requirements

licensing and intellectual property

market size and industry adoption

current state (2025-2026)

the data scarcity problem

hybrid approaches

verification and filtering

synthetic data for reasoning

multimodal synthetic data

tools and libraries

see also

references

Improve this article

Related Articles

Discrete Feature

Categorical Data

Continuous Feature

Ground Truth

Instance

definition and scope

types of synthetic data