# Autoregressive Model

> Source: https://aiwiki.ai/wiki/autoregressive_model
> Updated: 2026-06-22
> Categories: Large Language Models, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

An **autoregressive model** predicts each element of a sequence from the elements that precede it, feeding its own earlier outputs back in as context for every later prediction. The term covers both the linear AR(p) models that statisticians have fitted to [time series](/wiki/time_series) since the 1920s and the neural sequence models at the center of modern generative AI: the GPT family and most other [large language models](/wiki/large_language_model) (LLMs) are trained on exactly this next-token prediction task. The approach's defining weakness, strictly sequential decoding, has made room since early 2025 for parallel challengers, most visibly text [diffusion models](/wiki/diffusion_model) such as Mercury and [Gemini Diffusion](/wiki/gemini_diffusion).

The single most consequential autoregressive model to date is [GPT-3](/wiki/gpt_3), which its 2020 paper describes as "an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model" [9]. That scaling result, that a network trained only to predict the next token becomes a few-shot, in-context learner, is why next-token prediction now underpins the flagship assistants from [OpenAI](/wiki/openai), [Anthropic](/wiki/anthropic), and [Google](/wiki/google).

## Overview

Generation from an autoregressive model proceeds by ancestral sampling: draw the first element, feed it back in to draw the second, and continue until an end-of-sequence token or a length limit. The chain rule of probability guarantees that any joint distribution over a sequence can be written as a product of per-step conditionals, so the factorization itself loses no generality; models differ in the order in which elements are generated and in the function that computes each conditional. A classical AR(p) process uses a fixed linear map over the last p values, while a decoder-only [Transformer](/wiki/transformer) applies a deep network to everything in its context window.

Three properties explain the recipe's dominance in generative AI. Training is ordinary supervised prediction of the next element at every position of unlabeled data, so it scales to internet-sized corpora without annotation. The model assigns an exact likelihood to any sequence, unlike a [generative adversarial network](/wiki/generative_adversarial_network), which enables clean evaluation by [perplexity](/wiki/perplexity). And a single trained network can emit outputs of arbitrary length. The price is that element t cannot be sampled until element t-1 exists.

## What is the autoregressive factorization?

An autoregressive model factorizes the joint probability of a sequence as

`p(x_1, ..., x_T) = p(x_1) p(x_2 | x_1) p(x_3 | x_1, x_2) ... p(x_T | x_1, ..., x_{T-1})`

so that each factor conditions only on earlier elements. The classical AR(p) model of time-series analysis makes each step a linear function of the p most recent values plus noise:

`X_t = c + phi_1 X_{t-1} + phi_2 X_{t-2} + ... + phi_p X_{t-p} + e_t`

where the phi coefficients are estimated from data, for instance via the Yule-Walker equations, and e_t is zero-mean white noise; stationarity requires the roots of the characteristic polynomial to lie outside the unit circle.

Neural autoregressive models keep the factorization but compute each conditional with a network that outputs a full distribution over the next element, usually a softmax over a discrete vocabulary. Minimizing next-token cross-entropy is then exactly maximum likelihood training. Where AR(p) is Markovian of order p, a Transformer conditions on its whole context window, which in 2026 frontier models can span hundreds of thousands of tokens.

## When did autoregressive modeling start?

George Udny Yule introduced autoregression in 1927 to explain the quasi-periodic behavior of sunspot numbers, modeling each observation as a linear function of past observations plus random shocks rather than as a deterministic cycle [1]. Gilbert Walker extended the scheme in the 1930s, and the Box-Jenkins methodology of 1970 made AR and ARIMA models the workhorses of statistical forecasting in economics and industry [2].

The neural lineage began with the 2003 neural probabilistic language model of Bengio and colleagues, a feed-forward network that predicted the next word from learned embeddings of the preceding words [3]. Larochelle and Murray's NADE (2011) generalized the idea beyond language, estimating distributions over high-dimensional binary data with a weight-shared chain of neural conditionals [4]. In 2016 [Google DeepMind](/wiki/google_deepmind) showed the approach's reach across modalities: [PixelRNN](/wiki/pixelrnn) and PixelCNN generated images pixel by pixel, winning an ICML 2016 best paper award [5], while [WaveNet](/wiki/wavenet) used dilated causal convolutions to synthesize raw audio one sample at a time [6].

The Transformer (2017) removed the main training bottleneck: causal self-attention lets every position of a training sequence be predicted in parallel [7]. OpenAI's [GPT](/wiki/gpt) ([GPT-1](/wiki/gpt_1), June 2018) established decoder-only generative pretraining, a 12-layer [decoder](/wiki/decoder)-only Transformer trained on the BookCorpus to predict the next token from left context [8]; [GPT-2](/wiki/gpt_2) (2019) and GPT-3 (2020), the latter with 175 billion parameters, showed that scale turns next-token prediction into few-shot, in-context learning [9]; [ChatGPT](/wiki/chatgpt) (November 2022) carried the paradigm to a mass audience. Autoregression has since pushed back into vision as well: visual autoregressive modeling (VAR), which generates images by next-scale prediction, won the NeurIPS 2024 best paper award [10].

## How do autoregressive language models work?

A modern LLM is a [decoder](/wiki/decoder)-only Transformer trained with a causal attention mask: position t attends only to positions 1 through t, never ahead. During training the model therefore predicts every next token of a sequence simultaneously, each prediction conditioned on the true prefix rather than on the model's own output, a regime known as teacher forcing in the recurrent-network literature [11]. The loss, average next-token cross-entropy, falls as a smooth power law in parameters, data, and compute, the empirical foundation of LLM [scaling laws](/wiki/scaling_laws) [12]. Text is first segmented into subword units by a tokenizer such as [byte pair encoding](/wiki/byte_pair_encoding), so "next token" means the next subword rather than the next character or word.

Post-training stages such as instruction tuning and [RLHF](/wiki/rlhf) reshape which continuations the model prefers but leave the autoregressive machinery untouched: a chat assistant still emits its reply one token at a time, conditioned on the prompt plus everything it has generated so far.

## How is text sampled from the model?

At each step the network defines a probability distribution over the next token; a decoding algorithm decides what to do with it. Greedy decoding takes the most probable token, and [beam search](/wiki/beam_search) tracks several high-likelihood prefixes, but likelihood-maximizing decoding is known to produce repetitive, degenerate text in open-ended generation [13]. Production systems therefore sample: temperature scaling divides the logits by a constant before the softmax, sharpening (below 1) or flattening (above 1) the distribution; top-k sampling restricts choices to the k most probable tokens [14]; nucleus (top-p) sampling restricts them to the smallest set whose cumulative probability exceeds p [13]. Repetition and frequency penalties are common heuristics layered on top, and constrained decoding masks any token that would violate a target grammar or JSON schema.

Two systems techniques define modern autoregressive inference. The KV cache stores every layer's attention keys and values for past positions, so each new token costs one incremental forward pass instead of a full reread of the prefix, at the price of memory that grows with context length. [Speculative decoding](/wiki/speculative_decoding) attacks the sequential bottleneck directly: a cheap draft model (or auxiliary draft heads) proposes several tokens ahead, and the large model verifies them in a single parallel pass, accepting or rejecting them so that the output distribution is provably unchanged; reported speedups are roughly two to three times [15][16].

## How does autoregression differ from diffusion?

The central limitation is latency. A T-token answer requires T sequential network evaluations, and decoding is usually bound by memory bandwidth rather than arithmetic, so accelerators sit partly idle; long [chain-of-thought](/wiki/chain_of_thought) reasoning traces make the cost more visible. A second issue is exposure bias, the train-test mismatch created by teacher forcing: the model only ever saw true prefixes during training, yet at inference it conditions on its own samples, so an early mistake can drag generation off-distribution and compound. Scheduled sampling was an early remedy [17], and Ranzato and colleagues named and analyzed the problem [18]; large modern models are fairly robust to it, but error accumulation over very long outputs remains a practical concern. Finally, the left-to-right order is itself a modeling choice, natural for text and audio but arbitrary for images and tables.

Several research lines relax the recipe. [BERT](/wiki/bert)-style masked language models train bidirectionally by reconstructing randomly masked tokens; they make strong encoders for understanding tasks but do not define a tractable generative process for free-form text [19]. [XLNet](/wiki/xlnet) autoregressed over random permutations of the factorization order [20], and non-autoregressive machine translation generated all output positions in parallel, recovering quality through iterative refinement [21].

The strongest current challenger is diffusion. Discrete diffusion for text was developed in D3PM (2021) [22] and Diffusion-LM (2022) [23], and it reached scale in 2025. [LLaDA](/wiki/llada), an 8-billion-parameter masked diffusion model trained from scratch on 2.3 trillion tokens by researchers from Renmin University of China and Ant Group, performed comparably to [Llama 3](/wiki/llama_3) 8B and, freed from left-to-right order, mitigated the [reversal curse](/wiki/reversal_curse) that afflicts autoregressive LLMs [24]. [Inception Labs](/wiki/inception_labs), co-founded by Stanford professor Stefano Ermon, launched Mercury in February 2025 as the first commercial-scale diffusion LLM; in independent [Artificial Analysis](/wiki/artificial_analysis) measurements, Mercury Coder Mini reached 1,109 tokens per second and Mercury Coder Small 737 tokens per second on NVIDIA H100 GPUs [25]. Google DeepMind demoed Gemini Diffusion, an experimental text model reported at 1,000 to 2,000 tokens per second, at Google I/O in May 2025 [26]. Open and industrial follow-ups such as Dream and [ByteDance](/wiki/bytedance)'s [Seed Diffusion](/wiki/seed_diffusion) target the same speed advantage [27], while block diffusion hybrids generate text autoregressively block by block and denoise in parallel within each block [28].

| Paradigm | Training signal | Generation | Representative models |
|---|---|---|---|
| Autoregressive | Next-token prediction with causal masking | Sequential, one token per step | [GPT-4](/wiki/gpt_4), [Claude](/wiki/claude), [Llama](/wiki/llama), WaveNet |
| Masked language model | Reconstruct randomly masked tokens | Not naturally generative; used for embeddings and classification | BERT, [RoBERTa](/wiki/roberta) |
| Discrete diffusion | Iteratively unmask or denoise the whole sequence | Parallel refinement over a fixed number of steps | LLaDA, Mercury, Gemini Diffusion, Seed Diffusion |
| Non-autoregressive and any-order | Permutation, insertion, or refinement objectives | Parallel or flexible order | XLNet, non-autoregressive translation models |

As of June 2026 the flagship assistants from OpenAI, Anthropic, and Google remain autoregressive, with speculative decoding and other inference optimizations narrowing the speed gap, and controlled comparisons between the paradigms at matched scale are still an active research topic [27].

## References

1. Yule, G. U. (1927). "On a Method of Investigating Periodicities in Disturbed Series, with Special Reference to Wolfer's Sunspot Numbers". Philosophical Transactions of the Royal Society of London, Series A, 226: 267-298.
2. Box, G. E. P.; Jenkins, G. M. (1970). Time Series Analysis: Forecasting and Control. Holden-Day.
3. Bengio, Y.; Ducharme, R.; Vincent, P.; Jauvin, C. (2003). "A Neural Probabilistic Language Model". Journal of Machine Learning Research 3: 1137-1155. https://www.jmlr.org/papers/v3/bengio03a.html
4. Larochelle, H.; Murray, I. (2011). "The Neural Autoregressive Distribution Estimator". AISTATS 2011. https://proceedings.mlr.press/v15/larochelle11a.html
5. van den Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. (2016). "Pixel Recurrent Neural Networks". ICML 2016. https://arxiv.org/abs/1601.06759
6. van den Oord, A. et al. (2016). "WaveNet: A Generative Model for Raw Audio". arXiv. https://arxiv.org/abs/1609.03499
7. Vaswani, A. et al. (2017). "Attention Is All You Need". NeurIPS 2017. https://arxiv.org/abs/1706.03762
8. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training". OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
9. Brown, T. et al. (2020). "Language Models are Few-Shot Learners". NeurIPS 2020. https://arxiv.org/abs/2005.14165
10. Tian, K.; Jiang, Y.; Yuan, Z.; Peng, B.; Wang, L. (2024). "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". NeurIPS 2024. https://arxiv.org/abs/2404.02905
11. Williams, R. J.; Zipser, D. (1989). "A Learning Algorithm for Continually Running Fully Recurrent Neural Networks". Neural Computation 1(2): 270-280.
12. Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models". arXiv. https://arxiv.org/abs/2001.08361
13. Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; Choi, Y. (2020). "The Curious Case of Neural Text Degeneration". ICLR 2020. https://arxiv.org/abs/1904.09751
14. Fan, A.; Lewis, M.; Dauphin, Y. (2018). "Hierarchical Neural Story Generation". ACL 2018. https://arxiv.org/abs/1805.04833
15. Leviathan, Y.; Kalman, M.; Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding". ICML 2023. https://arxiv.org/abs/2211.17192
16. Chen, C. et al. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling". arXiv. https://arxiv.org/abs/2302.01318
17. Bengio, S.; Vinyals, O.; Jaitly, N.; Shazeer, N. (2015). "Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks". NeurIPS 2015. https://arxiv.org/abs/1506.03099
18. Ranzato, M.; Chopra, S.; Auli, M.; Zaremba, W. (2016). "Sequence Level Training with Recurrent Neural Networks". ICLR 2016. https://arxiv.org/abs/1511.06732
19. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL 2019. https://arxiv.org/abs/1810.04805
20. Yang, Z. et al. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding". NeurIPS 2019. https://arxiv.org/abs/1906.08237
21. Gu, J.; Bradbury, J.; Xiong, C.; Li, V. O. K.; Socher, R. (2018). "Non-Autoregressive Neural Machine Translation". ICLR 2018. https://arxiv.org/abs/1711.02281
22. Austin, J. et al. (2021). "Structured Denoising Diffusion Models in Discrete State-Spaces". NeurIPS 2021. https://arxiv.org/abs/2107.03006
23. Li, X. L. et al. (2022). "Diffusion-LM Improves Controllable Text Generation". NeurIPS 2022. https://arxiv.org/abs/2205.14217
24. Nie, S. et al. (2025). "Large Language Diffusion Models". NeurIPS 2025. https://arxiv.org/abs/2502.09992
25. Inception Labs (2025). "Mercury: Ultra-Fast Language Models Based on Diffusion". arXiv. https://arxiv.org/abs/2506.17298 (launch announcement: https://www.inceptionlabs.ai/blog/introducing-mercury)
26. Google DeepMind (2025). "Gemini Diffusion". https://deepmind.google/models/gemini-diffusion/
27. "A Survey on Diffusion Language Models" (2025). arXiv. https://arxiv.org/abs/2508.10875
28. Arriola, M. et al. (2025). "Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models". ICLR 2025. https://arxiv.org/abs/2503.09573
