Papers
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 5,250 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 5,250 words
Add missing citations, update stale details, or suggest a clearer explanation.
This page is a chronological index of AI research papers that shaped the field. It covers the foundations of deep learning (LeNet, AlexNet, ImageNet, ResNet), the transformer era (Attention is All You Need, BERT, GPT, T5), the scaling era (GPT-3, Chinchilla, PaLM, LLaMA), the alignment era (InstructGPT, RLHF, Constitutional AI, DPO), and the multimodal and reasoning era (GPT-4, Gemini, Llama 3, DeepSeek-R1). It also tracks landmark work in computer vision, reinforcement learning, audio, robotics, and scientific AI (AlphaGo, MuZero, AlphaFold).
There is no committee that ranks AI papers. Importance is a rough mix of three things: technical novelty, citation count, and downstream impact on products people actually use. A paper that introduces a new architecture (the transformer, residual networks, mixture of experts) tends to stay important for years because every later paper builds on top of it. A paper that introduces a benchmark (ImageNet, GLUE, SuperGLUE, MMLU) shapes what the next decade of research optimizes for. A paper that opens up a new capability (GANs, diffusion, in-context learning, chain of thought) gets re-cited every time someone tries to extend or critique it.
There is also a softer kind of importance: papers that change how the field talks to itself. "Sparks of AGI" did this for GPT-4. "Emergent Abilities of Large Language Models" did it for scaling. "Constitutional AI" did it for alignment without exhaustive human labeling. These papers are not always the most technically deep, but they reframe debates that everyone else then has to respond to.
A few practical filters are useful when reading the table below:
The table below skews toward papers that meet at least one of those tests. It is not exhaustive. The "Important Papers" section lists the canonical first appearances of major ideas. The "Other Papers" section lists notable follow-ups, benchmarks, and applied work.
Dates use the arXiv submission date (v1) when available, the conference or journal publication date otherwise. "Source" links go to the arXiv abstract page, the publisher PDF, or the lab's official release. "Organization" lists the primary affiliation of the first author or the lab that led the work. "Product" lists the model, system, or technique name that the paper introduced. Some early papers predate the modern convention of naming a model in the title, so the product column is blank.
For papers that have their own dedicated wiki entry, the title links to that entry.
| Name | Date | Source | Type | Organization | Product | Note |
|---|---|---|---|---|---|---|
| Long Short-Term Memory | 1997/11 | Neural Computation 9(8) | Natural Language Processing | LSTM | Hochreiter and Schmidhuber introduce gated recurrent units | |
| Gradient-Based Learning Applied to Document Recognition (LeNet-5) | 1998/11 | Proceedings of the IEEE | Computer Vision | AT&T Labs | LeNet-5 | LeCun et al., convolutional networks for digit recognition |
| ImageNet: A Large-Scale Hierarchical Image Database | 2009/06/20 | CVPR 2009 PDF | Computer Vision | Princeton | ImageNet | Deng, Dong, Socher, Li, Li, Fei-Fei |
| ImageNet Classification with Deep Convolutional Neural Networks (AlexNet) | 2012 | AlexNet Paper | Computer Vision | University of Toronto | AlexNet | Krizhevsky, Sutskever, Hinton |
| Efficient Estimation of Word Representations in Vector Space (Word2Vec) | 2013/01/16 | arxiv:1301.3781 | Natural Language Processing | Word2Vec | ||
| Playing Atari with Deep Reinforcement Learning (DQN) | 2013/12/19 | arxiv:1312.5602 | Reinforcement Learning | DeepMind | DQN (Deep Q-Learning) | |
| Generative Adversarial Networks (GAN) | 2014/06/10 | arxiv:1406.2661 | Computer Vision | Universite de Montreal | GAN (Generative Adversarial Network) | Goodfellow et al. |
| Very Deep Convolutional Networks for Large-Scale Image Recognition (VGGNet) | 2014/09/04 | arxiv:1409.1556 | Computer Vision | Oxford VGG | VGGNet | |
| Sequence to Sequence Learning with Neural Networks (Seq2Seq) | 2014/09/10 | arxiv:1409.3215 | Natural Language Processing | Seq2Seq | ||
| Adam: A Method for Stochastic Optimization | 2014/12/22 | arxiv:1412.6980 | Optimization | University of Amsterdam, OpenAI | Adam | Kingma and Ba, the default optimizer for years |
| Deep Residual Learning for Image Recognition (ResNet) | 2015/12/10 | arxiv:1512.03385 | Computer Vision | Microsoft Research | ResNet | He et al., introduced skip connections |
| Going Deeper with Convolutions (GoogleNet) | 2015/12/10 | arxiv:1409.4842 | Computer Vision | GoogleNet | ||
| Mastering the game of Go with deep neural networks and tree search (AlphaGo) | 2016/01/28 | Nature 529 | Reinforcement Learning | DeepMind | AlphaGo | Silver et al., defeated Lee Sedol two months later |
| Asynchronous Methods for Deep Reinforcement Learning (A3C) | 2016/02/04 | arxiv:1602.01783 | Reinforcement Learning | DeepMind | A3C | |
| WaveNet: A Generative Model for Raw Audio | 2016/09/12 | arxiv:1609.03499 | Audio | DeepMind | WaveNet | |
| Attention Is All You Need (Transformer) | 2017/06/12 | arxiv:1706.03762 | Natural Language Processing | Google Brain | Transformer | Vaswani et al., the foundation of every modern LLM |
| Deep reinforcement learning from human preferences | 2017/06/12 | arxiv:1706.03741 | Reinforcement Learning | OpenAI, DeepMind | RLHF | Christiano et al., the original RLHF paper |
| Proximal Policy Optimization Algorithms (PPO) | 2017/07/20 | arxiv:1707.06347 | Reinforcement Learning | OpenAI | PPO | Used in ChatGPT and most RLHF pipelines |
| Mastering the game of Go without human knowledge (AlphaGo Zero) | 2017/10/19 | Nature 550 | Reinforcement Learning | DeepMind | AlphaGo Zero | Self-play from scratch, no human games |
| Improving Language Understanding by Generative Pre-Training (GPT) | 2018/06 | paper source | Natural Language Processing | OpenAI | GPT | Radford et al., the first GPT |
| Deep contextualized word representations (ELMo) | 2018/02/15 | arxiv:1802.05365 | Natural Language Processing | Allen AI | ELMo | |
| GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding | 2018/04/20 | arxiv:1804.07461, website | Natural Language Processing | NYU, U Washington, DeepMind | GLUE | |
| BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | 2018/10/11 | arxiv:1810.04805 | Natural Language Processing | BERT (Bidirectional Encoder Representations from Transformers) | Devlin et al. | |
| Language Models are Unsupervised Multitask Learners (GPT-2) | 2019/02/14 | paper | Natural Language Processing | OpenAI | GPT-2 | Originally withheld due to misuse concerns |
| Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | 2019/01/09 | arxiv:1901.02860, github | Natural Language Processing | CMU, Google Brain | Transformer-XL | |
| RoBERTa: A Robustly Optimized BERT Pretraining Approach | 2019/07/26 | arxiv:1907.11692 | Natural Language Processing | Meta | RoBERTa | |
| Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5) | 2019/10/23 | arxiv:1910.10683 | Natural Language Processing | T5 | ||
| Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero) | 2019/11/19 | arxiv:1911.08265 | Reinforcement Learning | DeepMind | MuZero | Plans without knowing the rules |
| Scaling Laws for Neural Language Models | 2020/01/23 | arxiv:2001.08361 | Natural Language Processing | OpenAI | Scaling Laws | Kaplan et al., power-law relationships between loss, parameters, and compute |
| REALM: Retrieval-Augmented Language Model Pre-Training | 2020/02/10 | arxiv:2002.08909, blog post | Natural Language Processing | REALM | ||
| Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG) | 2020/05/22 | arxiv:2005.11401 | Natural Language Processing | Meta, UCL | RAG | Lewis et al., coined the term "RAG" |
| Language Models are Few-Shot Learners (GPT-3) | 2020/05/28 | arxiv:2005.14165 | Natural Language Processing | OpenAI | GPT-3 | 175B parameters, in-context learning |
| Denoising Diffusion Probabilistic Models (DDPM) | 2020/06/19 | arxiv:2006.11239 | Computer Vision | UC Berkeley | DDPM | Ho, Jain, Abbeel, the modern diffusion baseline |
| An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT) | 2020/10/22 | arxiv:2010.11929, GitHub | Computer Vision | ViT | ||
| Learning Transferable Visual Models From Natural Language Supervision (CLIP) | 2021/02/26 | arxiv:2103.00020, Blog Post | Computer Vision | OpenAI | CLIP | |
| LoRA: Low-Rank Adaptation of Large Language Models | 2021/06/17 | arxiv:2106.09685, GitHub | Natural Language Processing | Microsoft | LoRA | The standard parameter-efficient fine-tuning method |
| Highly accurate protein structure prediction with AlphaFold (AlphaFold 2) | 2021/07/15 | Nature 596 | Science | DeepMind | AlphaFold 2 | Jumper et al., won the 2024 Nobel Prize in Chemistry |
| MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer | 2021/10/05 | arxiv:2110.02178, GitHub | Computer Vision | Apple | MobileViT | |
| High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion) | 2021/12/20 | arxiv:2112.10752 | Computer Vision | LMU Munich, Runway | Latent Diffusion, Stable Diffusion | Rombach et al., the architecture behind Stable Diffusion |
| Improving language models by retrieving from trillions of tokens (RETRO) | 2021/12/08 | arxiv:2112.04426, Blog post | Natural Language Processing | DeepMind | RETRO | |
| LaMDA: Language Models for Dialog Applications | 2022/01/20 | arxiv:2201.08239, Blog Post | Natural Language Processing | LaMDA | ||
| Chain-of-Thought Prompting Elicits Reasoning in Large Language Models | 2022/01/28 | arxiv:2201.11903 | Natural Language Processing | Chain of Thought | Wei et al., "let's think step by step" | |
| Training language models to follow instructions with human feedback (InstructGPT) | 2022/03/04 | arxiv:2203.02155 | Natural Language Processing | OpenAI | InstructGPT | Ouyang et al., the direct predecessor of ChatGPT |
| Training Compute-Optimal Large Language Models (Chinchilla) | 2022/03/29 | arxiv:2203.15556 | Natural Language Processing | DeepMind | Chinchilla | Hoffmann et al., revised the scaling laws |
| PaLM: Scaling Language Modeling with Pathways | 2022/04/05 | arxiv:2204.02311 | Natural Language Processing | PaLM | ||
| Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback | 2022/04/12 | arxiv:2204.05862, GitHub | Natural Language Processing | Anthropic | RLHF | Bai et al., the HH-RLHF dataset |
| Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2) | 2022/04/13 | arxiv:2204.06125 | Computer Vision | OpenAI | DALL-E 2 | Ramesh, Dhariwal, Nichol, Chu, Chen |
| A Generalist Agent (Gato) | 2022/05/12 | arxiv:2205.06175, Blog Post | Multimodal | DeepMind | Gato | |
| Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen) | 2022/05/23 | arxiv:2205.11487, Blog Post | Computer Vision | Imagen | ||
| Emergent Abilities of Large Language Models | 2022/06/15 | arxiv:2206.07682 | Natural Language Processing | Google, Stanford, UNC, DeepMind | Emergent Abilities | |
| AudioLM: a Language Modeling Approach to Audio Generation | 2022/09/07 | arxiv:2209.03143 | Audio | AudioLM | ||
| ReAct: Synergizing Reasoning and Acting in Language Models | 2022/10/06 | arxiv:2210.03629, GitHub | Natural Language Processing | Google, Princeton | ReAct | |
| BLOOM: A 176B-Parameter Open-Access Multilingual Language Model | 2022/11/09 | arxiv:2211.05100, Blog Post | Natural Language Processing | BigScience, Hugging Face | BLOOM | Open source competitor to GPT-3 |
| Robust Speech Recognition via Large-Scale Weak Supervision (Whisper) | 2022/12/06 | arxiv:2212.04356 | Audio | OpenAI | Whisper | |
| Constitutional AI: Harmlessness from AI Feedback | 2022/12/15 | arxiv:2212.08073 | Natural Language Processing | Anthropic | Constitutional AI, Claude | Bai et al., RLAIF rather than RLHF |
| LLaMA: Open and Efficient Foundation Language Models | 2023/02/25 | arxiv:2302.13971, blog post, github | Natural Language Processing | Meta | LLaMA | First open-weights model competitive with GPT-3 |
| GPT-4 Technical Report | 2023/03/15 | arxiv:2303.08774, blog post, system card | Multimodal | OpenAI | GPT-4 | Withheld technical details, included a long system card |
| Sparks of Artificial General Intelligence: Early experiments with GPT-4 | 2023/03/22 | arxiv:2303.12712 | Multimodal | Microsoft Research | Sparks of AGI | Bubeck et al., reframed AGI discourse |
| Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO) | 2023/05/29 | arxiv:2305.18290 | Natural Language Processing | Stanford | DPO | Rafailov et al., a simpler alternative to PPO-based RLHF |
| Llama 2: Open Foundation and Fine-Tuned Chat Models | 2023/07/18 | arxiv:2307.09288 | Natural Language Processing | Meta | Llama 2 | Touvron et al., released under a permissive license |
| Mistral 7B | 2023/10/10 | arxiv:2310.06825 | Natural Language Processing | Mistral | Mistral 7B | Apache 2.0 licensed, sliding-window attention |
| Gemini: A Family of Highly Capable Multimodal Models | 2023/12/19 | arxiv:2312.11805 | Multimodal | Google DeepMind | Gemini | Ultra, Pro, and Nano sizes |
| Mixtral of Experts | 2024/01/08 | arxiv:2401.04088 | Natural Language Processing | Mistral | Mixtral 8x7B | Sparse mixture of experts, 13B active parameters per token |
| Accurate structure prediction of biomolecular interactions with AlphaFold 3 | 2024/05/08 | Nature 630 | Science | Google DeepMind, Isomorphic Labs | AlphaFold 3 | Abramson et al., handles proteins, nucleic acids, and ligands |
| The Llama 3 Herd of Models | 2024/07/23 | arxiv:2407.21783 | Multimodal | Meta | Llama 3 | 405B dense transformer, 128K context |
| DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning | 2025/01/22 | arxiv:2501.12948 | Natural Language Processing | DeepSeek | DeepSeek-R1 | Pure RL produces emergent reasoning, comparable to OpenAI o1 |
| Name | Date | Source | Type | Organization | Product | Note |
|---|---|---|---|---|---|---|
| Self-Rewarding Language Models | 2024/01/18 | arxiv:2401.10020 | Natural Language Processing | Meta | ||
| LLM in a flash: Efficient Large Language Model Inference with Limited Memory | 2023/12/12 | arxiv:2312.11514, HuggingFace | Natural Language Processing | Apple | ||
| Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation | 2023/12/07 | arxiv:2311.17117, Website, Video, GitHub, Tutorial | Computer Vision | Alibaba | Animate Anyone | |
| MatterGen: a generative model for inorganic materials design | 2023/12/06 | arxiv:2312.03687, Tweet | Materials Science | Microsoft | MatterGen | |
| Audiobox: Generating audio from voice and natural language prompts | 2023/11/30 | Paper, Website | Audio | Meta | Audiobox | |
| Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text | 2023/11/30 | arxiv:2311.18805 | Natural Language Processing | University of Tokyo | Scrambled Bench | |
| MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers | 2023/11/27 | arxiv:2311.15475, Website | Computer Vision | MeshGPT | ||
| Ferret: Refer and Ground Anything Anywhere at Any Granularity | 2023/10/11 | arxiv:2310.07704, GitHub | Multimodal, Natural Language Processing | Apple | Ferret | |
| SeamlessM4T - Massively Multilingual and Multimodal Machine Translation | 2023/08/23 | Paper, Website, Demo, GitHub | Natural Language Processing | Meta | SeamlessM4T | |
| RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control | 2023/08/01 | arxiv:2307.15818, Website, Blogpost | Robotics | RT-2 | ||
| Towards Generalist Biomedical AI | 2023/07/26 | arxiv:2307.14334 | Natural Language Processing | Med-PaLM | ||
| Large Language Models Understand and Can be Enhanced by Emotional Stimuli | 2023/07/14 | arxiv:2307.11760 | Natural Language Processing | Microsoft, CAS | EmotionPrompt | |
| MusicGen: Simple and Controllable Music Generation | 2023/06/08 | arxiv:2306.05284, GitHub, Example | Audio | Meta | MusicGen | |
| CodeTF: One-stop Transformer Library for State-of-the-art Code LLM | 2023/05/31 | arxiv:2306.00029, GitHub | Natural Language Processing | Salesforce | CodeTF | |
| Bytes Are All You Need: Transformers Operating Directly On File Bytes | 2023/05/31 | arxiv:2306.00238 | Computer Vision | Apple | ||
| Scaling Speech Technology to 1,000+ Languages | 2023/05/22 | Paper, Blogpost, GitHub | Natural Language Processing | Meta | Massively Multilingual Speech (MMS) | |
| RWKV: Reinventing RNNs for the Transformer Era | 2023/05/22 | arxiv:2305.13048 | Natural Language Processing | RWKV | ||
| ImageBind: One Embedding Space To Bind Them All | 2023/05/09 | arxiv:2305.05665, Website, Demo, Blog, GitHub | Multimodal, Computer Vision, Natural Language Processing | Meta | ImageBind | |
| Real-Time Neural Appearance Models | 2023/05/05 | Paper, Blog | NVIDIA | |||
| Poisoning Language Models During Instruction Tuning | 2023/05/01 | arxiv:2305.00944 | Natural Language Processing | |||
| Generative Agents: Interactive Simulacra of Human Behavior | 2023/04/07 | arxiv:2304.03442 | Human-AI Interaction, Natural Language Processing | Stanford | Generative agents | |
| Segment Anything | 2023/04/05 | Paper, Website, Blog, GitHub | Computer Vision | Meta | Segment Anything Model (SAM) | |
| HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace (Microsoft JARVIS) | 2023/03/30 | arxiv:2303.17580, HuggingFace Space, JARVIS GitHub | Natural Language Processing, Multimodal | Microsoft, Hugging Face | HuggingGPT, JARVIS | |
| BloombergGPT: A Large Language Model for Finance | 2023/03/30 | arxiv:2303.17564, press release | Natural Language Processing | Bloomberg | BloombergGPT | |
| Reflexion: an autonomous agent with dynamic memory and self-reflection | 2023/03/20 | arxiv:2303.11366, GitHub | Natural Language Processing | Northeastern, MIT | Reflexion | |
| PaLM-E: An Embodied Multimodal Language Model | 2023/03/06 | arxiv:2303.03378, blog | Natural Language Processing, Multimodal | PaLM-E | ||
| Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages | 2023/03/02 | arxiv:2303.01037, blog | Natural Language Processing | Universal Speech Model (USM) | ||
| Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1) | 2023/02/27 | arxiv:2302.14045 | Natural Language Processing | Microsoft | Kosmos-1 | |
| Structure and Content-Guided Video Synthesis with Diffusion Models (Gen-1) | 2023/02/06 | arxiv:2302.03011, blog post | Video-to-Video | Runway | Gen-1 | |
| Dreamix: Video Diffusion Models are General Video Editors | 2023/02/03 | arxiv:2302.01329, blog post | Video | Dreamix | ||
| FLAME: A small language model for spreadsheet formulas | 2023/01/31 | arxiv:2301.13779 | Natural Language Processing | Microsoft | FLAME | |
| SingSong: Generating musical accompaniments from singing | 2023/01/30 | arxiv:2301.12662, blog post | Audio | SingSong | ||
| MusicLM: Generating Music From Text | 2023/01/26 | arxiv:2301.11325, blog post | Audio | MusicLM | ||
| Mastering Diverse Domains through World Models (DreamerV3) | 2023/01/10 | arxiv:2301.04104v1, blogpost | Reinforcement Learning | DeepMind | DreamerV3 | |
| Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E) | 2023/01/05 | arxiv:2301.02111, Demo | Audio | Microsoft | VALL-E | |
| Muse: Text-To-Image Generation via Masked Generative Transformers | 2023/01/02 | arxiv:2301.00704, blog post | Computer Vision | Muse | ||
| InstructPix2Pix: Learning to Follow Image Editing Instructions | 2022/11/17 | arxiv:2211.09800, Blog Post | Computer Vision | UC Berkeley | InstructPix2Pix | |
| Block-Recurrent Transformers | 2022/03/11 | arxiv:2203.07852 | Natural Language Processing | |||
| Memorizing Transformers | 2022/03/16 | arxiv:2203.08913 | Natural Language Processing | |||
| STaR: Bootstrapping Reasoning With Reasoning | 2022/03/28 | arxiv:2203.14465 | Natural Language Processing | Stanford, Google | STaR | |
| Probabilistic Face Embeddings | 2019/04/21 | arxiv:1904.09658 | Computer Vision | Michigan State | PFEs |
If you read the table top to bottom, four broad waves show up.
The first is the deep learning revival, roughly 2012 to 2016. AlexNet (Krizhevsky, Sutskever, Hinton, 2012) showed that GPUs and large labeled datasets (ImageNet) could push convolutional networks past hand-engineered features. The top-5 error on ImageNet dropped from 26.2 percent to 15.3 percent in a single year, and the entire computer vision community switched over within months. Word2Vec (Mikolov et al., 2013) brought the same trick to language, mapping words to dense vectors where geometric relationships matched semantic ones. The famous "king minus man plus woman equals queen" example came from this paper. GANs (Goodfellow et al., 2014) introduced adversarial training for generation, pitting a generator against a discriminator in a minimax game. DQN, A3C, and the AlphaGo line showed that deep RL could solve previously intractable games. AlphaGo beat Lee Sedol in March 2016, a year that most professional Go players had predicted would not come for another decade. ResNet (He et al., 2015) made it possible to train networks hundreds of layers deep without vanishing gradients by adding identity skip connections. Most of these papers are still cited every week, and the architectures they introduced (CNNs, GANs, residual blocks, replay buffers) show up in almost every modern system.
The second is the transformer era, roughly 2017 to 2020. "Attention Is All You Need" (Vaswani et al., 2017) replaced recurrence with self-attention and parallel training, which both fixed the long-context problems of LSTMs and made training on TPUs and GPU clusters far more efficient. BERT (Devlin et al., 2018) and GPT (Radford et al., 2018) split the family tree into encoders and decoders. BERT used masked language modeling to learn bidirectional context, which turned out to be ideal for classification and retrieval. GPT used autoregressive next-token prediction, which scaled much better for generation. T5 (Raffel et al., 2019) unified everything as text-to-text. GPT-3 (Brown et al., 2020) made the bet that scale alone unlocks new behaviors, and it mostly paid off: a 175B parameter model that had never been fine-tuned on a target task could often match or beat smaller fine-tuned models from a few prompted examples. The Scaling Laws paper (Kaplan et al., 2020) gave the field a quantitative recipe for how loss should drop as parameters, data, and compute grew. ViT (Dosovitskiy et al., 2020) brought transformers to vision and CLIP (Radford et al., 2021) bridged vision and language by training a joint embedding space on 400 million image-text pairs scraped from the web. By the end of 2020, almost every state of the art system on every NLP benchmark used the same basic architecture.
The third is the alignment era, roughly 2021 to 2023. The bottleneck stopped being raw capability and started being making large models behave. Christiano et al. (2017) had introduced preference-based RL years earlier, but InstructGPT (Ouyang et al., 2022) showed it at GPT-3 scale: human labelers ranked model outputs, a reward model learned to imitate those rankings, and PPO fine-tuned the language model against the reward. The resulting 1.3B parameter InstructGPT was preferred to the 175B GPT-3 by human raters, despite having 100 times fewer parameters. ChatGPT, launched in November 2022, used the same recipe. Constitutional AI (Bai et al., 2022) replaced most of the human labels with model-generated feedback. The model critiques its own responses against a written constitution, then revises them, and the revised pairs train the next round. This made alignment cheaper to scale and gave Anthropic a way to articulate the behavior they wanted in plain English rather than implicitly through labels. DPO (Rafailov et al., 2023) skipped the reward model entirely: it showed that a language model can be optimized directly against preference data with a simple cross-entropy objective, which is more stable and cheaper than PPO. Most open-source RLHF pipelines have moved to DPO or one of its variants. Chain of Thought (Wei et al., 2022), ReAct (Yao et al., 2022), Reflexion, and self-rewarding methods turned chatbots into something closer to agents that can plan, call tools, and revise their own outputs. RLHF, DPO, and Constitutional AI now ship in almost every commercial chatbot.
The fourth wave, still in progress, is the reasoning and multimodal era. GPT-4 (March 2023) added vision and dramatically improved performance on professional exams. The Sparks of AGI paper (Bubeck et al., 2023) argued that GPT-4 already showed early signs of general intelligence, which kicked off a long debate about how much the benchmark numbers actually measured. Gemini (December 2023) was multimodal from day one, trained jointly on text, images, audio, and video rather than bolted together. Llama 2 (July 2023), Mistral 7B (October 2023), and Mixtral 8x7B (January 2024) pushed open-weights performance close to the frontier, with Mixtral matching Llama 2 70B while using only 13B active parameters per token. The Llama 3 herd (July 2024) released a 405B dense transformer with 128K context, narrowing the gap between open and closed frontier models to a few months. DeepSeek-R1 (January 2025) showed that pure reinforcement learning, without supervised reasoning traces, can elicit emergent long-horizon thinking: the model learns to backtrack, verify, and self-correct on its own. AlphaFold 2 and 3 extended the same transformer-and-attention machinery to biology, predicting protein structures with near-experimental accuracy. AlphaFold 3, released in 2024, also handles nucleic acids and small-molecule ligands. AlphaFold's authors won the 2024 Nobel Prize in Chemistry.
A few entries in the table do not get the same airtime as the headline architectures, but they end up doing a lot of work.
Adam (Kingma and Ba, 2014) is the optimizer that almost every model on this page was trained with, at least at some point. Variants like AdamW are now the default. The paper itself is short and unflashy.
LoRA (Hu et al., 2021) made fine-tuning large models accessible to anyone with a single GPU. By freezing the base model and learning a low-rank update, you can adapt a 7B model on a laptop. Hugging Face, Replicate, and most of the open-source fine-tuning ecosystem rely on it.
The Chinchilla paper (Hoffmann et al., 2022) corrected the original Kaplan scaling laws. Kaplan's recipe had told everyone to make models bigger; Chinchilla showed that for a given compute budget, you should also feed the model proportionally more data. Most frontier models since 2022 have followed Chinchilla-style ratios.
Whisper (Radford et al., 2022) made high-quality multilingual speech recognition free. The model was trained on 680,000 hours of weakly supervised audio scraped from the internet, and it has become the default backbone for transcription products, podcast tools, and voice agents.
The original Latent Diffusion paper (Rombach et al., 2021, published at CVPR 2022) is the actual technical foundation of Stable Diffusion. Stability AI funded the open release a few months later. Without this paper, the consumer text-to-image boom of 2022 to 2023 probably does not happen.
If you are new to the field and want a reading order, a reasonable path is: ImageNet, AlexNet, Word2Vec, Seq2Seq, Attention Is All You Need, BERT, GPT-2, Scaling Laws, GPT-3, InstructGPT, Constitutional AI, Chinchilla, Chain of Thought, LLaMA, GPT-4 Technical Report, DPO, Llama 3, DeepSeek-R1. That gives you the spine of the modern stack in roughly chronological order, with each paper building on the previous.
If you are tracking a specific subfield, the "Type" column groups papers by area. Vision papers are easy to skim by filtering on Computer Vision. RL papers are sparser but include the most cited single results in the field (AlphaGo, MuZero, DQN, PPO). Audio is covered by WaveNet, AudioLM, MusicGen, Whisper, and SeamlessM4T. Robotics shows up under RT-2 and PaLM-E.
If you are looking for the alignment and safety literature in particular, the core spine is: Christiano et al. (2017), InstructGPT (2022), the Anthropic HH paper (2022), Constitutional AI (2022), and DPO (2023). Each of these is short, readable, and shapes how production models are trained today.
This page is a living index. Some papers are obviously canonical and missing only because no one has written a row yet. Likely additions worth tracking: the original GAN follow-ups (DCGAN, StyleGAN), the long context line (FlashAttention, Mamba, state-space models), agent benchmarks (SWE-bench, GAIA, OSWorld), safety and evaluation work (RealToxicityPrompts, TruthfulQA, HELM), the o-series reasoning papers from OpenAI, and the Anthropic interpretability papers on dictionary learning and circuits. New rows can be added in chronological order under either table.