Papers

This page is a chronological index of AI research papers that shaped the field. It covers the foundations of deep learning (LeNet, AlexNet, ImageNet, ResNet), the transformer era (Attention is All You Need, BERT, GPT, T5), the scaling era (GPT-3, Chinchilla, PaLM, LLaMA), the alignment era (InstructGPT, RLHF, Constitutional AI, DPO), and the multimodal and reasoning era (GPT-4, Gemini, Llama 3, DeepSeek-R1). It also tracks landmark work in computer vision, reinforcement learning, audio, robotics, and scientific AI (AlphaGo, MuZero, AlphaFold).

What makes a paper "important"

There is no committee that ranks AI papers. Importance is a rough mix of three things: technical novelty, citation count, and downstream impact on products people actually use. A paper that introduces a new architecture (the transformer, residual networks, mixture of experts) tends to stay important for years because every later paper builds on top of it. A paper that introduces a benchmark (ImageNet, GLUE, SuperGLUE, MMLU) shapes what the next decade of research optimizes for. A paper that opens up a new capability (GANs, diffusion, in-context learning, chain of thought) gets re-cited every time someone tries to extend or critique it.

There is also a softer kind of importance: papers that change how the field talks to itself. "Sparks of AGI" did this for GPT-4. "Emergent Abilities of Large Language Models" did it for scaling. "Constitutional AI" did it for alignment without exhaustive human labeling. These papers are not always the most technically deep, but they reframe debates that everyone else then has to respond to.

A few practical filters are useful when reading the table below:

Did the paper introduce a model that other people kept training on (BERT, GPT-2, LLaMA, Mistral 7B)?
Did it open up a new task or capability (Word2Vec for embeddings, CLIP for vision and language, Whisper for speech, AlphaFold for biology)?
Did it become a standard reference cited by every follow-up paper (Adam, ResNet, Attention is All You Need, Chinchilla, Scaling Laws)?
Did it shift how labs build systems (RLHF, DPO, Constitutional AI, Chain of Thought)?

The table below skews toward papers that meet at least one of those tests. It is not exhaustive. The "Important Papers" section lists the canonical first appearances of major ideas. The "Other Papers" section lists notable follow-ups, benchmarks, and applied work.

Reading the table

Dates use the arXiv submission date (v1) when available, the conference or journal publication date otherwise. "Source" links go to the arXiv abstract page, the publisher PDF, or the lab's official release. "Organization" lists the primary affiliation of the first author or the lab that led the work. "Product" lists the model, system, or technique name that the paper introduced. Some early papers predate the modern convention of naming a model in the title, so the product column is blank.

For papers that have their own dedicated wiki entry, the title links to that entry.

Important papers

Name	Date	Source	Type	Organization	Product	Note
Long Short-Term Memory	1997/11	Neural Computation 9(8)	Natural Language Processing		LSTM	Hochreiter and Schmidhuber introduce gated recurrent units
Gradient-Based Learning Applied to Document Recognition (LeNet-5)	1998/11	Proceedings of the IEEE	Computer Vision	AT&T Labs	LeNet-5	LeCun et al., convolutional networks for digit recognition
ImageNet: A Large-Scale Hierarchical Image Database	2009/06/20	CVPR 2009 PDF	Computer Vision	Princeton	ImageNet	Deng, Dong, Socher, Li, Li, Fei-Fei
ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)	2012	AlexNet Paper	Computer Vision	University of Toronto	AlexNet	Krizhevsky, Sutskever, Hinton
Efficient Estimation of Word Representations in Vector Space (Word2Vec)	2013/01/16	arxiv:1301.3781	Natural Language Processing	Google	Word2Vec
Playing Atari with Deep Reinforcement Learning (DQN)	2013/12/19	arxiv:1312.5602	Reinforcement Learning	DeepMind	DQN (Deep Q-Learning)
Generative Adversarial Networks (GAN)	2014/06/10	arxiv:1406.2661	Computer Vision	Universite de Montreal	GAN (Generative Adversarial Network)	Goodfellow et al.
Very Deep Convolutional Networks for Large-Scale Image Recognition (VGGNet)	2014/09/04	arxiv:1409.1556	Computer Vision	Oxford VGG	VGGNet
Sequence to Sequence Learning with Neural Networks (Seq2Seq)	2014/09/10	arxiv:1409.3215	Natural Language Processing	Google	Seq2Seq
Adam: A Method for Stochastic Optimization	2014/12/22	arxiv:1412.6980	Optimization	University of Amsterdam, OpenAI	Adam	Kingma and Ba, the default optimizer for years
Deep Residual Learning for Image Recognition (ResNet)	2015/12/10	arxiv:1512.03385	Computer Vision	Microsoft Research	ResNet	He et al., introduced skip connections
Going Deeper with Convolutions (GoogleNet)	2015/12/10	arxiv:1409.4842	Computer Vision	Google	GoogleNet
Mastering the game of Go with deep neural networks and tree search (AlphaGo)	2016/01/28	Nature 529	Reinforcement Learning	DeepMind	AlphaGo	Silver et al., defeated Lee Sedol two months later
Asynchronous Methods for Deep Reinforcement Learning (A3C)	2016/02/04	arxiv:1602.01783	Reinforcement Learning	DeepMind	A3C
WaveNet: A Generative Model for Raw Audio	2016/09/12	arxiv:1609.03499	Audio	DeepMind	WaveNet
Attention Is All You Need (Transformer)	2017/06/12	arxiv:1706.03762	Natural Language Processing	Google Brain	Transformer	Vaswani et al., the foundation of every modern LLM
Deep reinforcement learning from human preferences	2017/06/12	arxiv:1706.03741	Reinforcement Learning	OpenAI, DeepMind	RLHF	Christiano et al., the original RLHF paper
Proximal Policy Optimization Algorithms (PPO)	2017/07/20	arxiv:1707.06347	Reinforcement Learning	OpenAI	PPO	Used in ChatGPT and most RLHF pipelines
Mastering the game of Go without human knowledge (AlphaGo Zero)	2017/10/19	Nature 550	Reinforcement Learning	DeepMind	AlphaGo Zero	Self-play from scratch, no human games
Improving Language Understanding by Generative Pre-Training (GPT)	2018/06	paper source	Natural Language Processing	OpenAI	GPT	Radford et al., the first GPT
Deep contextualized word representations (ELMo)	2018/02/15	arxiv:1802.05365	Natural Language Processing	Allen AI	ELMo
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding	2018/04/20	arxiv:1804.07461, website	Natural Language Processing	NYU, U Washington, DeepMind	GLUE
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	2018/10/11	arxiv:1810.04805	Natural Language Processing	Google	BERT (Bidirectional Encoder Representations from Transformers)	Devlin et al.
Language Models are Unsupervised Multitask Learners (GPT-2)	2019/02/14	paper	Natural Language Processing	OpenAI	GPT-2	Originally withheld due to misuse concerns
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context	2019/01/09	arxiv:1901.02860, github	Natural Language Processing	CMU, Google Brain	Transformer-XL
RoBERTa: A Robustly Optimized BERT Pretraining Approach	2019/07/26	arxiv:1907.11692	Natural Language Processing	Meta	RoBERTa
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)	2019/10/23	arxiv:1910.10683	Natural Language Processing	Google	T5
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero)	2019/11/19	arxiv:1911.08265	Reinforcement Learning	DeepMind	MuZero	Plans without knowing the rules
Scaling Laws for Neural Language Models	2020/01/23	arxiv:2001.08361	Natural Language Processing	OpenAI	Scaling Laws	Kaplan et al., power-law relationships between loss, parameters, and compute
REALM: Retrieval-Augmented Language Model Pre-Training	2020/02/10	arxiv:2002.08909, blog post	Natural Language Processing	Google	REALM
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG)	2020/05/22	arxiv:2005.11401	Natural Language Processing	Meta, UCL	RAG	Lewis et al., coined the term "RAG"
Language Models are Few-Shot Learners (GPT-3)	2020/05/28	arxiv:2005.14165	Natural Language Processing	OpenAI	GPT-3	175B parameters, in-context learning
Denoising Diffusion Probabilistic Models (DDPM)	2020/06/19	arxiv:2006.11239	Computer Vision	UC Berkeley	DDPM	Ho, Jain, Abbeel, the modern diffusion baseline
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)	2020/10/22	arxiv:2010.11929, GitHub	Computer Vision	Google	ViT
Learning Transferable Visual Models From Natural Language Supervision (CLIP)	2021/02/26	arxiv:2103.00020, Blog Post	Computer Vision	OpenAI	CLIP
LoRA: Low-Rank Adaptation of Large Language Models	2021/06/17	arxiv:2106.09685, GitHub	Natural Language Processing	Microsoft	LoRA	The standard parameter-efficient fine-tuning method
Highly accurate protein structure prediction with AlphaFold (AlphaFold 2)	2021/07/15	Nature 596	Science	DeepMind	AlphaFold 2	Jumper et al., won the 2024 Nobel Prize in Chemistry
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer	2021/10/05	arxiv:2110.02178, GitHub	Computer Vision	Apple	MobileViT
High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)	2021/12/20	arxiv:2112.10752	Computer Vision	LMU Munich, Runway	Latent Diffusion, Stable Diffusion	Rombach et al., the architecture behind Stable Diffusion
Improving language models by retrieving from trillions of tokens (RETRO)	2021/12/08	arxiv:2112.04426, Blog post	Natural Language Processing	DeepMind	RETRO
LaMDA: Language Models for Dialog Applications	2022/01/20	arxiv:2201.08239, Blog Post	Natural Language Processing	Google	LaMDA
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models	2022/01/28	arxiv:2201.11903	Natural Language Processing	Google	Chain of Thought	Wei et al., "let's think step by step"
Training language models to follow instructions with human feedback (InstructGPT)	2022/03/04	arxiv:2203.02155	Natural Language Processing	OpenAI	InstructGPT	Ouyang et al., the direct predecessor of ChatGPT
Training Compute-Optimal Large Language Models (Chinchilla)	2022/03/29	arxiv:2203.15556	Natural Language Processing	DeepMind	Chinchilla	Hoffmann et al., revised the scaling laws
PaLM: Scaling Language Modeling with Pathways	2022/04/05	arxiv:2204.02311	Natural Language Processing	Google	PaLM
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback	2022/04/12	arxiv:2204.05862, GitHub	Natural Language Processing	Anthropic	RLHF	Bai et al., the HH-RLHF dataset
Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)	2022/04/13	arxiv:2204.06125	Computer Vision	OpenAI	DALL-E 2	Ramesh, Dhariwal, Nichol, Chu, Chen
A Generalist Agent (Gato)	2022/05/12	arxiv:2205.06175, Blog Post	Multimodal	DeepMind	Gato
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)	2022/05/23	arxiv:2205.11487, Blog Post	Computer Vision	Google	Imagen
Emergent Abilities of Large Language Models	2022/06/15	arxiv:2206.07682	Natural Language Processing	Google, Stanford, UNC, DeepMind	Emergent Abilities
AudioLM: a Language Modeling Approach to Audio Generation	2022/09/07	arxiv:2209.03143	Audio	Google	AudioLM
ReAct: Synergizing Reasoning and Acting in Language Models	2022/10/06	arxiv:2210.03629, GitHub	Natural Language Processing	Google, Princeton	ReAct
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model	2022/11/09	arxiv:2211.05100, Blog Post	Natural Language Processing	BigScience, Hugging Face	BLOOM	Open source competitor to GPT-3
Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)	2022/12/06	arxiv:2212.04356	Audio	OpenAI	Whisper
Constitutional AI: Harmlessness from AI Feedback	2022/12/15	arxiv:2212.08073	Natural Language Processing	Anthropic	Constitutional AI, Claude	Bai et al., RLAIF rather than RLHF
LLaMA: Open and Efficient Foundation Language Models	2023/02/25	arxiv:2302.13971, blog post, github	Natural Language Processing	Meta	LLaMA	First open-weights model competitive with GPT-3
GPT-4 Technical Report	2023/03/15	arxiv:2303.08774, blog post, system card	Multimodal	OpenAI	GPT-4	Withheld technical details, included a long system card
Sparks of Artificial General Intelligence: Early experiments with GPT-4	2023/03/22	arxiv:2303.12712	Multimodal	Microsoft Research	Sparks of AGI	Bubeck et al., reframed AGI discourse
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO)	2023/05/29	arxiv:2305.18290	Natural Language Processing	Stanford	DPO	Rafailov et al., a simpler alternative to PPO-based RLHF
Llama 2: Open Foundation and Fine-Tuned Chat Models	2023/07/18	arxiv:2307.09288	Natural Language Processing	Meta	Llama 2	Touvron et al., released under a permissive license
Mistral 7B	2023/10/10	arxiv:2310.06825	Natural Language Processing	Mistral	Mistral 7B	Apache 2.0 licensed, sliding-window attention
Gemini: A Family of Highly Capable Multimodal Models	2023/12/19	arxiv:2312.11805	Multimodal	Google DeepMind	Gemini	Ultra, Pro, and Nano sizes
Mixtral of Experts	2024/01/08	arxiv:2401.04088	Natural Language Processing	Mistral	Mixtral 8x7B	Sparse mixture of experts, 13B active parameters per token
Accurate structure prediction of biomolecular interactions with AlphaFold 3	2024/05/08	Nature 630	Science	Google DeepMind, Isomorphic Labs	AlphaFold 3	Abramson et al., handles proteins, nucleic acids, and ligands
The Llama 3 Herd of Models	2024/07/23	arxiv:2407.21783	Multimodal	Meta	Llama 3	405B dense transformer, 128K context
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	2025/01/22	arxiv:2501.12948	Natural Language Processing	DeepSeek	DeepSeek-R1	Pure RL produces emergent reasoning, comparable to OpenAI o1

Other papers

Name	Date	Source	Type	Organization	Product
Self-Rewarding Language Models	2024/01/18	arxiv:2401.10020	Natural Language Processing	Meta
LLM in a flash: Efficient Large Language Model Inference with Limited Memory	2023/12/12	arxiv:2312.11514, HuggingFace	Natural Language Processing	Apple
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation	2023/12/07	arxiv:2311.17117, Website, Video, GitHub, Tutorial	Computer Vision	Alibaba	Animate Anyone
MatterGen: a generative model for inorganic materials design	2023/12/06	arxiv:2312.03687, Tweet	Materials Science	Microsoft	MatterGen
Audiobox: Generating audio from voice and natural language prompts	2023/11/30	Paper, Website	Audio	Meta	Audiobox
Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text	2023/11/30	arxiv:2311.18805	Natural Language Processing	University of Tokyo	Scrambled Bench
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers	2023/11/27	arxiv:2311.15475, Website	Computer Vision		MeshGPT
Ferret: Refer and Ground Anything Anywhere at Any Granularity	2023/10/11	arxiv:2310.07704, GitHub	Multimodal, Natural Language Processing	Apple	Ferret
SeamlessM4T - Massively Multilingual and Multimodal Machine Translation	2023/08/23	Paper, Website, Demo, GitHub	Natural Language Processing	Meta	SeamlessM4T
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	2023/08/01	arxiv:2307.15818, Website, Blogpost	Robotics	Google	RT-2
Towards Generalist Biomedical AI	2023/07/26	arxiv:2307.14334	Natural Language Processing	Google	Med-PaLM
Large Language Models Understand and Can be Enhanced by Emotional Stimuli	2023/07/14	arxiv:2307.11760	Natural Language Processing	Microsoft, CAS	EmotionPrompt
MusicGen: Simple and Controllable Music Generation	2023/06/08	arxiv:2306.05284, GitHub, Example	Audio	Meta	MusicGen
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM	2023/05/31	arxiv:2306.00029, GitHub	Natural Language Processing	Salesforce	CodeTF
Bytes Are All You Need: Transformers Operating Directly On File Bytes	2023/05/31	arxiv:2306.00238	Computer Vision	Apple
Scaling Speech Technology to 1,000+ Languages	2023/05/22	Paper, Blogpost, GitHub	Natural Language Processing	Meta	Massively Multilingual Speech (MMS)
RWKV: Reinventing RNNs for the Transformer Era	2023/05/22	arxiv:2305.13048	Natural Language Processing		RWKV
ImageBind: One Embedding Space To Bind Them All	2023/05/09	arxiv:2305.05665, Website, Demo, Blog, GitHub	Multimodal, Computer Vision, Natural Language Processing	Meta	ImageBind
Real-Time Neural Appearance Models	2023/05/05	Paper, Blog		NVIDIA
Poisoning Language Models During Instruction Tuning	2023/05/01	arxiv:2305.00944	Natural Language Processing
Generative Agents: Interactive Simulacra of Human Behavior	2023/04/07	arxiv:2304.03442	Human-AI Interaction, Natural Language Processing	Stanford	Generative agents
Segment Anything	2023/04/05	Paper, Website, Blog, GitHub	Computer Vision	Meta	Segment Anything Model (SAM)
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace (Microsoft JARVIS)	2023/03/30	arxiv:2303.17580, HuggingFace Space, JARVIS GitHub	Natural Language Processing, Multimodal	Microsoft, Hugging Face	HuggingGPT, JARVIS
BloombergGPT: A Large Language Model for Finance	2023/03/30	arxiv:2303.17564, press release	Natural Language Processing	Bloomberg	BloombergGPT
Reflexion: an autonomous agent with dynamic memory and self-reflection	2023/03/20	arxiv:2303.11366, GitHub	Natural Language Processing	Northeastern, MIT	Reflexion
PaLM-E: An Embodied Multimodal Language Model	2023/03/06	arxiv:2303.03378, blog	Natural Language Processing, Multimodal	Google	PaLM-E
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages	2023/03/02	arxiv:2303.01037, blog	Natural Language Processing	Google	Universal Speech Model (USM)
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)	2023/02/27	arxiv:2302.14045	Natural Language Processing	Microsoft	Kosmos-1
Structure and Content-Guided Video Synthesis with Diffusion Models (Gen-1)	2023/02/06	arxiv:2302.03011, blog post	Video-to-Video	Runway	Gen-1
Dreamix: Video Diffusion Models are General Video Editors	2023/02/03	arxiv:2302.01329, blog post	Video	Google	Dreamix
FLAME: A small language model for spreadsheet formulas	2023/01/31	arxiv:2301.13779	Natural Language Processing	Microsoft	FLAME
SingSong: Generating musical accompaniments from singing	2023/01/30	arxiv:2301.12662, blog post	Audio	Google	SingSong
MusicLM: Generating Music From Text	2023/01/26	arxiv:2301.11325, blog post	Audio	Google	MusicLM
Mastering Diverse Domains through World Models (DreamerV3)	2023/01/10	arxiv:2301.04104v1, blogpost	Reinforcement Learning	DeepMind	DreamerV3
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)	2023/01/05	arxiv:2301.02111, Demo	Audio	Microsoft	VALL-E
Muse: Text-To-Image Generation via Masked Generative Transformers	2023/01/02	arxiv:2301.00704, blog post	Computer Vision	Google	Muse
InstructPix2Pix: Learning to Follow Image Editing Instructions	2022/11/17	arxiv:2211.09800, Blog Post	Computer Vision	UC Berkeley	InstructPix2Pix
Block-Recurrent Transformers	2022/03/11	arxiv:2203.07852	Natural Language Processing	Google
Memorizing Transformers	2022/03/16	arxiv:2203.08913	Natural Language Processing	Google
STaR: Bootstrapping Reasoning With Reasoning	2022/03/28	arxiv:2203.14465	Natural Language Processing	Stanford, Google	STaR
Probabilistic Face Embeddings	2019/04/21	arxiv:1904.09658	Computer Vision	Michigan State	PFEs

How the field evolved

If you read the table top to bottom, four broad waves show up.

The first is the deep learning revival, roughly 2012 to 2016. AlexNet (Krizhevsky, Sutskever, Hinton, 2012) showed that GPUs and large labeled datasets (ImageNet) could push convolutional networks past hand-engineered features. The top-5 error on ImageNet dropped from 26.2 percent to 15.3 percent in a single year, and the entire computer vision community switched over within months. Word2Vec (Mikolov et al., 2013) brought the same trick to language, mapping words to dense vectors where geometric relationships matched semantic ones. The famous "king minus man plus woman equals queen" example came from this paper. GANs (Goodfellow et al., 2014) introduced adversarial training for generation, pitting a generator against a discriminator in a minimax game. DQN, A3C, and the AlphaGo line showed that deep RL could solve previously intractable games. AlphaGo beat Lee Sedol in March 2016, a year that most professional Go players had predicted would not come for another decade. ResNet (He et al., 2015) made it possible to train networks hundreds of layers deep without vanishing gradients by adding identity skip connections. Most of these papers are still cited every week, and the architectures they introduced (CNNs, GANs, residual blocks, replay buffers) show up in almost every modern system.

The second is the transformer era, roughly 2017 to 2020. "Attention Is All You Need" (Vaswani et al., 2017) replaced recurrence with self-attention and parallel training, which both fixed the long-context problems of LSTMs and made training on TPUs and GPU clusters far more efficient. BERT (Devlin et al., 2018) and GPT (Radford et al., 2018) split the family tree into encoders and decoders. BERT used masked language modeling to learn bidirectional context, which turned out to be ideal for classification and retrieval. GPT used autoregressive next-token prediction, which scaled much better for generation. T5 (Raffel et al., 2019) unified everything as text-to-text. GPT-3 (Brown et al., 2020) made the bet that scale alone unlocks new behaviors, and it mostly paid off: a 175B parameter model that had never been fine-tuned on a target task could often match or beat smaller fine-tuned models from a few prompted examples. The Scaling Laws paper (Kaplan et al., 2020) gave the field a quantitative recipe for how loss should drop as parameters, data, and compute grew. ViT (Dosovitskiy et al., 2020) brought transformers to vision and CLIP (Radford et al., 2021) bridged vision and language by training a joint embedding space on 400 million image-text pairs scraped from the web. By the end of 2020, almost every state of the art system on every NLP benchmark used the same basic architecture.

The third is the alignment era, roughly 2021 to 2023. The bottleneck stopped being raw capability and started being making large models behave. Christiano et al. (2017) had introduced preference-based RL years earlier, but InstructGPT (Ouyang et al., 2022) showed it at GPT-3 scale: human labelers ranked model outputs, a reward model learned to imitate those rankings, and PPO fine-tuned the language model against the reward. The resulting 1.3B parameter InstructGPT was preferred to the 175B GPT-3 by human raters, despite having 100 times fewer parameters. ChatGPT, launched in November 2022, used the same recipe. Constitutional AI (Bai et al., 2022) replaced most of the human labels with model-generated feedback. The model critiques its own responses against a written constitution, then revises them, and the revised pairs train the next round. This made alignment cheaper to scale and gave Anthropic a way to articulate the behavior they wanted in plain English rather than implicitly through labels. DPO (Rafailov et al., 2023) skipped the reward model entirely: it showed that a language model can be optimized directly against preference data with a simple cross-entropy objective, which is more stable and cheaper than PPO. Most open-source RLHF pipelines have moved to DPO or one of its variants. Chain of Thought (Wei et al., 2022), ReAct (Yao et al., 2022), Reflexion, and self-rewarding methods turned chatbots into something closer to agents that can plan, call tools, and revise their own outputs. RLHF, DPO, and Constitutional AI now ship in almost every commercial chatbot.

The fourth wave, still in progress, is the reasoning and multimodal era. GPT-4 (March 2023) added vision and dramatically improved performance on professional exams. The Sparks of AGI paper (Bubeck et al., 2023) argued that GPT-4 already showed early signs of general intelligence, which kicked off a long debate about how much the benchmark numbers actually measured. Gemini (December 2023) was multimodal from day one, trained jointly on text, images, audio, and video rather than bolted together. Llama 2 (July 2023), Mistral 7B (October 2023), and Mixtral 8x7B (January 2024) pushed open-weights performance close to the frontier, with Mixtral matching Llama 2 70B while using only 13B active parameters per token. The Llama 3 herd (July 2024) released a 405B dense transformer with 128K context, narrowing the gap between open and closed frontier models to a few months. DeepSeek-R1 (January 2025) showed that pure reinforcement learning, without supervised reasoning traces, can elicit emergent long-horizon thinking: the model learns to backtrack, verify, and self-correct on its own. AlphaFold 2 and 3 extended the same transformer-and-attention machinery to biology, predicting protein structures with near-experimental accuracy. AlphaFold 3, released in 2024, also handles nucleic acids and small-molecule ligands. AlphaFold's authors won the 2024 Nobel Prize in Chemistry.

Papers that are easy to underrate

A few entries in the table do not get the same airtime as the headline architectures, but they end up doing a lot of work.

Adam (Kingma and Ba, 2014) is the optimizer that almost every model on this page was trained with, at least at some point. Variants like AdamW are now the default. The paper itself is short and unflashy.

LoRA (Hu et al., 2021) made fine-tuning large models accessible to anyone with a single GPU. By freezing the base model and learning a low-rank update, you can adapt a 7B model on a laptop. Hugging Face, Replicate, and most of the open-source fine-tuning ecosystem rely on it.

The Chinchilla paper (Hoffmann et al., 2022) corrected the original Kaplan scaling laws. Kaplan's recipe had told everyone to make models bigger; Chinchilla showed that for a given compute budget, you should also feed the model proportionally more data. Most frontier models since 2022 have followed Chinchilla-style ratios.

Whisper (Radford et al., 2022) made high-quality multilingual speech recognition free. The model was trained on 680,000 hours of weakly supervised audio scraped from the internet, and it has become the default backbone for transcription products, podcast tools, and voice agents.

The original Latent Diffusion paper (Rombach et al., 2021, published at CVPR 2022) is the actual technical foundation of Stable Diffusion. Stability AI funded the open release a few months later. Without this paper, the consumer text-to-image boom of 2022 to 2023 probably does not happen.

How to use this index

If you are new to the field and want a reading order, a reasonable path is: ImageNet, AlexNet, Word2Vec, Seq2Seq, Attention Is All You Need, BERT, GPT-2, Scaling Laws, GPT-3, InstructGPT, Constitutional AI, Chinchilla, Chain of Thought, LLaMA, GPT-4 Technical Report, DPO, Llama 3, DeepSeek-R1. That gives you the spine of the modern stack in roughly chronological order, with each paper building on the previous.

If you are tracking a specific subfield, the "Type" column groups papers by area. Vision papers are easy to skim by filtering on Computer Vision. RL papers are sparser but include the most cited single results in the field (AlphaGo, MuZero, DQN, PPO). Audio is covered by WaveNet, AudioLM, MusicGen, Whisper, and SeamlessM4T. Robotics shows up under RT-2 and PaLM-E.

If you are looking for the alignment and safety literature in particular, the core spine is: Christiano et al. (2017), InstructGPT (2022), the Anthropic HH paper (2022), Constitutional AI (2022), and DPO (2023). Each of these is short, readable, and shapes how production models are trained today.

A note on what is missing

This page is a living index. Some papers are obviously canonical and missing only because no one has written a row yet. Likely additions worth tracking: the original GAN follow-ups (DCGAN, StyleGAN), the long context line (FlashAttention, Mamba, state-space models), agent benchmarks (SWE-bench, GAIA, OSWorld), safety and evaluation work (RealToxicityPrompts, TruthfulQA, HELM), the o-series reasoning papers from OpenAI, and the Anthropic interpretability papers on dictionary learning and circuits. New rows can be added in chronological order under either table.

Papers

What makes a paper "important"

Reading the table

Important papers

Other papers

How the field evolved

Papers that are easy to underrate

How to use this index

A note on what is missing

References

Improve this article

What makes a paper "important"

Reading the table

Important papers

Other papers

How the field evolved

Papers that are easy to underrate

How to use this index

A note on what is missing

References

What makes a paper "important"

Reading the table

Important papers

Other papers

How the field evolved

Papers that are easy to underrate

How to use this index

A note on what is missing

References

Improve this article

Related Articles

Acronyms

Model hubs

What makes a paper "important"

Reading the table

Important papers

Other papers

How the field evolved

Papers that are easy to underrate

How to use this index

A note on what is missing

References

Related Articles

Acronyms

Model hubs