Models
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,929 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,929 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Terms, Applications, Guides, Foundation models, LLMs
An artificial intelligence model is a mathematical function, typically a neural network, whose parameters are fit to data so the function can map inputs (text, images, audio, sensor readings, actions) to useful outputs. Modern systems are dominated by large pre-trained networks: a single foundation model is trained once on broad data, then adapted for many tasks through prompting, fine-tuning, or tool use. The term "foundation model" was popularized by the Stanford CRFM report "On the Opportunities and Risks of Foundation Models" (Bommasani et al., 2021), which described a paradigm shift toward general-purpose models that serve as a base for downstream applications.
This page is the gateway index to the models covered on AI Wiki. It links to detailed articles on individual model families, surveys the major modalities (text, image, video, audio, vision, embeddings, robotics), and summarizes architectural ideas, scaling trends, and the open versus closed-weight landscape that shape the field as of 2026.
In modern practice an "AI model" almost always refers to a deep learning network trained by gradient-based optimization. The dominant architecture is the transformer introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need." Transformers replaced earlier sequence models such as recurrent networks and convolutional language models, and they now underlie nearly all large language models, most state-of-the-art image and video generators (via diffusion transformers), and a growing share of vision and speech systems.
A model is identified by:
Progress in generative AI has been driven less by single inventions than by repeatedly increasing scale, and several rough eras can be drawn from public data.
| era | rough years | representative models | notable scale |
|---|---|---|---|
| pre-transformer | up to 2017 | AlexNet, VGG, ResNet, seq2seq LSTMs | tens to hundreds of millions of parameters |
| early transformer | 2017 to 2019 | original Transformer, BERT, GPT-2, T5 | up to ~1.5B parameters |
| GPT-3 era | 2020 to 2022 | GPT-3, Gopher, Chinchilla, PaLM, LLaMA | 100B to 540B parameters |
| chat era | late 2022 to 2023 | ChatGPT, GPT-4, Claude, Bard, LLaMA | hundreds of billions, mixture of experts begins |
| multimodal era | 2024 to 2025 | GPT-4o, Claude 3.5 Sonnet, Gemini, DeepSeek V3 | trillion-parameter sparse models, native multimodality |
| reasoning and agentic era | 2025 to 2026 | OpenAI o1, OpenAI o3, Claude Opus 4.7, GPT-5, Gemini 3 | inference-time scaling with test-time compute |
The transition from "chat" models to "reasoning" models in 2024 introduced the practice of allocating extra compute at inference time, as documented in OpenAI's o1 system card and later work on test-time compute and speculative decoding.
Large language models (LLMs) are transformer-based networks trained on text corpora to predict the next token. They are the most widely deployed type of AI model and the basis of chatbots, AI agents, and most tool use systems.
The "frontier" tier (see Frontier models) is dominated by API-only releases from a handful of US labs.
| family | maker | notable members |
|---|---|---|
| GPT | OpenAI | GPT-2, GPT-3, GPT-4, GPT-4o, GPT-4.5, GPT-5, GPT-5.1, GPT-5.2, GPT-5.4, GPT-5.5 |
| OpenAI reasoning | OpenAI | o1, o3, o4-mini |
| Claude | Anthropic | Claude 3 Opus, Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 4, Claude Opus 4, Claude Sonnet 4, Claude Opus 4.1, Claude Haiku 4.5, Claude Opus 4.5, Claude Sonnet 4.6, Claude Opus 4.6, Claude Opus 4.7 |
| Gemini | Google DeepMind | Gemini, Gemini 2.5 Pro, Gemini 3, Gemini 3 Pro |
| Grok | xAI | Grok, Grok 4 |
Open-weight models publish trained parameters under permissive or community licenses, allowing local use, fine-tuning, and quantization. Open-weight LLMs cluster around a few national and corporate ecosystems.
| family | origin | notable members |
|---|---|---|
| Llama | Meta | LLaMA, LLaMA 3, Llama 3.2, Llama 3.3 |
| Mistral and Mixtral | Mistral AI | Mistral 7B, Mixtral, Mistral Medium 3 |
| Gemma | Gemma, Gemma 2 | |
| Qwen | Alibaba | Qwen, Qwen3 |
| DeepSeek | DeepSeek | DeepSeek, DeepSeek V3, DeepSeek V4, DeepSeek-R1-Distill |
| gpt-oss | OpenAI | gpt-oss |
| Phi | Microsoft | Phi, Phi-4 |
| Falcon | Technology Innovation Institute | Falcon |
| Kimi | Moonshot AI | Kimi, Kimi K2 |
| Doubao and ERNIE | ByteDance, Baidu AI | Doubao, Baidu ERNIE family |
Many open-weight releases use a mixture-of-experts topology, where only a fraction of parameters activate per token. Mixtral 8x7B from Mistral AI was an early high-profile example, and DeepSeek V3 published technical details for a 671B-parameter sparse model with around 37B active parameters per token in its 2024 technical report.
LLM weights are usually delivered to end users through a chat product or developer API. The largest of these have separate articles.
| product | underlying model family | notes |
|---|---|---|
| ChatGPT | GPT and o-series | first mass-market chat product, launched November 2022 |
| Claude | Claude family | chat product from Anthropic |
| Gemini | Gemini family | rebranded from Google Bard in February 2024 |
| Grok | Grok family | chat product from xAI integrated with X |
| Kimi | Moonshot AI Kimi family | popular chat assistant in China |
| Doubao | ByteDance Doubao family | leading consumer chatbot in China |
A multimodal model accepts or produces more than one modality. Native multimodality means the same network handles text and other modalities through shared tokens or embeddings, in contrast to early systems that bolted a vision encoder onto a frozen language model.
| model | modalities in | modalities out | notable feature |
|---|---|---|---|
| GPT-4o | text, image, audio | text, image, audio | single network with native voice, released by OpenAI in May 2024 |
| Claude 3.5 Sonnet | text, image | text | strong vision and coding scores at release |
| Gemini 2.5 Pro | text, image, video, audio | text | long-context, video understanding |
| Gemini 3 Pro | text, image, video, audio | text | follow-on to the original Gemini Pro |
| GPT-5 and GPT-5.1 | text, image, audio | text, image, audio | reasoning plus multimodal |
| Grok 4 | text, image | text | xAI flagship multimodal model |
| Reka family | text, image, video, audio | text | early native multimodal effort |
| GPT Image 1 | text, image | image | image generation tied to ChatGPT |
| Mistral OCR 3 | document image | text | document understanding |
For a deeper survey see Multimodal AI, which covers vision-language fusion, audio tokens, and video understanding.
Text-to-image generation is dominated by diffusion models and, more recently, diffusion transformers. These models reverse a noising process to sample images from a learned distribution.
| model | maker | architecture |
|---|---|---|
| DALL-E | OpenAI | autoregressive then diffusion |
| Stable Diffusion | Stability AI | latent diffusion, U-Net |
| Stable Diffusion 3 | Stability AI | rectified flow, MMDiT transformer |
| Imagen | cascaded diffusion | |
| Imagen 4 | latest Imagen generation | |
| Midjourney | Midjourney | proprietary, image-focused |
| Flux | Black Forest Labs | flow matching, MMDiT |
| FLUX.2 | Black Forest Labs | next-generation Flux family |
| GPT Image 1 | OpenAI | autoregressive image plus diffusion decoder |
| Adobe Firefly | Adobe | commercial-safe training |
| Recraft AI | Recraft | vector and design oriented |
| Leonardo.AI | Leonardo | game and creative pipelines |
| Magnific AI | Magnific | upscaling and detail enhancement |
For architectural details see Stable Diffusion, Diffusion Transformer (DiT), GAN, and the broader AI art page. Local interfaces such as AUTOMATIC1111 and ComfyUI are widely used to run open-weight image models.
Video models extend image diffusion to time, typically using 3D attention or factorized space-time attention. They became broadly viable in 2024 to 2026.
| model | maker | notes |
|---|---|---|
| Sora | OpenAI | first widely demoed long-form text-to-video, previewed in February 2024 |
| Sora 2 | OpenAI | follow-on with audio and longer clips |
| Veo 2 | Google DeepMind | high-resolution generation |
| Runway Gen-3 Alpha | Runway | creator-focused product |
| Pika | Pika Labs | short-form video |
| Kling 2.1 | Kuaishou | strong physical realism |
| Higgsfield AI | Higgsfield | cinematic and motion controls |
| Tavus | Tavus | personalized video avatars |
| HeyGen | HeyGen | digital presenter videos |
The broader article Text-to-video generation covers benchmarks, training data, and product timelines.
Audio models split into recognition (speech to text), synthesis (text to speech), and music or sound generation. Most modern systems are transformer-based; some image-style diffusion is used for music.
| model | maker | task |
|---|---|---|
| Whisper | OpenAI | multilingual speech recognition |
| ElevenLabs | ElevenLabs | text-to-speech and voice cloning |
| ElevenLabs Music | ElevenLabs | music generation |
| Cartesia | Cartesia | low-latency state-space speech models |
| Suno and Udio | Suno, Udio | song generation from prompts |
| GPT-4o audio | OpenAI | end-to-end voice in chat |
| Gemini Live | streaming voice and vision |
For an overview of speech recognition and the Whisper architecture, see Automatic Speech Recognition and Text-to-Speech.
Classical computer vision models still play a major role in production systems for object detection, image segmentation, depth estimation, facial recognition, and autonomous driving.
| model | year of original release | task |
|---|---|---|
| VGG | 2014 | image classification |
| ResNet | 2015 | image classification, residual blocks |
| MobileNet | 2017 | efficient on-device vision |
| YOLO | 2016 onward | real-time object detection |
| DETR | 2020 | transformer-based object detection |
| Vision Transformer | 2020 | pure-attention vision backbone |
| Swin Transformer | 2021 | shifted-window vision transformer |
| CLIP | 2021 | contrastive image-text alignment |
| Segment Anything Model | 2023 | promptable image segmentation |
| NeRF | 2020 | neural radiance fields, novel view synthesis |
For convolutional foundations see Convolutional Neural Network, Convolutional Layer, and Pooling. The Image Recognition page surveys benchmark history including CIFAR-10 and ImageNet.
Embedding models map inputs to dense vectors used for retrieval, clustering, and downstream classification. They are central to retrieval-augmented generation pipelines and to vector search backends.
| model family | maker | notes |
|---|---|---|
| text-embedding-3 (small and large) | OpenAI | API embeddings, 1536 and 3072 dimensions |
| BGE family (bge-large, bge-m3) | Beijing Academy of AI | strong open-weight embeddings, multilingual M3 variant |
| Sentence Transformers | UKP Lab and community | popular Python framework, all-MiniLM-L6-v2 widely used baseline |
| E5 family (multilingual, instruct) | Microsoft Research | Wang et al. embedding line |
| GTE | Alibaba | general text embeddings, multilingual |
| Voyage | Voyage AI | retrieval-tuned embeddings, partner to Anthropic |
| Nomic Embed | Nomic | open-weight, fully reproducible training |
| Cohere Embed | Cohere | enterprise embedding API |
For underlying ideas see vector databases, retrieval-augmented generation, and Sentence Similarity.
In 2024 OpenAI released o1, the first widely deployed model that uses long chains of internal reasoning before producing an answer. The pattern was quickly followed by other labs.
| model | maker | notes |
|---|---|---|
| OpenAI o1 | OpenAI | first o-series reasoning model, public preview September 2024 |
| OpenAI o3 | OpenAI | successor to o1, strong on math and coding benchmarks |
| OpenAI o4-mini | OpenAI | smaller reasoning model |
| Claude reasoning modes | Anthropic | extended thinking added to Claude 3.7 Sonnet and successors |
| DeepSeek-R1 and DeepSeek-R1-Distill | DeepSeek | open-weight reasoning models, January 2025 release |
| Gemini 2.5 Pro | Google DeepMind | thinking mode and toolformer-style reasoning |
| Qwen QwQ | Alibaba | open-weight reasoning model |
The topic page Test-time compute covers benchmarks like AIME, MMLU-Pro, GSM8K, and the trend of trading inference compute for accuracy.
Vision-language-action (VLA) models extend large pre-trained transformers to robot control by adding action tokens. The Stanford CRFM 2024 survey, the Google DeepMind RT-2 paper, and Physical Intelligence's 2024 papers are core references.
| model | maker | notes |
|---|---|---|
| RT-1 and RT-2 | Google DeepMind | early scaling of vision-language-action policies, RT-2 announced July 2023 |
| OpenVLA | Stanford and partners | open-weight 7B VLA built on Llama-2 and Prismatic VLM, June 2024 |
| Pi0 (π0) | Physical Intelligence | flow-matching VLA, October 2024 paper |
| π0.5 | Physical Intelligence | follow-on with hierarchical action |
| Helix | Figure AI | proprietary humanoid VLA |
| GR00T | NVIDIA | humanoid robot foundation model line, announced GTC 2024 |
For architectural and training details see Robot foundation model, Robot manipulation, and SLAM.
Not all models target language or images. Some of the most influential systems are highly specialized.
| model | task | maker |
|---|---|---|
| AlphaGo and AlphaZero | board games via MCTS plus deep RL | DeepMind |
| MuZero | model-based reinforcement learning | DeepMind |
| AlphaFold 2 and AlphaFold 3 | protein and biomolecular structure prediction | DeepMind and Isomorphic Labs |
| GraphCast | medium-range weather forecasting | Google DeepMind |
| Aurora | high-resolution weather and atmospheric model | Microsoft Research |
| ESM-2 and ESM-3 | protein language models | Meta and EvolutionaryScale |
| ClimaX | climate modeling | Microsoft Research and partners |
| RoseTTAFold | protein structure prediction | Baker Lab |
| Tx-LLM and biomedical LLMs | medical reasoning | various |
AlphaFold 2 was reported in Jumper et al., Nature 2021. AlphaFold 3, covering protein-ligand and nucleic acid complexes, was reported in Abramson et al., Nature 2024. The 2024 Nobel Prize in Chemistry went to David Baker, Demis Hassabis, and John Jumper, partly for AlphaFold and related work.
Most modern AI models can be summarized by a few architectural ideas, each with its own page.
| concept | summary |
|---|---|
| Transformer | self-attention, the dominant architecture |
| Convolutional Neural Network | spatial weight sharing for vision |
| Diffusion model | iterative denoising for image, video, audio |
| Diffusion Transformer (DiT) | transformer backbones replacing U-Net in diffusion |
| Generative adversarial network | adversarial pair for generation, dominant pre-2022 |
| Mixture of experts | sparse routing for parameter scaling |
| Vision transformer | transformer applied to image patches |
| Swin Transformer | hierarchical vision transformer |
Training recipes pull from a smaller set of components.
| component | role |
|---|---|
| pre-training | learn general representations from broad data |
| fine-tuning | adapt a pre-trained model to a task or domain |
| LoRA and QLoRA | parameter-efficient fine-tuning |
| RLHF | reinforcement learning from human feedback |
| Constitutional AI | rules-based feedback used by Anthropic |
| In-context learning | task adaptation purely from prompt examples |
| Prompt engineering | designing effective inputs |
| Tool use | letting models call external functions |
| Speculative decoding | faster inference using a draft model |
| Byte pair encoding | dominant subword tokenization scheme |
| Model merging | combining trained checkpoints |
Empirical scaling laws describe how loss decreases as a power function of model size, dataset size, and compute. Two papers anchor the field:
Later work added wrinkles:
A model's release form determines who can run it, fine-tune it, or audit it. Three common categories:
| category | rights granted | examples |
|---|---|---|
| closed-weight API | inference only via API | GPT-4, Claude family, Gemini, Grok |
| open weights, restricted license | weights available, license restricts uses or commercial scale | Llama family, Mistral large models, Gemma |
| permissive open source | weights, code, often training scripts under Apache or MIT | Falcon, Pythia, OLMo, gpt-oss, parts of DeepSeek and Qwen |
Local runners such as Ollama, LM Studio, and the GGUF format have made open-weight models broadly usable on consumer hardware. OpenRouter acts as a unified API across both open and closed providers.
Model quality is reported on standardized benchmarks. Popular ones with their own pages include MMLU-Pro, GSM8K, MBPP, LongBench, JailbreakBench, and MT-Bench. Each measures different capabilities:
| benchmark | what it measures |
|---|---|
| MMLU-Pro | multi-domain academic knowledge, harder revision of MMLU |
| GSM8K | grade-school arithmetic word problems |
| MBPP | basic Python programming tasks |
| HumanEval | function-completion coding |
| LongBench | long-context understanding |
| MT-Bench | open-ended multi-turn quality, judged by GPT-4 |
| JailbreakBench | resistance to adversarial prompts |
| AIME and Math | competition-level math |
| ARC-AGI | abstract pattern reasoning |
Benchmark saturation has driven a steady cadence of harder evaluations: MMLU was superseded by MMLU-Pro and GPQA, HumanEval was superseded by SWE-bench and LiveCodeBench, and academic math benchmarks increasingly include AIME, USAMO, and Putnam-style problems.
For finer task taxonomies, see the existing index pages: