Flamingo (visual language model)
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,537 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,537 words
Add missing citations, update stale details, or suggest a clearer explanation.
Flamingo is a family of visual language models (VLMs) developed by DeepMind and introduced in April 2022 in the paper "Flamingo: a Visual Language Model for Few-Shot Learning." The work was led by Jean-Baptiste Alayrac and a team of 27 authors, and an accompanying blog post appeared on 28 April 2022, one day before the first arXiv preprint [1][2]. Flamingo accepts sequences of arbitrarily interleaved images, videos, and text and produces free-form text in response. Its defining property is few-shot, in-context learning: a single trained model adapts to a new vision-language task simply by being prompted with a handful of examples, without any task-specific fine-tuning or weight updates [1][3].
The model was notable both for its results and for its design philosophy. Rather than training a large multimodal network from scratch, Flamingo bridges two powerful pretrained and frozen models, one for vision and one for language, using a small set of newly trained connecting components. This recipe proved influential, and it was reproduced in the open-source projects OpenFlamingo and IDEFICS and echoed in many later VLMs [4][5].
By 2022 the dominant approach to vision-language tasks such as image captioning and visual question answering relied on fine-tuning: a model was specialized to each benchmark using thousands of annotated examples. In natural language processing, by contrast, large language models had shown that a single model could handle many tasks from a few in-context prompts. Flamingo set out to bring that few-shot, prompt-based flexibility to multimodal problems [1][3].
The project drew on several lines of work at DeepMind. Its language backbone is Chinchilla, the compute-optimal large language model that the lab had introduced shortly before [2][6]. Its visual token bottleneck is an adaptation of the Perceiver architecture. These pieces were combined so that the expensive, separately pretrained models could be reused while only lightweight bridging layers were learned [1].
Flamingo has four main parts: a frozen vision encoder, a Perceiver Resampler, a frozen pretrained language model, and gated cross-attention layers inserted into that language model [1][3].
The vision encoder is a Normalizer-Free ResNet, specifically the NFNet-F6 model, pretrained from scratch as a dual encoder with a contrastive image-text objective similar to CLIP. After this contrastive pretraining the encoder is frozen and reused unchanged [1][7]. The encoder turns an image, or each frame of a video, into a grid of feature vectors.
Because a raw feature grid can be large and variable in size, Flamingo passes it through a Perceiver Resampler. Using a fixed set of learned latent queries, the resampler attends over the visual features and emits a small, fixed number of output tokens, 64 in practice, regardless of input resolution or video length [1][7]. This keeps the number of visual tokens manageable and uniform across images and videos.
The language model is a pretrained Chinchilla Transformer whose original weights stay frozen. To let text attend to images, the authors interleave new gated cross-attention dense layers between the frozen language layers. These layers cross-attend from the text to the resampled visual tokens. Crucially, each newly added layer is multiplied by tanh(alpha), where alpha is a per-layer learnable scalar initialized to zero. At the start of training the network therefore behaves exactly like the original language model, and the visual pathway is opened up gradually as alpha moves away from zero. This tanh-gating stabilizes training when mixing fresh components into a frozen pretrained model [1][7].
A masking scheme controls which images each piece of text can see. By default a text token attends only to the visual tokens of the single image or video that immediately precedes it in the sequence, although the model can still reason about earlier images through the language model's own self-attention [1][7].
Flamingo was trained on a mixture of web-scale datasets. The largest is MultiModal MassiveWeb (M3W), a corpus of text and images interleaved as they appear on web pages, which is what gives the model its interleaved few-shot ability. This is combined with image-text pair datasets (ALIGN and a dataset called LTIP) and a video-text dataset (VTP), mixed with per-dataset weights of 1.0, 0.2, 0.2, and 0.03 respectively [1][7].
Flamingo is designed to be used the way a large language model is prompted. A user constructs a sequence that interleaves example image-and-answer pairs with a final query image, and the model completes the text. Because it was trained on naturally interleaved web data, it can fold in several such examples and infer the intended task from them, an ability the paper calls in-context few-shot learning [1][3].
This matters because it removes the per-task fine-tuning step that earlier vision-language systems required. The same frozen Flamingo weights can do captioning, open-ended visual question answering, multiple-choice questions, and visual dialogue, with the task specified entirely through the prompt. The blog post reports that Flamingo beats all previous few-shot approaches when given as few as four examples per task [2][3].
The Flamingo family comes in three sizes, distinguished by the frozen Chinchilla language model each one wraps. The trainable parameters are concentrated in the gated cross-attention layers and the Perceiver Resampler.
| Model | Frozen language model | Added trained parameters | Total parameters |
|---|---|---|---|
| Flamingo-3B | Chinchilla 1.4B | about 1.4B | about 3B |
| Flamingo-9B | Chinchilla 7B | about 1.8B | about 9B |
| Flamingo (Flamingo-80B) | Chinchilla 70B | about 10B | about 80B |
The largest configuration starts from the 70B-parameter Chinchilla model and reaches roughly 80 billion parameters in total, which is why it is often called Flamingo-80B [2][6][8].
Across the 16 image and video understanding benchmarks the authors evaluated, a single Flamingo model set a new state of the art in the few-shot setting, outperforming prior zero-shot and few-shot methods by a wide margin. More striking, on six of those 16 tasks Flamingo's few-shot results surpassed the best published results from models that had been fine-tuned on far larger amounts of task-specific data [1][8]. The benchmarks spanned captioning, visual question answering, and multiple-choice tasks over both still images and video.
Flamingo helped establish a template that many subsequent vision-language models followed: take strong frozen unimodal models and connect them with a lightweight, trainable bridge instead of training one giant multimodal model end to end. The Perceiver Resampler and gated cross-attention became reference designs that later systems borrowed or adapted [4][9].
DeepMind itself was pursuing multimodal generalist systems in the same period, including Gato, and the broader push toward natively multimodal models continued at the merged organization that produced the Gemini family [9]. Because Flamingo's weights were never publicly released, however, much of its direct practical impact came through open reproductions.
OpenFlamingo is an open-source framework released in 2023 by researchers associated with the ML Foundations group to replicate DeepMind's Flamingo models. Described in a paper submitted in August 2023, it offers autoregressive vision-language models ranging from 3B to 9B parameters, built on CLIP ViT-L/14 vision encoders paired with openly available language models. On seven vision-language datasets, OpenFlamingo models reached on average between 80 and 89 percent of the performance of the corresponding Flamingo models, while being fully open in code and weights [4].
IDEFICS, whose name stands for "Image-aware Decoder Enhanced a la Flamingo with Interleaved Cross-attentionS," is an open-access reproduction of Flamingo built by Hugging Face and released on 22 August 2023. It was published in 9-billion- and 80-billion-parameter variants, each in base and instruction-tuned versions, and was constructed entirely from publicly available data and models, using a LLaMA language model and an OpenCLIP vision encoder. Its training set included a new web-scale corpus called OBELICS, comprising 141 million interleaved image-text documents, 353 million images, and about 115 billion text tokens. IDEFICS was reported to be comparable in performance to the original closed-source Flamingo across several image-text benchmarks [5][10].
Flamingo demonstrated that few-shot, prompt-based learning, already familiar from large language models, could be extended to multimodal inputs, and that this could be achieved by reusing frozen pretrained components rather than retraining everything. Its architectural choices, the Perceiver Resampler bottleneck and the zero-initialized tanh-gated cross-attention, offered a stable and parameter-efficient way to graft vision onto a language model. Together with its open reproductions OpenFlamingo and IDEFICS, Flamingo became a widely cited reference point in the development of the visual language models that followed [4][5][9].