Flamingo (visual language model)
Last reviewed
Sources
10 citations
Review status
Source-backed
Revision
v2 · 1,874 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
10 citations
Review status
Source-backed
Revision
v2 · 1,874 words
Add missing citations, update stale details, or suggest a clearer explanation.
Flamingo is a family of visual language models (VLMs) built by DeepMind and introduced in April 2022 that brought few-shot, in-context learning to multimodal inputs. A single Flamingo model accepts sequences of arbitrarily interleaved images, videos, and text and produces free-form text in response, adapting to a new vision-language task purely from a handful of prompt examples, with no task-specific fine-tuning or weight updates [1][3]. Its largest configuration has about 80 billion parameters, and on six of the 16 benchmarks the authors studied it outperformed state-of-the-art models that had been fine-tuned on roughly 1,000 times more task-specific data, using just 32 examples per task and never updating its own weights [1][2][8].
The work was led by Jean-Baptiste Alayrac and a team of 27 authors in the paper "Flamingo: a Visual Language Model for Few-Shot Learning," with an accompanying blog post published on 28 April 2022, one day before the first arXiv preprint, and the paper later appeared at NeurIPS 2022 [1][2][3]. Flamingo was notable both for these results and for its design philosophy: rather than training a large multimodal network from scratch, it bridges two powerful pretrained and frozen models, one for vision and one for language, using a small set of newly trained connecting components. That recipe proved influential and was reproduced in the open-source projects OpenFlamingo and IDEFICS and echoed in many later VLMs [4][5].
Flamingo is a vision-language model that treats multimodal tasks the way a large language model treats text tasks: by prompting. As the paper states in its abstract, "For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples" [1]. The same frozen weights can caption images, answer open-ended visual questions, handle multiple-choice questions, and carry on visual dialogue, with the task specified entirely through the prompt rather than through retraining [1][3].
The central problem Flamingo set out to solve, in the authors' words, was that "building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research" [1]. Flamingo's answer was to combine large frozen pretrained models with new, lightweight bridging layers trained on web-scale interleaved data.
By 2022 the dominant approach to vision-language tasks such as image captioning and visual question answering relied on fine-tuning: a model was specialized to each benchmark using thousands of annotated examples. In natural language processing, by contrast, large language models had shown that a single model could handle many tasks from a few in-context prompts. Flamingo set out to bring that few-shot, prompt-based flexibility to multimodal problems [1][3].
The project drew on several lines of work at DeepMind. Its language backbone is Chinchilla, the compute-optimal large language model that the lab had introduced shortly before [2][6]. Its visual token bottleneck is an adaptation of the Perceiver architecture. These pieces were combined so that the expensive, separately pretrained models could be reused while only lightweight bridging layers were learned [1].
Flamingo has four main parts: a frozen vision encoder, a Perceiver Resampler, a frozen pretrained language model, and gated cross-attention layers inserted into that language model [1][3].
The vision encoder is a Normalizer-Free ResNet, specifically the NFNet-F6 model, pretrained from scratch as a dual encoder with a contrastive image-text objective similar to CLIP. After this contrastive pretraining the encoder is frozen and reused unchanged [1][7]. The encoder turns an image, or each frame of a video, into a grid of feature vectors.
Because a raw feature grid can be large and variable in size, Flamingo passes it through a Perceiver Resampler. Using a fixed set of learned latent queries, the resampler attends over the visual features and emits a small, fixed number of output tokens, 64 in practice, regardless of input resolution or video length [1][7]. This keeps the number of visual tokens manageable and uniform across images and videos.
The language model is a pretrained Chinchilla Transformer whose original weights stay frozen. To let text attend to images, the authors interleave new gated cross-attention dense layers between the frozen language layers. These layers cross-attend from the text to the resampled visual tokens. Crucially, each newly added layer is multiplied by tanh(alpha), where alpha is a per-layer learnable scalar initialized to zero. At the start of training the network therefore behaves exactly like the original language model, and the visual pathway is opened up gradually as alpha moves away from zero. This tanh-gating stabilizes training when mixing fresh components into a frozen pretrained model [1][7].
A masking scheme controls which images each piece of text can see. By default a text token attends only to the visual tokens of the single image or video that immediately precedes it in the sequence, although the model can still reason about earlier images through the language model's own self-attention [1][7].
Flamingo was trained on a mixture of web-scale datasets. The largest is MultiModal MassiveWeb (M3W), a corpus of text and images interleaved as they appear on web pages, which is what gives the model its interleaved few-shot ability. This is combined with image-text pair datasets (ALIGN and a dataset called LTIP) and a video-text dataset (VTP), mixed with per-dataset weights of 1.0, 0.2, 0.2, and 0.03 respectively [1][7].
Flamingo is designed to be used the way a large language model is prompted. A user constructs a sequence that interleaves example image-and-answer pairs with a final query image, and the model completes the text. Because it was trained on naturally interleaved web data, it can fold in several such examples and infer the intended task from them, an ability the paper calls in-context few-shot learning [1][3].
This matters because it removes the per-task fine-tuning step that earlier vision-language systems required. The same frozen Flamingo weights can do captioning, open-ended visual question answering, multiple-choice questions, and visual dialogue, with the task specified entirely through the prompt. As DeepMind's announcement put it, "On the 16 tasks we studied, Flamingo beats all previous few-shot learning approaches when given as few as four examples per task" [2]. The lab added that "in several cases, the same Flamingo model outperforms methods that are fine-tuned and optimised for each task independently and use multiple orders of magnitude more task-specific data" [2].
The Flamingo family comes in three sizes, distinguished by the frozen Chinchilla language model each one wraps. The trainable parameters are concentrated in the gated cross-attention layers and the Perceiver Resampler.
| Model | Frozen language model | Added trained parameters | Total parameters |
|---|---|---|---|
| Flamingo-3B | Chinchilla 1.4B | about 1.4B | about 3B |
| Flamingo-9B | Chinchilla 7B | about 1.8B | about 9B |
| Flamingo (Flamingo-80B) | Chinchilla 70B | about 10B | about 80B |
The largest configuration starts from the 70B-parameter Chinchilla model and reaches roughly 80 billion parameters in total, which is why it is often called Flamingo-80B [2][6][8].
Across the 16 image and video understanding benchmarks the authors evaluated, a single Flamingo model set a new state of the art in the few-shot setting, outperforming prior zero-shot and few-shot methods by a wide margin. More striking, on six of those 16 tasks Flamingo's few-shot results (with 32 task-specific examples and no weight updates) surpassed the best published results from models that had been fine-tuned on thousands of times more task-specific data [1][2][8]. The benchmarks spanned captioning, visual question answering, and multiple-choice tasks over both still images and video. Flamingo could also be fine-tuned, and when it was, it set a new state of the art on five additional benchmarks: VQAv2, VATEX, VizWiz, MSRVTTQA, and HatefulMemes [1].
Flamingo helped establish a template that many subsequent vision-language models followed: take strong frozen unimodal models and connect them with a lightweight, trainable bridge instead of training one giant multimodal model end to end. The Perceiver Resampler and gated cross-attention became reference designs that later systems borrowed or adapted [4][9].
DeepMind itself was pursuing multimodal generalist systems in the same period, including Gato, and the broader push toward natively multimodal models continued at the merged organization that produced the Gemini family [9]. Because Flamingo's weights were never publicly released, however, much of its direct practical impact came through open reproductions.
Flamingo's own weights and code were never publicly released by DeepMind. Its practical influence therefore spread mainly through two open reproductions, OpenFlamingo and IDEFICS, which rebuilt the recipe from openly available models and data.
OpenFlamingo is an open-source framework released in 2023 by researchers associated with the ML Foundations group to replicate DeepMind's Flamingo models. Described in a paper submitted in August 2023, it offers autoregressive vision-language models ranging from 3B to 9B parameters, built on CLIP ViT-L/14 vision encoders paired with openly available language models. On seven vision-language datasets, OpenFlamingo models reached on average between 80 and 89 percent of the performance of the corresponding Flamingo models, while being fully open in code and weights [4].
IDEFICS, whose name stands for "Image-aware Decoder Enhanced a la Flamingo with Interleaved Cross-attentionS," is an open-access reproduction of Flamingo built by Hugging Face and released on 22 August 2023. It was published in 9-billion- and 80-billion-parameter variants, each in base and instruction-tuned versions, and was constructed entirely from publicly available data and models, using a LLaMA language model and an OpenCLIP vision encoder. Its training set included a new web-scale corpus called OBELICS, comprising 141 million interleaved image-text documents, 353 million images, and about 115 billion text tokens. IDEFICS was reported to be comparable in performance to the original closed-source Flamingo across several image-text benchmarks [5][10].
Flamingo demonstrated that few-shot, prompt-based learning, already familiar from large language models, could be extended to multimodal inputs, and that this could be achieved by reusing frozen pretrained components rather than retraining everything. Its architectural choices, the Perceiver Resampler bottleneck and the zero-initialized tanh-gated cross-attention, offered a stable and parameter-efficient way to graft vision onto a language model. Together with its open reproductions OpenFlamingo and IDEFICS, Flamingo became a widely cited reference point in the development of the visual language models that followed [4][5][9].