See also: Machine learning terms, Large language model, Computer vision
A multimodal model in machine learning is a system that can process, relate, and generate information across multiple types of data, or modalities, such as text, images, audio, and video. Unlike unimodal models that operate on a single data type, multimodal models learn joint representations that capture patterns and relationships across different modalities. This enables capabilities that no single-modality model can achieve on its own, such as describing an image in natural language, answering questions about a photograph, or generating an image from a text prompt.
Multimodal models have become one of the most active areas of artificial intelligence research. Advances in deep learning and the transformer architecture have driven rapid progress, enabling models like GPT-4V, Gemini, and Claude 3 to process text and images within a single conversation. The field spans a wide range of tasks, from visual question answering and image captioning to text-to-image generation and audio transcription.
Imagine you have a robot friend that can look at pictures, read words, and listen to sounds all at the same time. Most robots can only do one of those things. But a multimodal model is like a robot that understands all of them together. If you show it a photo of a dog playing in a park and ask "what is the dog doing?", it can look at the picture and answer in words: "The dog is playing fetch." It connects what it sees with what it reads and hears, just like you do every day when you watch a movie (you see images and hear sounds at the same time to understand the story).
The development of multimodal models has progressed through several distinct phases, moving from isolated single-modality systems toward increasingly unified architectures.
Early multimodal systems relied on hand-crafted features and separate processing pipelines for each modality. In computer vision, bag-of-visual-words models extracted image features, while natural language processing relied on techniques like TF-IDF and word2vec for text. Combining modalities typically meant extracting features independently and then concatenating them for a downstream classifier. These approaches were limited by the quality of the hand-engineered features and the difficulty of aligning representations across modalities.
The rise of convolutional neural network architectures (such as VGGNet and ResNet) for image understanding and recurrent neural networks (LSTMs) for text processing enabled end-to-end trainable multimodal systems. Models for image captioning, such as Show and Tell (Vinyals et al., 2015), used a CNN to encode an image and an LSTM to generate a caption. Visual question answering (VQA) systems combined image features from CNNs with question embeddings from LSTMs to predict answers. These models demonstrated that neural network architectures could learn to bridge modalities, but they still used separate encoders for each data type with relatively simple fusion mechanisms.
The introduction of CLIP (Contrastive Language-Image Pre-training) by Radford et al. in January 2021 marked a turning point. CLIP trained a vision encoder and a text encoder jointly on 400 million image-text pairs from the internet, using a contrastive learning objective to align their embedding vector spaces. This produced a model with strong zero-shot transfer capabilities across a wide range of visual tasks. Around the same time, Google introduced ALIGN (Jia et al., 2021), which demonstrated that scaling to over one billion noisy image-text pairs could compensate for data quality issues and achieve state-of-the-art results on image-text retrieval benchmarks.
DeepMind's Flamingo (Alayrac et al., 2022) took a different approach, bridging a frozen pretrained vision encoder with a frozen large language model using novel cross-attention layers. Flamingo could handle arbitrarily interleaved sequences of images and text, enabling few-shot learning on multimodal tasks. The 80-billion-parameter model set new state-of-the-art results on several visual question answering and captioning benchmarks with only a handful of examples.
The current era is defined by natively multimodal large language models. Google DeepMind's Gemini (December 2023) was designed from the ground up as a multimodal model, jointly processing text, images, audio, and video within a single transformer decoder. OpenAI's GPT-4V (September 2023) extended GPT-4 with vision capabilities, allowing users to upload images alongside text prompts. Anthropic's Claude 3 (March 2024) introduced vision across all model sizes. Meta's LLaVA (Liu et al., 2023) demonstrated that connecting a CLIP vision encoder to a language model (Vicuna) through a trainable projection matrix, followed by visual instruction tuning, could produce strong multimodal capabilities at a fraction of the cost of proprietary models.
The trend has accelerated in 2024 and 2025, with models like GPT-4o (which processes text, images, and audio natively), Gemini 2.0 (which adds multimodal output including generated images and text-to-speech), and open-source alternatives such as LLaVA-1.5 and VideoLLaMA 2 advancing the state of the art.
Multimodal learning encompasses several distinct paradigms, each addressing different aspects of how models can work with multiple data types.
| Type | Description | Example |
|---|---|---|
| Fusion | Combining information from multiple modalities into a joint representation for prediction | A sentiment analysis model that combines text and facial expression features to classify emotion |
| Alignment | Learning a shared representation space where items from different modalities can be directly compared | CLIP aligns image and text embeddings so that an image of a cat is close to the text "a photo of a cat" |
| Translation | Converting data from one modality to another | Image captioning (image to text), text-to-image generation (text to image), speech-to-text (audio to text) |
| Co-learning | Using information from a resource-rich modality to improve learning in a resource-poor modality | Using large amounts of unlabeled images to improve performance on a text classification task with limited labeled data |
Fusion refers to how information from different modalities is combined within a model. The choice of fusion strategy has significant implications for model performance, computational cost, and the types of cross-modal interactions the model can capture.
Early fusion (also called input-level fusion) combines raw or lightly processed inputs from all modalities at the beginning of the model pipeline. For example, image patch embeddings and text token embeddings might be concatenated into a single sequence before being fed into a transformer. Early fusion allows the model to learn cross-modal interactions from the very first layers, enabling fine-grained interactions between modalities. However, it can be computationally expensive because the model must process the combined input at every layer.
Gemini uses a form of early fusion, with all modalities entering a shared transformer decoder via type embeddings. GPT-4o similarly processes text, image, and audio tokens within a single model.
Late fusion (also called decision-level fusion) processes each modality independently through separate encoders and combines their outputs only at the final prediction stage. This is the simplest fusion approach: each modality is processed by its own specialized model, and the predictions or feature vectors are combined (for example, by averaging, voting, or a learned combination layer) to produce a final output. Late fusion is computationally efficient and allows each encoder to be independently pretrained. However, it limits the ability to model fine-grained interactions between modalities.
CLIP and ALIGN use a late fusion approach: separate image and text encoders produce embeddings that are compared only through a cosine similarity function.
Cross-attention fusion inserts attention layers that allow one modality to attend to another at intermediate stages of processing. In this approach, queries come from one modality (for example, text tokens) and keys and values come from another (for example, image patch features). This enables rich, fine-grained cross-modal interactions while still maintaining separate processing streams for each modality.
Flamingo uses cross-attention layers (called "Perceiver Resampler" and "gated cross-attention dense" layers) to allow the language model to attend to visual features. LLaVA projects visual features into the language model's embedding space through a trainable linear projection.
| Fusion strategy | When modalities are combined | Strengths | Limitations |
|---|---|---|---|
| Early fusion | At the input layer | Rich cross-modal interactions from the start; unified representation | Computationally expensive; requires all modalities at input time |
| Late fusion | At the output/decision layer | Modular; allows independent pretraining; computationally efficient | Limited cross-modal interaction; cannot capture fine-grained dependencies |
| Cross-attention fusion | At intermediate layers | Fine-grained cross-modal interaction; flexible architecture | Adds architectural complexity; requires careful design of attention patterns |
Contrastive learning has emerged as one of the most effective techniques for aligning representations across modalities. The core idea is to train a model so that matching pairs (for example, an image and its correct caption) have similar representations, while non-matching pairs have dissimilar representations.
CLIP (Radford et al., 2021) uses a dual-encoder architecture with a vision transformer (ViT) for images and a text transformer for captions. During training, a batch of N image-text pairs produces N correct pairings and N^2 - N incorrect pairings. The model maximizes the cosine similarity between embeddings of correct pairs while minimizing similarity for incorrect pairs, using a symmetric cross-entropy loss. This simple objective, when scaled to 400 million image-text pairs, produces representations that transfer to a wide range of downstream tasks without any task-specific fine-tuning (zero-shot transfer).
SigLIP (Zhai et al., 2023) replaced CLIP's softmax-based contrastive loss with a sigmoid loss. Instead of normalizing similarities across the entire batch (which requires a global view of all pairwise similarities), SigLIP treats each image-text pair as an independent binary classification problem. This change enables more efficient scaling to larger batch sizes and improves performance at smaller batch sizes. SigLIP has become a popular vision encoder choice in recent multimodal systems, including PaLI and Gemini.
ALIGN (Jia et al., 2021) demonstrated that scaling training data to over one billion noisy image alt-text pairs, without expensive filtering or curation, could achieve state-of-the-art results. Using a dual-encoder architecture with EfficientNet-L2 as the image encoder and BERT-Large as the text encoder, ALIGN showed that data scale could compensate for noise in the training data.
Vision-language models (VLMs) are multimodal systems specifically designed to process and relate visual and textual information. They represent one of the most mature categories of multimodal models.
| Model | Organization | Year | Architecture | Key contribution |
|---|---|---|---|---|
| CLIP | OpenAI | 2021 | Dual-encoder (ViT + text transformer) | Contrastive pretraining on 400M image-text pairs; strong zero-shot transfer |
| ALIGN | 2021 | Dual-encoder (EfficientNet + BERT) | Scaling to 1B+ noisy image-text pairs without curation | |
| Flamingo | DeepMind | 2022 | Frozen LM + vision encoder with cross-attention | Few-shot multimodal learning; handles interleaved image-text sequences |
| SigLIP | 2023 | Dual-encoder (ViT + text transformer) | Sigmoid loss for more scalable contrastive learning | |
| LLaVA | University of Wisconsin-Madison | 2023 | CLIP ViT + Vicuna LLM with projection layer | Visual instruction tuning; open-source multimodal LLM |
Multimodal large language models (MLLMs) extend large language model architectures to accept inputs beyond text, typically images, audio, and video. These systems represent the current frontier of multimodal AI.
OpenAI's GPT-4V (September 2023) added vision capabilities to GPT-4, allowing users to upload images and receive text responses that reason about the visual content. GPT-4o (May 2024) went further, processing text, images, and audio natively within a single model, enabling real-time voice conversations with visual understanding. The exact architecture of GPT-4V and GPT-4o has not been publicly disclosed, but they represent some of the most capable multimodal systems available.
Google DeepMind's Gemini family (launched December 2023) was built as a natively multimodal model from the start. Unlike systems that bolt vision modules onto existing language models, Gemini's transformer decoder processes text, image, audio, and video tokens through a unified architecture. The family includes Gemini Ultra (for complex tasks), Gemini Pro (general purpose), and Gemini Nano (on-device). Gemini 2.0 (December 2024) extended capabilities to include multimodal output, generating images and speech alongside text.
Anthropic's Claude 3 family (March 2024) added vision capabilities across all model sizes (Haiku, Sonnet, and Opus). Claude 3 models can process and analyze images, extract data from documents, and reason about visual content within the context of a conversation.
LLaVA (Large Language and Vision Assistant) by Liu et al. (2023) demonstrated an effective open-source approach to building multimodal LLMs. The architecture connects a pretrained CLIP ViT-L/14 vision encoder to the Vicuna language model through a trainable projection matrix. Training proceeds in two stages: (1) feature alignment pretraining on 558K image-text pairs, and (2) visual instruction tuning on 150K GPT-generated multimodal instruction-following examples plus 515K VQA samples. LLaVA achieved 85.1% of GPT-4's performance on synthetic multimodal benchmarks and spawned a family of follow-up models including LLaVA-1.5 and LLaVA-NeXT.
Text-to-image generation is a prominent application of multimodal learning, translating textual descriptions into visual content. These systems are a type of generative model that operates across the text and image modalities.
OpenAI's DALL-E (January 2021) was among the first large-scale text-to-image models, using a transformer to generate images autoregressively from text tokens. DALL-E 2 (April 2022) switched to a diffusion-based architecture, using CLIP embeddings to guide image generation. DALL-E 3 (October 2023) improved text rendering and prompt adherence by training on highly descriptive captions and integrating directly with ChatGPT.
Stable Diffusion (Rombach et al., August 2022), developed by Stability AI in collaboration with LMU Munich and Runway, is a latent diffusion model that performs the denoising process in a compressed latent space rather than in pixel space. This approach dramatically reduces computational requirements, making high-quality image generation accessible on consumer GPUs. The model uses a pretrained text encoder (typically CLIP) to convert text prompts into embeddings that guide the diffusion process. Stable Diffusion XL and subsequent versions have improved image quality and resolution.
Midjourney is a proprietary text-to-image generation service that became widely popular for its distinctive artistic style and high-quality outputs. While the specific architecture has not been publicly disclosed, it is generally understood to be based on diffusion model techniques.
| Model | Developer | Year | Architecture | Key feature |
|---|---|---|---|---|
| DALL-E | OpenAI | 2021 | Autoregressive transformer | First large-scale text-to-image model |
| DALL-E 2 | OpenAI | 2022 | CLIP-guided diffusion | CLIP embeddings for text-image alignment |
| Stable Diffusion | Stability AI / LMU Munich / Runway | 2022 | Latent diffusion model | Open-source; runs on consumer GPUs |
| DALL-E 3 | OpenAI | 2023 | Improved diffusion with descriptive captions | Better prompt adherence; ChatGPT integration |
| Midjourney | Midjourney, Inc. | 2022 | Diffusion-based (proprietary) | Distinctive artistic quality |
Audio-language models bridge the gap between spoken or acoustic data and text, enabling tasks such as speech recognition, audio captioning, and music generation.
OpenAI's Whisper (Radford et al., 2022) is a general-purpose speech recognition model trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It uses an encoder-decoder transformer architecture, with input audio split into 30-second chunks and converted into log-Mel spectrograms. Whisper performs multilingual speech recognition, speech translation, and language identification. Its robustness across accents, background noise, and technical language has made it a reference audio encoder in many multimodal architectures, including MMS-LLaMA and Omni-AVSR.
Google's AudioLM (Borsos et al., 2023) generates realistic audio (speech and music) by treating audio generation as a language modeling problem. It maps audio to discrete tokens using neural audio codecs and then uses a transformer-based language model to predict the next token in the sequence. AudioLM can generate coherent speech continuations and music that maintains the style and structure of a given prompt.
Video understanding extends multimodal learning to temporal sequences of images, often combined with audio. This is one of the most challenging multimodal domains because it requires capturing both spatial and temporal information.
TimeSformer (Bertasius et al., 2021), developed at Meta, applies the transformer architecture to video understanding by treating a video as a sequence of spatial patches across time. It replaces 3D convolutions entirely with space-time self-attention, achieving strong results on action recognition benchmarks while being roughly three times faster to train than 3D CNNs and requiring less than one-tenth the inference compute.
VideoMAE (Tong et al., 2022) applies masked autoencoder pretraining to video data. By masking a high percentage (90-95%) of video patches and training a model to reconstruct them, VideoMAE learns effective spatiotemporal representations for downstream tasks like action recognition and video retrieval.
VideoLLaMA 2 (Cheng et al., 2024) is a video large language model that advances spatial-temporal modeling through a Spatial-Temporal Convolution (STC) connector and integrates an audio branch through joint training. It achieves competitive results on video question answering and video captioning tasks, representing the trend of extending LLM-based multimodal systems to video.
Multimodal models have been applied across a wide range of domains.
Visual question answering (VQA) requires a model to answer natural language questions about an image. Given an image and a question such as "How many people are in the picture?", the model must understand both the visual content and the linguistic query to produce a correct answer. VQA has been one of the primary benchmarks for evaluating multimodal models, with datasets like VQAv2 containing over 1 million questions about real images.
Image captioning generates natural language descriptions of images. Modern captioning systems use vision encoders to extract image features and language models to generate fluent descriptions. The COCO Captions benchmark, with over 1.5 million captions for 330,000 images, is the standard evaluation dataset. Metrics such as CIDEr, BLEU, and METEOR measure the quality of generated captions against human references.
Models like CLIP and ALIGN enable visual search, where a text query is used to find relevant images (or vice versa) by comparing embeddings in a shared representation space. This has practical applications in stock photo search, e-commerce product discovery, and content moderation.
Multimodal models improve accessibility for people with visual or hearing impairments. Image captioning and visual description systems can narrate visual content for blind and low-vision users, while speech recognition models like Whisper provide accurate transcriptions for deaf and hard-of-hearing users.
Multimodal models that combine medical images (X-rays, MRIs, pathology slides) with clinical text (radiology reports, patient notes) can assist in diagnosis, report generation, and clinical decision support. These applications require careful validation and are subject to regulatory oversight.
Text-to-image models like DALL-E, Stable Diffusion, and Midjourney have transformed creative workflows in design, advertising, and entertainment. These tools allow artists and non-artists alike to generate, iterate on, and refine visual content from text descriptions.
Evaluating multimodal models requires benchmarks that test cross-modal understanding across different tasks and difficulty levels.
| Benchmark | Task | Modalities | Key metric | Description |
|---|---|---|---|---|
| VQAv2 | Visual question answering | Image + text | VQA accuracy | 1M+ questions about real images; accounts for annotator variance |
| COCO Captions | Image captioning | Image + text | CIDEr | 1.5M captions for 330K images from Microsoft COCO |
| ImageNet | Image classification | Image (+ text for zero-shot) | Top-1 / Top-5 accuracy | 14M+ images across 21K categories; used to evaluate zero-shot transfer |
| MMLU | Language understanding | Text | Accuracy | 57 academic subjects; tests knowledge and reasoning; used for general LLM evaluation |
| MMBench | Multimodal understanding | Image + text | Accuracy across 20 abilities | Tests perception, reasoning, and knowledge in MLLMs |
| TextVQA | Text recognition in images | Image + text | VQA accuracy | Questions that require reading text within images |
| SEEDBench | Comprehensive MLLM evaluation | Image + text | Multiple choice accuracy | Tests 12 dimensions including spatial and temporal understanding |
Multimodal hallucination occurs when a model generates outputs that are inconsistent with the visual or auditory input. For example, a model might describe objects that are not present in an image, fabricate text that supposedly appears in a document, or incorrectly identify spatial relationships between objects. Hallucination arises from several sources: cross-modal misalignment (where the model's text generation drifts away from the grounding provided by the visual input), overreliance on unimodal priors (where the language model's learned biases override visual evidence), and spurious inter-modality correlations learned during training. Mitigating hallucination remains an active area of research, with approaches including reinforcement learning from human feedback (RLHF), improved training data curation, and specialized decoding strategies.
Even when trained to align representations across modalities, models often exhibit a "modality gap" in which embeddings from different modalities occupy distinct regions of the representation space rather than being fully intermixed. This gap can lead to systematic biases in cross-modal retrieval and classification. Research by Liang et al. (2022) showed that CLIP embeddings from images and text occupy two separate, narrow cones in the embedding space, limiting the effectiveness of direct embedding comparison.
Multimodal models, particularly those using early fusion, are extremely computationally expensive. Processing high-resolution images, long videos, or audio alongside text requires substantially more memory and compute than text-only models. Most vision-language models process images at relatively low resolutions (224x224 or 336x336 pixels) to manage costs. Training frontier multimodal models from scratch requires thousands of GPUs over weeks or months, making this area accessible primarily to well-resourced research labs and companies.
Multimodal training data, often scraped from the web, contains noise, errors, and biases. Image-text pairs from alt-text can be loosely correlated or misleading. Training datasets may underrepresent certain demographics, cultures, or languages, leading to models that perform unevenly across populations. Ensuring data quality at the scale required for modern multimodal models (hundreds of millions to billions of examples) remains a significant challenge.
Current multimodal models, particularly those built on CLIP-style vision encoders, often struggle with fine-grained visual understanding. They may fail to accurately count objects, understand spatial relationships, read small text in images, or distinguish between visually similar items. CLIP's training objective focuses on matching images to overall descriptions rather than understanding detailed spatial composition, which limits downstream models that rely on CLIP features.