Multimodal Model

Introduction

A multimodal model in machine learning is a system that can process, relate, and generate information across multiple types of data, or modalities, such as text, images, audio, and video. Unlike unimodal models that operate on a single data type, multimodal models learn joint representations that capture patterns and relationships across different modalities. This enables capabilities that no single-modality model can achieve on its own, such as describing an image in natural language, answering questions about a photograph, or generating an image from a text prompt.

Multimodal models have become one of the most active areas of artificial intelligence research. Advances in deep learning and the transformer architecture have driven rapid progress, enabling models like GPT-4V, Gemini, and Claude 3 to process text and images within a single conversation. The field spans a wide range of tasks, from visual question answering and image captioning to text-to-image generation and audio transcription.

Explain like I'm 5 (ELI5)

Imagine you have a robot friend that can look at pictures, read words, and listen to sounds all at the same time. Most robots can only do one of those things. But a multimodal model is like a robot that understands all of them together. If you show it a photo of a dog playing in a park and ask "what is the dog doing?", it can look at the picture and answer in words: "The dog is playing fetch." It connects what it sees with what it reads and hears, just like you do every day when you watch a movie (you see images and hear sounds at the same time to understand the story).

Historical evolution

The development of multimodal models has progressed through several distinct phases, moving from isolated single-modality systems toward increasingly unified architectures.

Early approaches (pre-2015)

Early multimodal systems relied on hand-crafted features and separate processing pipelines for each modality. In computer vision, bag-of-visual-words models extracted image features, while natural language processing relied on techniques like TF-IDF and word2vec for text. Combining modalities typically meant extracting features independently and then concatenating them for a downstream classifier. These approaches were limited by the quality of the hand-engineered features and the difficulty of aligning representations across modalities.

Deep learning era (2015 to 2020)

The rise of convolutional neural network architectures (such as VGGNet and ResNet) for image understanding and recurrent neural networks (LSTMs) for text processing enabled end-to-end trainable multimodal systems. Models for image captioning, such as Show and Tell (Vinyals et al., 2015), used a CNN to encode an image and an LSTM to generate a caption. Visual question answering (VQA) systems combined image features from CNNs with question embeddings from LSTMs to predict answers. These models demonstrated that neural network architectures could learn to bridge modalities, but they still used separate encoders for each data type with relatively simple fusion mechanisms.

Vision-language pretraining (2021 to 2022)

The introduction of CLIP (Contrastive Language-Image Pre-training) by Radford et al. in January 2021 marked a turning point. CLIP trained a vision encoder and a text encoder jointly on 400 million image-text pairs from the internet, using a contrastive learning objective to align their embedding vector spaces. This produced a model with strong zero-shot transfer capabilities across a wide range of visual tasks. Around the same time, Google introduced ALIGN (Jia et al., 2021), which demonstrated that scaling to over one billion noisy image-text pairs could compensate for data quality issues and achieve state-of-the-art results on image-text retrieval benchmarks.

DeepMind's Flamingo (Alayrac et al., 2022) took a different approach, bridging a frozen pretrained vision encoder with a frozen large language model using novel cross-attention layers. Flamingo could handle arbitrarily interleaved sequences of images and text, enabling few-shot learning on multimodal tasks. The 80-billion-parameter model set new state-of-the-art results on several visual question answering and captioning benchmarks with only a handful of examples.

Multimodal large language models (2023 to present)

The current era is defined by natively multimodal large language models. Google DeepMind's Gemini (December 2023) was designed from the ground up as a multimodal model, jointly processing text, images, audio, and video within a single transformer decoder. OpenAI's GPT-4V (September 2023) extended GPT-4 with vision capabilities, allowing users to upload images alongside text prompts. Anthropic's Claude 3 (March 2024) introduced vision across all model sizes. Meta's LLaVA (Liu et al., 2023) demonstrated that connecting a CLIP vision encoder to a language model (Vicuna) through a trainable projection matrix, followed by visual instruction tuning, could produce strong multimodal capabilities at a fraction of the cost of proprietary models.

The trend has accelerated in 2024 and 2025, with models like GPT-4o (which processes text, images, and audio natively), Gemini 2.0 (which adds multimodal output including generated images and text-to-speech), and open-source alternatives such as LLaVA-1.5 and VideoLLaMA 2 advancing the state of the art.

Types of multimodal learning

Multimodal learning encompasses several distinct paradigms, each addressing different aspects of how models can work with multiple data types.

Type	Description	Example
Fusion	Combining information from multiple modalities into a joint representation for prediction	A sentiment analysis model that combines text and facial expression features to classify emotion
Alignment	Learning a shared representation space where items from different modalities can be directly compared	CLIP aligns image and text embeddings so that an image of a cat is close to the text "a photo of a cat"
Translation	Converting data from one modality to another	Image captioning (image to text), text-to-image generation (text to image), speech-to-text (audio to text)
Co-learning	Using information from a resource-rich modality to improve learning in a resource-poor modality	Using large amounts of unlabeled images to improve performance on a text classification task with limited labeled data

Fusion strategies

Fusion refers to how information from different modalities is combined within a model. The choice of fusion strategy has significant implications for model performance, computational cost, and the types of cross-modal interactions the model can capture.

Early fusion

Early fusion (also called input-level fusion) combines raw or lightly processed inputs from all modalities at the beginning of the model pipeline. For example, image patch embeddings and text token embeddings might be concatenated into a single sequence before being fed into a transformer. Early fusion allows the model to learn cross-modal interactions from the very first layers, enabling fine-grained interactions between modalities. However, it can be computationally expensive because the model must process the combined input at every layer.

Gemini uses a form of early fusion, with all modalities entering a shared transformer decoder via type embeddings. GPT-4o similarly processes text, image, and audio tokens within a single model.

Late fusion

Late fusion (also called decision-level fusion) processes each modality independently through separate encoders and combines their outputs only at the final prediction stage. This is the simplest fusion approach: each modality is processed by its own specialized model, and the predictions or feature vectors are combined (for example, by averaging, voting, or a learned combination layer) to produce a final output. Late fusion is computationally efficient and allows each encoder to be independently pretrained. However, it limits the ability to model fine-grained interactions between modalities.

CLIP and ALIGN use a late fusion approach: separate image and text encoders produce embeddings that are compared only through a cosine similarity function.

Cross-attention fusion

Cross-attention fusion inserts attention layers that allow one modality to attend to another at intermediate stages of processing. In this approach, queries come from one modality (for example, text tokens) and keys and values come from another (for example, image patch features). This enables rich, fine-grained cross-modal interactions while still maintaining separate processing streams for each modality.

Flamingo uses cross-attention layers (called "Perceiver Resampler" and "gated cross-attention dense" layers) to allow the language model to attend to visual features. LLaVA projects visual features into the language model's embedding space through a trainable linear projection.

Fusion strategy	When modalities are combined	Strengths	Limitations
Early fusion	At the input layer	Rich cross-modal interactions from the start; unified representation	Computationally expensive; requires all modalities at input time
Late fusion	At the output/decision layer	Modular; allows independent pretraining; computationally efficient	Limited cross-modal interaction; cannot capture fine-grained dependencies
Cross-attention fusion	At intermediate layers	Fine-grained cross-modal interaction; flexible architecture	Adds architectural complexity; requires careful design of attention patterns

Contrastive learning for multimodal alignment

Contrastive learning has emerged as one of the most effective techniques for aligning representations across modalities. The core idea is to train a model so that matching pairs (for example, an image and its correct caption) have similar representations, while non-matching pairs have dissimilar representations.

CLIP's approach

CLIP (Radford et al., 2021) uses a dual-encoder architecture with a vision transformer (ViT) for images and a text transformer for captions. During training, a batch of N image-text pairs produces N correct pairings and N^2 - N incorrect pairings. The model maximizes the cosine similarity between embeddings of correct pairs while minimizing similarity for incorrect pairs, using a symmetric cross-entropy loss. This simple objective, when scaled to 400 million image-text pairs, produces representations that transfer to a wide range of downstream tasks without any task-specific fine-tuning (zero-shot transfer).

SigLIP

SigLIP (Zhai et al., 2023) replaced CLIP's softmax-based contrastive loss with a sigmoid loss. Instead of normalizing similarities across the entire batch (which requires a global view of all pairwise similarities), SigLIP treats each image-text pair as an independent binary classification problem. This change enables more efficient scaling to larger batch sizes and improves performance at smaller batch sizes. SigLIP has become a popular vision encoder choice in recent multimodal systems, including PaLI and Gemini.

ALIGN

ALIGN (Jia et al., 2021) demonstrated that scaling training data to over one billion noisy image alt-text pairs, without expensive filtering or curation, could achieve state-of-the-art results. Using a dual-encoder architecture with EfficientNet-L2 as the image encoder and BERT-Large as the text encoder, ALIGN showed that data scale could compensate for noise in the training data.

Vision-language models

Vision-language models (VLMs) are multimodal systems specifically designed to process and relate visual and textual information. They represent one of the most mature categories of multimodal models.

Model	Organization	Year	Architecture	Key contribution
CLIP	OpenAI	2021	Dual-encoder (ViT + text transformer)	Contrastive pretraining on 400M image-text pairs; strong zero-shot transfer
ALIGN	Google	2021	Dual-encoder (EfficientNet + BERT)	Scaling to 1B+ noisy image-text pairs without curation
Flamingo	DeepMind	2022	Frozen LM + vision encoder with cross-attention	Few-shot multimodal learning; handles interleaved image-text sequences
SigLIP	Google	2023	Dual-encoder (ViT + text transformer)	Sigmoid loss for more scalable contrastive learning
LLaVA	University of Wisconsin-Madison	2023	CLIP ViT + Vicuna LLM with projection layer	Visual instruction tuning; open-source multimodal LLM

Multimodal large language models

Multimodal large language models (MLLMs) extend large language model architectures to accept inputs beyond text, typically images, audio, and video. These systems represent the current frontier of multimodal AI.

GPT-4V and GPT-4o

OpenAI's GPT-4V (September 2023) added vision capabilities to GPT-4, allowing users to upload images and receive text responses that reason about the visual content. GPT-4o (May 2024) went further, processing text, images, and audio natively within a single model, enabling real-time voice conversations with visual understanding. The exact architecture of GPT-4V and GPT-4o has not been publicly disclosed, but they represent some of the most capable multimodal systems available.

Gemini

Google DeepMind's Gemini family (launched December 2023) was built as a natively multimodal model from the start. Unlike systems that bolt vision modules onto existing language models, Gemini's transformer decoder processes text, image, audio, and video tokens through a unified architecture. The family includes Gemini Ultra (for complex tasks), Gemini Pro (general purpose), and Gemini Nano (on-device). Gemini 2.0 (December 2024) extended capabilities to include multimodal output, generating images and speech alongside text.

Claude 3

Anthropic's Claude 3 family (March 2024) added vision capabilities across all model sizes (Haiku, Sonnet, and Opus). Claude 3 models can process and analyze images, extract data from documents, and reason about visual content within the context of a conversation.

LLaVA

LLaVA (Large Language and Vision Assistant) by Liu et al. (2023) demonstrated an effective open-source approach to building multimodal LLMs. The architecture connects a pretrained CLIP ViT-L/14 vision encoder to the Vicuna language model through a trainable projection matrix. Training proceeds in two stages: (1) feature alignment pretraining on 558K image-text pairs, and (2) visual instruction tuning on 150K GPT-generated multimodal instruction-following examples plus 515K VQA samples. LLaVA achieved 85.1% of GPT-4's performance on synthetic multimodal benchmarks and spawned a family of follow-up models including LLaVA-1.5 and LLaVA-NeXT.

Text-to-image models

Text-to-image generation is a prominent application of multimodal learning, translating textual descriptions into visual content. These systems are a type of generative model that operates across the text and image modalities.

DALL-E

OpenAI's DALL-E (January 2021) was among the first large-scale text-to-image models, using a transformer to generate images autoregressively from text tokens. DALL-E 2 (April 2022) switched to a diffusion-based architecture, using CLIP embeddings to guide image generation. DALL-E 3 (October 2023) improved text rendering and prompt adherence by training on highly descriptive captions and integrating directly with ChatGPT.

Stable Diffusion

Stable Diffusion (Rombach et al., August 2022), developed by Stability AI in collaboration with LMU Munich and Runway, is a latent diffusion model that performs the denoising process in a compressed latent space rather than in pixel space. This approach dramatically reduces computational requirements, making high-quality image generation accessible on consumer GPUs. The model uses a pretrained text encoder (typically CLIP) to convert text prompts into embeddings that guide the diffusion process. Stable Diffusion XL and subsequent versions have improved image quality and resolution.

Midjourney

Midjourney is a proprietary text-to-image generation service that became widely popular for its distinctive artistic style and high-quality outputs. While the specific architecture has not been publicly disclosed, it is generally understood to be based on diffusion model techniques.

Model	Developer	Year	Architecture	Key feature
DALL-E	OpenAI	2021	Autoregressive transformer	First large-scale text-to-image model
DALL-E 2	OpenAI	2022	CLIP-guided diffusion	CLIP embeddings for text-image alignment
Stable Diffusion	Stability AI / LMU Munich / Runway	2022	Latent diffusion model	Open-source; runs on consumer GPUs
DALL-E 3	OpenAI	2023	Improved diffusion with descriptive captions	Better prompt adherence; ChatGPT integration
Midjourney	Midjourney, Inc.	2022	Diffusion-based (proprietary)	Distinctive artistic quality

Audio-language models

Audio-language models bridge the gap between spoken or acoustic data and text, enabling tasks such as speech recognition, audio captioning, and music generation.

Whisper

OpenAI's Whisper (Radford et al., 2022) is a general-purpose speech recognition model trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It uses an encoder-decoder transformer architecture, with input audio split into 30-second chunks and converted into log-Mel spectrograms. Whisper performs multilingual speech recognition, speech translation, and language identification. Its robustness across accents, background noise, and technical language has made it a reference audio encoder in many multimodal architectures, including MMS-LLaMA and Omni-AVSR.

AudioLM

Google's AudioLM (Borsos et al., 2023) generates realistic audio (speech and music) by treating audio generation as a language modeling problem. It maps audio to discrete tokens using neural audio codecs and then uses a transformer-based language model to predict the next token in the sequence. AudioLM can generate coherent speech continuations and music that maintains the style and structure of a given prompt.

Video understanding models

Video understanding extends multimodal learning to temporal sequences of images, often combined with audio. This is one of the most challenging multimodal domains because it requires capturing both spatial and temporal information.

TimeSformer

TimeSformer (Bertasius et al., 2021), developed at Meta, applies the transformer architecture to video understanding by treating a video as a sequence of spatial patches across time. It replaces 3D convolutions entirely with space-time self-attention, achieving strong results on action recognition benchmarks while being roughly three times faster to train than 3D CNNs and requiring less than one-tenth the inference compute.

VideoMAE

VideoMAE (Tong et al., 2022) applies masked autoencoder pretraining to video data. By masking a high percentage (90-95%) of video patches and training a model to reconstruct them, VideoMAE learns effective spatiotemporal representations for downstream tasks like action recognition and video retrieval.

VideoLLaMA 2

VideoLLaMA 2 (Cheng et al., 2024) is a video large language model that advances spatial-temporal modeling through a Spatial-Temporal Convolution (STC) connector and integrates an audio branch through joint training. It achieves competitive results on video question answering and video captioning tasks, representing the trend of extending LLM-based multimodal systems to video.

Applications

Multimodal models have been applied across a wide range of domains.

Visual question answering

Visual question answering (VQA) requires a model to answer natural language questions about an image. Given an image and a question such as "How many people are in the picture?", the model must understand both the visual content and the linguistic query to produce a correct answer. VQA has been one of the primary benchmarks for evaluating multimodal models, with datasets like VQAv2 containing over 1 million questions about real images.

Image captioning

Image captioning generates natural language descriptions of images. Modern captioning systems use vision encoders to extract image features and language models to generate fluent descriptions. The COCO Captions benchmark, with over 1.5 million captions for 330,000 images, is the standard evaluation dataset. Metrics such as CIDEr, BLEU, and METEOR measure the quality of generated captions against human references.

Visual search and retrieval

Models like CLIP and ALIGN enable visual search, where a text query is used to find relevant images (or vice versa) by comparing embeddings in a shared representation space. This has practical applications in stock photo search, e-commerce product discovery, and content moderation.

Accessibility

Multimodal models improve accessibility for people with visual or hearing impairments. Image captioning and visual description systems can narrate visual content for blind and low-vision users, while speech recognition models like Whisper provide accurate transcriptions for deaf and hard-of-hearing users.

Medical imaging and diagnosis

Multimodal models that combine medical images (X-rays, MRIs, pathology slides) with clinical text (radiology reports, patient notes) can assist in diagnosis, report generation, and clinical decision support. These applications require careful validation and are subject to regulatory oversight.

Creative generation

Text-to-image models like DALL-E, Stable Diffusion, and Midjourney have transformed creative workflows in design, advertising, and entertainment. These tools allow artists and non-artists alike to generate, iterate on, and refine visual content from text descriptions.

Benchmarks and evaluation

Evaluating multimodal models requires benchmarks that test cross-modal understanding across different tasks and difficulty levels.

Benchmark	Task	Modalities	Key metric	Description
VQAv2	Visual question answering	Image + text	VQA accuracy	1M+ questions about real images; accounts for annotator variance
COCO Captions	Image captioning	Image + text	CIDEr	1.5M captions for 330K images from Microsoft COCO
ImageNet	Image classification	Image (+ text for zero-shot)	Top-1 / Top-5 accuracy	14M+ images across 21K categories; used to evaluate zero-shot transfer
MMLU	Language understanding	Text	Accuracy	57 academic subjects; tests knowledge and reasoning; used for general LLM evaluation
MMBench	Multimodal understanding	Image + text	Accuracy across 20 abilities	Tests perception, reasoning, and knowledge in MLLMs
TextVQA	Text recognition in images	Image + text	VQA accuracy	Questions that require reading text within images
SEEDBench	Comprehensive MLLM evaluation	Image + text	Multiple choice accuracy	Tests 12 dimensions including spatial and temporal understanding

Challenges and limitations

Hallucination

Multimodal hallucination occurs when a model generates outputs that are inconsistent with the visual or auditory input. For example, a model might describe objects that are not present in an image, fabricate text that supposedly appears in a document, or incorrectly identify spatial relationships between objects. Hallucination arises from several sources: cross-modal misalignment (where the model's text generation drifts away from the grounding provided by the visual input), overreliance on unimodal priors (where the language model's learned biases override visual evidence), and spurious inter-modality correlations learned during training. Mitigating hallucination remains an active area of research, with approaches including reinforcement learning from human feedback (RLHF), improved training data curation, and specialized decoding strategies.

Modality gap

Even when trained to align representations across modalities, models often exhibit a "modality gap" in which embeddings from different modalities occupy distinct regions of the representation space rather than being fully intermixed. This gap can lead to systematic biases in cross-modal retrieval and classification. Research by Liang et al. (2022) showed that CLIP embeddings from images and text occupy two separate, narrow cones in the embedding space, limiting the effectiveness of direct embedding comparison.

Computational cost

Multimodal models, particularly those using early fusion, are extremely computationally expensive. Processing high-resolution images, long videos, or audio alongside text requires substantially more memory and compute than text-only models. Most vision-language models process images at relatively low resolutions (224x224 or 336x336 pixels) to manage costs. Training frontier multimodal models from scratch requires thousands of GPUs over weeks or months, making this area accessible primarily to well-resourced research labs and companies.

Data quality and bias

Multimodal training data, often scraped from the web, contains noise, errors, and biases. Image-text pairs from alt-text can be loosely correlated or misleading. Training datasets may underrepresent certain demographics, cultures, or languages, leading to models that perform unevenly across populations. Ensuring data quality at the scale required for modern multimodal models (hundreds of millions to billions of examples) remains a significant challenge.

Fine-grained visual understanding

Current multimodal models, particularly those built on CLIP-style vision encoders, often struggle with fine-grained visual understanding. They may fail to accurately count objects, understand spatial relationships, read small text in images, or distinguish between visually similar items. CLIP's training objective focuses on matching images to overall descriptions rather than understanding detailed spatial composition, which limits downstream models that rely on CLIP features.

References

Radford, A., Kim, J.W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of the 38th International Conference on Machine Learning (ICML)*. arXiv:2103.00020.
Jia, C., Yang, Y., Xia, Y., et al. (2021). "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision." *Proceedings of the 38th International Conference on Machine Learning (ICML)*. arXiv:2102.05918.
Alayrac, J.-B., Donahue, J., Luc, P., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning." *Advances in Neural Information Processing Systems 35 (NeurIPS)*. arXiv:2204.14198.
Liu, H., Li, C., Wu, Q., and Lee, Y.J. (2023). "Visual Instruction Tuning." *Advances in Neural Information Processing Systems 36 (NeurIPS)*. arXiv:2304.08485.
Radford, A., Kim, J.W., Xu, T., et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." *arXiv preprint*. arXiv:2212.04356.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:2112.10752.
Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. (2023). "Sigmoid Loss for Language Image Pre-Training." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*. arXiv:2303.15343.
Gemini Team, Google DeepMind. (2023). "Gemini: A Family of Highly Capable Multimodal Models." *arXiv preprint*. arXiv:2312.11805.
Borsos, Z., Marinier, R., Vincent, D., et al. (2023). "AudioLM: a Language Modeling Approach to Audio Generation." *IEEE/ACM Transactions on Audio, Speech, and Language Processing*.
Bertasius, G., Wang, H., and Torresani, L. (2021). "Is Space-Time Attention All You Need for Video Understanding?" *Proceedings of the 38th International Conference on Machine Learning (ICML)*. arXiv:2102.05095.
Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., and Zou, J.Y. (2022). "Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning." *Advances in Neural Information Processing Systems 35 (NeurIPS)*.
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). "Show and Tell: A Neural Image Caption Generator." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.

Introduction

Explain like I'm 5 (ELI5)

Historical evolution

Early approaches (pre-2015)

Deep learning era (2015 to 2020)

Vision-language pretraining (2021 to 2022)

Multimodal large language models (2023 to present)

Types of multimodal learning

Fusion strategies

Early fusion

Late fusion

Cross-attention fusion

Contrastive learning for multimodal alignment

CLIP's approach

SigLIP

ALIGN

Vision-language models

Multimodal large language models

GPT-4V and GPT-4o

Gemini

Claude 3

LLaVA

Text-to-image models

DALL-E

Stable Diffusion

Midjourney

Audio-language models

Whisper

AudioLM

Video understanding models

TimeSformer

VideoMAE

VideoLLaMA 2

Applications

Visual question answering

Image captioning

Visual search and retrieval

Accessibility

Medical imaging and diagnosis

Creative generation

Benchmarks and evaluation

Challenges and limitations

Hallucination

Modality gap

Computational cost

Data quality and bias

Fine-grained visual understanding

References

Improve this article

Related Articles

Pre-training

OCR Models

LeNet

Computer-use agent

Computer-use model

VLA

Introduction

Explain like I'm 5 (ELI5)

Historical evolution

Early approaches (pre-2015)

Deep learning era (2015 to 2020)

Vision-language pretraining (2021 to 2022)

Multimodal large language models (2023 to present)

Types of multimodal learning

Fusion strategies

Early fusion

Late fusion

Cross-attention fusion

Contrastive learning for multimodal alignment

CLIP's approach

SigLIP

ALIGN

Vision-language models

Multimodal large language models

GPT-4V and GPT-4o

Gemini

Claude 3

LLaVA

Text-to-image models

DALL-E