Multimodal AI

Artificial Intelligence Computer Vision Deep Learning Machine Learning

32 min read

Updated Mar 25, 2026

Multimodal AI refers to artificial intelligence systems that can process, understand, and generate information across multiple types of data, or modalities, such as text, images, audio, video, and code. Unlike unimodal systems that operate on a single data type, multimodal AI integrates information from different sources to build richer representations and perform tasks that require reasoning across modalities. A person reading a textbook, for example, draws on both the written text and the accompanying diagrams; multimodal AI aims to replicate this kind of cross-modal understanding in machines.

The field has accelerated rapidly since 2021, driven by breakthroughs in contrastive learning, vision transformers, and large language models. Models like CLIP, GPT-4, Gemini, and Claude can now accept images alongside text prompts, answer questions about photographs, interpret charts, and analyze documents. On the generative side, systems like DALL-E, Stable Diffusion, and Sora produce images and videos from text descriptions. These capabilities have moved multimodal AI from a research curiosity into a technology with broad commercial applications in healthcare, education, creative industries, and software development.

History

Early multimodal research (1990s to 2010s)

The idea of combining multiple data modalities in AI systems predates the current wave of deep learning. Early work in the 1990s and 2000s focused on audiovisual speech recognition, where combining lip movements (video) with audio signals improved transcription accuracy over audio-only systems. Researchers also explored multimodal sentiment analysis, fusing text, audio tone, and facial expressions to gauge a speaker's emotional state.

During this period, the dominant approach was "fusion," typically categorized into three strategies. Early fusion concatenates raw features from each modality into a single representation before feeding them into a model. Late fusion processes each modality independently and combines the predictions at the output stage. Hybrid fusion mixes both approaches, merging features at intermediate layers. A 2018 survey by Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency ("Multimodal Machine Learning: A Survey and Taxonomy") provided a comprehensive taxonomy of these fusion strategies and identified key challenges including alignment, translation, and co-learning across modalities.

Before the transformer era, multimodal systems tended to be task-specific. Visual question answering (VQA) models combined convolutional neural networks for image encoding with recurrent neural networks for question processing. Image captioning systems used encoder-decoder architectures where a CNN encoded the image and an RNN decoded it into a natural language sentence. These systems worked well within narrow domains but lacked the general-purpose flexibility of modern multimodal models.

The CLIP breakthrough (2021)

The modern era of multimodal AI began with CLIP (Contrastive Language-Image Pre-training), released by OpenAI in January 2021. CLIP introduced a new training paradigm: rather than training a vision model on fixed categories (like ImageNet's 1,000 classes), it learned to associate images with free-form text descriptions using contrastive learning on 400 million image-text pairs scraped from the internet.

CLIP's zero-shot image classification, where it could categorize images it had never seen during training by comparing them against arbitrary text descriptions, matched the performance of a fully supervised ResNet-50 on ImageNet. This demonstrated that natural language supervision could serve as a scalable alternative to manually labeled datasets. CLIP's shared vision-language embedding space became a foundational building block for downstream systems, including DALL-E 2 and Stable Diffusion, which use CLIP embeddings to guide image generation.

Flamingo and few-shot multimodal learning (2022)

Google DeepMind published Flamingo in April 2022, advancing multimodal AI in a different direction. While CLIP could match images to text, Flamingo could hold conversations about images and perform tasks like visual question answering with only a handful of examples (few-shot learning). Flamingo achieved state-of-the-art results on 6 out of 16 vision-language benchmarks using just 32 few-shot examples, without any task-specific fine-tuning.

Flamingo's architecture influenced many subsequent models. It connected a frozen vision encoder to a frozen large language model using two key components: a Perceiver Resampler that converted variable-length visual features into a fixed number of visual tokens, and gated cross-attention layers interleaved between the language model's existing layers. This design pattern of bridging a pretrained vision encoder with a pretrained language model became the template for an entire generation of multimodal models.

The GPT-4 era and beyond (2023 to present)

In March 2023, OpenAI released GPT-4, which included vision capabilities (known as GPT-4V when the visual input feature launched in September 2023). GPT-4V could interpret photographs, read handwriting, analyze charts, understand memes, and reason about complex visual scenes. In May 2024, OpenAI released GPT-4o (the "o" stands for "omni"), a natively multimodal model trained from the ground up to process text, images, and audio within a single neural network, rather than bolting vision onto an existing language model through an adapter.

Google DeepMind announced Gemini 1.0 in December 2023, the first model in Google's lineup explicitly designed as natively multimodal from the start of training. Gemini processes text, images, audio, and video jointly. The Gemini 1.5 series, announced in February 2024, extended the context window to 1 million tokens and demonstrated strong long-context multimodal understanding. By 2025, Gemini 2.5 Pro scored 84.8% on VideoMME, establishing it as the leading production model for video understanding.

Anthropic launched the Claude 3 model family (Haiku, Sonnet, and Opus) in March 2024, introducing vision capabilities across all three tiers. Claude 3.5 Sonnet, released in June 2024, became Anthropic's strongest vision model, surpassing Claude 3 Opus on standard vision benchmarks. Claude's vision capabilities excel at interpreting charts, graphs, diagrams, and document images, with particular strength in accurately transcribing text from imperfect images.

Key Architectures and Approaches

CLIP

CLIP uses a dual-encoder architecture with separate encoders for images and text that map both modalities into a shared embedding space. The image encoder is a Vision Transformer (ViT) or ResNet variant; the text encoder is a transformer-based language model with 63 million parameters, 12 layers, and 512-dimensional embeddings. During training, CLIP maximizes the cosine similarity between embeddings of matched image-text pairs in a batch while minimizing similarity for all mismatched pairs, optimizing a symmetric cross-entropy loss.

The model was trained on WebImageText (WIT), a dataset of 400 million image-text pairs. The largest ViT-based CLIP model (ViT-L/14) took 12 days to train on 256 NVIDIA V100 GPUs, while the largest ResNet model required 18 days on 592 V100 GPUs. CLIP's text encoder uses byte pair encoding with a vocabulary of 49,152 tokens and a maximum context length of 76 tokens.

CLIP cannot generate text or images on its own. Its strength lies in creating a shared semantic space where images and text can be compared, enabling zero-shot classification, image-text retrieval, and serving as a guiding signal for generative models.

Flamingo

Flamingo bridges a frozen pretrained vision encoder (a CLIP-style model) and a frozen large language model (Chinchilla) using two trainable components. The Perceiver Resampler takes variable-length visual features from the vision encoder and compresses them into a fixed set of visual tokens. These tokens then condition the language model through gated cross-attention layers inserted between the language model's existing transformer blocks.

DeepMind trained three Flamingo variants: Flamingo-3B (built on Chinchilla 1.4B), Flamingo-9B (on Chinchilla 7B), and Flamingo-80B (on Chinchilla 70B). The training data consisted of large-scale multimodal web corpora containing arbitrarily interleaved text and images. Flamingo could handle both still images and video frames as visual input.

The key insight behind Flamingo was that by keeping both the vision encoder and language model frozen and only training the bridging components, the model could leverage the full capabilities of both pretrained systems while learning to connect them efficiently.

GPT-4V and GPT-4o

GPT-4V added visual understanding to OpenAI's GPT-4 by attaching a vision encoder to the existing language model via an adapter module. This approach allowed GPT-4 to accept images as input alongside text, but the vision component was not deeply integrated into the model's core architecture.

GPT-4o took a fundamentally different approach. Released in May 2024, it is an autoregressive omni model that accepts any combination of text, audio, image, and video as input and can generate text, audio, and image outputs. The key architectural distinction is that GPT-4o was trained as a single unified neural network from the start, rather than combining separately trained components. This native multimodality gives GPT-4o significantly better performance on tasks requiring tight integration of visual and linguistic reasoning, such as describing spatial relationships in images or following multi-step visual instructions.

GPT-4o also introduced native voice-to-voice capabilities. Previous OpenAI voice features relied on a pipeline of separate models (speech-to-text, then GPT, then text-to-speech), but GPT-4o handles audio input and output directly, enabling it to capture tone, emotion, and nuance that would be lost in a text-based intermediary step.

Gemini

Google DeepMind's Gemini models are built on a decoder-only transformer architecture and are natively multimodal, meaning they were trained from the beginning on interleaved data across text, images, audio, and video. This differs from approaches that bolt vision onto an existing language model.

The original Gemini 1.0 family (December 2023) shipped in three sizes: Ultra (for complex reasoning tasks), Pro (for general use), and Nano (for on-device deployment). Gemini 1.5 Pro (February 2024) introduced a mixture-of-experts architecture and a context window of up to 1 million tokens, enabling the model to process extremely long documents, multi-hour videos, and large codebases in a single prompt.

Gemini 2.0, announced in December 2024, added native tool use and agentic capabilities alongside its multimodal foundation. By early 2026, the Gemini family had evolved through multiple generations, with each iteration improving multimodal reasoning, extending context lengths, and expanding the range of supported modalities.

Claude 3 and Claude 4 Vision

Anthropic's Claude 3 family, released in March 2024, introduced vision capabilities across all model tiers. The models can process photographs, charts, graphs, technical diagrams, screenshots, and document images. Users can include up to 20 images per request on claude.ai and up to 600 images via the API.

Claude 3.5 Sonnet showed step-change improvements in visual reasoning compared to Claude 3 Opus, particularly on tasks involving chart and graph interpretation. The model can accurately transcribe text from imperfect or low-quality images, making it useful in retail, logistics, and financial services where documents may be photographed under poor conditions.

Claude's vision architecture processes images as part of the model's context window. The model analyzes visual content and reasons about it using the same language understanding capabilities it applies to text, enabling it to answer complex questions that require integrating information from both images and text within a conversation.

LLaVA and Open-Source Multimodal Models

LLaVA (Large Language-and-Vision Assistant), introduced in 2023 by researchers at the University of Wisconsin-Madison and Microsoft Research, demonstrated that competitive multimodal performance could be achieved with open-source components. LLaVA connects a CLIP vision encoder to an open-source large language model (such as LLaMA or Vicuna) through a simple projection layer (an MLP) that maps visual embeddings into the language model's token space.

LLaVA's training follows a two-stage process. In the first stage (feature alignment), the model learns to map visual features to the language model's embedding space using 558,000 image-text pairs, with both the vision encoder and language model frozen. In the second stage (visual instruction tuning), the model is trained on multimodal instruction-following data generated by GPT-4, teaching it to respond to complex visual questions and instructions.

LLaVA-NeXT (January 2024) improved on the original with higher input resolution and enhanced reasoning capabilities. LLaVA-OneVision-1.5 continued the series with the same "ViT-MLP-LLM" architecture but with expanded training data and improved performance.

The open-source multimodal ecosystem has grown substantially beyond LLaVA. Notable models include:

Model	Organization	Parameters	Key feature
LLaVA-1.5	UW-Madison / Microsoft	7B, 13B	Two-stage training with MLP projector
InternVL 2.5	Shanghai AI Lab	1B to 78B	Competitive with GPT-4V on many benchmarks
Qwen-VL	Alibaba	2B to 72B	Strong multilingual vision-language performance
Molmo	Allen Institute for AI	1B to 72B	Trained on high-quality, human-annotated data
CogVLM	Tsinghua / Zhipu AI	17B	Deep fusion of vision and language features
Idefics2	Hugging Face	8B	Fully open with training data, code, and model weights

By 2025, open-source vision-language models like InternVL3.5 (78B parameters, scoring 71.4 on WildVision) were matching proprietary models on public benchmarks, significantly narrowing the gap between open and closed multimodal systems.

Modality Types

Multimodal AI systems can work with a growing range of data types. Each modality presents unique representation challenges and requires specialized encoding strategies.

Modality	Description	Common encoding approach	Example tasks
Text	Written language, code, structured data	Tokenization with BPE or SentencePiece; transformer encoder	Question answering, summarization, translation
Images	Photographs, diagrams, charts, screenshots	Vision Transformer (ViT) patch embeddings; CNN feature extraction	Classification, object detection, visual QA
Audio	Speech, music, environmental sounds	Mel spectrogram conversion; audio transformer encoder	Speech recognition, music generation, sound classification
Video	Temporal sequences of frames with optional audio	Frame sampling + spatial-temporal transformers	Action recognition, video captioning, temporal reasoning
3D	Point clouds, meshes, volumetric data	Point cloud transformers; voxel-based encoding	3D object recognition, scene reconstruction
Code	Programming languages, markup, configuration files	Tokenization similar to text but with code-specific vocabulary	Code generation, bug detection, code explanation

Modern multimodal models increasingly handle combinations of these modalities within a single interaction. GPT-4o, for instance, can accept text, images, and audio simultaneously, while Gemini processes text, images, audio, and video in a unified architecture.

Applications

Visual Question Answering

Visual question answering (VQA) is the task of answering natural language questions about the content of an image. Given an image and a question like "What color is the car in the background?" or "How many people are sitting at the table?", a VQA system must understand both the visual content and the linguistic structure of the question to produce a correct answer.

Early VQA systems combined CNNs for visual feature extraction with RNNs for question encoding, merging the two representations through attention mechanisms. Modern VQA is largely handled by multimodal large language models. GPT-4V, Gemini, and Claude 3 can all answer complex visual questions that require multi-step reasoning, world knowledge, and spatial understanding.

The VQA dataset (Antol et al., 2015) and its successors (VQA v2, VizWiz) established standard evaluation protocols. More recent benchmarks like MMMU and MathVista test much harder multimodal reasoning, including questions that require understanding college-level diagrams, scientific figures, and mathematical notation.

Image Captioning

Image captioning generates a natural language description of an image's content. The task requires identifying objects, understanding their relationships, and expressing that understanding in grammatically correct, contextually appropriate language.

The MS COCO captioning benchmark (Lin et al., 2014) was the standard evaluation dataset for years, with models scored on metrics like BLEU, METEOR, CIDEr, and SPICE. Modern multimodal models have largely surpassed purpose-built captioning systems; when prompted to describe an image, GPT-4V, Gemini, and Claude produce detailed, nuanced descriptions that go well beyond simple object enumeration.

Video Understanding

Video understanding extends multimodal AI to temporal data, requiring models to track objects across frames, understand actions and events, and reason about cause-and-effect relationships over time. Key tasks include action recognition (classifying what is happening in a video clip), temporal grounding (locating when a specific event occurs), and video summarization.

Gemini 2.5 Pro leads on video understanding benchmarks as of 2025, scoring 84.8% on VideoMME. The model can process multi-hour videos within its million-token context window, enabling applications like automated meeting summarization, sports analysis, and surveillance review.

Document Analysis

Multimodal AI has proven particularly useful for analyzing documents that combine text, tables, figures, and formatting. Traditional optical character recognition (OCR) extracts text but loses structural information; multimodal models can understand the layout, interpret charts within documents, parse tables, and answer questions about the document's content as a whole.

Applications include automated invoice processing, legal document review, financial report analysis, and research paper comprehension. Claude's vision capabilities, which excel at transcribing text from imperfect images, are particularly suited to document analysis workflows where source materials may be photographs of printed documents or low-resolution scans.

Medical Imaging

Multimodal AI is transforming medical imaging by enabling systems that can interpret medical scans alongside clinical notes, patient histories, and diagnostic criteria. Medical VQA systems allow clinicians to ask natural language questions about radiology images, pathology slides, and other medical visuals.

Google's Med-PaLM M (2023) was one of the first generalist medical AI models capable of interpreting medical images across multiple imaging modalities (X-rays, CT scans, dermatology photos, pathology slides) while also engaging in medical dialogue. Microsoft's BiomedParse (2024), trained on 6 million visual objects across nine imaging modalities, advanced the state of the art for medical image segmentation and analysis.

Key medical VQA datasets include VQA-RAD (for radiology images) and PathVQA (for pathology), though the field faces challenges with data privacy, annotation quality, and the high stakes of clinical deployment.

Text-to-Image Generation

Text-to-image generation creates visual content from natural language descriptions. The field has evolved rapidly from producing blurry, low-resolution outputs to generating photorealistic images that are difficult to distinguish from photographs.

DALL-E Series

OpenAI's DALL-E (January 2021) was among the first large-scale text-to-image models, using a discrete variational autoencoder and autoregressive transformer. DALL-E 2 (April 2022) switched to a diffusion model architecture guided by CLIP embeddings, producing dramatically higher-quality images. DALL-E 3 (September 2023) improved prompt adherence by training on highly descriptive captions generated by a captioning model, addressing the common problem of models ignoring parts of complex prompts. DALL-E 3 is integrated directly into ChatGPT, making it accessible through conversational interaction.

DALL-E 3 particularly excels at rendering text within images, a task that earlier text-to-image models struggled with. This capability makes it suitable for generating marketing materials, social media graphics, and other content where legible text is required.

Stable Diffusion

Stable Diffusion, released by Stability AI in August 2022, was the first high-quality open-source text-to-image model. It uses a latent diffusion architecture that performs the denoising process in a compressed latent space rather than in pixel space, significantly reducing computational requirements. The model's open-source nature enabled a large community of developers to create custom fine-tuned models, training techniques (like LoRA and DreamBooth), and user interfaces.

Stable Diffusion XL (SDXL), released in 2023, improved image quality and resolution. Stable Diffusion 3 and 3.5 (2024) adopted a multimodal diffusion transformer (MMDiT) architecture that processes text and image tokens jointly, further improving prompt following and visual quality. The open-source ecosystem around Stable Diffusion has produced thousands of specialized models for different artistic styles, domains, and applications.

Midjourney

Midjourney, operated by an independent research lab founded by David Holz (formerly of Leap Motion), gained widespread attention for its ability to produce highly aesthetic, artistic images. The service operates primarily through a Discord bot interface, though a web application became available in 2024.

Midjourney v5 (March 2023) significantly improved photorealism and detail. Midjourney v6 (December 2023) added better text rendering and more precise prompt following. Midjourney v7 (April 2025) is regarded as producing some of the most aesthetically compelling AI-generated imagery available, dominating artistic and creative use cases.

Comparison of Text-to-Image Systems

System	Organization	Release (latest major version)	Architecture	Open source	Key strength
DALL-E 3	OpenAI	September 2023	Diffusion + caption model	No	Text rendering; prompt adherence
Stable Diffusion 3.5	Stability AI	2024	Multimodal diffusion transformer (MMDiT)	Yes	Customizability; open ecosystem
Midjourney v7	Midjourney, Inc.	April 2025	Undisclosed (diffusion-based)	No	Artistic quality; aesthetic output
Imagen 3	Google DeepMind	2024	Diffusion + T5 text encoder	No	Photorealism; detail accuracy
Flux	Black Forest Labs	2024	Rectified flow transformer	Yes (base model)	Fast generation; high quality

Text-to-Video Generation

Text-to-video generation extends image generation into the temporal domain, producing moving visual content from text descriptions. The field made dramatic progress between 2023 and 2025.

Sora

OpenAI previewed Sora in February 2024, demonstrating text-to-video generation at a quality level that surprised the research community. Sora uses a diffusion transformer architecture that operates on spacetime patches, treating video as collections of smaller units of data (analogous to tokens in GPT). This approach allows the model to handle videos of different durations, resolutions, and aspect ratios.

Sora was released publicly for ChatGPT Plus and Pro users in December 2024. Sora 2, released in early 2025, improved visual quality, physics understanding, and motion consistency. A September 2025 update added synchronized dialogue and sound effect generation, making Sora capable of producing videos with native audio.

Runway

Runway, a startup founded by Cristobal Valenzuela, Alejandro Matamala, and Anastasis Germanidis, has been one of the most active companies in AI video generation. Runway Gen-1 (February 2023) enabled video-to-video transformation. Gen-2 (June 2023) added text-to-video generation. Gen-3 Alpha (June 2024) substantially improved temporal consistency and motion quality. Gen-4, available in 2025, focuses on visual consistency across multiple shots and scenes, making it particularly useful for creative professionals who need coherent multi-shot sequences.

Pika

Pika Labs, founded by former Stanford AI researchers Demi Guo and Chenlin Meng, entered the text-to-video space in 2023 initially through a Discord-based interface. Pika evolved into a full web platform, with version 2.0 (late 2024) introducing creative effects like "Pikaffects" for adding fantastical transformations to video clips. Pika 2.5 balances accessibility and affordability, targeting individual creators and small teams rather than large production studios.

Other Notable Text-to-Video Systems

System	Organization	Key feature
Sora 2	OpenAI	Highest visual quality; native audio generation
Runway Gen-4	Runway	Multi-shot consistency; creative professional tools
Pika 2.5	Pika Labs	Accessible pricing; creative effects
Veo 3	Google DeepMind	Native dialogue generation; strong physics
Kling	Kuaishou	Strong motion consistency; available in China
Luma Dream Machine	Luma AI	Fast generation; 3D-aware video synthesis

Text-to-Speech and Speech-to-Text

Audio processing represents another important dimension of multimodal AI, bridging the gap between written text and spoken language.

Speech-to-Text (Automatic Speech Recognition)

OpenAI's Whisper, released as open-source software in September 2022, transformed automatic speech recognition by training a single model on 680,000 hours of multilingual, multitask supervised data from the web. Whisper uses an encoder-decoder transformer architecture: input audio is split into 30-second chunks, converted to log-Mel spectrograms, and processed by the encoder. The decoder then predicts a sequence of tokens representing the transcription, language identification, timestamps, and other metadata.

Whisper supports transcription in multiple languages and translation into English. Its robustness to accents, background noise, and technical language comes from the scale and diversity of its training data. Whisper Large V3 (November 2023) improved accuracy further, and the model has been widely adopted as a building block in larger multimodal pipelines.

Other notable speech-to-text systems include Google's Universal Speech Model (USM), which supports over 100 languages, and ElevenLabs' Scribe (February 2025), which achieves 96.7% accuracy for English with features like word-level timestamps and speaker diarization across 99 languages.

Text-to-Speech (Speech Synthesis)

Modern text-to-speech (TTS) systems have moved well beyond robotic-sounding output. ElevenLabs, founded in 2022, has become the leading TTS platform, with its models capable of producing speech that captures emotion, emphasis, pauses, and natural cadence. ElevenLabs' Eleven v3 (June 2025) supports over 70 languages, natural multi-speaker dialogue, and audio tags like [excited], [whispers], and [sighs] that give users fine-grained control over delivery.

OpenAI integrated TTS capabilities into GPT-4o, enabling real-time voice conversations where the model can detect and respond to emotional cues in the speaker's voice. Unlike earlier voice assistants that operated through a pipeline of separate speech-to-text and text-to-speech models, GPT-4o processes audio natively, preserving nuances that would be lost in a text intermediary.

System	Type	Organization	Release	Key capability
Whisper	Speech-to-text	OpenAI	September 2022	Multilingual; robust to noise; open source
USM	Speech-to-text	Google	2023	100+ languages; 12M hours of speech data
Scribe	Speech-to-text	ElevenLabs	February 2025	99 languages; speaker diarization; 96.7% English accuracy
ElevenLabs Eleven v3	Text-to-speech	ElevenLabs	June 2025	70+ languages; emotional expressiveness; audio tags
GPT-4o Voice	Both (native)	OpenAI	May 2024	Native audio I/O; real-time conversation; emotion detection
Bark	Text-to-speech	Suno AI	2023	Open source; nonverbal sounds; music generation

Challenges

Modality Alignment

One of the core technical challenges in multimodal AI is aligning representations across different modalities so that the model can reason about their relationships. Images, text, and audio have fundamentally different structures: text is sequential and discrete, images are spatial and continuous, and audio is temporal and continuous. Mapping these diverse formats into a shared representation space where meaningful comparisons can be made remains difficult.

Simple projection layers (such as those used in LLaVA) are computationally efficient but may miss fine-grained correspondences between visual and textual information. More complex alignment mechanisms, like the gated cross-attention in Flamingo, improve integration but increase computational cost and architectural complexity. The choice of alignment strategy has a direct impact on model performance, particularly for tasks that require detailed spatial reasoning or precise grounding of text to visual regions.

Multimodal Hallucination

Hallucination in multimodal models occurs when a model generates information that is not supported by the input data. In text-only language models, hallucination means fabricating facts; in multimodal systems, it can also mean describing objects that are not present in an image, misidentifying spatial relationships, or generating captions that contradict the visual evidence.

Research has identified several causes of multimodal hallucination. Models often over-rely on language priors, generating plausible-sounding descriptions based on textual patterns rather than actually attending to the visual input. This is especially problematic for unusual or unexpected scenes. Limited token constraints in bridging modules (like Q-Former in BLIP-2) can cause information loss, where the model simply does not have enough representational capacity to encode all relevant visual details. Training data biases, where certain objects or scenes are overrepresented, can lead models to "guess" based on statistical co-occurrence rather than genuine visual understanding.

Benchmarks like POPE (Polling-based Object Probing Evaluation) and CHAIR (Caption Hallucination Assessment with Image Relevance) specifically evaluate hallucination rates. Mitigating hallucination remains an active research area, with approaches including reinforcement learning from human feedback (RLHF) applied to visual outputs, improved training data curation, and architectural modifications that strengthen visual grounding.

Evaluation Difficulty

Evaluating multimodal models is harder than evaluating unimodal systems because the output quality depends on multiple interacting capabilities. A model might excel at recognizing objects but fail at understanding their spatial relationships, or correctly identify the content of a chart but misinterpret the data it presents. Single-number metrics often fail to capture these nuances.

Automatic metrics for generated images (FID, CLIP Score, Inception Score) measure distributional similarity or text-image alignment but do not fully capture perceptual quality, factual accuracy, or creative merit. For video generation, evaluation is even harder because temporal consistency, motion realism, and narrative coherence all matter but are difficult to quantify. Human evaluation remains the gold standard for many multimodal tasks, but it is expensive, slow, and subjective.

Computational Cost

Multimodal models are generally more expensive to train and run than unimodal models because they must process and integrate multiple data types. Video understanding models, in particular, face enormous computational demands because video data is orders of magnitude larger than still images. Gemini 1.5 Pro's ability to process multi-hour videos within a million-token context window requires substantial infrastructure that is not accessible to most researchers.

The computational gap between proprietary and open-source multimodal models has narrowed since 2023, but cutting-edge multimodal training still requires resources available primarily to large technology companies. This raises concerns about equitable access to multimodal AI capabilities and the concentration of AI development among a small number of organizations.

Benchmarks

Standardized benchmarks are essential for measuring progress in multimodal AI. The following benchmarks are among the most widely used for evaluating multimodal large language models.

MMMU

MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning), introduced in late 2023 and published at CVPR 2024, evaluates expert-level multimodal understanding across a broad scope of academic disciplines. It contains 11,500 questions drawn from college exams, quizzes, and textbooks, covering six core disciplines: Art and Design, Business, Science, Health and Medicine, Humanities and Social Science, and Tech and Engineering. The questions span 30 subjects and 183 subfields, with 30 different image types including charts, diagrams, maps, tables, music sheets, and chemical structures.

Even advanced models like GPT-4V initially achieved only around 56% accuracy on MMMU, indicating significant room for improvement. MMMU-Pro, a harder variant introduced in 2024, filters out questions answerable by text-only models and augments candidate options to reduce the impact of guessing. Model performance on MMMU-Pro ranges from approximately 16.8% to 26.9%, highlighting the gap between current capabilities and genuine expert-level multimodal reasoning.

MathVista

MathVista evaluates mathematical reasoning in visual contexts, testing whether models can interpret graphs, geometric figures, statistical charts, and scientific diagrams to solve quantitative problems. The benchmark combines challenges from mathematical reasoning with visual perception, requiring models to read data from charts, apply formulas, and perform multi-step calculations. The leading models as of 2024 achieve scores around 63.9, slightly surpassing the human average of 60.3.

MM-Vet

MM-Vet evaluates large multimodal models on integrated capabilities that require combining multiple skills simultaneously. Rather than testing individual abilities in isolation (e.g., object recognition alone), MM-Vet poses questions that demand recognition, knowledge, spatial understanding, language generation, and mathematical reasoning together. This makes it a useful complement to benchmarks that test narrower skills.

Other Notable Benchmarks

Benchmark	Focus area	Size	Key characteristic
MMMU	Expert-level multimodal reasoning	11,500 questions	College-level across 30 subjects
MMMU-Pro	Robust multimodal reasoning	Subset of MMMU	Filters text-only solvable questions; harder options
MathVista	Mathematical visual reasoning	6,141 problems	Charts, geometry, statistics
MM-Vet	Integrated multimodal capabilities	218 questions	Tests combined skills
VQAv2	Visual question answering	1.1M questions	Balanced pairs to reduce language bias
TextVQA	Text reading in images	45,336 questions	Requires OCR + reasoning
DocVQA	Document understanding	50,000 questions	Scanned documents and forms
VideoMME	Video understanding	900 videos	Short, medium, and long video comprehension
HallusionBench	Visual hallucination detection	1,129 examples	Tests resistance to visual illusions and leading questions
WildVision	Real-world vision tasks	500+ samples	User-submitted queries; open-ended

Major Multimodal Models

The following table summarizes the most prominent multimodal AI models as of early 2026.

Model	Organization	Release date	Input modalities	Output modalities	Open source	Notes
CLIP	OpenAI	January 2021	Images, text	Embeddings (shared space)	Yes	Foundation for zero-shot classification and generative guidance
DALL-E 2	OpenAI	April 2022	Text	Images	No	Diffusion model guided by CLIP embeddings
Flamingo	DeepMind	April 2022	Images, video, text	Text	No	Few-shot multimodal learning; Perceiver Resampler
Whisper	OpenAI	September 2022	Audio	Text	Yes	680K hours training data; multilingual ASR
Stable Diffusion	Stability AI	August 2022	Text	Images	Yes	Latent diffusion; large open-source ecosystem
LLaVA	UW-Madison / Microsoft	April 2023	Images, text	Text	Yes	ViT-MLP-LLM architecture; visual instruction tuning
GPT-4V	OpenAI	September 2023	Images, text	Text	No	Vision adapter on GPT-4; strong visual reasoning
DALL-E 3	OpenAI	September 2023	Text	Images	No	Improved prompt adherence; text rendering in images
Gemini 1.0	Google DeepMind	December 2023	Text, images, audio, video	Text	No	Natively multimodal; Ultra/Pro/Nano tiers
Claude 3	Anthropic	March 2024	Images, text	Text	No	Haiku/Sonnet/Opus tiers; strong document analysis
GPT-4o	OpenAI	May 2024	Text, images, audio, video	Text, audio, images	No	Natively multimodal omni model; real-time voice
Gemini 1.5 Pro	Google DeepMind	February 2024	Text, images, audio, video	Text	No	1M token context; mixture-of-experts
Claude 3.5 Sonnet	Anthropic	June 2024	Images, text	Text	No	Strongest Claude vision model; chart/graph interpretation
Sora	OpenAI	December 2024	Text	Video	No	Diffusion transformer on spacetime patches
InternVL 2.5	Shanghai AI Lab	2024	Images, text	Text	Yes	Competitive with GPT-4V at 78B parameters
Qwen-VL	Alibaba	2024	Images, text	Text	Yes	Strong multilingual vision-language performance
Gemini 2.5 Pro	Google DeepMind	2025	Text, images, audio, video	Text, images	No	84.8% on VideoMME; leading video understanding
Sora 2	OpenAI	Early 2025	Text	Video, audio	No	Native audio; improved physics understanding

Future Directions

As of early 2026, several trends are shaping the trajectory of multimodal AI.

Unified any-to-any models. The boundary between understanding and generation is blurring. Rather than separate models for image understanding and image generation, researchers are working toward systems that can both comprehend and produce content across all modalities within a single architecture. GPT-4o's ability to accept and generate text, images, and audio in a unified model points toward this future.

World models and simulation. OpenAI described Sora as a "world simulator," and the text-to-video field is increasingly framed not just as content generation but as learning physical world dynamics. If a video model can accurately predict how objects move, interact, and behave under different conditions, it has effectively learned a model of physics that could be useful for robotics, autonomous vehicles, and scientific simulation.

Embodied multimodal AI. Google DeepMind's Gemini Robotics (2025) demonstrated robots using multimodal models to see, understand, and interact with physical environments. Combining vision, language, and action in embodied agents represents a natural extension of multimodal AI from digital content to the physical world.

Efficiency and accessibility. While frontier multimodal models require massive computational resources, there is strong momentum toward smaller, more efficient models that can run on consumer hardware or edge devices. Google's Gemini Nano runs on mobile phones; open-source models like Molmo and LLaVA variants offer competitive performance at 7B to 8B parameters. Techniques like quantization, distillation, and efficient attention mechanisms are making multimodal capabilities more accessible.

Improved evaluation. The community recognizes that current benchmarks do not fully capture multimodal model capabilities. MMMU-Pro, HallusionBench, and WildVision represent efforts to create more rigorous, harder-to-game evaluations. Future benchmarks will likely focus more on real-world task completion, long-form reasoning, and resistance to adversarial inputs.

Safety and alignment. As multimodal models become more capable, concerns about misuse grow. Deepfake generation, misinformation through manipulated images, and privacy violations through visual surveillance are all exacerbated by increasingly powerful multimodal systems. Developing robust safeguards, content provenance systems (like C2PA metadata for generated images), and alignment techniques specific to multimodal outputs remains an open challenge.

References

Radford, A., Kim, J.W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of ICML*, 8748-8763.
Alayrac, J.-B., Donahue, J., Luc, P., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning." *Advances in Neural Information Processing Systems*, 35.
OpenAI. (2023). "GPT-4 Technical Report." *arXiv:2303.08774*.
OpenAI. (2024). "GPT-4o System Card." OpenAI.
Gemini Team, Google. (2023). "Gemini: A Family of Highly Capable Multimodal Models." *arXiv:2312.11805*.
Anthropic. (2024). "The Claude 3 Model Family: Opus, Sonnet, Haiku." Anthropic Model Card.
Liu, H., Li, C., Wu, Q., & Lee, Y.J. (2023). "Visual Instruction Tuning." *Advances in Neural Information Processing Systems*, 36 (NeurIPS 2023 Oral).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." *Proceedings of CVPR*, 10684-10695.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). "Hierarchical Text-Conditional Image Generation with CLIP Latents." *arXiv:2204.06125*.
Betker, J., Goh, G., Jing, L., et al. (2023). "Improving Image Generation with Better Captions." OpenAI.
Brooks, T., Peebles, B., Holmes, C., et al. (2024). "Video Generation Models as World Simulators." OpenAI.
Radford, A., Kim, J.W., Xu, T., et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." *arXiv:2212.04356*.
Yue, X., Ni, Y., Zhang, K., et al. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." *Proceedings of CVPR 2024*.
Lu, P., Banerjee, H., Peng, T., et al. (2024). "MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts." *Proceedings of ICLR 2024*.
Yu, W., Yang, Z., Li, L., et al. (2023). "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities." *arXiv:2308.02490*.
Baltrusaitis, T., Ahuja, C., & Morency, L.-P. (2019). "Multimodal Machine Learning: A Survey and Taxonomy." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 41(2), 423-443.
Antol, S., Agrawal, A., Lu, J., et al. (2015). "VQA: Visual Question Answering." *Proceedings of ICCV*, 2425-2433.
Lin, T.-Y., Maire, M., Belongie, S., et al. (2014). "Microsoft COCO: Common Objects in Context." *Proceedings of ECCV*, 740-755.
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models." *Proceedings of ICML*.
Esser, P., Kulal, S., Blattmann, A., et al. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis." *Proceedings of ICML 2024*.
Tu, H., Pahuja, V., Shang, C., et al. (2023). "Med-PaLM M: Towards Generalist Biomedical AI." *arXiv:2307.14334*.
Reid, M., Savinov, N., Teber, D., et al. (2024). "Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context." *arXiv:2403.05530*.

History

Early multimodal research (1990s to 2010s)

The CLIP breakthrough (2021)

Flamingo and few-shot multimodal learning (2022)

The GPT-4 era and beyond (2023 to present)

Key Architectures and Approaches

CLIP

Flamingo

GPT-4V and GPT-4o

Gemini

Claude 3 and Claude 4 Vision

LLaVA and Open-Source Multimodal Models

Modality Types

Applications

Visual Question Answering

Image Captioning

Video Understanding

Document Analysis

Medical Imaging

Text-to-Image Generation

DALL-E Series

Stable Diffusion

Midjourney

Comparison of Text-to-Image Systems

Text-to-Video Generation

Sora

Runway

Pika

Other Notable Text-to-Video Systems

Text-to-Speech and Speech-to-Text

Speech-to-Text (Automatic Speech Recognition)

Text-to-Speech (Speech Synthesis)

Challenges

Modality Alignment

Multimodal Hallucination

Evaluation Difficulty

Computational Cost

Benchmarks

MMMU

MathVista

MM-Vet

Other Notable Benchmarks

Major Multimodal Models

Future Directions

References

Related Articles

Computer vision

Object detection

Convolutional Filter

Convolutional Layer

Multimodal Model

Stride

History

Early multimodal research (1990s to 2010s)

The CLIP breakthrough (2021)

Flamingo and few-shot multimodal learning (2022)

The GPT-4 era and beyond (2023 to present)

Key Architectures and Approaches

CLIP

Flamingo

GPT-4V and GPT-4o

Gemini

Claude 3 and Claude 4 Vision

LLaVA and Open-Source Multimodal Models

Modality Types

Applications

Visual Question Answering

Image Captioning

Video Understanding

Document Analysis

Medical Imaging

Text-to-Image Generation

DALL-E Series

Stable Diffusion

Midjourney

Comparison of Text-to-Image Systems

Text-to-Video Generation

Sora

Runway

Pika