Multimodal AI refers to artificial intelligence systems that can process, understand, and generate information across multiple types of data, or modalities, such as text, images, audio, video, and code. Unlike unimodal systems that operate on a single data type, multimodal AI integrates information from different sources to build richer representations and perform tasks that require reasoning across modalities. A person reading a textbook, for example, draws on both the written text and the accompanying diagrams; multimodal AI aims to replicate this kind of cross-modal understanding in machines.
The field has accelerated rapidly since 2021, driven by breakthroughs in contrastive learning, vision transformers, and large language models. Models like CLIP, GPT-4, Gemini, and Claude can now accept images alongside text prompts, answer questions about photographs, interpret charts, and analyze documents. On the generative side, systems like DALL-E, Stable Diffusion, and Sora produce images and videos from text descriptions. These capabilities have moved multimodal AI from a research curiosity into a technology with broad commercial applications in healthcare, education, creative industries, and software development.
The idea of combining multiple data modalities in AI systems predates the current wave of deep learning. Early work in the 1990s and 2000s focused on audiovisual speech recognition, where combining lip movements (video) with audio signals improved transcription accuracy over audio-only systems. Researchers also explored multimodal sentiment analysis, fusing text, audio tone, and facial expressions to gauge a speaker's emotional state.
During this period, the dominant approach was "fusion," typically categorized into three strategies. Early fusion concatenates raw features from each modality into a single representation before feeding them into a model. Late fusion processes each modality independently and combines the predictions at the output stage. Hybrid fusion mixes both approaches, merging features at intermediate layers. A 2018 survey by Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency ("Multimodal Machine Learning: A Survey and Taxonomy") provided a comprehensive taxonomy of these fusion strategies and identified key challenges including alignment, translation, and co-learning across modalities.
Before the transformer era, multimodal systems tended to be task-specific. Visual question answering (VQA) models combined convolutional neural networks for image encoding with recurrent neural networks for question processing. Image captioning systems used encoder-decoder architectures where a CNN encoded the image and an RNN decoded it into a natural language sentence. These systems worked well within narrow domains but lacked the general-purpose flexibility of modern multimodal models.
The modern era of multimodal AI began with CLIP (Contrastive Language-Image Pre-training), released by OpenAI in January 2021. CLIP introduced a new training paradigm: rather than training a vision model on fixed categories (like ImageNet's 1,000 classes), it learned to associate images with free-form text descriptions using contrastive learning on 400 million image-text pairs scraped from the internet.
CLIP's zero-shot image classification, where it could categorize images it had never seen during training by comparing them against arbitrary text descriptions, matched the performance of a fully supervised ResNet-50 on ImageNet. This demonstrated that natural language supervision could serve as a scalable alternative to manually labeled datasets. CLIP's shared vision-language embedding space became a foundational building block for downstream systems, including DALL-E 2 and Stable Diffusion, which use CLIP embeddings to guide image generation.
Google DeepMind published Flamingo in April 2022, advancing multimodal AI in a different direction. While CLIP could match images to text, Flamingo could hold conversations about images and perform tasks like visual question answering with only a handful of examples (few-shot learning). Flamingo achieved state-of-the-art results on 6 out of 16 vision-language benchmarks using just 32 few-shot examples, without any task-specific fine-tuning.
Flamingo's architecture influenced many subsequent models. It connected a frozen vision encoder to a frozen large language model using two key components: a Perceiver Resampler that converted variable-length visual features into a fixed number of visual tokens, and gated cross-attention layers interleaved between the language model's existing layers. This design pattern of bridging a pretrained vision encoder with a pretrained language model became the template for an entire generation of multimodal models.
In March 2023, OpenAI released GPT-4, which included vision capabilities (known as GPT-4V when the visual input feature launched in September 2023). GPT-4V could interpret photographs, read handwriting, analyze charts, understand memes, and reason about complex visual scenes. In May 2024, OpenAI released GPT-4o (the "o" stands for "omni"), a natively multimodal model trained from the ground up to process text, images, and audio within a single neural network, rather than bolting vision onto an existing language model through an adapter.
Google DeepMind announced Gemini 1.0 in December 2023, the first model in Google's lineup explicitly designed as natively multimodal from the start of training. Gemini processes text, images, audio, and video jointly. The Gemini 1.5 series, announced in February 2024, extended the context window to 1 million tokens and demonstrated strong long-context multimodal understanding. By 2025, Gemini 2.5 Pro scored 84.8% on VideoMME, establishing it as the leading production model for video understanding.
Anthropic launched the Claude 3 model family (Haiku, Sonnet, and Opus) in March 2024, introducing vision capabilities across all three tiers. Claude 3.5 Sonnet, released in June 2024, became Anthropic's strongest vision model, surpassing Claude 3 Opus on standard vision benchmarks. Claude's vision capabilities excel at interpreting charts, graphs, diagrams, and document images, with particular strength in accurately transcribing text from imperfect images.
CLIP uses a dual-encoder architecture with separate encoders for images and text that map both modalities into a shared embedding space. The image encoder is a Vision Transformer (ViT) or ResNet variant; the text encoder is a transformer-based language model with 63 million parameters, 12 layers, and 512-dimensional embeddings. During training, CLIP maximizes the cosine similarity between embeddings of matched image-text pairs in a batch while minimizing similarity for all mismatched pairs, optimizing a symmetric cross-entropy loss.
The model was trained on WebImageText (WIT), a dataset of 400 million image-text pairs. The largest ViT-based CLIP model (ViT-L/14) took 12 days to train on 256 NVIDIA V100 GPUs, while the largest ResNet model required 18 days on 592 V100 GPUs. CLIP's text encoder uses byte pair encoding with a vocabulary of 49,152 tokens and a maximum context length of 76 tokens.
CLIP cannot generate text or images on its own. Its strength lies in creating a shared semantic space where images and text can be compared, enabling zero-shot classification, image-text retrieval, and serving as a guiding signal for generative models.
Flamingo bridges a frozen pretrained vision encoder (a CLIP-style model) and a frozen large language model (Chinchilla) using two trainable components. The Perceiver Resampler takes variable-length visual features from the vision encoder and compresses them into a fixed set of visual tokens. These tokens then condition the language model through gated cross-attention layers inserted between the language model's existing transformer blocks.
DeepMind trained three Flamingo variants: Flamingo-3B (built on Chinchilla 1.4B), Flamingo-9B (on Chinchilla 7B), and Flamingo-80B (on Chinchilla 70B). The training data consisted of large-scale multimodal web corpora containing arbitrarily interleaved text and images. Flamingo could handle both still images and video frames as visual input.
The key insight behind Flamingo was that by keeping both the vision encoder and language model frozen and only training the bridging components, the model could leverage the full capabilities of both pretrained systems while learning to connect them efficiently.
GPT-4V added visual understanding to OpenAI's GPT-4 by attaching a vision encoder to the existing language model via an adapter module. This approach allowed GPT-4 to accept images as input alongside text, but the vision component was not deeply integrated into the model's core architecture.
GPT-4o took a fundamentally different approach. Released in May 2024, it is an autoregressive omni model that accepts any combination of text, audio, image, and video as input and can generate text, audio, and image outputs. The key architectural distinction is that GPT-4o was trained as a single unified neural network from the start, rather than combining separately trained components. This native multimodality gives GPT-4o significantly better performance on tasks requiring tight integration of visual and linguistic reasoning, such as describing spatial relationships in images or following multi-step visual instructions.
GPT-4o also introduced native voice-to-voice capabilities. Previous OpenAI voice features relied on a pipeline of separate models (speech-to-text, then GPT, then text-to-speech), but GPT-4o handles audio input and output directly, enabling it to capture tone, emotion, and nuance that would be lost in a text-based intermediary step.
Google DeepMind's Gemini models are built on a decoder-only transformer architecture and are natively multimodal, meaning they were trained from the beginning on interleaved data across text, images, audio, and video. This differs from approaches that bolt vision onto an existing language model.
The original Gemini 1.0 family (December 2023) shipped in three sizes: Ultra (for complex reasoning tasks), Pro (for general use), and Nano (for on-device deployment). Gemini 1.5 Pro (February 2024) introduced a mixture-of-experts architecture and a context window of up to 1 million tokens, enabling the model to process extremely long documents, multi-hour videos, and large codebases in a single prompt.
Gemini 2.0, announced in December 2024, added native tool use and agentic capabilities alongside its multimodal foundation. By early 2026, the Gemini family had evolved through multiple generations, with each iteration improving multimodal reasoning, extending context lengths, and expanding the range of supported modalities.
Anthropic's Claude 3 family, released in March 2024, introduced vision capabilities across all model tiers. The models can process photographs, charts, graphs, technical diagrams, screenshots, and document images. Users can include up to 20 images per request on claude.ai and up to 600 images via the API.
Claude 3.5 Sonnet showed step-change improvements in visual reasoning compared to Claude 3 Opus, particularly on tasks involving chart and graph interpretation. The model can accurately transcribe text from imperfect or low-quality images, making it useful in retail, logistics, and financial services where documents may be photographed under poor conditions.
Claude's vision architecture processes images as part of the model's context window. The model analyzes visual content and reasons about it using the same language understanding capabilities it applies to text, enabling it to answer complex questions that require integrating information from both images and text within a conversation.
LLaVA (Large Language-and-Vision Assistant), introduced in 2023 by researchers at the University of Wisconsin-Madison and Microsoft Research, demonstrated that competitive multimodal performance could be achieved with open-source components. LLaVA connects a CLIP vision encoder to an open-source large language model (such as LLaMA or Vicuna) through a simple projection layer (an MLP) that maps visual embeddings into the language model's token space.
LLaVA's training follows a two-stage process. In the first stage (feature alignment), the model learns to map visual features to the language model's embedding space using 558,000 image-text pairs, with both the vision encoder and language model frozen. In the second stage (visual instruction tuning), the model is trained on multimodal instruction-following data generated by GPT-4, teaching it to respond to complex visual questions and instructions.
LLaVA-NeXT (January 2024) improved on the original with higher input resolution and enhanced reasoning capabilities. LLaVA-OneVision-1.5 continued the series with the same "ViT-MLP-LLM" architecture but with expanded training data and improved performance.
The open-source multimodal ecosystem has grown substantially beyond LLaVA. Notable models include:
| Model | Organization | Parameters | Key feature |
|---|---|---|---|
| LLaVA-1.5 | UW-Madison / Microsoft | 7B, 13B | Two-stage training with MLP projector |
| InternVL 2.5 | Shanghai AI Lab | 1B to 78B | Competitive with GPT-4V on many benchmarks |
| Qwen-VL | Alibaba | 2B to 72B | Strong multilingual vision-language performance |
| Molmo | Allen Institute for AI | 1B to 72B | Trained on high-quality, human-annotated data |
| CogVLM | Tsinghua / Zhipu AI | 17B | Deep fusion of vision and language features |
| Idefics2 | Hugging Face | 8B | Fully open with training data, code, and model weights |
By 2025, open-source vision-language models like InternVL3.5 (78B parameters, scoring 71.4 on WildVision) were matching proprietary models on public benchmarks, significantly narrowing the gap between open and closed multimodal systems.
Multimodal AI systems can work with a growing range of data types. Each modality presents unique representation challenges and requires specialized encoding strategies.
| Modality | Description | Common encoding approach | Example tasks |
|---|---|---|---|
| Text | Written language, code, structured data | Tokenization with BPE or SentencePiece; transformer encoder | Question answering, summarization, translation |
| Images | Photographs, diagrams, charts, screenshots | Vision Transformer (ViT) patch embeddings; CNN feature extraction | Classification, object detection, visual QA |
| Audio | Speech, music, environmental sounds | Mel spectrogram conversion; audio transformer encoder | Speech recognition, music generation, sound classification |
| Video | Temporal sequences of frames with optional audio | Frame sampling + spatial-temporal transformers | Action recognition, video captioning, temporal reasoning |
| 3D | Point clouds, meshes, volumetric data | Point cloud transformers; voxel-based encoding | 3D object recognition, scene reconstruction |
| Code | Programming languages, markup, configuration files | Tokenization similar to text but with code-specific vocabulary | Code generation, bug detection, code explanation |
Modern multimodal models increasingly handle combinations of these modalities within a single interaction. GPT-4o, for instance, can accept text, images, and audio simultaneously, while Gemini processes text, images, audio, and video in a unified architecture.
Visual question answering (VQA) is the task of answering natural language questions about the content of an image. Given an image and a question like "What color is the car in the background?" or "How many people are sitting at the table?", a VQA system must understand both the visual content and the linguistic structure of the question to produce a correct answer.
Early VQA systems combined CNNs for visual feature extraction with RNNs for question encoding, merging the two representations through attention mechanisms. Modern VQA is largely handled by multimodal large language models. GPT-4V, Gemini, and Claude 3 can all answer complex visual questions that require multi-step reasoning, world knowledge, and spatial understanding.
The VQA dataset (Antol et al., 2015) and its successors (VQA v2, VizWiz) established standard evaluation protocols. More recent benchmarks like MMMU and MathVista test much harder multimodal reasoning, including questions that require understanding college-level diagrams, scientific figures, and mathematical notation.
Image captioning generates a natural language description of an image's content. The task requires identifying objects, understanding their relationships, and expressing that understanding in grammatically correct, contextually appropriate language.
The MS COCO captioning benchmark (Lin et al., 2014) was the standard evaluation dataset for years, with models scored on metrics like BLEU, METEOR, CIDEr, and SPICE. Modern multimodal models have largely surpassed purpose-built captioning systems; when prompted to describe an image, GPT-4V, Gemini, and Claude produce detailed, nuanced descriptions that go well beyond simple object enumeration.
Video understanding extends multimodal AI to temporal data, requiring models to track objects across frames, understand actions and events, and reason about cause-and-effect relationships over time. Key tasks include action recognition (classifying what is happening in a video clip), temporal grounding (locating when a specific event occurs), and video summarization.
Gemini 2.5 Pro leads on video understanding benchmarks as of 2025, scoring 84.8% on VideoMME. The model can process multi-hour videos within its million-token context window, enabling applications like automated meeting summarization, sports analysis, and surveillance review.
Multimodal AI has proven particularly useful for analyzing documents that combine text, tables, figures, and formatting. Traditional optical character recognition (OCR) extracts text but loses structural information; multimodal models can understand the layout, interpret charts within documents, parse tables, and answer questions about the document's content as a whole.
Applications include automated invoice processing, legal document review, financial report analysis, and research paper comprehension. Claude's vision capabilities, which excel at transcribing text from imperfect images, are particularly suited to document analysis workflows where source materials may be photographs of printed documents or low-resolution scans.
Multimodal AI is transforming medical imaging by enabling systems that can interpret medical scans alongside clinical notes, patient histories, and diagnostic criteria. Medical VQA systems allow clinicians to ask natural language questions about radiology images, pathology slides, and other medical visuals.
Google's Med-PaLM M (2023) was one of the first generalist medical AI models capable of interpreting medical images across multiple imaging modalities (X-rays, CT scans, dermatology photos, pathology slides) while also engaging in medical dialogue. Microsoft's BiomedParse (2024), trained on 6 million visual objects across nine imaging modalities, advanced the state of the art for medical image segmentation and analysis.
Key medical VQA datasets include VQA-RAD (for radiology images) and PathVQA (for pathology), though the field faces challenges with data privacy, annotation quality, and the high stakes of clinical deployment.
Text-to-image generation creates visual content from natural language descriptions. The field has evolved rapidly from producing blurry, low-resolution outputs to generating photorealistic images that are difficult to distinguish from photographs.
OpenAI's DALL-E (January 2021) was among the first large-scale text-to-image models, using a discrete variational autoencoder and autoregressive transformer. DALL-E 2 (April 2022) switched to a diffusion model architecture guided by CLIP embeddings, producing dramatically higher-quality images. DALL-E 3 (September 2023) improved prompt adherence by training on highly descriptive captions generated by a captioning model, addressing the common problem of models ignoring parts of complex prompts. DALL-E 3 is integrated directly into ChatGPT, making it accessible through conversational interaction.
DALL-E 3 particularly excels at rendering text within images, a task that earlier text-to-image models struggled with. This capability makes it suitable for generating marketing materials, social media graphics, and other content where legible text is required.
Stable Diffusion, released by Stability AI in August 2022, was the first high-quality open-source text-to-image model. It uses a latent diffusion architecture that performs the denoising process in a compressed latent space rather than in pixel space, significantly reducing computational requirements. The model's open-source nature enabled a large community of developers to create custom fine-tuned models, training techniques (like LoRA and DreamBooth), and user interfaces.
Stable Diffusion XL (SDXL), released in 2023, improved image quality and resolution. Stable Diffusion 3 and 3.5 (2024) adopted a multimodal diffusion transformer (MMDiT) architecture that processes text and image tokens jointly, further improving prompt following and visual quality. The open-source ecosystem around Stable Diffusion has produced thousands of specialized models for different artistic styles, domains, and applications.
Midjourney, operated by an independent research lab founded by David Holz (formerly of Leap Motion), gained widespread attention for its ability to produce highly aesthetic, artistic images. The service operates primarily through a Discord bot interface, though a web application became available in 2024.
Midjourney v5 (March 2023) significantly improved photorealism and detail. Midjourney v6 (December 2023) added better text rendering and more precise prompt following. Midjourney v7 (April 2025) is regarded as producing some of the most aesthetically compelling AI-generated imagery available, dominating artistic and creative use cases.
| System | Organization | Release (latest major version) | Architecture | Open source | Key strength |
|---|---|---|---|---|---|
| DALL-E 3 | OpenAI | September 2023 | Diffusion + caption model | No | Text rendering; prompt adherence |
| Stable Diffusion 3.5 | Stability AI | 2024 | Multimodal diffusion transformer (MMDiT) | Yes | Customizability; open ecosystem |
| Midjourney v7 | Midjourney, Inc. | April 2025 | Undisclosed (diffusion-based) | No | Artistic quality; aesthetic output |
| Imagen 3 | Google DeepMind | 2024 | Diffusion + T5 text encoder | No | Photorealism; detail accuracy |
| Flux | Black Forest Labs | 2024 | Rectified flow transformer | Yes (base model) | Fast generation; high quality |
Text-to-video generation extends image generation into the temporal domain, producing moving visual content from text descriptions. The field made dramatic progress between 2023 and 2025.
OpenAI previewed Sora in February 2024, demonstrating text-to-video generation at a quality level that surprised the research community. Sora uses a diffusion transformer architecture that operates on spacetime patches, treating video as collections of smaller units of data (analogous to tokens in GPT). This approach allows the model to handle videos of different durations, resolutions, and aspect ratios.
Sora was released publicly for ChatGPT Plus and Pro users in December 2024. Sora 2, released in early 2025, improved visual quality, physics understanding, and motion consistency. A September 2025 update added synchronized dialogue and sound effect generation, making Sora capable of producing videos with native audio.
Runway, a startup founded by Cristobal Valenzuela, Alejandro Matamala, and Anastasis Germanidis, has been one of the most active companies in AI video generation. Runway Gen-1 (February 2023) enabled video-to-video transformation. Gen-2 (June 2023) added text-to-video generation. Gen-3 Alpha (June 2024) substantially improved temporal consistency and motion quality. Gen-4, available in 2025, focuses on visual consistency across multiple shots and scenes, making it particularly useful for creative professionals who need coherent multi-shot sequences.
Pika Labs, founded by former Stanford AI researchers Demi Guo and Chenlin Meng, entered the text-to-video space in 2023 initially through a Discord-based interface. Pika evolved into a full web platform, with version 2.0 (late 2024) introducing creative effects like "Pikaffects" for adding fantastical transformations to video clips. Pika 2.5 balances accessibility and affordability, targeting individual creators and small teams rather than large production studios.
| System | Organization | Key feature |
|---|---|---|
| Sora 2 | OpenAI | Highest visual quality; native audio generation |
| Runway Gen-4 | Runway | Multi-shot consistency; creative professional tools |
| Pika 2.5 | Pika Labs | Accessible pricing; creative effects |
| Veo 3 | Google DeepMind | Native dialogue generation; strong physics |
| Kling | Kuaishou | Strong motion consistency; available in China |
| Luma Dream Machine | Luma AI | Fast generation; 3D-aware video synthesis |
Audio processing represents another important dimension of multimodal AI, bridging the gap between written text and spoken language.
OpenAI's Whisper, released as open-source software in September 2022, transformed automatic speech recognition by training a single model on 680,000 hours of multilingual, multitask supervised data from the web. Whisper uses an encoder-decoder transformer architecture: input audio is split into 30-second chunks, converted to log-Mel spectrograms, and processed by the encoder. The decoder then predicts a sequence of tokens representing the transcription, language identification, timestamps, and other metadata.
Whisper supports transcription in multiple languages and translation into English. Its robustness to accents, background noise, and technical language comes from the scale and diversity of its training data. Whisper Large V3 (November 2023) improved accuracy further, and the model has been widely adopted as a building block in larger multimodal pipelines.
Other notable speech-to-text systems include Google's Universal Speech Model (USM), which supports over 100 languages, and ElevenLabs' Scribe (February 2025), which achieves 96.7% accuracy for English with features like word-level timestamps and speaker diarization across 99 languages.
Modern text-to-speech (TTS) systems have moved well beyond robotic-sounding output. ElevenLabs, founded in 2022, has become the leading TTS platform, with its models capable of producing speech that captures emotion, emphasis, pauses, and natural cadence. ElevenLabs' Eleven v3 (June 2025) supports over 70 languages, natural multi-speaker dialogue, and audio tags like [excited], [whispers], and [sighs] that give users fine-grained control over delivery.
OpenAI integrated TTS capabilities into GPT-4o, enabling real-time voice conversations where the model can detect and respond to emotional cues in the speaker's voice. Unlike earlier voice assistants that operated through a pipeline of separate speech-to-text and text-to-speech models, GPT-4o processes audio natively, preserving nuances that would be lost in a text intermediary.
| System | Type | Organization | Release | Key capability |
|---|---|---|---|---|
| Whisper | Speech-to-text | OpenAI | September 2022 | Multilingual; robust to noise; open source |
| USM | Speech-to-text | 2023 | 100+ languages; 12M hours of speech data | |
| Scribe | Speech-to-text | ElevenLabs | February 2025 | 99 languages; speaker diarization; 96.7% English accuracy |
| ElevenLabs Eleven v3 | Text-to-speech | ElevenLabs | June 2025 | 70+ languages; emotional expressiveness; audio tags |
| GPT-4o Voice | Both (native) | OpenAI | May 2024 | Native audio I/O; real-time conversation; emotion detection |
| Bark | Text-to-speech | Suno AI | 2023 | Open source; nonverbal sounds; music generation |
One of the core technical challenges in multimodal AI is aligning representations across different modalities so that the model can reason about their relationships. Images, text, and audio have fundamentally different structures: text is sequential and discrete, images are spatial and continuous, and audio is temporal and continuous. Mapping these diverse formats into a shared representation space where meaningful comparisons can be made remains difficult.
Simple projection layers (such as those used in LLaVA) are computationally efficient but may miss fine-grained correspondences between visual and textual information. More complex alignment mechanisms, like the gated cross-attention in Flamingo, improve integration but increase computational cost and architectural complexity. The choice of alignment strategy has a direct impact on model performance, particularly for tasks that require detailed spatial reasoning or precise grounding of text to visual regions.
Hallucination in multimodal models occurs when a model generates information that is not supported by the input data. In text-only language models, hallucination means fabricating facts; in multimodal systems, it can also mean describing objects that are not present in an image, misidentifying spatial relationships, or generating captions that contradict the visual evidence.
Research has identified several causes of multimodal hallucination. Models often over-rely on language priors, generating plausible-sounding descriptions based on textual patterns rather than actually attending to the visual input. This is especially problematic for unusual or unexpected scenes. Limited token constraints in bridging modules (like Q-Former in BLIP-2) can cause information loss, where the model simply does not have enough representational capacity to encode all relevant visual details. Training data biases, where certain objects or scenes are overrepresented, can lead models to "guess" based on statistical co-occurrence rather than genuine visual understanding.
Benchmarks like POPE (Polling-based Object Probing Evaluation) and CHAIR (Caption Hallucination Assessment with Image Relevance) specifically evaluate hallucination rates. Mitigating hallucination remains an active research area, with approaches including reinforcement learning from human feedback (RLHF) applied to visual outputs, improved training data curation, and architectural modifications that strengthen visual grounding.
Evaluating multimodal models is harder than evaluating unimodal systems because the output quality depends on multiple interacting capabilities. A model might excel at recognizing objects but fail at understanding their spatial relationships, or correctly identify the content of a chart but misinterpret the data it presents. Single-number metrics often fail to capture these nuances.
Automatic metrics for generated images (FID, CLIP Score, Inception Score) measure distributional similarity or text-image alignment but do not fully capture perceptual quality, factual accuracy, or creative merit. For video generation, evaluation is even harder because temporal consistency, motion realism, and narrative coherence all matter but are difficult to quantify. Human evaluation remains the gold standard for many multimodal tasks, but it is expensive, slow, and subjective.
Multimodal models are generally more expensive to train and run than unimodal models because they must process and integrate multiple data types. Video understanding models, in particular, face enormous computational demands because video data is orders of magnitude larger than still images. Gemini 1.5 Pro's ability to process multi-hour videos within a million-token context window requires substantial infrastructure that is not accessible to most researchers.
The computational gap between proprietary and open-source multimodal models has narrowed since 2023, but cutting-edge multimodal training still requires resources available primarily to large technology companies. This raises concerns about equitable access to multimodal AI capabilities and the concentration of AI development among a small number of organizations.
Standardized benchmarks are essential for measuring progress in multimodal AI. The following benchmarks are among the most widely used for evaluating multimodal large language models.
MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning), introduced in late 2023 and published at CVPR 2024, evaluates expert-level multimodal understanding across a broad scope of academic disciplines. It contains 11,500 questions drawn from college exams, quizzes, and textbooks, covering six core disciplines: Art and Design, Business, Science, Health and Medicine, Humanities and Social Science, and Tech and Engineering. The questions span 30 subjects and 183 subfields, with 30 different image types including charts, diagrams, maps, tables, music sheets, and chemical structures.
Even advanced models like GPT-4V initially achieved only around 56% accuracy on MMMU, indicating significant room for improvement. MMMU-Pro, a harder variant introduced in 2024, filters out questions answerable by text-only models and augments candidate options to reduce the impact of guessing. Model performance on MMMU-Pro ranges from approximately 16.8% to 26.9%, highlighting the gap between current capabilities and genuine expert-level multimodal reasoning.
MathVista evaluates mathematical reasoning in visual contexts, testing whether models can interpret graphs, geometric figures, statistical charts, and scientific diagrams to solve quantitative problems. The benchmark combines challenges from mathematical reasoning with visual perception, requiring models to read data from charts, apply formulas, and perform multi-step calculations. The leading models as of 2024 achieve scores around 63.9, slightly surpassing the human average of 60.3.
MM-Vet evaluates large multimodal models on integrated capabilities that require combining multiple skills simultaneously. Rather than testing individual abilities in isolation (e.g., object recognition alone), MM-Vet poses questions that demand recognition, knowledge, spatial understanding, language generation, and mathematical reasoning together. This makes it a useful complement to benchmarks that test narrower skills.
| Benchmark | Focus area | Size | Key characteristic |
|---|---|---|---|
| MMMU | Expert-level multimodal reasoning | 11,500 questions | College-level across 30 subjects |
| MMMU-Pro | Robust multimodal reasoning | Subset of MMMU | Filters text-only solvable questions; harder options |
| MathVista | Mathematical visual reasoning | 6,141 problems | Charts, geometry, statistics |
| MM-Vet | Integrated multimodal capabilities | 218 questions | Tests combined skills |
| VQAv2 | Visual question answering | 1.1M questions | Balanced pairs to reduce language bias |
| TextVQA | Text reading in images | 45,336 questions | Requires OCR + reasoning |
| DocVQA | Document understanding | 50,000 questions | Scanned documents and forms |
| VideoMME | Video understanding | 900 videos | Short, medium, and long video comprehension |
| HallusionBench | Visual hallucination detection | 1,129 examples | Tests resistance to visual illusions and leading questions |
| WildVision | Real-world vision tasks | 500+ samples | User-submitted queries; open-ended |
The following table summarizes the most prominent multimodal AI models as of early 2026.
| Model | Organization | Release date | Input modalities | Output modalities | Open source | Notes |
|---|---|---|---|---|---|---|
| CLIP | OpenAI | January 2021 | Images, text | Embeddings (shared space) | Yes | Foundation for zero-shot classification and generative guidance |
| DALL-E 2 | OpenAI | April 2022 | Text | Images | No | Diffusion model guided by CLIP embeddings |
| Flamingo | DeepMind | April 2022 | Images, video, text | Text | No | Few-shot multimodal learning; Perceiver Resampler |
| Whisper | OpenAI | September 2022 | Audio | Text | Yes | 680K hours training data; multilingual ASR |
| Stable Diffusion | Stability AI | August 2022 | Text | Images | Yes | Latent diffusion; large open-source ecosystem |
| LLaVA | UW-Madison / Microsoft | April 2023 | Images, text | Text | Yes | ViT-MLP-LLM architecture; visual instruction tuning |
| GPT-4V | OpenAI | September 2023 | Images, text | Text | No | Vision adapter on GPT-4; strong visual reasoning |
| DALL-E 3 | OpenAI | September 2023 | Text | Images | No | Improved prompt adherence; text rendering in images |
| Gemini 1.0 | Google DeepMind | December 2023 | Text, images, audio, video | Text | No | Natively multimodal; Ultra/Pro/Nano tiers |
| Claude 3 | Anthropic | March 2024 | Images, text | Text | No | Haiku/Sonnet/Opus tiers; strong document analysis |
| GPT-4o | OpenAI | May 2024 | Text, images, audio, video | Text, audio, images | No | Natively multimodal omni model; real-time voice |
| Gemini 1.5 Pro | Google DeepMind | February 2024 | Text, images, audio, video | Text | No | 1M token context; mixture-of-experts |
| Claude 3.5 Sonnet | Anthropic | June 2024 | Images, text | Text | No | Strongest Claude vision model; chart/graph interpretation |
| Sora | OpenAI | December 2024 | Text | Video | No | Diffusion transformer on spacetime patches |
| InternVL 2.5 | Shanghai AI Lab | 2024 | Images, text | Text | Yes | Competitive with GPT-4V at 78B parameters |
| Qwen-VL | Alibaba | 2024 | Images, text | Text | Yes | Strong multilingual vision-language performance |
| Gemini 2.5 Pro | Google DeepMind | 2025 | Text, images, audio, video | Text, images | No | 84.8% on VideoMME; leading video understanding |
| Sora 2 | OpenAI | Early 2025 | Text | Video, audio | No | Native audio; improved physics understanding |
As of early 2026, several trends are shaping the trajectory of multimodal AI.
Unified any-to-any models. The boundary between understanding and generation is blurring. Rather than separate models for image understanding and image generation, researchers are working toward systems that can both comprehend and produce content across all modalities within a single architecture. GPT-4o's ability to accept and generate text, images, and audio in a unified model points toward this future.
World models and simulation. OpenAI described Sora as a "world simulator," and the text-to-video field is increasingly framed not just as content generation but as learning physical world dynamics. If a video model can accurately predict how objects move, interact, and behave under different conditions, it has effectively learned a model of physics that could be useful for robotics, autonomous vehicles, and scientific simulation.
Embodied multimodal AI. Google DeepMind's Gemini Robotics (2025) demonstrated robots using multimodal models to see, understand, and interact with physical environments. Combining vision, language, and action in embodied agents represents a natural extension of multimodal AI from digital content to the physical world.
Efficiency and accessibility. While frontier multimodal models require massive computational resources, there is strong momentum toward smaller, more efficient models that can run on consumer hardware or edge devices. Google's Gemini Nano runs on mobile phones; open-source models like Molmo and LLaVA variants offer competitive performance at 7B to 8B parameters. Techniques like quantization, distillation, and efficient attention mechanisms are making multimodal capabilities more accessible.
Improved evaluation. The community recognizes that current benchmarks do not fully capture multimodal model capabilities. MMMU-Pro, HallusionBench, and WildVision represent efforts to create more rigorous, harder-to-game evaluations. Future benchmarks will likely focus more on real-world task completion, long-form reasoning, and resistance to adversarial inputs.
Safety and alignment. As multimodal models become more capable, concerns about misuse grow. Deepfake generation, misinformation through manipulated images, and privacy violations through visual surveillance are all exacerbated by increasingly powerful multimodal systems. Developing robust safeguards, content provenance systems (like C2PA metadata for generated images), and alignment techniques specific to multimodal outputs remains an open challenge.