# Multimodal Model

> Source: https://aiwiki.ai/wiki/multimodal_model
> Updated: 2026-06-20
> Categories: Computer Vision, Deep Learning, Machine Learning, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Large language model](/wiki/large_language_model), [Computer vision](/wiki/computer_vision)*

A **multimodal model** is an artificial intelligence system that processes, relates, and generates information across two or more types of data, or modalities, such as text, images, audio, and video, within a single model. Unlike unimodal models that operate on one data type, multimodal models learn joint representations that capture relationships across modalities, enabling tasks no single-modality model can perform alone, such as describing an image in words, answering questions about a photo, or generating an image from a text prompt. The approach was crystallized by OpenAI's CLIP in 2021, which trained on 400 million image-text pairs and reached 76.2% zero-shot accuracy on ImageNet, matching a fully supervised ResNet-50 without using any of its 1.28 million labeled training examples.[1][13] By 2024 the paradigm produced natively multimodal large language models such as Google DeepMind's Gemini, whose Ultra model scored 90.0% on MMLU to become the first model to surpass the 89.8% human-expert benchmark.[8]

## Introduction

A **multimodal model** in [machine learning](/wiki/machine_learning) is a system that can process, relate, and generate information across multiple types of data, or modalities, such as text, images, audio, and video. Unlike unimodal models that operate on a single data type, multimodal models learn joint representations that capture patterns and relationships across different modalities. This enables capabilities that no single-modality model can achieve on its own, such as describing an image in natural language, answering questions about a photograph, or generating an image from a text prompt.

Multimodal models have become one of the most active areas of artificial intelligence research. Advances in [deep learning](/wiki/deep_model) and the [transformer](/wiki/transformer) architecture have driven rapid progress, enabling models like GPT-4V, Gemini, and Claude 3 to process text and images within a single conversation. The field spans a wide range of tasks, from visual question answering and image captioning to text-to-image generation and audio transcription.

## Explain like I'm 5 (ELI5)

Imagine you have a robot friend that can look at pictures, read words, and listen to sounds all at the same time. Most robots can only do one of those things. But a multimodal model is like a robot that understands all of them together. If you show it a photo of a dog playing in a park and ask "what is the dog doing?", it can look at the picture and answer in words: "The dog is playing fetch." It connects what it sees with what it reads and hears, just like you do every day when you watch a movie (you see images and hear sounds at the same time to understand the story).

## What is a multimodal model used for?

Multimodal models power image captioning, visual question answering, text-to-image and text-to-video generation, speech recognition, document understanding, visual search, and accessibility tools. In practice this means a single model can read a chart, transcribe a meeting, generate an illustration from a sentence, or answer questions about an uploaded photo. These applications are detailed in the Applications section below.

## Historical evolution

The development of multimodal models has progressed through several distinct phases, moving from isolated single-modality systems toward increasingly unified architectures.

### Early approaches (pre-2015)

Early multimodal systems relied on hand-crafted features and separate processing pipelines for each modality. In computer vision, bag-of-visual-words models extracted image features, while natural language processing relied on techniques like TF-IDF and word2vec for text. Combining modalities typically meant extracting features independently and then concatenating them for a downstream classifier. These approaches were limited by the quality of the hand-engineered features and the difficulty of aligning representations across modalities.

### Deep learning era (2015 to 2020)

The rise of [convolutional neural network](/wiki/convolutional_neural_network) architectures (such as VGGNet and ResNet) for image understanding and recurrent neural networks (LSTMs) for text processing enabled end-to-end trainable multimodal systems. Models for image captioning, such as Show and Tell (Vinyals et al., 2015), used a CNN to encode an image and an LSTM to generate a caption.[12] Visual question answering (VQA) systems combined image features from CNNs with question embeddings from LSTMs to predict answers. These models demonstrated that [neural network](/wiki/neural_network) architectures could learn to bridge modalities, but they still used separate encoders for each data type with relatively simple fusion mechanisms.

### Vision-language pretraining (2021 to 2022)

The introduction of CLIP (Contrastive Language-Image Pre-training) by Radford et al. in January 2021 marked a turning point.[1] CLIP trained a vision encoder and a text encoder jointly on 400 million image-text pairs from the internet, using a contrastive learning objective to align their [embedding vector](/wiki/embedding_vector) spaces.[1] The authors framed the motivation directly: "State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories," a constraint CLIP removed by learning from natural language instead.[1] This produced a model with strong zero-shot transfer: CLIP reached 76.2% top-1 accuracy on ImageNet without any task-specific fine-tuning, matching the original ResNet-50 while earlier zero-shot methods had managed only 11.5%.[1][13] Around the same time, Google introduced ALIGN (Jia et al., 2021), which demonstrated that scaling to over one billion noisy image-text pairs could compensate for data quality issues and achieve state-of-the-art results on image-text retrieval benchmarks.[2]

DeepMind's Flamingo (Alayrac et al., 2022) took a different approach, bridging a frozen pretrained vision encoder with a frozen large language model using novel cross-[attention](/wiki/attention) layers.[3] Flamingo could handle arbitrarily interleaved sequences of images and text, enabling few-shot learning on multimodal tasks. The 80-billion-parameter model set new state-of-the-art results across 16 multimodal benchmarks using only a handful of examples, and on all 16 tasks with published few-shot results it outperformed prior work by a large margin.[3] The authors described their key innovations as architecture that can "(i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs."[3]

### Multimodal large language models (2023 to present)

The current era is defined by natively multimodal large language models. Google DeepMind's Gemini (December 2023) was designed from the ground up as a multimodal model, jointly processing text, images, audio, and video within a single transformer decoder.[8] The Gemini technical report states the family "was designed to be natively multimodal, pre-trained from the start on different modalities," and reports that Gemini Ultra reached 90.0% on the MMLU benchmark, the first model to exceed the 89.8% human-expert score.[8] OpenAI's GPT-4V (September 2023) extended GPT-4 with vision capabilities, allowing users to upload images alongside text prompts. Anthropic's Claude 3 (March 2024) introduced vision across all model sizes. Meta's LLaVA (Liu et al., 2023) demonstrated that connecting a CLIP vision encoder to a language model (Vicuna) through a trainable projection matrix, followed by visual instruction tuning, could produce strong multimodal capabilities at a fraction of the cost of proprietary models.[4]

The trend has accelerated in 2024 and 2025, with models like GPT-4o (which processes text, images, and audio natively), Gemini 2.0 (which adds multimodal output including generated images and text-to-speech), and open-source alternatives such as LLaVA-1.5 and VideoLLaMA 2 advancing the state of the art.

## Types of multimodal learning

Multimodal learning encompasses several distinct paradigms, each addressing different aspects of how models can work with multiple data types.

| Type | Description | Example |
|---|---|---|
| Fusion | Combining information from multiple modalities into a joint representation for prediction | A sentiment analysis model that combines text and facial expression features to classify emotion |
| Alignment | Learning a shared representation space where items from different modalities can be directly compared | CLIP aligns image and text embeddings so that an image of a cat is close to the text "a photo of a cat" |
| Translation | Converting data from one modality to another | Image captioning (image to text), text-to-image generation (text to image), speech-to-text (audio to text) |
| Co-learning | Using information from a resource-rich modality to improve learning in a resource-poor modality | Using large amounts of unlabeled images to improve performance on a text classification task with limited labeled data |

## Fusion strategies

Fusion refers to how information from different modalities is combined within a model. The choice of fusion strategy has significant implications for model performance, computational cost, and the types of cross-modal interactions the model can capture.

### Early fusion

Early fusion (also called input-level fusion) combines raw or lightly processed inputs from all modalities at the beginning of the model pipeline. For example, image patch embeddings and text token embeddings might be concatenated into a single sequence before being fed into a transformer. Early fusion allows the model to learn cross-modal interactions from the very first layers, enabling fine-grained interactions between modalities. However, it can be computationally expensive because the model must process the combined input at every layer.

Gemini uses a form of early fusion, with all modalities entering a shared transformer decoder via type embeddings. GPT-4o similarly processes text, image, and audio tokens within a single model.

### Late fusion

Late fusion (also called decision-level fusion) processes each modality independently through separate encoders and combines their outputs only at the final prediction stage. This is the simplest fusion approach: each modality is processed by its own specialized model, and the predictions or feature vectors are combined (for example, by averaging, voting, or a learned combination layer) to produce a final output. Late fusion is computationally efficient and allows each encoder to be independently pretrained. However, it limits the ability to model fine-grained interactions between modalities.

CLIP and ALIGN use a late fusion approach: separate image and text encoders produce embeddings that are compared only through a cosine similarity function.[1][2]

### Cross-attention fusion

Cross-attention fusion inserts attention layers that allow one modality to attend to another at intermediate stages of processing. In this approach, queries come from one modality (for example, text tokens) and keys and values come from another (for example, image patch features). This enables rich, fine-grained cross-modal interactions while still maintaining separate processing streams for each modality.

Flamingo uses cross-attention layers (called "Perceiver Resampler" and "gated cross-attention dense" layers) to allow the language model to attend to visual features.[3] LLaVA projects visual features into the language model's embedding space through a trainable linear projection.[4]

| Fusion strategy | When modalities are combined | Strengths | Limitations |
|---|---|---|---|
| Early fusion | At the input layer | Rich cross-modal interactions from the start; unified representation | Computationally expensive; requires all modalities at input time |
| Late fusion | At the output/decision layer | Modular; allows independent pretraining; computationally efficient | Limited cross-modal interaction; cannot capture fine-grained dependencies |
| Cross-attention fusion | At intermediate layers | Fine-grained cross-modal interaction; flexible architecture | Adds architectural complexity; requires careful design of attention patterns |

## Contrastive learning for multimodal alignment

Contrastive learning has emerged as one of the most effective techniques for aligning representations across modalities. The core idea is to train a model so that matching pairs (for example, an image and its correct caption) have similar representations, while non-matching pairs have dissimilar representations.

### CLIP's approach

CLIP (Radford et al., 2021) uses a dual-encoder architecture with a vision transformer (ViT) for images and a text transformer for captions.[1] During training, a batch of N image-text pairs produces N correct pairings and N^2 - N incorrect pairings. The model maximizes the cosine similarity between embeddings of correct pairs while minimizing similarity for incorrect pairs, using a symmetric cross-entropy loss.[1] This simple objective, when scaled to 400 million image-text pairs, produces representations that transfer to a wide range of downstream tasks without any task-specific fine-tuning (zero-shot transfer). On a 27-dataset evaluation suite, zero-shot CLIP outperformed a fully supervised linear classifier fitted on ResNet-50 features on 16 of the datasets.[1]

### SigLIP

SigLIP (Zhai et al., 2023) replaced CLIP's softmax-based contrastive loss with a sigmoid loss.[7] Instead of normalizing similarities across the entire batch (which requires a global view of all pairwise similarities), SigLIP treats each image-text pair as an independent binary classification problem. This change enables more efficient scaling to larger batch sizes and improves performance at smaller batch sizes.[7] SigLIP has become a popular vision encoder choice in recent multimodal systems, including PaLI and Gemini.

### ALIGN

ALIGN (Jia et al., 2021) demonstrated that scaling training data to over one billion noisy image alt-text pairs, without expensive filtering or curation, could achieve state-of-the-art results.[2] Using a dual-encoder architecture with EfficientNet-L2 as the image encoder and BERT-Large as the text encoder, ALIGN showed that data scale could compensate for noise in the training data.[2]

## Vision-language models

Vision-language models (VLMs) are multimodal systems specifically designed to process and relate visual and textual information. They represent one of the most mature categories of multimodal models.

| Model | Organization | Year | Architecture | Key contribution |
|---|---|---|---|---|
| CLIP | OpenAI | 2021 | Dual-encoder (ViT + text transformer) | Contrastive pretraining on 400M image-text pairs; 76.2% zero-shot ImageNet |
| ALIGN | Google | 2021 | Dual-encoder (EfficientNet + BERT) | Scaling to 1B+ noisy image-text pairs without curation |
| Flamingo | DeepMind | 2022 | Frozen LM + vision encoder with cross-attention | Few-shot SOTA on 16 multimodal tasks; handles interleaved image-text sequences |
| SigLIP | Google | 2023 | Dual-encoder (ViT + text transformer) | Sigmoid loss for more scalable contrastive learning |
| LLaVA | University of Wisconsin-Madison | 2023 | CLIP ViT + Vicuna LLM with projection layer | Visual instruction tuning; open-source multimodal LLM |

## Multimodal large language models

Multimodal large language models (MLLMs) extend [large language model](/wiki/large_language_model) architectures to accept inputs beyond text, typically images, audio, and video. These systems represent the current frontier of multimodal AI.

### GPT-4V and GPT-4o

OpenAI's GPT-4V (September 2023) added vision capabilities to GPT-4, allowing users to upload images and receive text responses that reason about the visual content. GPT-4o (released May 13, 2024) went further, processing text, images, and audio natively within a single model that was trained end-to-end across all three modalities.[14] OpenAI reports that GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which it notes is "similar to human response time in a conversation."[14] The exact architecture of GPT-4V and GPT-4o has not been publicly disclosed, but they represent some of the most capable multimodal systems available.

### Gemini

Google DeepMind's Gemini family (launched December 2023) was built as a natively multimodal model from the start.[8] Unlike systems that bolt vision modules onto existing language models, Gemini's transformer decoder processes text, image, audio, and video tokens through a unified architecture.[8] Gemini Ultra scored 90.0% on MMLU, which the technical report describes as the first model result to surpass the 89.8% score attributed to human experts on that benchmark.[8] The family includes Gemini Ultra (for complex tasks), Gemini Pro (general purpose), and Gemini Nano (on-device). Gemini 2.0 (December 2024) extended capabilities to include multimodal output, generating images and speech alongside text.

### Claude 3

Anthropic's Claude 3 family (March 2024) added vision capabilities across all model sizes (Haiku, Sonnet, and Opus). Claude 3 models can process and analyze images, extract data from documents, and reason about visual content within the context of a conversation.

### LLaVA

LLaVA (Large Language and Vision Assistant) by Liu et al. (2023) demonstrated an effective open-source approach to building multimodal LLMs.[4] The architecture connects a pretrained CLIP ViT-L/14 vision encoder to the Vicuna language model through a trainable projection matrix.[4] Training proceeds in two stages: (1) feature alignment pretraining on 558K image-text pairs, and (2) visual instruction tuning on 150K GPT-generated multimodal instruction-following examples plus 515K VQA samples. LLaVA achieved 85.1% of GPT-4's performance on synthetic multimodal benchmarks and spawned a family of follow-up models including LLaVA-1.5 and LLaVA-NeXT.[4]

## Text-to-image models

Text-to-image generation is a prominent application of multimodal learning, translating textual descriptions into visual content. These systems are a type of [generative model](/wiki/generative_model) that operates across the text and image modalities.

### DALL-E

OpenAI's DALL-E (January 2021) was among the first large-scale text-to-image models, using a transformer to generate images autoregressively from text tokens. DALL-E 2 (April 2022) switched to a diffusion-based architecture, using CLIP embeddings to guide image generation. DALL-E 3 (October 2023) improved text rendering and prompt adherence by training on highly descriptive captions and integrating directly with ChatGPT.

### Stable Diffusion

Stable Diffusion (Rombach et al., August 2022), developed by Stability AI in collaboration with LMU Munich and Runway, is a latent diffusion model that performs the denoising process in a compressed latent space rather than in pixel space.[6] This approach dramatically reduces computational requirements, making high-quality image generation accessible on consumer GPUs.[6] The model uses a pretrained text encoder (typically CLIP) to convert text prompts into embeddings that guide the diffusion process. Stable Diffusion XL and subsequent versions have improved image quality and resolution.

### Midjourney

Midjourney is a proprietary text-to-image generation service that became widely popular for its distinctive artistic style and high-quality outputs. While the specific architecture has not been publicly disclosed, it is generally understood to be based on diffusion model techniques.

| Model | Developer | Year | Architecture | Key feature |
|---|---|---|---|---|
| DALL-E | OpenAI | 2021 | Autoregressive transformer | First large-scale text-to-image model |
| DALL-E 2 | OpenAI | 2022 | CLIP-guided diffusion | CLIP embeddings for text-image alignment |
| Stable Diffusion | Stability AI / LMU Munich / Runway | 2022 | Latent diffusion model | Open-source; runs on consumer GPUs |
| DALL-E 3 | OpenAI | 2023 | Improved diffusion with descriptive captions | Better prompt adherence; ChatGPT integration |
| Midjourney | Midjourney, Inc. | 2022 | Diffusion-based (proprietary) | Distinctive artistic quality |

## Audio-language models

Audio-language models bridge the gap between spoken or acoustic data and text, enabling tasks such as speech recognition, audio captioning, and music generation.

### Whisper

OpenAI's Whisper (Radford et al., 2022) is a general-purpose speech recognition model trained on 680,000 hours of multilingual and multitask supervised data collected from the web.[5] It uses an encoder-decoder transformer architecture, with input audio split into 30-second chunks and converted into log-Mel spectrograms.[5] The authors report that when "scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning."[5] Whisper performs multilingual speech recognition, speech translation, and language identification, and its robustness across accents, background noise, and technical language has made it a reference audio encoder in many multimodal architectures, including MMS-LLaMA and Omni-AVSR.[5]

### AudioLM

Google's AudioLM (Borsos et al., 2023) generates realistic audio (speech and music) by treating audio generation as a language modeling problem.[9] It maps audio to discrete tokens using neural audio codecs and then uses a transformer-based language model to predict the next token in the sequence.[9] AudioLM can generate coherent speech continuations and music that maintains the style and structure of a given prompt.[9]

## Video understanding models

Video understanding extends multimodal learning to temporal sequences of images, often combined with audio. This is one of the most challenging multimodal domains because it requires capturing both spatial and temporal information.

### TimeSformer

TimeSformer (Bertasius et al., 2021), developed at Meta, applies the transformer architecture to video understanding by treating a video as a sequence of spatial patches across time.[10] It replaces 3D convolutions entirely with space-time self-attention, achieving strong results on action recognition benchmarks while being roughly three times faster to train than 3D CNNs and requiring less than one-tenth the inference compute.[10]

### VideoMAE

VideoMAE (Tong et al., 2022) applies masked autoencoder pretraining to video data. By masking a high percentage (90-95%) of video patches and training a model to reconstruct them, VideoMAE learns effective spatiotemporal representations for downstream tasks like action recognition and video retrieval.

### VideoLLaMA 2

VideoLLaMA 2 (Cheng et al., 2024) is a video large language model that advances spatial-temporal modeling through a Spatial-Temporal Convolution (STC) connector and integrates an audio branch through joint training. It achieves competitive results on video question answering and video captioning tasks, representing the trend of extending LLM-based multimodal systems to video.

## Applications

Multimodal models have been applied across a wide range of domains.

### Visual question answering

Visual question answering (VQA) requires a model to answer natural language questions about an image. Given an image and a question such as "How many people are in the picture?", the model must understand both the visual content and the linguistic query to produce a correct answer. VQA has been one of the primary benchmarks for evaluating multimodal models, with datasets like VQAv2 containing over 1 million questions about real images.

### Image captioning

Image captioning generates natural language descriptions of images. Modern captioning systems use vision encoders to extract image features and language models to generate fluent descriptions. The COCO Captions benchmark, with over 1.5 million captions for 330,000 images, is the standard evaluation dataset. Metrics such as CIDEr, BLEU, and METEOR measure the quality of generated captions against human references.

### Visual search and retrieval

Models like CLIP and ALIGN enable visual search, where a text query is used to find relevant images (or vice versa) by comparing embeddings in a shared representation space. This has practical applications in stock photo search, e-commerce product discovery, and content moderation.

### Accessibility

Multimodal models improve accessibility for people with visual or hearing impairments. Image captioning and visual description systems can narrate visual content for blind and low-vision users, while speech recognition models like Whisper provide accurate transcriptions for deaf and hard-of-hearing users.

### Medical imaging and diagnosis

Multimodal models that combine medical images (X-rays, MRIs, pathology slides) with clinical text (radiology reports, patient notes) can assist in diagnosis, report generation, and clinical decision support. These applications require careful validation and are subject to regulatory oversight.

### Creative generation

Text-to-image models like DALL-E, Stable Diffusion, and Midjourney have transformed creative workflows in design, advertising, and entertainment. These tools allow artists and non-artists alike to generate, iterate on, and refine visual content from text descriptions.

## Benchmarks and evaluation

Evaluating multimodal models requires benchmarks that test cross-modal understanding across different tasks and difficulty levels.

| Benchmark | Task | Modalities | Key metric | Description |
|---|---|---|---|---|
| VQAv2 | Visual question answering | Image + text | VQA accuracy | 1M+ questions about real images; accounts for annotator variance |
| COCO Captions | Image captioning | Image + text | CIDEr | 1.5M captions for 330K images from Microsoft COCO |
| ImageNet | Image classification | Image (+ text for zero-shot) | Top-1 / Top-5 accuracy | 14M+ images across 21K categories; used to evaluate zero-shot transfer |
| MMLU | Language understanding | Text | Accuracy | 57 academic subjects; tests knowledge and reasoning; used for general LLM evaluation |
| MMBench | Multimodal understanding | Image + text | Accuracy across 20 abilities | Tests perception, reasoning, and knowledge in MLLMs |
| TextVQA | Text recognition in images | Image + text | VQA accuracy | Questions that require reading text within images |
| SEEDBench | Comprehensive MLLM evaluation | Image + text | Multiple choice accuracy | Tests 12 dimensions including spatial and temporal understanding |

## How does a multimodal model differ from a unimodal model?

A unimodal model accepts and reasons over a single data type, for example a text-only [large language model](/wiki/large_language_model) or an image-only classifier. A multimodal model accepts two or more modalities and learns a shared or aligned representation so that information from one modality can inform another. The practical consequence is capability: a unimodal language model cannot see an uploaded chart, and a unimodal image classifier cannot follow a written instruction, whereas a multimodal model such as GPT-4o or Gemini can do both within one exchange. Natively multimodal systems are pre-trained on multiple modalities together, as opposed to bolting a vision encoder onto a pre-trained text model after the fact.[8]

## Challenges and limitations

### Hallucination

Multimodal hallucination occurs when a model generates outputs that are inconsistent with the visual or auditory input. For example, a model might describe objects that are not present in an image, fabricate text that supposedly appears in a document, or incorrectly identify spatial relationships between objects. Hallucination arises from several sources: cross-modal misalignment (where the model's text generation drifts away from the grounding provided by the visual input), overreliance on unimodal priors (where the language model's learned biases override visual evidence), and spurious inter-modality correlations learned during training. Mitigating hallucination remains an active area of research, with approaches including reinforcement learning from human feedback (RLHF), improved training data curation, and specialized decoding strategies.

### Modality gap

Even when trained to align representations across modalities, models often exhibit a "modality gap" in which embeddings from different modalities occupy distinct regions of the representation space rather than being fully intermixed. This gap can lead to systematic biases in cross-modal retrieval and classification. Research by Liang et al. (2022) showed that CLIP embeddings from images and text occupy two separate, narrow cones in the embedding space, limiting the effectiveness of direct embedding comparison.[11]

### Computational cost

Multimodal models, particularly those using early fusion, are extremely computationally expensive. Processing high-resolution images, long videos, or audio alongside text requires substantially more memory and compute than text-only models. Most vision-language models process images at relatively low resolutions (224x224 or 336x336 pixels) to manage costs. Training frontier multimodal models from scratch requires thousands of GPUs over weeks or months, making this area accessible primarily to well-resourced research labs and companies.

### Data quality and bias

Multimodal training data, often scraped from the web, contains noise, errors, and biases. Image-text pairs from alt-text can be loosely correlated or misleading. Training datasets may underrepresent certain demographics, cultures, or languages, leading to models that perform unevenly across populations. Ensuring data quality at the scale required for modern multimodal models (hundreds of millions to billions of examples) remains a significant challenge.

### Fine-grained visual understanding

Current multimodal models, particularly those built on CLIP-style vision encoders, often struggle with fine-grained visual understanding. They may fail to accurately count objects, understand spatial relationships, read small text in images, or distinguish between visually similar items. CLIP's training objective focuses on matching images to overall descriptions rather than understanding detailed spatial composition, which limits downstream models that rely on CLIP features.

## References

1. Radford, A., Kim, J.W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of the 38th International Conference on Machine Learning (ICML)*. arXiv:2103.00020.

2. Jia, C., Yang, Y., Xia, Y., et al. (2021). "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision." *Proceedings of the 38th International Conference on Machine Learning (ICML)*. arXiv:2102.05918.

3. Alayrac, J.-B., Donahue, J., Luc, P., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning." *Advances in Neural Information Processing Systems 35 (NeurIPS)*. arXiv:2204.14198.

4. Liu, H., Li, C., Wu, Q., and Lee, Y.J. (2023). "Visual Instruction Tuning." *Advances in Neural Information Processing Systems 36 (NeurIPS)*. arXiv:2304.08485.

5. Radford, A., Kim, J.W., Xu, T., et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." *arXiv preprint*. arXiv:2212.04356.

6. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:2112.10752.

7. Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. (2023). "Sigmoid Loss for Language Image Pre-Training." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*. arXiv:2303.15343.

8. Gemini Team, Google DeepMind. (2023). "Gemini: A Family of Highly Capable Multimodal Models." *arXiv preprint*. arXiv:2312.11805.

9. Borsos, Z., Marinier, R., Vincent, D., et al. (2023). "AudioLM: a Language Modeling Approach to Audio Generation." *IEEE/ACM Transactions on Audio, Speech, and Language Processing*.

10. Bertasius, G., Wang, H., and Torresani, L. (2021). "Is Space-Time Attention All You Need for Video Understanding?" *Proceedings of the 38th International Conference on Machine Learning (ICML)*. arXiv:2102.05095.

11. Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., and Zou, J.Y. (2022). "Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning." *Advances in Neural Information Processing Systems 35 (NeurIPS)*.

12. Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). "Show and Tell: A Neural Image Caption Generator." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.

13. Radford, A., et al. (2021). "CLIP: Connecting text and images." OpenAI. Reports CLIP matching ResNet-50 zero-shot accuracy (76.2% top-1) on ImageNet without using its 1.28 million labeled training examples. https://openai.com/index/clip/

14. OpenAI. (2024). "Hello GPT-4o." Reports GPT-4o trained end-to-end across text, vision, and audio and responding to audio in as little as 232 ms (320 ms average). https://openai.com/index/hello-gpt-4o/