ImageBind

AI Models Meta AI Multimodal AI

7 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 1,353 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ImageBind is a multimodal model from Meta AI (its Fundamental AI Research lab) that learns a single joint embedding space across six different modalities: images and video, text, audio, depth, thermal (infrared) data, and inertial measurement unit (IMU) signals. It was announced on 9 May 2023, alongside open-source code and model weights, and the accompanying paper, "ImageBind: One Embedding Space To Bind Them All" by Rohit Girdhar and colleagues, was published as a highlight at CVPR 2023 ^[1]^[2]^[3].

The central idea is that you do not need training data in which all six modalities appear together. Instead, each modality is aligned only to images, and that shared anchor is enough for the rest of the modalities to line up with one another. Meta calls this an "emergent" alignment: pairs the model never saw together, such as audio and text, end up close in the embedding space because each was independently bound to images ^[2]^[3]. This lets a single model perform zero-shot tasks like retrieving an image from a sound, and even combine embeddings arithmetically across modalities.

The binding idea

Most prior work on joint embeddings needed paired examples for each combination of modalities it wanted to relate. Collecting such data for every pair (audio with depth, thermal with text, and so on) does not scale. ImageBind sidesteps this by treating images as a hub. Images naturally co-occur with many other signals: a video frame comes with its soundtrack, a photo from an RGB-D camera comes with a depth map, a scene from a thermal camera comes with an infrared frame, and footage from a head-mounted camera comes with motion sensor readings. By binding each non-image modality to images, all of the modalities become mutually comparable through that common reference ^[2]^[3].

Training uses the standard InfoNCE contrastive objective. For each modality, the model is given naturally paired (image-or-video, modality) examples and learns to pull matching pairs together and push non-matching pairs apart in the embedding space ^[4]^[5]. Only image-paired data is used; no (audio, depth) or (text, thermal) pairs are ever shown, yet those relationships emerge. The strength of the emergent behavior scales with the quality of the image encoder: a stronger vision backbone produces better zero-shot results on the non-vision modalities ^[1]^[6].

Architecture and backbone

ImageBind reuses large-scale pretrained image-text models rather than training vision and language understanding from scratch. The image and text encoders are initialized from a pretrained CLIP-style model (OpenCLIP) and kept frozen; the image encoder is a ViT-H with roughly 630M parameters and the text encoder has on the order of 300M parameters. The encoders for the remaining modalities are trained to map their inputs into the same space as these frozen image features ^[4]^[5].

Each modality has a dedicated encoder, but they share the Transformer family of architectures. Depth and thermal data are treated as single-channel images and passed through a Vision Transformer. Audio is converted to a mel-spectrogram and encoded with a Vision Transformer over spectrogram patches. IMU data (a multi-channel time series of accelerometer and gyroscope readings) is split into fixed-length clips, projected with a 1D convolution, and encoded with a Transformer ^[4]^[5].

Modality	Encoder	Paired data source for binding
Image / video	ViT-H, frozen OpenCLIP weights (~630M params)	Web-scale image-text (with text encoder)
Text	Frozen OpenCLIP text encoder (~300M params)	Web-scale image-text
Audio	Vision Transformer over mel-spectrogram patches	AudioSet (video paired with its audio)
Depth	Vision Transformer (depth as 1-channel image)	SUN RGB-D (images paired with depth)
Thermal	Vision Transformer (thermal as 1-channel image)	LLVIP (visible images paired with infrared)
IMU	1D convolution then Transformer	Ego4D (egocentric video paired with IMU)

The naturally occurring pairings come from existing datasets: video and audio from AudioSet, image and depth from SUN RGB-D, image and thermal from LLVIP, and video and IMU from Ego4D, in addition to large web image-text collections for the vision-language side ^[4]^[5]. Because depth, thermal, and audio are recast into image-like or spectrogram representations, the same contrastive recipe applies uniformly across them.

Emergent zero-shot capabilities

Once trained, ImageBind supports several behaviors without any task-specific fine-tuning:

Cross-modal retrieval. Given a query in one modality, retrieve matching items in another. The model can fetch images from audio clips, retrieve audio for a text prompt, or find depth maps for an image, including pairs it never saw together during training ^[2]^[3].
Embedding-space arithmetic. Because all modalities share one space, their embeddings can be added to compose meaning. Combining an image of a bird with the sound of waves, for example, can retrieve images of birds near water ^[2]^[3].
Audio-driven generation. Feeding ImageBind audio embeddings into the decoder of a pretrained image generator (Meta demonstrated this with a DALLE-2-style decoder) produces images from sounds, such as rain or a forest ambience ^[2]^[3].
Cross-modal detection and few-shot recognition. The paper also shows the embeddings can upgrade detectors to new modalities and serve as strong few-shot recognizers ^[1]^[6].

On standard benchmarks, ImageBind set new results for emergent zero-shot recognition across modalities, in several cases matching or surpassing specialist models trained with direct supervision on the target modality. Meta's released checkpoint (imagebind_huge) reports the following zero-shot classification accuracies ^[3]^[6]:

Benchmark	Modality	Top-1 accuracy
ImageNet-1k	Image	77.7%
Kinetics-400	Video	50.0%
NYU Depth v2	Depth	54.0%
ESC-50	Audio	66.9%
LLVIP	Thermal	63.4%
Ego4D	IMU	25.0%

On audio specifically, ImageBind's emergent zero-shot classification and retrieval matched or beat prior specialist audio-text models on tasks built from datasets such as ESC, Clotho, and AudioCaps, despite never being trained on paired audio-text data ^[1]^[6].

Release, license, and use in later work

Meta released the code and the imagebind_huge weights publicly on GitHub (facebookresearch/ImageBind), under the CC-BY-NC 4.0 license, which permits research use but not commercial use ^[3]. A project page with interactive demos was published at imagebind.metademolab.com ^[2].

Meta framed ImageBind partly as a foundation for generative and search systems. The announcement noted that the embeddings could let a text-to-image system such as Make-A-Scene generate images from audio (for example, the sounds of a market or a rainforest), and more broadly enable richer multimodal search and content understanding ^[2]. The work sits in a line of Meta research on aligning many modalities to a single representation and on extending pretrained vision-language models to senses beyond sight and language.

Relationship to other multimodal binders

ImageBind should not be confused with Meta-Transformer, a separate framework introduced in July 2023 by a different group (Kaixiong Gong and colleagues, arXiv 2307.10802). Meta-Transformer targets twelve modalities, including point clouds, hyperspectral and X-ray images, graphs, tabular and time-series data, and uses a modality-shared encoder with frozen parameters and a unified tokenizer; it does not rely on a pretrained large vision-language model and reports results on several tasks where it outperforms ImageBind ^[7]^[8]. The naming is coincidental: "Meta-Transformer" refers to the meta (shared) encoder rather than to Meta the company, and it is not a Meta AI product. ImageBind's distinguishing contribution is specifically the image-as-hub binding that produces emergent alignment from image-paired data alone.

References

Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., Joulin, A., Misra, I. "ImageBind: One Embedding Space To Bind Them All." arXiv:2305.05665, May 2023. https://arxiv.org/abs/2305.05665 ↩
Meta AI. "ImageBind: Holistic AI learning across six modalities." Meta AI Blog, 9 May 2023. https://ai.meta.com/blog/imagebind-six-modalities-binding-ai/ ↩
facebookresearch/ImageBind, GitHub repository (README, checkpoints, license, benchmark table). https://github.com/facebookresearch/ImageBind ↩
"ImageBind: One Embedding Space To Bind Them All." CREATIS MYRIAD paper review (encoder and dataset details). https://creatis-myriad.github.io/2024/03/20/ImageBind.html ↩
"ImageBind MultiJoint Embedding Model from Meta Explained." Encord Blog. https://encord.com/blog/imagebind-embedding-model-explained/ ↩
Girdhar et al. "ImageBind: One Embedding Space To Bind Them All." CVPR 2023 Open Access. https://openaccess.thecvf.com/content/CVPR2023/html/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.html ↩
Zhang, Y., Gong, K., et al. "Meta-Transformer: A Unified Framework for Multimodal Learning." arXiv:2307.10802, July 2023. https://arxiv.org/abs/2307.10802 ↩
"Meta-Transformer: A Unified Framework for Multimodal Learning." Encord Blog. https://encord.com/blog/meta-transformer-explained/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Meta AI Modality Vector embeddings

The binding idea

Architecture and backbone

Emergent zero-shot capabilities

Release, license, and use in later work

Relationship to other multimodal binders

References

Improve this article

Related Articles

Llama 3.2

Llama 4 Scout and Maverick

Chameleon (Meta AI)

Llama 3.2 Vision

Muse Spark

CM3leon

What links here

Related Articles

Llama 3.2

Llama 4 Scout and Maverick

Chameleon (Meta AI)

Llama 3.2 Vision

Muse Spark

CM3leon

What links here