Back to the 2023 Spring, Reddit and telegram communities are surprised of the MetaAI’s technology. They introduced the ImageBind and Segment Everything in May and April. Even though they published a lot of meaningful technical papers like LLaMa in February 2023, However, ImageBind evolve the possibility of the multi-modality using more than the two modalities. Consequently they were got the CVPR 2023 highlighted paper award.
Today, we’ll briefly delve into the ImageBind – One embedding space bind them all. It makes not only binding the more than two modalities successful in one-embedding space even surpassing the zero-shot capability in text-audio task.
Before we start, In computer vision task, like CMC (Contrastive Multiview Coding) methods are considering the various view points and they utilize the variety images such as thermal, optical flow etc. These images are pass through the each encoder and feature vectors are mapping close to same classes. In contrast, different classes are mapping far from apart using InfoNCE loss function.
To explain InfoNCE loss slightly for the future, each vectors are calculated by the other vectors from the each encoder. If they got a high similarity score, it means they were mapping through the as close as possible. We hope to map the other classes as far as their similarity distance, they were reflected in the denominator. If each q and k are similar or same class, then our q-k would be close 1 and the denominator close to 0. Then our loss function is “-log 1 = 0”. On the right side, the loss would be increased.
Let’s look at this timeline related with the multi-modality’s history. Starting from the OpenAI, now most of researchers are draw attention for multi-modality task. ALIGN and CLIP shown impressive zero-shot performance using contrastive learning in joint embedding space. Flamingo shown the SOTA few-shot benchmarks processing the arbitrarily interleaved images and texts. Our today’s main topic ImageBind shown the SOTA ability in text-audio zero shot benchmarks even they shown the six modalities combining method in one embedding space.
To evolve the computer vision task’s multi-modality which using the Augmented images like CMC (previously explained), CLIP expand to the text data from the utilizing the “Alternate text” within the web-image. Furthermore, they build up the similarity matrix to pair the similar feature’s in the vector space. Following the contrastive learning methods, we try to make high score in the positive area in diagonal position, In contrast, the negative area in the outside of diagonal position as low as possible.
It just collects the web-scaled image and text pair data and they perform the scalable SOTA ability in ImageNet task and zero-shot capability. Now, based on these CLIP method, we were trying to combine more modalities and improved it
This octagon is mixed dotted lines and bold lines. Before the ImageBind came up, It is infeasible to collect whole modality data. There is no proper data related/paired to the one class exactly. However, our ImageBind can solve it using the image and video data which sequences of the image.
Honestly, when we see the above source code, it just looks like gathering three modalities or input three data. Even they were looks not related and randomly chosen, each modality is related to the “Image” or “Video”. The preliminaries of this multi modality task, we are not utilize the (Image, Text) pre-trained model to (Video, Audio) task. But now, our ImageBind can utilize “out-of-the-box” related to the trained modality’s task. Moreover ImageBind shown the zero-shot capabilities in Audio-Text dataset.
This Image is looks like, the expanded version of CLIP paper. Yes, it is based on the CLIP architecture. Similar to CLIP, ImageBind utilize the InfoNCE loss in the same manner. (Image, Text) data is adopted 400 millions web scaled dataset, and other paired data like (Video, Audio) utilize Audio set-An ontology and human labeled dataset for audio events and (Image, Depth) utilize SUN RGB-D dataset and (Image, Thermal) utilize LLVIP and (Video, IMU) utilize Ego4D dataset.
This table show the zero-shot capability compared to the scratch learning from Random initialize and text dataset based on multi-modality pre-trained model. Also they shown the SOTA scores in the supervised learning approach.
You might not be interested in this table, because they looks better only Random initialized or some text-paired dataset. Moreover all of datasets are under the SOTA scores. However, the positive thing is that our ImageBind can close to ResNet performance even they were zero-shot performance. And Depth dataset shown the superior result than text based modality model. The worse score in the Audio tasks looks terrible, However, AudioCLIP  in audio model is not zero shot task, they were performed by supervised learning.
Are you still boring?
Then let’s see the one more interesting Figure. we compare these two model in two tasks. Even our ImageBind model shown the worse performance in Audio task they were performed by supervision. However, when we perform the few-shot learning task, Especially left AudioMAE is the latest SOTA model in Audio task, But Our ImageBind superior result in lower case shots.
ViT-H performed the highest in ImageBind, it can be understood that ViT-H was used. However, if you look more specifically, you can see that there is a characteristic called Emerge properties.
Emergent properties appears when the FLOPs of the model increase in size, the performance increases unpredictably. This report was made in 2020 OpenAI Scaling Laws for Neural Language Models OpenAI also stated that performance increases as the number of data increases.
Following picture was published by Stanford 2023 “Are Emergent Abilities of Large Language Models a Mirage”. Based on this, it can be seen that each Modality also appears in ImageBind these Unpredictable edge properties, and that it has constructed a one-embedding space to effectively control them.
The last interesting is their Embedding-Space arithmetic examples. Images created by Embedded-Space Artistic can be recognized where objects exist in the image using Audio to Text.
Embedding-Space arithmetic indicates that it has the ability to express semantic relations in vector space. In the NLP field, starting with building an embedded space that makes it King-Man + Woman = Queen, GAN can create a smiling man by utilizing the value of Smiling Woman — General Woman + General Man through Interpretable Vector Math. This is the same when it is expanded from embedding to one-embedding space, which considers multi-modality, to map various semantic changes or correlations in space.
Their objective is developing a model that unifies six modalities into a singular space, based on a web-scaled image dataset. The configuration of multi-modalities typically requires complete data across all modalities, which is often infeasible due to various constraints and limitations. They use training data based on image-centric. Only paired data consisting of images and corresponding modality-specific data need to be prepared. Moreover the model is trained employing the methodology used by CLIP, which based on InfoNCE loss to make an similarity matrix. To sum up, our ImageBind creates a multi-modal embedding space, consolidating various modalities into a shared space. Also they achieves state-of-the-art (SOTA) results in Audio-text zero-shot benchmarks.
If you enjoyed this, give it a clap or subscribe!! 😃😃