Who is this post for?
- Reader Audience [🟢⚪️⚪️]: AI beginners, familiar with popular concepts, models and their applications
- Level [🟢🟢️⚪️]: Intermediate topic
- Complexity [🟢⚪️⚪️]: Easy to digest, no mathematical formulas or complex theory here
Foundational large language models (LLMs), pre-trained on huge datasets are pretty efficient at handling generic, multi-tasking via prompts through zero-shot, few-shot or transfer learning.
- What if there was a way to extend the intelligence of these models, by enabling them to use different modalities of input, such as photos, audio, and video? Or in other words, make them Multimodal!
- It could greatly improve how we search for things on the web, or even understand the world around us for example in real world applications such as medicine and pathology.
- There is a solution! Multimodal deep learning models can combine the embeddings from different types of input, enabling, for example, an LLM to “see” what you are asking for, and return relevant results.
⚡️Stick around if you want to learn more about how this all works and play around with a working demo!
It starts with embeddings
One of the most powerful building blocks of training deep learning models is the creation of embedding vectors.
During training, the model encodes the different categories (for example, people, foods, and toys) it encounters into their numerical representation aka. an Embedding, that is stored as a vector of numbers.
Embeddings are useful when we want to move from a sparse representation of category (or class) for example a long string of text or an image, to something that is more compact, and can be reused across other models too.