Structural Embedding Alignment for Multimodal Large Language Model

Pasted image 20240930030451.png

Beautiful!

In this work, a team uses visual embeddings and a visual embedding table to process visual information in a more "complete" way to address the misalignment between visual and textual embeddings.

A research team uses a structured visual embedding table to align the information and tackle the problem.


Architecture overview


Probabilistic Visual Tokens

Both images and texts are input into the MLLM, and they have diverse tokenization strategies.

Given a pre-trained vision transformer (ViT) backbone gθ with parameters θ, we then transform the patches into a sequence of visual representations

For the textual input, let {ti}i=1m be the input sequence of textual tokens, which are further processed by an LLM fϕ, parameterized by ϕ. In MLLM, both visual {ri}i=1n and textual {ti}i=1m tokens should be transformed into the same form, and then the LLM processes all tokens into an output sequence of textual tokens. We use λ to denote the index of the image indicator token, i.e., tλ=, and the multimodal input tokens become:

(1)[t1,...,tλ1,...,tm]

Instead of using continuous visual tokens in Equation 1, we align the internal tokenization strategies between images and texts to inspire the potential of the MLLM.

To mimic the discrete textual tokens, we use a linear head WRK×d to transform the concrete visual tokens.

Assuming K is the visual vocabulary size, i.e., the number of unique visual words, then given a visual token ri, we first transform ri into a (K1)-dimensional probability simplex ΔK by a linear projection followed by a softmax normalization:

(2)vi=softmax(Wri)WRK×d.

We set viΔK as a kind of probabilistic token, which is a probability distribution over the visual vocabulary containing K visual words. If ri is more related to certain patterns, the corresponding elements in vi should be larger.

Visual Embedding Table:

In LLMs, it is a common practice to employ a textual embedding table, which maps each word in the vocabulary to an embedding vector. For each textual token ti in the one-hot form, its embedding TiRd is the row of the textual embedding table indicated by the non-zero index in ti.

Analogously, we introduce an additional visual embedding table, where each visual word (each row) is associated with an embedding vector ekRd, with d being the embedding dimension. To make the embeddings of visual and textual tokens have compatible shapes, we simply set the dimension of the visual embedding table to be the same as that of the textual embedding table.

Accordingly, the embedding of each visual token vi can be derived based on the probabilistic token:

Vi=k=1Kvi,kekRd

where vi,k denotes the k-th component of vi.

On the other hand, since viΔK, the above formula can be rewritten as:

Vi=Ekvi[ek]

which is an expectation of the visual word’s embedding, with the visual word drawn from vi. In other words, we assume that the visual embedding could be sampled from the discrete visual embedding table based on the probabilistic token vi of the patch.

Pasted image 20240930030019.png|666

demo on HF: https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B

source: https://arxiv.org/pdf/2405.20797

#ai #LLM #transformers #MLLM #embeddings