Structural Embedding Alignment for Multimodal Large Language Model

Pasted image 20240930030451.png

Beautiful!

In this work, a team uses visual embeddings and a visual embedding table to process visual information in a more "complete" way to address the misalignment between visual and textual embeddings.

A research team uses a structured visual embedding table to align the information and tackle the problem.

Architecture overview

Pre-trained vision transformer (ViT) backbone
LLM ?
Structured Visual Embedding Table instead of a connector-based approach

Probabilistic Visual Tokens

Both images and texts are input into the MLLM, and they have diverse tokenization strategies.

Given a pre-trained vision transformer (ViT) backbone gθ with parameters θ, we then transform the patches into a sequence of visual representations

For the textual input, let ${t_{i}}_{i = 1}^{m}$ be the input sequence of textual tokens, which are further processed by an LLM $f_{ϕ}$ , parameterized by $ϕ$ . In MLLM, both visual ${r_{i}}_{i = 1}^{n}$ and textual ${t_{i}}_{i = 1}^{m}$ tokens should be transformed into the same form, and then the LLM processes all tokens into an output sequence of textual tokens. We use $λ$ to denote the index of the image indicator token, i.e., $t_{λ} =$ , and the multimodal input tokens become:

\begin{matrix} (1) & [t_{1}, . . ., t_{λ - 1}, . . ., t_{m}] \end{matrix}

Instead of using continuous visual tokens in Equation 1, we align the internal tokenization strategies between images and texts to inspire the potential of the MLLM.

To mimic the discrete textual tokens, we use a linear head $W \in R^{K \times d}$ to transform the concrete visual tokens.

Assuming $K$ is the visual vocabulary size, i.e., the number of unique visual words, then given a visual token $r_{i}$ , we first transform $r_{i}$ into a $(K - 1)$ -dimensional probability simplex $Δ_{K}$ by a linear projection followed by a softmax normalization:

\begin{matrix} (2) & v_{i} = softmax (W r_{i}) W \in R^{K \times d} . \end{matrix}

We set $v_{i} \in Δ_{K}$ as a kind of probabilistic token, which is a probability distribution over the visual vocabulary containing $K$ visual words. If $r_{i}$ is more related to certain patterns, the corresponding elements in $v_{i}$ should be larger.

Visual Embedding Table:

This is essentially a table or matrix that holds these structured embeddings.
Each entry in this table corresponds to a visual feature extracted from an image, such as objects, regions, or visual attributes.
It allows for better visual-textual alignment by structuring the visual features in a way that can be mapped more precisely to text (such as object names or attributes mentioned in a sentence).

In LLMs, it is a common practice to employ a textual embedding table, which maps each word in the vocabulary to an embedding vector. For each textual token $t_{i}$ in the one-hot form, its embedding $T_{i} \in R^{d^{'}}$ is the row of the textual embedding table indicated by the non-zero index in $t_{i}$ .

Analogously, we introduce an additional visual embedding table, where each visual word (each row) is associated with an embedding vector $e_{k} \in R^{d^{'}}$ , with $d^{'}$ being the embedding dimension. To make the embeddings of visual and textual tokens have compatible shapes, we simply set the dimension of the visual embedding table to be the same as that of the textual embedding table.

Accordingly, the embedding of each visual token $v_{i}$ can be derived based on the probabilistic token:

V_{i} = \sum_{k = 1}^{K} v_{i, k} e_{k} \in R^{d^{'}}

where $v_{i, k}$ denotes the $k$ -th component of $v_{i}$ .

On the other hand, since $v_{i} \in Δ_{K}$ , the above formula can be rewritten as:

V_{i} = E_{k \sim v_{i}} [e_{k}]

which is an expectation of the visual word’s embedding, with the visual word drawn from $v_{i}$ . In other words, we assume that the visual embedding could be sampled from the discrete visual embedding table based on the probabilistic token $v_{i}$ of the patch.

Pasted image 20240930030019.png|666

demo on HF: https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B

source: https://arxiv.org/pdf/2405.20797

#ai #LLM #transformers #MLLM #embeddings