# matryoshka_multimodal_models__28f4ab61.pdf Published as a conference paper at ICLR 2025 MATRYOSHKA MULTIMODAL MODELS Mu Cai1 Jianwei Yang2 Jianfeng Gao2 Yong Jae Lee1 1University of Wisconsin-Madison 2Microsoft Research, Redmond https://matryoshka-mm.github.io/ Large Multimodal Models (LMMs) such as LLa VA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning and merging methods exist, they produce a single-length output for each image and cannot afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g., adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around 9 visual tokens to obtain an accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at the sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations. 1 INTRODUCTION Large Multimodal models (LMMs) (Open AI, 2023a; Liu et al., 2023a; Zhu et al., 2024; Liu et al., 2024b;a; Wang et al., 2023; Bai et al., 2023) have shown strong performance in visual-linguistic understanding and reasoning. Models such as LLa VA (Liu et al., 2023a; 2024a;b) first embed the input image with a fixed number of visual tokens, and then feed them as prefix tokens to a Large Language Model (LLM) (Vicuna, 2023; Meta, 2024) to reason about the input image. Similar model designs are borrowed in video LMMs (Lin et al., 2023b; Zhang et al., 2023a), where each frame contributes a fixed number of tokens to form the final video representation. In reality, the number of visual tokens can be prohibitively large in the case of high-resolution images, and even more so for long videos. Existing works (Lin et al., 2023b; Liu et al., 2024b; Zhang et al., 2024b; Team, 2024) mainly tackle this issue by increasing the input context length and consequently, feeding a large number e.g., 3-8k of visual tokens into the LLM. This approach has a couple of significant drawbacks: (1) the extremely long context makes both training and inference inefficient; (2) an excessive number of visual tokens can actually harm the LMM s performance, distracting it from attending to the relevant information, as we show in Sec. 4.3. Several recent works (Bolya et al., 2023; Chen et al., 2024; Shang et al., 2024) use heuristics to prune and merge visual tokens to reduce the sequence length. However, they produce a single-length output and do not afford control over the final sequence length, which could be useful to trade information density versus efficiency while accounting for resource constraints in the deployment phase. Images and videos naturally exhibit a hierarchical structure from coarse to fine details, and our human visual system has evolved to recognize visual information in this coarse to fine manner, as shown by biologists and psychologists decades ago (Harris & Giachritsis, 2000; Hegd e, 2008). Can we create a similar structure for LMMs, where within one suite of model weights, the visual content tokens are organized into different scales of granularities? Conceptually, our goal is to learn the visual Published as a conference paper at ICLR 2025 In the heart of a bustling restaurant, a young girl finds solace at a table In the heart of a bustling restaurant, a young girl with vibrant hair is seated at a wooden table, her attention captivated by the camera In the heart of a bustling restaurant, a young girl with long, dark hair is the center of attention. She's dressed in a blue and white striped sweater,. The table is adorned with a white paper bag, perhaps holding her meal. A blue Pepsi cup rests on the table Describe this image for me. Figure 1: Matryoshka Multimodal Models. We enforce the coarser set of visual tokens XSi 1 to be derived from the finer level of visual tokens XSi. As a result, the granularity of Matryoshka visual tokens gradually changes in a controllable manner. The image is from MSCOCO (Lin et al., 2014) validation set and the captions are generated given 1, 9, and 576 tokens, respectively. tokens to have a nested structure, similar to the Matryoshka Doll (Kusupati et al., 2022). Matryoshka Representation Learning (MRL) (Kusupati et al., 2022) builds the Matryoshka mechanism over a neural network s representation vector, where each of the segments with various feature dimensions is capable of handling tasks like classification or retrieval. However, for LMMs, the inefficiency mainly comes from the number of tokens. Thus, inspired by, but different from MRL, our work is motivated to build Matryoshka Multimodal Models upon the token length dimension, so that we can flexibly adjust it. Figure 2: MMBench evaluation results under M3, oracle under LLa VA-1.5-M3, LLa VA-1.5 with average pooling at inference time, LLa VA-1.5 separately trained for each specific scale, and other methods. M3 shows as least as good performance as LLa VA trained for each specific scale. A large gap exists between the oracle upperbound and model s actual performance on a specific scale. Specifically, we propose M3: Matryoshka Multimodal Models, which enforces an LMM to learn a hierarchy of visual representation granularities at the token sequence level, instead of the feature dimension level as in MRL (Kusupati et al., 2022). With this representation, at inference time, the visual granularity can be flexibly controlled based on specific requirements, e.g., to account for the input image s information density and efficiency constraints. Our training process is simple and straightforward. During training, we encode the image into M sets of visual tokens from coarse to fine, XSi, i = 1, , M, where the number of visual tokens gradually increases, i.e., |XSi 1| < |XSi|. And importantly, the visual tokens in a coarser level are derived from the visual tokens in a finer level, i.e., XSi 1 XSi, i. In this way, the visual information in [XS1, XS2, , XSM ] gradually includes more fine-grained details. For example, given a natural image as shown in Figure 1, XS1 includes high-level semantics such as the restaurant and girl, while XSM includes more details such as the Pepsi cup and white paper bag. All other training settings, such as the loss function and model architecture, are kept the same as LLa VA (Liu et al., 2023a; 2024a;b). Our approach, M3, introduces several novel properties and benefits for LMMs. First, our approach can efficiently represent visual content where users can flexibly choose the visual token scale at inference time. Under one suite of weights, it generates multiple nested sets of visual tokens with different granualarities in information density. This enables flexibility in the number of visual tokens used for any image during inference, enabling control over the best tradeoff between cost and performance Published as a conference paper at ICLR 2025 based on the image or video content. For example, one can use all visual tokens for images with dense details and use just a few tokens for simpler images. This flexibility can be particularly significant when handling very long visual sequences, such as videos. For instance, given a fixed budget of 2880 visual tokens, a user could represent a video of 2880 frames each with one token or represent the same video by sampling 5 frames each with 576 tokens. Second, our method can be used as a general framework to evaluate the visual complexity of visionlanguage benchmarks, i.e., which level of granularity is needed in order to perform the given task correctly. Surprisingly, we find that most benchmarks, especially those mainly crafted from natural scenes (such as COCO) (Goyal et al., 2017; Li et al., 2023c; Liu et al., 2023b), can be handled well with only 9 tokens per image. In contrast, dense visual perception tasks such as document understanding or OCR (Singh et al., 2019; Masry et al., 2022) require a greater amount of tokens (144 576 tokens) per image to handle the task well. The detailed findings are presented in Sec. 4.2. Finally, our approach provides a foundation to tackle a critical task in LMMs: How to use the least amount of visual tokens while answering the visual questions correctly?. Based on the model s predictions on the test set, we find that compared to full visual tokens, the oracle can use far fewer tokens while performing much better. For example, under six common LMM benchmarks used in LLa VA-Ne XT (Liu et al., 2024b), the oracle with the trained M3 model can use as few as 8.9 visual tokens on average to achieve performance that is 8% points better than LLa VA-Ne XT which uses 576 tokens per image grid. This indicates that there is a large room for improvement compared to the oracle upperbound, as we show in Sec. 4.2. To enable further research on adaptive LMMs that learn diverse information granularities, we publicly release our code and models. 2 RELATED WORK Large Multimodal Models. Large Language Models (LLMs) like Chat GPT (Open AI, 2023b), GPT-4 (Open AI, 2023c), and LLa MA (Touvron et al., 2023) have demonstrated impressive reasoning and generalization capabilities for text. The LLM landscape has been significantly transformed by the introduction of models that also incorporate visual information e.g., GPT-4V (Open AI, 2023a). Building upon open-source LLMs (Touvron et al., 2023; Vicuna, 2023), a plethora of multimodal models have made significant strides, spearheaded by models like LLa VA (Liu et al., 2023a; 2024a) and Mini GPT-4 (Zhu et al., 2024), which combine LLa MA s (Touvron et al., 2023) language capabilities with a CLIP (Radford et al., 2021) image encoder. Recently, LMMs on more tasks and modalities have emerged, such as region level LMMs (Cai et al., 2024; Zhang et al., 2023c; Chen et al., 2023; Peng et al., 2023; Zhang et al., 2023b), 3D LMMs (Hong et al., 2023), and video LMMs (Lin et al., 2023b; Zhang et al., 2023a; 2024b). However, existing LMMs typically represent the visual content with a large, fixed number of tokens, making it challenging to scale to very long visual sequences such as high-resolution images or long-form videos. In this work, we propose to efficiently represent the visual content by learning multiple nested sets of visual tokens, providing flexibility in the number of visual tokens used for any image during inference. Matryoshka Representation Learning. Matryoshka Representation Learning (MRL) (Kusupati et al., 2022) addresses the need for flexible representations that can adapt to multiple downstream tasks with varying computational resources. This approach, inspired by the nested nature of Matryoshka dolls, encodes information at different granularities within the same high-dimensional feature vector produced by a neural network. The adaptability of MRL extends across different modalities, including vision (Res Net (He et al., 2016), Vi T (Dosovitskiy et al., 2021)), vision + language (ALIGN (Jia et al., 2021)), and language (BERT (Devlin et al., 2018)), demonstrating its versatility and efficiency. Recent work (Li et al., 2024) extends MRL to both the text embedding space and the Transformer layers space. Our approach is inspired by MRL, but instead of learning multiple nested embeddings for a high-dimensional feature vector, we learn nested visual tokens along the token length dimension for the visual input. We are the first to show that the idea of Matryosha learning can enable explicit control over the visual granularity of the visual content that an LMM processes. Token Reduction. One of the main causes of inefficiency in recent LMMs is their large number of prefix visual tokens that are fed into the LLM (Liu et al., 2023a; Zhu et al., 2024). The quadratic complexity in Transformers (Vaswani et al., 2017) is the key issue in scaling the input sequence length for Transformers. Token reduction serves as an effective technique to reduce computational costs in Published as a conference paper at ICLR 2025 Text Prompt : Describe the scene for me. Large Language Model : There are a group of people standing in the ski facility, some of them are holding a green flag while other are Granularity Figure 3: Architecture of our proposed Matryoshka Multimodal Models. The visual features from CLIP are represented as several groups of coarse-to-fine visual tokens. At test time, users can explicitly control the granularity of the visual features. Transformers. Sparse attention methods such as Linformer (Wang et al., 2020) and Re Former (Kitaev et al., 2020) conduct attention operations within local windows rather than the full context, thereby reducing the quadratic complexity of the vanilla attention operation. Another notable method is Token Merging (To Me) (Bolya et al., 2023), which utilizes full attention but gradually reduces the number of tokens in each transformer block by selecting the most representative tokens through bipartite matching for the Vision Transformer (Vi T). A recent work (Haurum et al., 2023) further studies different families of token reduction methods for Vi T. However, prior approaches produce a single length output per input image and do not offer multiple granularities over the reduced token sequence. Our M3 approach instead learns a multi-granularity, coarse-to-fine token representation within the same model architecture and weights, enabling it to easily be adjusted to various computational or memory constraints. A concurrent work (Hu et al., 2024) shares a similar spirit with our approach, representing an image with varying numbers of visual tokens using a single set of model weights. While their method reformats the visual tokens into a sequential list via transformation layers, we use average pooling to preserve the spatial structure of the visual tokens, demonstrating effectiveness in our experiments. 3 M3: MATRYOSHKA MULTIMODAL MODELS Our goal is to learn a Large Multimodal Model (LMM) that represents visual content as nested sets of visual tokens capturing information across multiple coarse-to-fine granularities, so that one can explicitly control the visual granularity per test instance during inference. Here we introduce how we learn a Matryoshka doll-like token sequence. LMMs such as LLa VA (Liu et al., 2023a) typically input a sequence of visual tokens as prefix tokens to the LLM for visual-linguistic reasoning. The visual encoder from pretrained vision-language models, such as CLIP (Radford et al., 2021) and Sig LIP (Zhai et al., 2023), is typically utilized to project the images into the set of visual tokens. In particular, the CLIP visual encoder represents an input image I as an H W grid of visual tokens XH W , where each Xi RC is a C dimensional feature vector. Our goal is to learn nested sets of visual tokens [XS1, XS2, , XSM ] which encode the visual information in a coarse-to-fine manner. To this end, we enforce XSi XSi+1, i. Importantly, we do not introduce any new learnable parameters to the LMM. We instead optimize the CLIP visual encoder to learn the nested visual representation directly, and train the ensuing LLM to adapt to the learned nested set of tokens. For ease of exposition, we consider CLIP-Vi T-L-336 (Radford et al., 2021) as the visual encoder, where an image is encoded as 24 24 visual tokens (576 total). We create M sets of tokens e.g., |Si| {1, 9, 36, 144, 576}, in which the visual tokens at the coarser level are derived directly from those at the finer level. Specifically, given the initial 24 24 visual tokens, We sequentially apply 2 2 pooling with a stride 2, resulting in 12 12, 6 6, and 3 3 visual tokens. Finally, we apply 3 3 pooling and get the most condensed single visual token. In this way, the sets of Matryoshka visual tokens can gradually preserve the spatial information in the original tokens while simultaneously forming a coarse-to-fine nested representation. Published as a conference paper at ICLR 2025 We train M3 by averaging the autoregressive next token prediction loss for each scale Si for each image Ii. Specifically, given a Matryoshka visual representation XSi for scale Si, we maximize the likelihood of the predicted tokens matching the ground-truth answer Xa: P(Xa | XSi, Xq) = j=1 Pθ(xj | XSi, Xq, Xa,