# prismer_a_visionlanguage_model_with_multitask_experts__ccc29de7.pdf Published in Transactions on Machine Learning Research (01/2024) Prismer: A Vision-Language Model with Multi-Task Experts Shikun Liu1,2 Linxi Fan2 Edward Johns1 Zhiding Yu2 Chaowei Xiao2,3 Anima Anandkumar2,4 1Imperial College London 2NVIDIA 3University of Wisconsin, Madison 4Caltech Reviewed on Open Review: https: // openreview. net/ forum? id= R7H43YD6Lo Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a dataand parameter-efficient vision-language model that leverages an ensemble of task-specific experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from multiple readily-available, pre-trained experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer. 1 Introduction Large pre-trained models have demonstrated exceptional generalisation capabilities across a wide range of tasks. However, these capabilities come at a hefty cost in terms of computational resources required for training and inference, as well as the need for large amounts of training data. In the language domain, models with hundreds of billions of learnable parameters typically require a compute budget on the yotta FLOP scale (Chowdhery et al., 2022; Brown et al., 2020; Black et al., 2022; Rae et al., 2021). The problems in vision-language learning are arguably more challenging. This domain is a strict super-set of language processing, whilst also requiring extra skills unique to visual and multi-modal reasoning. For example, many image captioning and visual question answering problems require the model to be capable of fine-grained object recognition, detection, counting, and 3D perception (Antol et al., 2015; Chen et al., 2015). A typical solution is to use a massive amount of image-text data to train one giant, monolithic model that learns to develop these task-specific skills from scratch, simultaneously, and within the same generic architecture. Instead, we investigate an alternative approach to learning these skills and domain knowledge via distinct and separate sub-networks, referred to as experts . As such, each expert can be optimised independently for a specific task, allowing for the use of domain-specific data and architectures that would not be feasible with a single large network. This leads to improved training efficiency, as the model can focus on integrating specialised skills and domain knowledge, rather than trying to learn everything at once, making it an effective way to scale down multi-modal learning. To achieve this, we propose Prismer1, a visually conditioned autoregressive text generation model, trained to better use diverse pre-trained task experts for open-ended vision-language reasoning tasks. Prismer s key design elements include i) powerful vision-only and language-only models for web-scale knowledge to construct our core network backbones, and ii) multi-task vision experts encoding multiple types of visual information, Corresponding Author: shikun.liu17@imperial.ac.uk. Work done during an internship at NVIDIA. 1The model name Prismer draws from the analogy to an optical prism which breaks a white light into a spectrum of colours, and here we break down a single reasoning task into diverse domain-specific reasoning. Published in Transactions on Machine Learning Research (01/2024) PERSON PERSON HAT HAT HAT HELMET CHAIRBASEBALL SEGMENTATION OBJECT DETECTION SURFACE NORMAL OCR DETECTION A man is playing baseball in a field. IMAGE CAPTIONING Q: What s this person doing? A: Playing baseball. Q: What s the number of this player? A: 21. VISUAL QUESTION ANSWERING Figure 1: Prismer model overview. Prismer is a data-efficient vision-language model that leverages diverse pre-trained experts through its predicted multi-task signals. It can perform vision-language reasoning tasks such as image captioning and visual question answering. The analogy is with an optical prism: Prismer splits a single reasoning task into diverse domain-specific reasoning. including low-level vision signals such as depth, and high-level vision signals such as instance and semantic labels, as a form of auxiliary knowledge, directly from their corresponding network outputs. All expert models are individually pre-trained and frozen, and are connected through some lightweight trainable components which contribute to roughly 20% of the total network parameters. Despite Prismer being trained on only 13M publicly available image/alt-text data examples, it shows strong multi-modal reasoning performance in tasks such as image captioning, image classification, and visual question answering, competitive with many state-of-the-art vision-language models (Alayrac et al., 2022; Wang et al., 2022a; 2021), that were trained with one or two orders of magnitude more data. Finally, we conduct an in-depth analysis of Prismer s learning behaviours and observe some encouraging properties. For example, i) Prismer exhibits strong robustness against the inclusion of noisy experts, and ii) the learning performance also scales favourably with increases in both the quantity or quality of experts. 2 Related Work Vision-Language Models (VLMs) Inspired by the breakthrough of transformers in the language domain (Vaswani et al., 2017; Devlin et al., 2019), early works aimed to model the vision-language relationship using a shared network based on transformers in a single-stream design (Li et al., 2020a; Chen et al., 2020; Li et al., 2020b; Su et al., 2020). These works usually leverage a pre-trained object detector, encoding images as sequences of visual words, parameterised by objector region-level features. Prismer takes a slightly different approach by using pre-trained models to provide their output predictions as auxiliary signals, whilst still relying on the original images to encode visual features. Another line of works encodes vision and language features in separate networks in a dual-stream design, where the vision-only and language-only features are aligned through contrastive learning (Radford et al., 2021; Zhai et al., 2022; Jia et al., 2021; Li et al., 2021). These works typically focus on close-ended multi-modal alignment tasks such as image-text classification and retrieval. In contrast, Prismer s vision encoder also Published in Transactions on Machine Learning Research (01/2024) aligns its vision features with the language embedding through pre-training with contrastive learning, but with a greater emphasis on multi-modal generation tasks. Both singleand dual-steam VLMs in the past years have often been pre-trained with a combination of multiple objectives, such as masked language modelling, masked region modelling, word-region alignment, visual grounding and more (Li et al., 2020a; Cho et al., 2021; Li et al., 2022; 2021; Lu et al., 2019). These multiple objectives can make the training process more complex and require careful balancing of the different losses. Prismer adopts a different approach, aligning with recent developments in VLMs that focus on language generation, and only require a single autoregressive training objective (Wang et al., 2022a; 2021; Hu et al., 2022). Despite the reduced complexity, training these large-scale VLMs is data intensive and computationally demanding, often requiring billions of training data. To overcome these challenges, Prismer leverages powerful pre-trained task-specific expert models for data-efficient training. Unlike another set of works that prioritise in-context capability by conditioning on a large frozen language model with no task-specific fine-tuning (Eichenberg et al., 2021; Tsimpoukelli et al., 2021; Alayrac et al., 2022), Prismer focuses on fine-tuned performance with an emphasis on parameter efficiency, using smaller but diverse pre-trained experts. Multi-task and Auxiliary Learning Multi-task learning and auxiliary learning aim to train models to predict multiple outputs (such as semantic segmentation, object detection, and depth estimation) from a single input, thereby improving the performance across one or multiple tasks. This is often achieved through the design of effective multi-task networks that balance task-shared and task-specific features (Liu et al., 2019b; Misra et al., 2016; Sun et al., 2020; Xu et al., 2018), or through the explicit modelling of task relationships (Liu et al., 2019a; 2022; Navon et al., 2021; Zamir et al., 2018; Fifty et al., 2021). Recently, multi-task learning has been further generalised to unify vision-only, language-only, and vision-language tasks by considering them within a sequence-to-sequence framework (Wang et al., 2022b; Lu et al., 2022; Zhu et al., 2022). Prismer also employs multiple tasks, specifically in the vision domain, similar to these methods, but uniquely uses them solely as input, serving as auxiliary knowledge. Prismer is more related to works such as (Bachmann et al., 2022; Ghiasi et al., 2021), which utilise pre-trained experts to create pseudo labels for multi-task self-training. However, whilst those methods focus on learning task-agnostic features through multi-task supervision, Prismer focuses purely on multi-modal reasoning with a single-task objective. Unifying Pre-trained Experts The utilisation of diverse pre-trained domain experts for multi-modal reasoning has been investigated in previous studies. Socratic models (Zeng et al., 2022) use language as a one-way communication interface to connect different pre-trained experts. Viper GPT (Surís et al., 2023) and Visual Programming (Gupta & Kembhavi, 2023) harness the in-context learning capabilities of large language models, breaking down complex multi-modal reasoning into modular programs, which are then solved sequentially by leveraging pre-trained vision experts through APIs. The aforementioned methods excel at modular problem decomposition and establishing connections among pre-trained experts, and thereby being limited to zero-shot multi-modal reasoning within the domains on which the experts were pre-trained, and errors predicted by previous experts can be carried forward to future experts. However, Prismer stands out with a distinct objective by aiming to better bridge these pre-trained experts through a unified architecture design. As such, Prismer aims to create a more seamless collaboration between these experts, ultimately optimising multi-modal reasoning in a more integrated manner, and more robust to non-optimal experts. Finally, we would like to highlight the distinction between the concept of experts defined in Mixture of Experts (Mo E) (Riquelme et al., 2021; Nguyen & Chamroukhi, 2018; Masoudnia & Ebrahimpour, 2014) and in Prismer. In Mo E, the experts are sub-modules in a single network, interconnected through their corresponding gating networks, encoding implicit knowledge guided by a shared training objective. On the other hand, in Prismer, the experts are independently pre-trained models, encoding explicit knowledge based on their pre-trained tasks or domains. 3 Prismer: Open-ended Reasoning with Multi-Task Knowledge In this section, we introduce the Prismer model, a type of vision-language generative model that takes multi-task signals as input, and outputs free-form text. Published in Transactions on Machine Learning Research (01/2024) Vision Encoder Bi-Directional Attention Block RGB Patchify Depth Patchify A bear is sitting on the grass Experts Resampler Normal Patchify Feed-Forward A bear is sitting on the grass Layer Norm Prediction Language Decoder Word Embedding Causal Attention Block Feed-Forward Cross Attention Block Figure 2: Prismer architecture design overview. Prismer has two main trainable components: the Experts Resampler that converts variable multi-task signals to a fixed number of outputs, and the Adaptor that enhances the model s expressivity for vision-language reasoning. To ensure that the model takes advantage of the rich domain-specific knowledge encoded in the pre-trained experts, the majority of network weights are frozen during training, as represented by . 3.1 Model Overview The design of the Prismer model is illustrated in Fig. 2. Prismer is an encoder-decoder transformer model (Vaswani et al., 2017) that leverages a library of existing pre-trained experts. It consists of a vision encoder and an auto-regressive language decoder. The vision encoder takes an RGB image and its multi-task labels as input (e.g. depth, surface normal, segmentation labels, predicted from the frozen pre-trained experts), and outputs a sequence of RGB and multi-task features. The language decoder is then conditioned on these multi-task features via cross attention, and produces a sequence of text tokens. Prismer is designed to leverage pre-trained experts whilst keeping the number of trainable parameters to a minimum. To do this, the network weights of the pre-trained experts are frozen to maintain the integrity of their learned knowledge and prevent catastrophic forgetting (Kemker et al., 2018; Kirkpatrick et al., 2017). To link the multi-task knowledge as well as the vision and language parts of Prismer, we insert two parameter-efficient trainable components: Experts Resampler and Adaptor. The Experts Resampler is used in the vision encoder to map a variable length of multi-task signals to a sequence of multi-task features with a fixed length. The Adaptors are inserted in each transformer layer of the vision and language parts of the model to better adapt the pre-trained experts to new tasks and modalities. Prismer is a generative model, and we re-formulate all vision-language reasoning tasks as a language modelling or prefix language modelling problem. For example, given the input image along with its multi-task tokens (predicted with the multi-task experts) and a question as the prefix, the model generates the answer for the Published in Transactions on Machine Learning Research (01/2024) visual question answering task; given the input image along with its multi-task tokens, the model generates its caption for the image captioning task. Once we have a prefix prompt, we may either sample the output text in an autoregressive manner, as in an open-ended setting; or we may rank the log-likelihood from a fixed set of completions, as in a closed-ended setting. 3.2 Pre-trained Experts In Prismer, we include two types of pre-trained experts: Backbone Experts The vision-only and language-only pre-trained models, which are responsible for encoding images and texts into a meaningful sequence of tokens. Both models are required to be based on the transformer architecture (Vaswani et al., 2017), so we that can easily connect them with a few trainable components of similar designs. To preserve their rich domain-specific knowledge encoded in the network parameters, the majority of the weights are frozen during pre-training. Task Experts The models that produce multiple task-specific labels, depending on their training datasets, are treated as black-box predictors. These task experts can be designed either as a single multi-task expert or an ensemble of multiple task-specific experts, and their predicted labels are utilised as input for the Prismer model. Consequently, all network weights of the task experts are frozen, and they can have any design. In Prismer, we incorporate up to 6 task-specific experts, all within the vision domain. These experts encode three low-level vision signals (depth, surface normals, and edges) and three high-level vision signals (object labels, segmentation labels, and OCR labels). Our selection of these 6 vision experts is based on tasks commonly studied in the multi-task learning community (Zamir et al., 2018; Standley et al., 2020; Liu et al., 2022), which have demonstrated varying levels of benefits in learning generalised visual representations. Additionally, these expert models are relatively lightweight, incurring minimal additional training and inference costs with simple model parallelism. We apply task-specific post-processing on these predicted labels, transforming them to a RH W C tensor (here H, W, C represent image height, width and channels respectively. e.g. C = 1 for depth and edge labels, and C = 3 for surface normals label). For all expert labels encoding high-level signals, we tile each pixel with its corresponding text embedding from a pre-trained CLIP text model (Radford et al., 2021), and then we apply PCA to down-sample the dimensionality to C = 64 for efficient training. The detailed descriptions of all task experts, including their pre-trained datasets and the architecture design, are listed in Appendix A. 3.3 Key Architectural Components Task-Specific Convolutional Stem All expert labels are first processed with randomly initialised convolution layers to map them to the same dimensionality. Specifically, we apply 5 convolutional layers and each is composed of a small [3 3] kernel, which is shown to perform better than a single convolutional layer but with a larger kernel in the original Vision Transformer design (Dosovitskiy et al., 2020), consistent with the finding in (Xiao et al., 2021). The convolutional stem is designed to be task-specific, which we have found to yield superior performance in comparison to a shared design in a multi-task learning setting (Liu et al., 2019b; Misra et al., 2016). For high-level semantic labels such as those in object detection, semantic segmentation, and OCR detection, we down-sample the resolution by a factor of 4 to conserve running memory. Furthermore, for each object instance, we add a trainable and randomly sampled embedding to distinguish among different object instances. The size of this instance embedding is set to 128, which corresponds to the maximum possible number of object instances to be present in a single image. For RGB images, we simply process with the pre-trained convolutional stem defined by the original vision backbone. All task expert embeddings, including RGB, are then added with a pre-trained positional embedding before being further processed by transformer layers. Experts Resampler The computational complexity of self-attention is quadratically proportional to the number of input tokens. As such, the vision encoder can easily require tremendous memory when including a large number of task experts. To address this, we propose Experts Resampler, which takes a variable number Published in Transactions on Machine Learning Research (01/2024) of expert labels as input and outputs a fixed number of tokens, illustrated in Fig. 3 Left. Such design produces a constant memory for the self-attention in the vision encoder, as well as the vision-text cross attention in the language decoder (shown in Fig. 2), independent of the inclusion of a different number of experts. Inspired by the design in the Perceiver Resampler (Jaegle et al., 2021) and the Flamingo model (Alayrac et al., 2022), the Experts Resampler learns a pre-defined number of latent input queries, to cross-attend a flattened multi-task features. The resampler then compresses the multi-task features into a much smaller number of tokens equal to the number of learned latent queries, as a form of auxiliary knowledge distillation. We design keys and queries to be a concatenation for both multi-task features and the learned latent queries, which is shown to be more effective, consistent with the design in the Flamingo model (Alayrac et al., 2022). Experts Resampler Adaptor Up Projection Squared Re LU Down Projection Multi-modal Features Learned Latents Bi-Directional Attention Block Feed-Forward Block Figure 3: Design details in Experts Resampler and Adaptor. Left: The Experts Resampler takes multi-task features with variable length as input, and outputs a fixed number of tokens via cross attention. Right: The Adaptor has a residual connection to the input and two fullyconnected layers, that down-projects the input features to a smaller bottleneck dimension and then up-projects back to the original dimension. Lightweight Adaptor We insert one lightweight adaptor into each transformer layer of both vision and language backbones in order to improve Prismer s expressivity and conditioning on multi-task features, illustrated in Fig. 3 Right. The adaptor has an encoder-decoder design, which has proven to be successful for efficient transfer learning in the NLP domain (Houlsby et al., 2019; Pfeiffer et al., 2020). It first down-projects the input features into a smaller dimension, applies a non-linearity, and then up-projects the features back to the original input dimension. We choose the non-linearity function to be squared Re LU (So et al., 2021) a simple and parameter-free function that delivers strong training stability. With the residual connection, we initialise all adaptors with near-zero weights to approximate the identity function. Combined with a standard cross attention block in the language decoder, the model is able to smoothly transition from the domain-specific vision-only and language-only backbones to a vision-language model during pre-training with paired image-text data. The model performance, memory usage and time complexity for other design choices are systematically evaluated and ablated in Appendix B. 3.4 Training Objective For simplicity, we train Prismer with a single objective to predict the next text autoregressively. Following the standard encoder-decoder architecture, the vision encoder predicts the multi-task features z, and the language decoder learns to maximise the conditional likelihood of the paired text caption y under the forward autoregressive factorisation: L = PT t=1 log p(yt|y