# multimodal_quantitative_language_for_generative_recommendation__2f076aef.pdf Published as a conference paper at ICLR 2025 MULTIMODAL QUANTITATIVE LANGUAGE FOR GENERATIVE RECOMMENDATION Jianyang Zhai1,2, Zi-Feng Mai1,3, Chang-Dong Wang1,3 , Feidiao Yang2*, Xiawu Zheng2,4, Hui Li4, Yonghong Tian2,5 1Sun Yat-sen University, 2Pengcheng Laboratory, 3Guangdong Key Laboratory of Big Data Analysis and Processing, 4Xiamen University, 5Peking University {zhaijy01, yangfd}@pcl.ac.cn, changdongwang@hotmail.com Generative recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. Most existing methods attempt to leverage prior knowledge embedded in Pre-trained Language Models (PLMs) to improve the recommendation performance. However, they often fail to accommodate the differences between the general linguistic knowledge of PLMs and the specific needs of recommendation systems. Moreover, they rarely consider the complementary knowledge between the multimodal information of items, which represents the multi-faceted preferences of users. To facilitate efficient recommendation knowledge transfer, we propose a novel approach called Multimodal Quantitative Language for Generative Recommendation (MQL4GRec). Our key idea is to transform items from different domains and modalities into a unified language, which can serve as a bridge for transferring recommendation knowledge. Specifically, we first introduce quantitative translators to convert the text and image content of items from various domains into a new and concise language, known as quantitative language, with all items sharing the same vocabulary. Then, we design a series of quantitative language generation tasks to enrich quantitative language with semantic information and prior knowledge. Finally, we achieve the transfer of recommendation knowledge from different domains and modalities to the recommendation task through pre-training and fine-tuning. We evaluate the effectiveness of MQL4GRec through extensive experiments and comparisons with existing methods, achieving improvements over the baseline by 11.18%, 14.82%, and 7.95% on the NDCG metric across three different datasets, respectively. 1 1 INTRODUCTION 18 Piece Acrylic Paint Set A_4 B_3 C_8 D_6 a_3 b_4 c_1 d_5 A_1 B_4 C_2 D_6 a_2 b_3 c_1 d_6 Arts Sengoku Basara: The Last Party Quantitative Language Figure 1: Illustration of our MQL4GRec. We translate items from different domains and modalities into a new unified language, which can then serve as a bridge for transferring recommendation knowledge. Recommendation systems (RS) aim to recommend items to users that they may be interested in, and are widely used on many online platforms, such as ecommerce and social networking (Chaves et al., 2022; Covington et al., 2016). For a long time, recommendation models that represent users and items using their unique IDs (known as IDRec) have been dominant in the field of RS (Kang & Mc Auley, 2018a; Sun et al., 2019; Zhang et al., 2024). However, IDRec may encounter cold start and knowledge transferability issues due to its inherent properties. To address Corresponding authors. 1Our implementation is available at: https://github.com/zhaijianyang/MQL4GRec. Published as a conference paper at ICLR 2025 these limitations, some literature (Hou et al., 2022; Sun et al., 2023) employs modal encoders (Devlin et al., 2018; He et al., 2016) to learn universal representations of items or sequences. While promising, these modal encoders are typically not specifically designed for recommendation tasks, resulting in suboptimal performance. Recently, generative recommendation has emerged as a promising paradigm, which employs an endto-end generative model to directly predict identifiers of target candidates (Geng et al., 2022; Rajput et al., 2023). Due to the success of PLMs in natural language generation (NLG) (Raffel et al., 2020a; Brown et al., 2020; Touvron et al., 2023), most existing methods attempt to leverage the prior knowledge of PLMs to improve the recommendation performance (Bao et al., 2023; Zhang et al., 2023; Zheng et al., 2023). They formalize the recommendation task as a sequence-to-sequence generation process, where the input sequence contains data of items interacted with users, and the output sequence represent identifiers of target items. Then they enable PLMs to perform recommendation tasks by adding instructions or prompts. Despite achieving decent performance, they suffer from the following limitations: 1) There are significant task differences between PLMs and RS, which may lead to inconsistencies between the general linguistic knowledge of PLMs and the specific requirements of RS; 2) They often overlook the complementary knowledge between the multimodal information of items, which is crucial for capturing the multi-faceted preferences of users. To address these limitations, it is crucial to bridge the gaps between different domains and modalities, leveraging their recommendation knowledge to enhance the performance of the target domains. Inspired by significant advancements in NLG, such as pretraining-finetuning (Devlin et al., 2018; Raffel et al., 2020b) and prompt-tuning (Brown et al., 2020; Touvron et al., 2023), we propose the idea of transforming items from various domains and modalities into a new and unified language. A key factor contributing to these significant advances is the use of a shared vocabulary, where tokens are endowed with rich semantic information and prior knowledge across various tasks, which can then be effectively transferred to downstream tasks. Thus, we aspire for this new language to encompass a vocabulary in which tokens can represent items from various domains and modalities, as depicted in Figure 1. Specifically, this language not only serves as a bridge for knowledge transfer but also as identifiers of items, and should be more concise than the original modalities (text and image) to avoid issues in generation (Hua et al., 2023). To this end, we propose a novel approach known as Quantitative Language for Multimodal Generative Recommendation (MQL4GRec). Specifically, we first introduce quantitative translators to convert the content of items (text and images) into the quantitative language. We train a separate quantitative translator for each modality of the item, each consisting of a modal encoder and a vector quantizer. Together, the codebooks of the two quantitative translators constitute the vocabulary. Then, we design a series of quantitative language generation tasks aiming at endowing quantitative language with rich semantic information and prior knowledge, and these tasks can be viewed as microcosms of NLG tasks. Specifically, we additionally incorporate some special tokens as task prompts. Finally, we transfer the source domain and multimodal recommendation knowledge to the recommendation tasks through pre-training and fine-tuning. To evaluate the effectiveness of our proposed MQL4GRec, we conduct extensive experiments and comparisons with existing methods. Relative to the baseline, we observe improvements of 11.18%, 14.82%, and 7.95% on the NDCG metric across three datasets, respectively. In summary, our proposed MQL4GRec achieves the transfer of recommendation knowledge by breaking down barriers between items across different domains and modalities, demonstrating strong scalability and potential. Our main contributions can be summarized as follows: We propose MQL4GRec, a novel approach that translates items from various domains and modalities into a unified quantitative language, thereby breaking down the barriers between them and facilitating the transfer of recommendation knowledge. We design a series of quantitative language generation tasks that endow quantitative language with rich semantic information and prior knowledge, and enhance the performance of recommendation tasks through pre-training and fine-tuning. We conduct extensive experiments and analyses on three public datasets, and the results validate the effectiveness of our proposed method. Published as a conference paper at ICLR 2025 2 RELATED WORKS Generative Recommendation. Generative models are one of the hottest research topics in machine learning, resulting in some representative works such as Variational Auto Encoders (VAEs) (Kingma & Welling, 2014), Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and Diffusion models (Ho et al., 2020). Generally, generative models aim to learn the distribution of the training data P(x)and generate new samples z P(x). These generative models have also been applied to recommendation, resulting in many remarkable works of VAE-based (Cai & Cai, 2022; Shenbin et al., 2020), GAN-based (He et al., 2018; Guo et al., 2022; Wang et al., 2022) and diffusion-based (Jiang et al., 2024; Wang et al., 2023c) recommendation. Recently, Transformer-based PLMs such as LLa MA (Touvron et al., 2023) and GPT (Brown et al., 2020) have also shown promising capabilities in language generation. With the help of such powerful generative PLMs, some PLM-based recommendation methods have also been proposed. Some early works, such as P5 (Geng et al., 2022) and M6-Rec (Cui et al., 2022), attempt to transform recommendation into a language generation task by designing prompts to bridge the gap between the downstream task and the pretraining task of PLMs. Some works focus on leveraging the prior knowledge in PLMs for recommendation by various tuning techniques such as parameter-efficient fine-tuning (PEFT) (Bao et al., 2023) and instruction tuning (Zhang et al., 2023). One of the most important tasks in PLM-based recommendation is how to assign an unique sequence of tokens to each item as its ID. Early works (Geng et al., 2022; Cui et al., 2022) directly use the original name of the item or randomly assign an integer for each item, which have weak transferability and are sometimes unintelligible to PLMs. SEATER (Si et al., 2023) constructs treestructured item IDs from a pretrained SASRec (Kang & Mc Auley, 2018b) model. P5-ID (Hua et al., 2023) investigates the effect of different item IDs on recommendation. Cola Rec (Wang et al., 2024) captures the collaborative signals between items to construct generative item IDs. Notably, TIGER (Rajput et al., 2023) is the first attempt to use RQ-VAE to construct item IDs by quantizing the item embeddings. Multi-modal Recommendation. Multi-modal side information of items, such as descriptive text and images, has been shown to be effective in improving recommendations by providing richer contexts for interactions. Early works such as VBPR (He & Mc Auley, 2016) extract visual features by matrix factorization to achieve more personalized ranking. Some works (Wei et al., 2019; Sun et al., 2020; Wei et al., 2020) leverage various types of graph neural network (GNN) to fuse the multimodal features. For example, LATTICE (Zhang et al., 2021) designs a modality-aware learning layer to learn item-item structures for each modality and aggregates them to obtain latent item graphs. Dual GNN (Wang et al., 2023b) proposes a multi-modal representation learning module to model the user attentions across modalities and inductively learn the user preference. MVGAE (Yi & Chen, 2022) uses a modality-specific variational graph autoencoder to fuse the modality-specific node embeddings. Recently, with the profound development of foundation models in different modalities (Radford et al., 2021; Brown et al., 2020; Raffel et al., 2020b), some recent works attempt to leverage pretrained foundation models as feature encoders to encode the multi-modal side information. Following P5 (Geng et al., 2022), VIP5 (Geng et al., 2023b) extends it into a multi-modal version which encodes the item images by a pretrained CLIP image encoder. MMGRec (Liu et al., 2024) utilizes a Graph RQ-VAE to construct item IDs from both multi-modal and collaborative information. Moreover, IISAN (Fu et al., 2024) propose a simple plug-and-play architecture using a Decoupled PEFT structure and exploiting both intraand inter-modal adaptation. In this section, we elaborate on the proposed MQL4GRec, a novel approach of transferring recommendation knowledge across different domains and modalities. We first translate item content into a unified quantitative language, which bridge the gaps between different domains and modalities. Then, we design a series of quantitative language generation tasks, and achieve the transfer of recommendation knowledge through pre-training and fine-tuning. The overall framework of the method is illustrated in Figure 2. Published as a conference paper at ICLR 2025 a_4 b_3 c_8 d_6 A_4 B_3 C_8 D_6 a_4 b_3 c_8 d_6 A_4 B_3 C_8 D_6 a_2 b_3 c_1 d_6 *_0 *_1 *_2 a_4 b_3 c_8 d_6 A_2 B_3 C_1 D_6 A_4 B_3 C_8 D_6 a_2 b_3 c_1 d_6 A_4 B_3 C_8 D_6 A_2 B_3 C_1 D_6 a_4 b_3 c_8 d_6 a_2 b_3 c_1 d_6 A_1 B_4 C_2 D_6 A_1 B_4 C_2 D_6 a_2 b_3 c_1 d_6 *_3 *_4 *_5 *_6 *_7 *_8 *_9 *_10 *_11 *_12 *_13 *_14 *_15 *_16 *_17 Next Item Generation: Asymmetric Item Generation: Quantitative Language Alignment: Prompt Token Image Token Text Token Transformer Encoder A_1 B_4 C_2 D_6 *_15 *_16 *_17 Transformer Decoder a_2 b_3 c_1 d_6 <\s> a_2 b_3 c_1 d_6 Task Prompt Source Sequence Target Sequence ME D ℎ 𝑟1 = 2 𝑟2 1 2 3 4 5 6 a_2 b_3 c_1 d_6 18 Piece Acrylic Paint Set Sengoku Basara: The Last Party a_3 b_4 c_1 d_5 A_4 B_3 C_8 D_6 a_2 b_3 c_1 d_6 A_1 B_4 C_2 D_6 Pre-training and Fine-tuning Figure 2: The overall framework of MQL4GRec. We regard the quantizer as a translator, converting item content from different domains and modalities into a unified quantitative language, thus bridging the gap between them (left). Subsequently, we design a series of quantitative language generation tasks to facilitate the transfer of recommendation knowledge through pre-training and fine-tuning (right). 3.1 QUANTITATIVE LANGUAGE The original modal content of items is complex, which can affect the efficiency and performance of recommendations (Hua et al., 2023). Therefore, we translate item content from various domains and modalities into a concise and unified quantitative language. In this subsection, we introduce a quantitative translator to accomplish the aforementioned conversion. Quantitative Translator. Vector Quantization (VQ) is an information compression technique widely utilized across various domains (Van Den Oord et al., 2017; Zeghidour et al., 2021), which maps high-dimensional data onto a finite set of discrete vectors, known as the codebook. In this paper, we treat the quantizer as a translator that converts complex item content into a concise quantitative language. Here, the codebook serves as the vocabulary of the quantitative language. To obtain a unified quantitative language, we first employ a frozen modal encoder (LLa MA or Vi T (Dosovitskiy et al., 2020)) to encode item content (text or image), and to obtain the item representation. Further, we take the item representation as input, and train a Residual-Quantized Variational Auto Encoder (RQ-VAE) (Zeghidour et al., 2021) for generating item tokens. RQ-VAE is a multilevel vector quantizer that applies quantization on residuals to generate a tuple of codewords (i.e., item tokens). As shown in Figure 2 (left), for an item representation h, RQ-VAE first encodes it into a latent representation z. At each level l, we have a codebook Cl = vl k K k=1, where each codebook vector is a learnable cluster center. The residual quantization process can be represented as: ci = arg min k ri vi k 2 2 , (1) ri+1 = ri vi ci, (2) where ci is the codeword of the i-th level, ri is the residual vector of the i-th level, and r1 = z. Assuming we have L-level codebooks, the quantization representation of z can be obtained according to ˆz = PL i=1 vi ci. Then ˆz will be used as decoder input to reconstruct the item representation h. The loss function can be represented as: Lrecon = h ˆh 2 2, (3) sg [ri] vi ci 2 2 + β ri sg vi ci 2 2 , (4) Published as a conference paper at ICLR 2025 L(h) = Lrecon + Lrqvae, (5) where ˆh is the output of the decoder, sg[*] represents the stop-gradient operator, and β is a loss coefficient. The overall loss is divided into two parts, Lrecon is the reconstruction loss, and Lrqvae is the RQ loss used to minimize the distance between codebook vectors and residual vectors. Items typically encompass content from multiple modalities, representing various aspects of user preferences. In our setup, each item comprises two modalities: text and image. We train a quantitative translator for each modality, then add prefixes to the codewords from each of the two codebooks to form a dictionary. Specifically, for the text quantitative translator, we prepend lowercase letter prefixes to the codewords to obtain Vt = {a_1, b_2, . . . , d_K}; for the image quantitative translator, we prepend uppercase letter prefixes to the codewords to obtain Vv = {A_1, B_2, . . . , D_K}. Here, a/A represents the 1-th level codebook, d/D represents the 4-th level codebook, etc. Subsequently, the dictionary can be represented as V = {Vt, Vv}. With each quantitative translator having LK codewords, the size of our dictionary is 2LK, enabling us to represent a total of KL items. Once the quantitative translators are trained, we can directly use them to translate new items into quantitative language. For example, for the item text "Sengoku Basara: The Last Party", after encoding it through the text encoder and RQ-VAE, we obtain a set of codewords (2, 3, 1, 6). Then, by appending lowercase letters before each number, we can get the text quantitative language of the item as . Similarly, for the item s image, we can obtain its image quantitative language as . Handling Collisions. Translating item content into quantitative language may lead to item collisions, where multiple items possess the same tokens. To address this issue, some methods (Rajput et al., 2023; Hua et al., 2023) append an additional identifier after the item indices, which may introduce semantically unrelated distributions. LC-Rec (Zheng et al., 2023) introduces a uniform distribution constraint to prevent multiple items from clustering in the same leaf node. However, this method does not completely resolve collisions, such as when items have the same modality information or when the number of collisions exceeds the size of the last level codebook, which can lead to inflated performance metrics. (More discussion in Appendix E.1.) To address the above issue, we reallocate tokens for colliding items based on the distance from the residual vector to the code vectors. Specifically, for N colliding items, we first calculate the distances D RN L K between the residual vectors and the code vectors for each level based on di k = ri vi k 2 2, and sort the distances to obtain the indices I = argsort(D, axis = 2) RN L K. Then, we sort the colliding items based on their minimum distance to the code vectors of the last level, i.e., (item1, item2, . . . , item N) = sortmin(d L)(colliding items). Finally, we reallocate tokens for the sorted colliding items based on I, following these principles: 1) Start from the last level to assign the nearest token to each item. If collisions occur, assign the next nearest token. 2) If there are insufficient tokens in the last level, for the remaining colliding items, reallocate tokens from the second last level based on distance, and then reallocate tokens from the last level. We repeat this process until all colliding items are handled. 3.2 QUANTITATIVE LANGUAGE GENERATION TASKS In this subsection, we design several quantitative language generation tasks with the aim of imbuing quantitative language with more semantic information, thereby transferring prior knowledge to the target task, as illustrated in Figure 2 (right). Specifically, we additionally include some special tokens in the dictionary, which can serve as prompts to differentiate the types of tasks. Next Item Generation. Since our primary goal is to predict the next item, the next item generation task is our main optimization objective. Specifically, each item contains both text and image modalities, so we have two subtasks: 1) Next Text Item Generation; 2) Next Image Item Generation. In this context, the input sequence is the item tokens sequence from the user interaction history, and the output sequence is the target item tokens corresponding to the respective modality. Different modal sequences reflect different aspects of user preferences. Asymmetric Item Generation. In the next item generation task, the input and output are tokens of the same modality, and we refer to this task as symmetric. To facilitate the interaction Published as a conference paper at ICLR 2025 of recommendation knowledge between two modalities, we introduce asymmetric item generation tasks. Here, there are two subtasks: 1) Asymmetric Text Item Generation, where the input is the image tokens of the interaction history items, and the output is the text tokens of the target item; 2) Asymmetric Image Item Generation, where the input is the text tokens of the interaction history items, and the output is the image tokens of the target item. For example, for the input sequence "<*_6><*_7><*_8>", in humanunderstandable language, it can be described as follows: "Based on the user s text interaction sequence, please predict the next item s image quantitative language: , ". Quantitative Language Alignment Asymmetric item generation tasks enable the interaction of knowledge between two modalities, but they fall under the category of implicit alignment of the two modalities. We further introduce explicit Quantitative Language Alignment tasks to directly achieve alignment between the text and image quantitative languages of items. Here, we also have two subtasks: 1) Text-to-Image Alignment; 2) Image-to-Text Alignment. For example, for the input sequence "<*_12><*_13><*_14>", in human-understandable language, it can be described as follows: "Please provide the image quantitative language for the following item: ". 3.3 TRAINING AND RECOMMENDATION Training. Quantitative language can be viewed as a microcosm of natural language. We employ a two-stage paradigm of pre-training and fine-tuning to optimize the model, which is similar to NLG tasks. For pre-training, we utilize the source domain datasets, where the pre-training task consists of two sub-tasks for next item generation. The purpose is to transfer recommendation knowledge from the source domains to the target domains. For fine-tuning, we conduct it on the target domain dataset, with tasks encompassing all quantitative language generation tasks. The aim is to leverage recommendation knowledge from different modalities to explore users multifaceted preferences. The tasks mentioned above are conditional language generation tasks performed in a sequence-tosequence manner. We optimize the negative log-likelihood of the generation target as follows: j=1 log Pθ (Yj | Y