# incontext_meta_lora_generation__0d966077.pdf In-Context Meta Lo RA Generation Yihua Shao1,2 , Minxi Yan3 , Yang Liu4 , Siyu Chen3 , Wenjie Chen3 , Xinwei Long2 , Ziyang Yan5 , Lei Li6 , Chenyu Zhang5 , Nicu Sebe5 , Hao Tang2 , Yan Wang3 , Hao Zhao3 , Mengzhu Wang1 and Jingcai Guo1 1Department of COMP/LSGI, The Hong Kong Polytechnic University, Hong Kong SAR 2School of Computer Science, Peking University, China 3Institute for AI Industry Research (AIR), Tsinghua University, China 4Beijing Institute for General Artificial Intelligence, China 5Department of Information Engineering and Computer Science, University of Trento, Italy 6Department of Computer Science, University of Copenhagen, Denmark yihuajerry@gmail.com, jc-jingcai.guo@polyu.edu.hk Low-rank Adaptation (Lo RA) has demonstrated remarkable capabilities for task-specific fine-tuning. However, in scenarios involving multiple tasks, training a separate Lo RA model for each task results in considerable inefficiency in terms of storage and inference. Moreover, existing parameter generation methods fail to capture the correlations among these tasks, making multitask Lo RA parameter generation challenging. To address these limitations, we propose the In-Context Meta Lo RA (ICM-Lo RA), a novel approach that efficiently achieves task-specific customization of large language models (LLMs). Specifically, we use training data from all tasks to train a tailored generator, Conditional Variational Autoencoder (CVAE). CVAE takes task descriptions as inputs and produces task-aware Lo RA weights as outputs. These Lo RA weights are then merged with LLMs to create task-specialized models without the need for additional fine-tuning. Furthermore, we utilize incontext meta-learning for knowledge enhancement and task mapping to capture the relationship between tasks and parameter distributions. Consequently, our method achieves more accurate Lo RA parameter generation for diverse tasks using CVAE. ICM-Lo RA enables more accurate Lo RA parameter reconstruction than current parameter reconstruction methods and is useful for implementing task-specific enhancements to Lo RA parameters. Simultaneously, our method occupies 283MB, which is only 1% of the storage space required by the original Lo RA. The code is available at https://github.com/Yihua Jerry/ICM-Lo RA Corresponding author: Jingcai Guo. Task Vector 2 Task Vector 1 Figure 1: ICM-Lo RA achieves accurate reconstruction of Lo RA parameters through task vectors context modeling. 1 Introduction Large-scale models (LLMs/MLMs) have become the cornerstone of modern AI applications [Xiao et al., 2024; Dubey et al., 2024; Shao et al., 2024]. However, these models typically require substantial amounts of data for finetuning. We always fine-tune LLMs with Low-rank Adaptation (Lo RA) [Hu et al., 2021] using task-specific data. In scenarios with numerous subtasks [Erkoc et al., 2023; Yan et al., 2024], the current approach of training a separate Lo RA for each subtask leads to inefficiency in storage and inference. For example, in multitask scenarios, the weights of Lo RA can become prohibitively expensive to store, necessitating more efficient solutions. Although FM-Delta [Mizrahi et al., 2017] employs a novel compression scheme that significantly reduces the need for storage by storing compressed fine-tuned models, it does not address the issue of capturing the correlations between sub-tasks. Current parameter generation methods [Platanios et al., 2018; Wortsman et al., 2022; Jin et al., 2024] can only generate Lo RA parameters for a single task, and it is not possible to implement the simultaneous generation of Lo RA weights required for different tasks using only one generator. Moreover, the current parameter generation training method lacks context modeling capability, making it difficult to implement multi-use multi-task enhancement of Lo RA weights. This causes a significant storage burden when storing the Lo RA weights and training data. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Therefore, we propose the In-Context Meta Lo RA (ICMLo RA) to implement the generation of different tasks Lo RA with a self-designed generator, a Conditional Variational Autoencoder (CVAE). As shown in figure 1, ICM-Lo RA utilizes task vectors for context modeling through in-context metalearning, allowing the augmented CVAE to learn the parameter distribution features. By combining in-context learning and meta-learning and using task vectors as modeling labels, we achieve meta-enhancement for Lo RA parameters. Furthermore, ICM-Lo RA can eliminate the dependence on data and storage and only requires a generator to implement parameter generation. We evaluate our method for both text and visual tasks on different models. For visual tasks, we select target detection tasks and use the COCO dataset [Lin et al., 2014] to classify subclasses based on the detection task labels for the experiments. For language tasks, we choose The Pile [Gao et al., 2020] as the training corpus and use five different subsets to simulate multi-class training tasks and validate the model on the validation set. The results indicate that CVAE can successfully generate different tasks Lo RA parameters. Compared with current methods, the generated Lo RA parameters exhibit less accuracy loss. In addition, compared to original datasets and Lo RA weights, our generator significantly reduce storage. In summary, the contributions of our approach can be summarized as follows: 1) We propose a novel framework, In-Context Meta Lo RA (ICM-Lo RA), which uses a self-designed parameter generator, a Conditional Variational Autoencoder (CVAE), to generate Lo RA weights, addressing the inefficiency of training separate Lo RA models for multiple sub-tasks. 2) We employ in-context meta-learning for knowledge enhancement and task matching, which enables the generator to better learn the correspondence between tasks and model parameter distributions. 3) Compared to existing methods, CVAE can generate taskspecific Lo RA parameters that are the same as or even better than the original Lo RA. In addition, our ICMLo RA could cost only 1% storage compared with the original datasets. 2 Related Works 2.1 Parameters Generation The core of parameter generation is to help the model generate a distribution similar to of the that original model. As one of the pioneers, [Platanios et al., 2018] introduced a contextual parameter generator (CPG) addressing the challenge of training separate models for each language pair in neural machine translation (NMT). Some methods like stochastic neural networks [Sompolinsky et al., 1988; Bottou and others, 1991; Wong, 1991; Schmidt et al., 1992; Murata et al., 1994; Graves, 2011] and Bayesian neural networks [Neal, 2012; Kingma, 2013; Rezende et al., 2014; Kingma et al., 2015; Blundell et al., 2015; Gal and Ghahramani, 2016] improved the robustness and generalisation of the model through the prior probability distribution of the parameters, but these methods performed poorly in large-scale or complex scenarios. Hyper Networks [Ha et al., 2016] generate the parameters of large networks through small networks. With the development of diffusion, methods such as G.pt [Peebles et al., 2022] and P-diff families [Wang et al., 2024; Zhao et al., 2021] began to use diffusion to generate normal scale parameters; however, they are limited in generating parameters that are too large or too small. Furthermore, COND P-DIFF [Jin et al., 2024] first applies parameter generation to generate Lo RA parameters, but it only generates Lo RA models for coarse-grained tasks, and its parameters for generating Lo RA for fine-grained tasks do not perform well. Therefore, we design a fine-grained task Lora generator that uses in-context learning (Sec. 2.2) to enhance the ability of context understanding of a generator model such as diffusion. 2.2 In-Context Learning In-Context Learning (ICL) has emerged as a powerful paradigm in machine learning. As a pioneering work, [Brown et al., 2020] reveals for the first time the learning ability of large language models in the presence of a small number of examples. Building upon this, for training LLMs, Meta ICL [Min et al., 2021] integrated tasks into the ICL format and enabled models to achieve performance similar to direct fine-tuning. Lamda [Thoppilan et al., 2022] emphasizes instruction tuning for models to better understand task descriptions instead of just examples. Self-instruct [Wang et al., 2022b] enables LLMs to generate instructions for task alignment to explore enhancing ICL, while [Wei et al., 2022] introduced Chain-of-Thoughts (Co T) as an intermediate step between input and output to boost LLM reasoning in ICL. Task separation ICL, such as Self-Ask [Press et al., 2022] and ICAP [Chi and Wylie, 2014], has explored multi-stage ICL, where tasks are broken down into simpler sub-tasks, each with its own set of demonstrations, and LLMs can process them individually. Super ICL [Xu et al., 2023] utilizes smaller models as plugins to effectively execute tasks within the LLM framework, demonstrating the potential of hybrid model approaches in ICL. In scenario understanding, In-Context Lo RA (IC-Lo RA) [Huang et al., 2024] and [Hendel et al., 2023] respectively apply ICL to image generation by diffusion and context classification by LLM. In this paper, we apply ICL to the generator to help it better understand the context information in Lora. 2.3 Dataset Condensation Dataset Condensation (DC) aims to create a compact and representative subset of the original training data. As a foundational work, [Zhao et al., 2020] first tried to compress the data by matching the gradients of the synthetic dataset with the original to ensure that the condensed dataset retains the essential characteristics for effective model training. [Zhao and Bilen, 2021] further achieves more efficient data enhancement to synthesise more informative synthetic images by using differentiable twin enhancement (DSA). Building on this, [Zhao and Bilen, 2023] also explores DC with Distribution Matching, optimizing synthetic data to match the original distribution in embedding spaces via maximum mean Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Figure 2: Task Hidden Space Distribution. Hidden states of tasks Lo RA parameters have clustering phenomena. discrepancy (MMD). [Wei et al., 2024] compresses data by matching latent space quantiles and minimizing distribution fit statistics and [Lee et al., 2022] further advances the field by modifying the loss function to capture class differences with a bi-level warm-up strategy for stable optimization. [He et al., 2024] integrates multiple dataset compression processes to get datasets of various sizes and introduces adaptive subset loss to reduce subset degradation. [Wang et al., 2022a] proposes a new method for dataset compression by aligning features, while [Liu et al., 2024] introduces a dualdomain matching method for dataset condensation in time series classification, which further extends the applicability of DC to different data types. In our work, we discard the original dataset and deposit the different task information of the dataset into the generator model for data information aggregation and compression. 3 Methodology In this section, we present our approach in terms of an overview, task vector extraction, and parameter sampling and reconstruction for model customization. 3.1 Overview As shown in Figure 2, we extracted the final hidden states of five different categories, dog, sofa, cat, bicycle and motorbike, at the last time step from the last layer of Florence2 s [Xiao et al., 2024] decoder and visualized them with S-NE [Van der Maaten and Hinton, 2008]. The visualization demonstrates that hidden states from different categories form distinct clusters. For simplicity, we refer to the final hidden states at the last time step from the last layer of the decoders as task vectors [Hendel et al., 2023]. From the information mentioned above, we can draw two conclusions: First, task vectors from different categories are discriminative, as they form distinct clusters. Second, task vectors can represent the high-level features of different categories because the last layer and final time step typically capture high-level representations. Therefore, task vectors satisfy the two key properties of condition vectors: discriminability and representativeness of the target condition. Based on this, we hypothesize that task vectors can serve as condition vectors to effectively guide the generation process in the CVAE model. Figure 3 provides an illustrative overview of the proposed method. Our method comprises three parts. 1) Preparing Lo RA parameter data and extracting task vectors. We fine-tune a Large Language Model (LLM) or a Large Vision-Language Model (LVLM) on a specific task category and save checkpoints from the final stages of the training process. These checkpoints serve as training data for subsequent generative models. Next, we perform inference using the fine-tuned model on randomly selected samples from a specific task category, such as cat or dog detection. During inference, we extract the hidden states from the last layer at the final time step. These hidden states are then averaged to derive a task vector that represents a specific task category. 2) Training the CVAE model. The Lo RA parameters extracted from the fine-tuned model checkpoints, along with the task vectors, are used as training data to train a Conditional Variational Autoencoder (CVAE). We use in-context meta-learning to implement CVAE for modeling relationships between multiple tasks and to enhance Lo RA parameter distribution learning. 3) Generating and applying Lo RA parameters. Using the trained CVAE, we sample from a Gaussian distribution to reconstruct the Lo RA parameters for the target task category. The reconstructed Lo RA parameters are then used to perform inference on the test set, enabling the model to generalize effectively to the specific task. 3.2 Task Vector Extraction Considering the contextual capabilities of large-scale models and the evidence that in-context learning can generate taskspecific representations [Hendel et al., 2023], we fine-tune a pre-trained model using Lo RA [Hu et al., 2021] for a specific task category. Due to the hidden state corresponding to the last token, we extracted the hidden state of each sample hlast i with the last token generated by LLM. For a specific set of task samples {xi}N i=1, LLM generates hidden states of the task from the last layer. These hidden states are then averaged to produce a compact task vector, which can be expressed as Eq. (1): i=1 hlast i , (1) where N is the number of task-specific samples, vtask Rd represents the task vector for the given category, and d represents the dimensionality of the hidden state. Since the last time steps of many natural language processing scenarios contain complete information about the input sequence, we choose the last token to generate the hidden state hlast i . For the task of mapping an input sequence to a single vector, the hidden state of the last marker typically integrates information from all previous time steps, thus becoming a compact representation of the overall semantics of input xi. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Task1: Cat Detection Task2: Dog Detection Task3: Sofa Detection ... Lo RA Parameters Task Vector 1 Task Vector 2 Task Vector 3 Task Vector Task Vector Lo RA Parameter : Frozen the parameters : Train the parameters Lo RA1 Lo RA2 Lo RA3 Step1. Train Lo RA Parameter and Task Vector Step2. Train Generator Step3. Parameter Generation Lo RA1 Lor RA2 Lo RA3 reverse reverse inference Figure 3: Overall Framework of ICM-Lo RA. We train different task Lo RAs based on task categories and generate their task vectors. Then, we train a self-designed CVAE utilizing these task data by in-context meta learning. Finally, we generate the task Lo RA by training CVAE. Furthermore, we extract the hidden states from the final layer because this layer typically represents the most abstract and task-specific feature space. As information flows through the network, the first few layers typically encode general linguistic or structural features, whereas the last layer captures high-level semantic features specific to the current task. This recursion allows the final layer to act as a task feature extractor, providing a representation that is well-suited for generating task vectors. The abstract nature of the last layer ensures that the generated task vector xtask can effectively capture the key features of the task. 3.3 Conditional Variational Autoencoder Based on the Variational Autoencoder (VAE) [Kingma, 2013], we employ a Conditional Variational Autoencoder (CVAE) to model the distribution of Lo RA parameters conditioned on task vectors. The CVAE consists of an encoder qϕ (z | l, vtask), which maps the Lo RA parameter l and task vector vtask to a latent representation z, and a decoder qθ (z | l, vtask), which reconstructs l from z and vtask. The encoder and decoder are conditioned on the task vector vtask, which provides additional information to guide the generation of Lo RA parameters. The encoder qϕ (z | l, vtask) models the approximate posterior distribution over the latent variables z, given Lo RA parameter l and task vector vtask. The encoder takes as input the concatenation of the Lo RA parameter l and the task vector vtask, which is denoted as Eq. (2), x = [l; vtask] Rdl+dtask, (2) where dl and vtask represent the dimensions of the Lo RA parameter l and the task vector vtask, respectively. The concatenated vector x is passed through a neural network, which outputs the parameters of the approximate posterior distribution. This process can be expressed as Eq. (3), qϕ (z | l, vtask) = N z; µϕ, σ2 ϕ (x) , (3) where µϕ (x) and σ2 ϕ (x) are the mean and variance of the latent variable z, computed from the input x. The latent variable z is then sampled from this distribution using the reparameterization trick, which can be represented as Eq. (4): z = µϕ (x) + σϕ (x) ϵ, (4) where ϵ N (0, I) is a noise term, and represents elementwise multiplication. The decoder pθ (l | z, vtask) models the likelihood of the Lo RA parameter l given the latent variable z and task vector vtask. The decoder receives z and vtask as input, which are concatenated as Eq. (5), x = [z; vtask] Rdz+dtask, (5) where dz is the dimensionality of the latent variable z. The concatenated vector x is then input into a neural network to output the parameters of the likelihood distribution for the Lo RA parameter l, and the process can be represented as Eq. (6): pθ (z, vtask) = N l, ˆµθ (x ) , ˆσ2 θ (x ) , (6) where ˆµθ(x ) and ˆσ2 θ(x ) are the mean and variance of the predicted Lo RA parameter l, computed from the input x . Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) The decoder aims to minimize the reconstruction error, ensuring that the generated Lo RA parameters match the true parameters as closely as possible. The latent space z Rk is assumed to follow a Gaussian prior distribution which could be represented as Eq. (7): p (z) = N (z; 0, I) , (7) where I is the identity matrix. The objective is to maximize the evidence lower bound (ELBO), which consists of two terms: the reconstruction term and the regularization term. The ELBO could be expressed as Eq. (8), L = Eqϕ(z|l,vtask) [log pθ(l | z, vtask)] + KL (qϕ(z | l, vtask) p(z)) , (8) where KL ( ) represents the Kullback-Leibler divergence. The first term encourages the decoder to reconstruct accurate Lo RA parameters, while the second term regularizes the latent space to match the Gaussian prior. By conditioning both the encoder and decoder on the task vector vtask, the model learns to generate Lo RA parameters that are specific to the given task, leading to task-aware representations in the latent space. For CVAE to generate task Lo RA parameters, we utilize CLIP s [Radford et al., 2021] text encoder to output the task vector vtask. A sample z is drawn from the prior distribution p(z), and the decoder generates the corresponding Lo RA parameter: lgenerated = pθ(l | z, vtask). (9) 4 Experiments In this section, we evaluated several tasks on LLMs and MLMs. We evaluated the task performance of the Lo RA [Hu et al., 2021] generated by the current Lo RA parameters generation methods. This demonstrates the effectiveness and reasonableness of our approach. 4.1 Experiment Setting Baselines. We chose the original model, Lo RA, Lo RA generated by Model Soup [Wortsman et al., 2022] and COND P-DIFF [Jin et al., 2024] as baseline to test the advantages of our method on different tasks. Datasets. For the computer vision task, we select the most representative target detection task for to experiment. We choose the COCO [Lin et al., 2014] dataset and divide it into different subclasses based on the detection task labels. For the language modelling task, we employ The Pile [Gao et al., 2020] as the training corpus. To simulate the multi-category training tasks, we pick five various subsets from the Pile corpus and validate our method on the test sets. Data Preparation. The model fine-tuning process also produces a series of Lo RA matrices {Lt}T t=1 with different ranks r, where T represents the number of fine-tuning steps. Each matrix Lt Rm n is flattened into a one-dimensional vector lt Rm n to facilitate alignment with the task vector vtask. These flattened Lo RA parameters, along with the corresponding task vectors, form the training dataset {(vtask, lt)} for the self-designed CVAE. Training Strategies. The CVAE model employs a 12-layer 1D CNN architecture for both the encoder and decoder. The loss function for the CVAE combines the Kullback-Leibler divergence (KLD) and reconstruction loss, with the KLD weight set to 0.005. The loss function could be expressed as Eq. (8) We fine-tuned the model on a specific task using Lo RA (Low-Rank Adaptation) for a total of 150 epochs, saving the Lo RA parameters from the final 50 epochs. The task vector is extracted from the last token of the last layer in the CLIP [Radford et al., 2021]. Subsequently, the CVAE model is trained for 2,000 epochs to ensure robust learning of the latent space. All experiments were conducted on a single NVIDIA A800 GPU, with each experiment taking approximately 3 hours to complete. 4.2 Main Results We conduct experiments on computer vision tasks and natural language processing tasks respectively, and demonstrate that our approach generalizes across models and can be adapted to tasks of multiple modalities. Object Detection. As shown in Table 1, we selected several subsets of the more conventional tasks and fine-tuned on Florence-2 [Xiao et al., 2024]. The Lo RA parameters generated by ICM-Lo RA in the subset of expert tasks in the COCO dataset have the smallest difference in effect from the original Lo RA parameters, and even the Lo RA parameters generated by ICM-Lo RA outperform the original Lo RA in some of the tasks. This suggests that our method generates more Lo RA parameters than the other methods. complete. By adding incontext learning, ICM-Lo RA s understanding of task scenarios is enhanced compared to COND P-DIFF, implementing the effect of outperforming the original Lo RA on some tasks. As shown in Table 3, our method significantly achieves task-specific dataset compression using much less storage than the original Lo RA weights with the original dataset. The compression of the visual dataset is achieved through the parameter generation method, which significantly reduces the storage cost. This shows that our approach not only generates task-corresponding Lo RAs more accurately but also enables task-based data compression. As shown in Figure 4, both Model Soup and COND PDIFF were poorly labeled in the detection, and COND PDIFF even had a false detection in the complex environment. This indicates that the reconstructed Lo RA cannot be successfully adapted to LVLM. Language Modeling. When it comes to the language modelling task, we set the Lo RA rank r = 2. We fine-tuned the Llama-3-8B [Dubey et al., 2024] model across different tasks and proposed the result of five subsets from the Pile corpus, including Ar Xiv, Books, Ubuntu, Wikipedia, and Gutenberg. As shown in Table 2, compared to other methods, ICM-Lo RA achieves the lowest perplexity and bits-per-character. This outperformance clearly shows its superiority in these specific tasks of language modelling. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Dog Bicycle Cat Sofa Motorbike Method MAP50 MAP75 MAP50 MAP75 MAP50 MAP75 MAP50 MAP75 MAP50 MAP75 Original Lo RA 0.96 0.89 0.90 0.81 0.94 0.90 0.95 0.86 0.82 0.78 Original Model 0.92 0.87 0.89 0.80 0.93 0.89 0.00 0.00 0.00 0.00 Model Soup 0.93 0.87 0.90 0.78 0.93 0.88 0.81 0.74 0.81 0.72 COND P-DIFF 0.94 0.87 0.90 0.77 0.93 0.89 0.84 0.78 0.80 0.74 ICM-Lo RA 0.96 0.89 0.90 0.81 0.95 0.91 0.95 0.86 0.83 0.78 Table 1: Parameter Reconstruction Results for Lo RA Rank r = 2 in Object Detection Task. ICM-Lo RA generates Lo RA weights are closest to the original Lo RA, and even better than the original Lo RA in some tasks. Ar Xiv Books Ubuntu Wikipedia Gutenberg Method PPL BPC PPL BPC PPL BPC PPL BPC PPL BPC Original Lo RA 6.75 0.41 7.07 0.48 9.66 0.58 5.54 0.43 8.60 0.62 Original Model 7.76 0.44 7.67 0.51 10.00 0.59 6.07 0.46 8.75 0.63 Model Soup 7.00 0.43 7.08 0.48 9.67 0.59 5.56 0.45 8.61 0.63 COND P-DIFF 6.73 0.41 7.10 0.49 9.67 0.58 5.55 0.44 8.60 0.62 ICM-Lo RA 6.74 0.40 7.07 0.48 9.65 0.58 5.54 0.43 8.59 0.61 Table 2: Parameter Reconstruction Results for Lo RA Rank r = 2 in Language Modeling. Compared with baseline methods, ICM-Lo RA generates Lo RA parameters with a lower PPL ( ) and BPC ( ) on different subsets, equaling or even surpassing the original Lo RA. Rank Method r = 1 r = 2 r = 4 r = 8 Original COCO 25G Original Lo RA 2.1G 4.3G 8.5G 16.9G Model Soup 423MB 453MB 478MB 504MB COND P-DIFF 314MB 314MB 318MB 326MB ICM-Lo RA 283MB 283MB 283MB 283MB Table 3: Storage Memory Required for Different Methods. In the vision task, ICM-Lo RA and COND P-DIFF take up less storage overhead compared to the original dataset with the original Lo RA parameter weights. In some subtasks, ICM-Lo RA achieved the same PPL and BPC as the original Lo RA and even achieve lower PPL and BPC than the original Lo RA. This indicates that even in language tasks, ICM-Lo RA can reconstruct the Lo RA parameters and even reconstruct Lo RA with more reasonable parameter distribution in some specific subtasks. Since Llama-3 [Dubey et al., 2024] and Florence-2 [Xiao et al., 2024] have different parameter distributions and model constructions, the Lo RA construction has a different parameter distribution. We argue that ICM-Lo RA can adapt both multi-modal and language tasks, and can adapt different models with diverse parameter distributions. Therefore, we conclude that ICM-Lo RA is highly effective in enhancing the performance of language modeling for parameter generation. 4.3 Ablation Studies In this section, we will first discuss the effect of different ranks and parametric quantities of Lo RA on the generation of task Lo RA. Then we discuss the effect of different number of convolutional layers n on model performance during sampling. Finally, we discuss the impact on different task vector Original ICM-Lo RA COND P-DIFF Model Soup Detecting the giraffes in the image. Detecting the cats in the image. COND P-DIFF Model Soup Original ICM-Lo RA COND P-DIFF Model Soup Detecting the horse in the image. Figure 4: Visual Comparison of Different Methods for Generating Lo RA. The Lo RA generated by ICM-Lo RA is most similar to the original Lo RA effect. generated by CLIP s vision encoder and text encoder. Impact of Lo RA Rank r and Parameter Number P. We trained the Lo RA parameters with rank r of 1,2,4,8 respectively and trained CVAE using these Lo RA parameters. The task Lo RA parameters were generated based on the task vectors and evaluated on the COCO dataset. Meanwhile, we tested several other methods and discussed the effect of Lo RA parameter size on the generated Lo RA parameters. We selected dogs and cats as examples and reported their MAP50. As shown in Table 4 and Figure 5, as the rank r of Lo RA gradually increases, the number of Lo RA parameters also increases. The other methods gradually decrease the effect on Lo RA reconstruction as the number of Lo RA parameters increases, which indicates that these methods cannot adapt to Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Model Soup COND P-DIFF ICM-Lo RA Rank P Dog Cat Dog Cat Dog Cat r = 1 241241 0.93 0.93 0.94 0.93 0.95 0.97 r = 2 482482 0.93 0.93 0.94 0.93 0.96 0.95 r = 4 964964 0.93 0.92 0.93 0.93 0.95 0.96 r = 8 1929928 0.91 0.91 0.90 0.90 0.95 0.96 Table 4: Impact of Lo RA Rank and Parameter Number. Our method is more robust in higher rank and with more parameters. Original rank=2 COND P-DIFF Model Soup Original rank=1 ICM-Lo RA COND P-DIFF Model Soup Original rank=4 COND P-DIFF Model Soup Original rank=8 ICM-Lo RA COND P-DIFF Model Soup Figure 5: Visualization of Lo RA Rank Impacts. For the task Detecting cats in the image ., ICM-Lo RA is less affected by Lo RA rank compared to other methods. the reconstruction of Lo RA with a large number of parameters. The detection effect of our method is almost the same as the original Lo RA with the increase of Lo RA parameters, which indicates that our method is more robust and can adapt to the reconstruction of Lo RA weights with different numbers of parameters. Impact of Convolutional Layers Number. We evaluate the effect of sampling convolutional layers with different Lo RA ranks and number of layers on the model s generation of Lo RA weights for the task Detecting Cats in Pictures . As shown in Table 5, as the Lo RA rank and the number of parameters rise, the effect of sampling progressively decreases the fewer the convolutional layers. This indicates that the deeper the network is, the better the model samples the parameters. However, when the network convolution layer is too deep, it causes the model to fail to learn the parameter distribution characteristics of the fewer parameter Lo RA. Therefore, we choose to use 12 convolutional layers to sample the Lo RA parameters in our experiments. Impact on Text and Vision Task Vector. We generate text task vectors using Detect the cat in the picture. and Detect the dog in the picture. by text encoder. Then generate vision r = 1 r = 2 r = 4 r = 8 n = 10 0.93 0.93 0.91 0.89 n = 11 0.93 0.93 0.92 0.91 n = 12 0.93 0.93 0.93 0.93 n = 13 0.93 0.93 0.93 0.93 n = 14 0.93 0.93 0.93 0.93 n = 20 0.90 0.91 0.92 0.93 Table 5: Impact of Sampling Convolutional Layers. Lo RAs with larger ranks require more convolutional layers to be sampled, but too many convolutional layers can lead to poor model sampling. Vision Text Dog Cat 0.96 0.95 0.96 0.95 0.96 0.95 Table 6: Impact of Text and Vision Task Vector. The visual task vectors and the text task vectors reconstructed by Lo RA have essentially the same effect in both tasks. task vectors using cat and dog images. Finally we evaluate the reconstructed Lo RA parameters on a subset of cats and dog in COCO and report MAP50. As it shown in Table 6, task vectors instructed by vision and task input has equal impact in generate the task vector So we consider that task vectors of different modalities have equivalent effects on parameter generation. And the simultaneous use of multimodal task vectors is not possible for the process of enhancing parameter generation. 5 Conclusion In this paper, we propose ICM-Lo RA, a novel framework that uses a self-designed parameters generator, Conditional Variational Autoencoder (CVAE), which could generate Lo RA parameters to implement model customization. ICM-Lo RA achieves task vectors and Lo RA parameter context modeling by combining in-context learning and meta-learning, which allowing CVAE to learn Lo RA parameter distributions more accurately. Our method achieves accurate task instruct Lo RA parameter generation with only CVAE, eliminating the need for additional training data and storage. The experimental results on both language modeling and object detection tasks have further validated the effectiveness of our approach could apply to different models with different tasks. ICM-Lo RA can also reduce storage costs and improve computational efficiency. Overall, ICM-Lo RA represents a significant advancement in parameter generation and large-scale model customization. Acknowledgements This work was supported by funding from the Hong Kong RGC General Research Fund (152211/23E, 15216424/24E, and 152115/25E), the Poly U Internal Fund (P0056171), and the Huawei Gifted Fund. The first two authors (i.e., Yihua Shao and Minxi Yan) contributed equally to this work. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) References [Blundell et al., 2015] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International conference on machine learning, pages 1613 1622. PMLR, 2015. [Bottou and others, 1991] L eon Bottou et al. Stochastic gradient learning in neural networks. Proceedings of Neuro Nımes, 91(8):12, 1991. [Brown et al., 2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020. [Chi and Wylie, 2014] Michelene TH Chi and Ruth Wylie. The icap framework: Linking cognitive engagement to active learning outcomes. Educational psychologist, 49(4):219 243, 2014. [Dubey et al., 2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783, 2024. [Erkoc et al., 2023] Ziya Erkoc , Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14300 14310, 2023. [Gal and Ghahramani, 2016] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050 1059. PMLR, 2016. [Gao et al., 2020] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. ar Xiv preprint ar Xiv:2101.00027, 2020. [Graves, 2011] Alex Graves. Practical variational inference for neural networks. Advances in neural information processing systems, 24, 2011. [Ha et al., 2016] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. ar Xiv preprint ar Xiv:1609.09106, 2016. [He et al., 2024] Yang He, Lingao Xiao, Joey Tianyi Zhou, and Ivor Tsang. Multisize dataset condensation. ar Xiv preprint ar Xiv:2403.06075, 2024. [Hendel et al., 2023] Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9318 9333, 2023. [Hu et al., 2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021. [Huang et al., 2024] Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers. ar Xiv preprint ar Xiv:2410.23775, 2024. [Jin et al., 2024] Xiaolong Jin, Kai Wang, Dongwen Tang, Wangbo Zhao, Yukun Zhou, Junshu Tang, and Yang You. Conditional lora parameter generation. ar Xiv preprint ar Xiv:2408.01415, 2024. [Kingma et al., 2015] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28, 2015. [Kingma, 2013] Diederik P Kingma. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [Lee et al., 2022] Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, and Sungroh Yoon. Dataset condensation with contrastive signals. In International Conference on Machine Learning, pages 12352 12364. PMLR, 2022. [Lin et al., 2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740 755. Springer, 2014. [Liu et al., 2024] Zhanyu Liu, Ke Hao, Guanjie Zheng, and Yanwei Yu. Dataset condensation for time series classification via dual domain matching. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1980 1991, 2024. [Min et al., 2021] Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to learn in context. ar Xiv preprint ar Xiv:2110.15943, 2021. [Mizrahi et al., 2017] Tal Mizrahi, Yoram Revah, Yehonathan Refael Kalim, Elad Kapuza, and Yuval Cassuto. Fm-delta: Fault management packet compression. In 2017 IFIP/IEEE Symposium on Integrated Network and Service Management (IM), pages 596 599. IEEE, 2017. [Murata et al., 1994] Noboru Murata, Shuji Yoshizawa, and Shun-ichi Amari. Network information criteriondetermining the number of hidden units for an artificial neural network model. IEEE transactions on neural networks, 5(6):865 872, 1994. [Neal, 2012] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012. [Peebles et al., 2022] William Peebles, Ilija Radosavovic, Tim Brooks, Alexei A Efros, and Jitendra Malik. Learning to learn with generative models of neural network checkpoints. ar Xiv preprint ar Xiv:2209.12892, 2022. [Platanios et al., 2018] Emmanouil Antonios Platanios, Mrinmaya Sachan, Graham Neubig, and Tom Mitchell. Contextual parameter generation for universal neural Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) machine translation. ar Xiv preprint ar Xiv:1808.08493, 2018. [Press et al., 2022] Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. ar Xiv preprint ar Xiv:2210.03350, 2022. [Radford et al., 2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748 8763. PMLR, 2021. [Rezende et al., 2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278 1286. PMLR, 2014. [Schmidt et al., 1992] Wouter F Schmidt, Martin A Kraaijveld, Robert PW Duin, et al. Feed forward neural networks with random weights. In International conference on pattern recognition, pages 1 1. IEEE Computer Society Press, 1992. [Shao et al., 2024] Yihua Shao, Siyu Liang, Xiaolin Lin, Zijian Ling, Zixian Zhu, Minxi Yan, Haiyang Liu, Siyu Chen, Ziyang Yan, Yilan Meng, et al. Gwq: Gradientaware weight quantization for large language models. ar Xiv preprint ar Xiv:2411.00850, 2024. [Sompolinsky et al., 1988] Haim Sompolinsky, Andrea Crisanti, and Hans-Jurgen Sommers. Chaos in random neural networks. Physical review letters, 61(3):259, 1988. [Thoppilan et al., 2022] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. ar Xiv preprint ar Xiv:2201.08239, 2022. [Van der Maaten and Hinton, 2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [Wang et al., 2022a] Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12196 12205, 2022. [Wang et al., 2022b] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. ar Xiv preprint ar Xiv:2212.10560, 2022. [Wang et al., 2024] Kai Wang, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, and Yang You. Neural network diffusion. ar Xiv preprint ar Xiv:2402.13144, 2024. [Wei et al., 2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824 24837, 2022. [Wei et al., 2024] Wei Wei, Tom De Schepper, and Kevin Mets. Dataset condensation with latent quantile matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7703 7712, 2024. [Wong, 1991] Eugene Wong. Stochastic neural networks. Algorithmica, 6(1):466 478, 1991. [Wortsman et al., 2022] Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, pages 23965 23998. PMLR, 2022. [Xiao et al., 2024] Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4818 4829, 2024. [Xu et al., 2023] Canwen Xu, Yichong Xu, Shuohang Wang, Yang Liu, Chenguang Zhu, and Julian Mc Auley. Small models are valuable plug-ins for large language models. ar Xiv preprint ar Xiv:2305.08848, 2023. [Yan et al., 2024] Ziyang Yan, Lei Li, Yihua Shao, Siyu Chen, Wuzong Kai, Jenq-Neng Hwang, Hao Zhao, and Fabio Remondino. 3dsceneeditor: Controllable 3d scene editing with gaussian splatting. ar Xiv preprint ar Xiv:2412.01583, 2024. [Zhao and Bilen, 2021] Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pages 12674 12685. PMLR, 2021. [Zhao and Bilen, 2023] Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6514 6523, 2023. [Zhao et al., 2020] Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. ar Xiv preprint ar Xiv:2006.05929, 2020. [Zhao et al., 2021] Qi Hao Zhao, Wei Hu, Yangyu Huang, and Fan Zhang. P-diff+: Improving learning classifier with noisy labels by noisy negative learning loss. Neural Networks, 144:1 10, 2021. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)