# llm_augmented_llms_expanding_capabilities_through_composition__597311f8.pdf LLM AUGMENTED LLMS: EXPANDING CAPABILITIES THROUGH COMPOSITION Rachit Bansal1 Bidisha Samanta1 Siddharth Dalmia2 Nitish Gupta1 Shikhar Vashishth1 Sriram Ganapathy1 Abhishek Bapna1 Prateek Jain1 Partha Talukdar1 1Google Research India 2Google Deep Mind Foundational models with billions of parameters which have been trained on large corpora of data have demonstrated non-trivial skills in a variety of domains. However, due to their monolithic structure, it is challenging and expensive to augment them or impart new skills. On the other hand, due to their adaptation abilities, several new instances of these models are being trained towards new domains and tasks. In this work, we study the problem of efficient and practical composition of existing foundation models with more specific models to enable newer capabilities. To this end, we propose CALM Composition to Augment Language Models which introduces cross-attention between models to compose their representations and enable new capabilities. Salient features of CALM are: (i) Scales up LLMs on new tasks by re-using existing LLMs along with a few additional parameters and data, (ii) Existing model weights are kept intact, and hence preserves existing capabilities, and (iii) Applies to diverse domains and settings. We illustrate that augmenting Pa LM2-S with a smaller model trained on low-resource languages results in an absolute improvement of up to 13% on tasks like translation into English and arithmetic reasoning for low-resource languages. Similarly, when Pa LM2-S is augmented with a code-specific model, we see a relative improvement of 40% over the base model for code generation and explanation tasks on-par with fully fine-tuned counterparts. 1 INTRODUCTION Large Language Models (LLMs) have shown to encompass a range of foundational capabilities such as commonsense and factual reasoning, world knowledge, and coherent language generation (Bubeck et al., 2023; Google et al., 2023). Leveraging these foundational capabilities, a number of efforts in the community have fine-tuned these models to enable domain-specific capabilities such as code generation, copy editing, and mathematical problem solving (Lewkowycz et al., 2022; Singhal et al., 2023). This has resulted in the development of several specialized large models with domainspecific capabilities. For example, there are models that do well on standard code generation but are not as proficient in general logical reasoning and vice-versa. Presence of such a large number of domain-specific models leads to a natural question: Can we compose an anchor model with a domain-specific augmenting model to enable new capabilities? For example, can we compose an augmenting model s code understanding capability with an anchor LLM s language generation capability to enable code-to-text generation capability? The typical approach for this problem is to further pre-train or (efficiently) fine-tune the anchor model on the data that was originally used to train the augmenting model (Hu et al., 2021; Kessler et al., 2022). However, many a times such solutions are not feasible since training large models is computationally expensive, especially since the augmenting model itself may be an LLM trained on a massive corpora. Further, processing data from multiple sources might not be feasible due to privacy concerns and organizational boundaries. Working with multiple distinct models is also desirable since it allows the reuse of existing models with established capabilities, providing better control and avoiding catastrophic forgetting that is prevalent in conventional approaches. Correspondence to Rachit and Bidisha: [brachit, bidishasamanta]@google.com Translate from XX to En: Everything but the kitchen sink Low-resource Language Pre-trained What does this Python code do? Implements the classic word game of Hangman m B Key-value l A,i l A,i l B,j What is the value of x1 + x8 * xn? Since x1=10, x8=14, xn=2, x1 + x8 * xn = 38 Pre-trained Numeric Arithmetic Figure 1: Overview of CALM. To augment an anchor LLM (m B) with new capabilities through composition with a specialized augmenting model (m A). Figure illustrates three m A with different capabilities: key-value mapping (left), low-resource languages (center), and code (right). Models m A and m B remain unchanged ( ) during composition. A few additional parameters are learnt over models layer representations. Leftmost plot shows an m A trained on a set of string-integer mappings, e.g., {x1 : 10, . . . , xn : 2}. m B is a large LM with arithmetic capabilities. CALM composes these two frozen models to solve the task of arithmetic on keys which either models could not solve on their own ( 4.1). Notably, CALM generalizes to the entire key-value set despite training with arithmetic examples spanning only 20% of the keys. To address the training and the data challenges mentioned above, we propose and study a practical setting for model composition: (i) we are given access to one (or more) augmenting model(s) and an anchor model, (ii) we are not allowed to modify the weights of either models, and (iii) we only have access to a small amount of data, representing the combined skills of the given models, e.g., code generation with complex logical reasoning. Prior work has largely approached the question of composition from either a routing or a merging standpoint, neither of which provide an effective solution to capture this setting. Routing between the given models, i.e., choosing an output of one model over the other (Ma et al., 2019), or performing a soft ensemble (Muqeeth et al., 2023) is not effective when neither of the models can demonstrate the desired capability. Another body of work creates a combined model by an arithmetic combination of base model parameters (Wortsman et al., 2022; Ilharco et al., 2022; Matena & Raffel, 2022). However, these settings are naturally restrictive and their efficacy is unclear when combining models with different sizes and pre-training objectives (Yadav et al., 2023). In this work, we propose a novel Composition to Augment Language Models (CALM) framework to address the general model composition setting mentioned above. Rather than a shallow combination of the augmenting and anchor LMs (Wortsman et al., 2022; Ilharco et al., 2022), CALM introduces a small number of trainable parameters over both augmenting and anchor models intermediate layer representations. CALM finds an effective combination of the given models to perform new challenging tasks more accurately than either of the models alone, while preserving the capabilities of individual models. Figure 1 highlights few motivating scenarios for CALM. We study key practical applications of CALM: language inclusivity and code generation. For language inclusivity ( 4.2), we use a model that has been trained on a set of low-resource languages. We observe that composing this model with the LLM allows us to borrow its generation and reasoning capabilities to achieve significantly better performance on translation and arithmetic reasoning tasks for low-resource languages (Tables 2 and 3). This composed model outperforms not only the two base models but also versions of the LLM that have been further pre-trained or Lo RA (Hu et al., 2021) fine-tuned for the set of low-resource languages. For code generation ( 4.3), we use a model that has been trained on open-source code across a variety of programming languages. Composing this model with the LLM hence borrowing its low-level logic and generation capabilities outperforms the two base models (Table 4) on code explanation and code completion tasks. 2 RELATED WORKS Parameter efficient fine-tuning: A large body of prior work focuses on parameter efficient ways of fine-tuning models for new tasks by introducing a small number of trainable parameters, keeping the original model intact (Houlsby et al., 2019; Wang et al., 2020; Pfeiffer et al., 2021; Hu et al., 2021; Kessler et al., 2022). Since this paradigm allows a small set of new parameters to be trained, it is challenging to use these approaches to augment novel domains and knowledge sources that are entirely absent from the original training corpus. In contrast, CALM enables a model to be adapted to new domains using augmenting models. In Section 4.4, we draw empirical comparisons between CALM and Lo RA (Hu et al., 2021), a representative parameter efficient fine-tuning method. Model Merging: Merging different expert models with simple techniques like task vector averaging provides a way of recombining different capabilities of these models (Ilharco et al., 2022; Matena & Raffel, 2022). However, these methods are only relevant when the original models are well aligned. Other related approaches are also applicable only when the models are derived from the same model (Matena & Raffel, 2022) or they are of same size (Muqeeth et al., 2023). In contrast, CALM is more generic and is applicable to any set of models. Model and Task Compositionality: The modular encoder-decoder based method in Dalmia et al. (2022) adapts components of encoder-decoder models to allow flexible re-usability of different encoders, each with their own capabilities. Several past studies explore compositionality of modalityspecific encoders with language models to serve multi-modal use-cases (Ziegler et al., 2019; Alayrac et al., 2022). Typically, they introduce cross-attention parameters across a language model in order to attend to representations from an image encoder and show an effective transfer of modalities across models. In this work, we extend the ideology of model re-use and modularity to composition of capabilities in large language models. Models as Tools: Another interesting direction for using multiple language models to solve a downstream task has been to perform composition in the models input text space (Zeng et al., 2022; Shen et al., 2023). Schick et al. (2023) have demonstrated how a model can be taught to use external tools there might be an opportunity to investigate if other models can be called as a part of the same framework. Since these approaches require a large amount of prompt engineering, in this work we focus on composition through representations that can be learnt automatically. 3 COMPOSITION TO AUGMENT LANGUAGE MODELS (CALM) Given an anchor model m B and an augmenting model m A, CALM aims to compose the two models (m A B) to enable new capabilities as a composition of capabilities of the two individual models. As discussed in the introduction, we study this composition in a practical setting with the following assumptions: i) we can access weights, run forward and backward pass, and access intermediate representations of both m B and m A, ii) we are not allowed to change weights of both the models, iii) we do not have access to the training data, hyperparameters, training states of both the base models, iv) we are provided a few examples from the target composition domain. The goal is to learn a composition m A B = f(m A, m B, ΘC, DC) to achieve some joint task C. The weights of m A and m B are frozen. ΘC is the additional set of trainable parameters introduced to learn the composition and DC refers to the set of examples that are used to learn this composition. 3.1 LEARNING TO COMPOSE (ΘC) As outlined in Figure 1, we operate over a selected set of layers from m B and m A at all times. We learn two sets of additional parameters over these layers: (i) A simple set of linear transformations, fproj(.) that maps an ith layer representation from m A to the dimensionality of representations from m B, and (ii) A set of cross-attention layers, fcross(.,.) that cross-attend between this transformed layer representation and a jth layer representation from m B. Compositional Layers: Let the augmenting model m A and the anchor model m B have NA and NB layers, respectively. Also, let DA and DB be the token dimensionality of the two models. We first choose a set of compositional layers LA and LB for both models, over which the set of new learnable parameters are introduced during composition. n A = |LA| and n B = |LB|. For simplicity, we set n A = n B = n and the gap between two contiguous selected layers is kept uniform based on the number of selected layers that is, (l2 l1) = = (ln l(n 1)) = N/n. Further, HA {HA1, HA2, . . . , HAn A} denote the layer representation of a given input after each layer in LA. Learned Projections: Next we map representations from m A to that of m B via a projection layer. In particular, for each layer in LA, we learn a projection function fproj : RDA RDB, that projects representations from these layers to the desired representation size of m B. Let, fproj(HA) {fproj(HA1), fproj(HA2), . . . , fproj(HAn A)} This transformation enables cross-attention across models, and also performs an alignment of representations from m A and m B despite frozen weights of the base models. Cross-attention Layers: Similar to the multi-headed cross-attention in encoder-decoder models (for example Vaswani et al. (2017) and Raffel et al. (2020)) we introduce cross-attention between representations of the anchor and the augmenting model. In particular, we use fproj(HAi) from the augmenting model as the key and value vectors for each head in cross-attention. We use the vector HBj from the anchor model as the query vector, which leads to the following cross-attention setup: fcross(fproj(HAi), HBj) = Concat.k (headk) WO k NH where, headk = Attn.(QB, KA, VA), and, QB = HBj WQ k , KA, VA = fproj(HAi)WK k , fproj(HAi)WV k Here, NH represents the number of attention heads used for cross-attention which, in our case, is typically the same as the number of heads used for self-attention in m B. Each of WO RDB DB, WQ k , WK k , and WV k RDB DB//NH are learnable weight matrices, where k {1..NH}. Finally, the cross-attention output is added as a residual connection to the layer representations of m B. The resultant output vector, in-turn, is the input to the succeeding layer in m B: HA Bj = HBj + fcross(fproj(HAi), HBj) Here, HA Bj denotes the input to the (j + 1)th layer of the composed model. All layers in LA and LB are utilized in a similar manner. Propagating over the remaining layers in m B gives us a final output token yt decoded for the tth timestep. Akin to usual auto-regressive decoding, the output token for each time-step is appended to the input: xt+1 = xt yt, Since the updated input at each time step is passed to both models, all representations for the two models are refreshed. 3.2 COMPOSITION TRAINING DATA (DC) Since the target model m A B involves a composition over the two models m A and m B, we construct the set of training examples DC to depict a combined skill that enables ΘC to attend over the two models appropriately for the target task. Ideally, if the set of tasks involved in composition task are distinguished as t1 and t2 respectively, then we design DC to depict a joint task C. For example, with respect to our synthetic key-value setup: our final task (C) is to perform arithmetic over a set of keys. The augmenting model m A is trained to learn the given key-value pairs (notated as task, t1) and the anchor model m B is generic model that can perform numeric arithmetic well (task t2). For learning the set of parameters ΘC for composition, we consider DC to be arithmetic over a held-in set of keys (task C), encompassing combined skills from the two models. In contrast to fine-tuning approaches like Lo RA (Hu et al., 2021) that would require the entire knowledge source (here, key-values) during training time, we find that training composition on only a fraction of the keys can generalize to the full set. In other real world settings, a clear distinction in specializing tasks for each model might be difficult to formulate and hence defining a task that captures the combined skills can be challenging. We find that using a set of examples that capture certain capabilities of the two models suffices, i.e., some rough notion of t A B. For our language inclusivity task, we use a mixture of examples containing a small amount of low-resource language and high-resource language data. Composing multiple models: Finally, we note that while the method has been presented for a setting with one anchor model and only one augmenting model, CALM is applicable to multiple augmenting models as well. In particular, CALM would require learning similar projection and cross-attention components between the anchor and each of the augmenting model. We leave a thorough investigation of this as a topic of future work. 4 EXPERIMENTS We demonstrate the following in three domains: (a) an anchor LLM (m B) can be composed with an augmenting model (m A) trained on mappings between string keys and number values to solve arithmetic expressions over those keys requiring both, knowledge of the KV mappings and arithmetic capabilities ( 4.1); (b) how CALM can be used to expand the language coverage of an anchor LLM (m B) to low-resource languages it has not seen during pre-training. We show that an augmenting model (m A) pre-trained on low-resource languages can be composed with such an anchor model to significantly improve translation and math-word problem solving capabilities in low-resource languages ( 4.2); (c) how code completion and explanation can be improved by composing an anchor LLM with an augmenting model (m A) specializing in the code domain ( 4.3). In all experiments, we start with a Pa LM2-XXS model and further train it on domain-specific data to arrive at an augmenting model (m A) that is then kept frozen during composition. Note that no task specific training data was used to train CALM. We use Pa LM2-XS or Pa LM2-S models as the anchor LLM (m B) that is also kept frozen during composition training. For all our experiments, we set NA/n = 4, i.e., we perform composition using every 4th layer output from m A. Correspondingly, layers from m B (LB) are chosen such that n B = n A = n, hence n B = NB/4. 4.1 KEY-VALUE ARITHMETIC We first study the setting where we have a small augmenting LM that has been trained to memorize string-to-integer key-value (KV) mappings, and a large anchor LM that is capable of performing arithmetic over integers. We wish to use CALM to compose them and enable a new capability of solving arithmetic expressions containing those keys. Key-Value Domain Knowledge We first generate a repository of KV pairs containing NKV = 25K pairs by sampling English strings of 2 6 characters from the vocabulary of the Pa LM2-XXS model and randomly assigning them unique integer values in the range [1, NKV]. This constitutes the knowledge artifact, DKV. We further generate a collection of arithmetic expressions (DKV-EXP) containing addition (+), subtraction ( ), and multiplication ( ) operations between 3 6 keys by randomly sampling keys from DKV and operations to perform between them. We generate three datasets: (i) KV-Substitution (DKV-SUBS): This dataset maps each expression in DKV-EXP, to an expression where the keys are replaced by their corresponding values. For example, this dataset contains examples of the form ( + , 10 + 22 24). (ii) KV-Arithmetic (DKV-MATH): This dataset maps each expression in DKV-EXP to the numeric value arrived at by solving the arithmetic expression when the keys would be replaced by the corresponding values. For example, examples in this dataset look like ( + , 8). (iii) Numeric-Arithmetic (DNUM-MATH): This dataset maps the value substituted version of each expression in DKV-EXP to the numeric value arrived at by solving the arithmetic expression. For example, examples in this dataset look like (10 + 22 24, 8). Models We obtain augmenting model m A by further training a pre-trained Pa LM2-XXS model on DKV-SUBS to make it memorize the KV pairs in DKV. Note that, training on DKV-SUBS does not teach this augmenting model how to solve arithmetic expressions. Next, we use a pre-trained Pa LM2-XS model as the anchor model m B. This model is capable of solving numeric expressions with decent performance (see Table 1). Note that, this model has no knowledge of the KV pairs in DKV. We now take examples from the KV-Substitution dataset DKV-SUBS that only span 20% of the keys in DKV to form the training data for composition (DC). We use DC to compose the augmenting model (m A) having knowledge of DKV and the pre-trained anchor model m B by training the composition parameters (ΘC) using CALM as explained in 3. Both m A and m B are kept unchanged. Evaluation Task We evaluate the composed model m A B for its ability to solve arithmetic expressions containing keys from DKV. Specifically, we evaluate on the subset of DKV-MATH dataset that does not contain expressions used in DC during training. This way, we are able to measure the composed model s ability to generalize to keys beyond what was observed during training. m A m B CALM (m A B) DKV-SUBS 98.1 0.0 92.9 DNUM-MATH 4.2 73.7 72.0 DKV-MATH 0.7 0.0 84.3 Table 1: Evaluation (accuracy (%)) for a synthetic key-value (KV) task. m A is trained to memorize the KV mappings while m B excels at arithmetic We see that a composition m A B is able to perform arithmetic over held-out keys. Results Table 1 shows the performance of the three models: m A, m B, and m A B across the aforementioned datasets. First, we observe that the augmenting model m A achieves 98.1% at the KV-Substitution task showing that memorizes DKV well. Next, we see that it performs poorly (4.2%) at the Numeric-Arithmetic task showing that it does not have arithmetic capabilities. As a result, this model is not able to solve arithmetic expressions containing keys from DKV. As expected, the anchor model m B gets 0% accuracy on the KV-Substitution and KV-Arithmetic tasks as it has not seen any data from DKV. However, it performs well (73.7%) on the Numeric-Arithmetic task demonstrating capability of arithmetic over numerals. Lastly, we see that the composed model m A B is able to solve all tasks with high accuracy, especially the KVArithmetic task (84.3%) for which both the underlying models fail. This shows that the composed model is able to leverage the relevant capabilities from both the augmenting and anchor model to solve a complex task. 4.2 LOW-RESOURCE LANGUAGE INCLUSIVITY FLORES-200 (XX to En; chr F1) Model lij mr taq nn su ban pl th min acm avg. Pa LM2-XXS 24.0 16.5 21.6 33.3 20.6 2.1 5.3 63.2 44.0 59.8 29.0 + NTL (m A) 32.0 21.6 46.9 50.0 40.6 4.1 4.0 63.8 47.8 61.1 37.2 Pa LM2-S (m B) 32.6 24.2 44.6 50.8 50.9 5.4 9.5 69.0 61.0 68.6 41.7 CALM (m A B) 44.1 30.4 55.1 54.6 54.4 11.8 11.3 69.4 61.1 68.9 46.1 m B+NTL (m NTL B ) 48.1 39.1 59.2 57.5 57.3 11.4 9.9 69.4 61.4 69.0 48.2 Table 2: Translation performance for XX to English direction on the FLORES-200 dataset (Costajuss a et al., 2022): We show results for a subset of 10 low-resource languages. Note that the composed model m A B significantly outperforms both m A and m B. On the complete language list, m A B outperforms both the underlying models for 175 of 192 languages (Appendix A; Figure 2). m NTL B represents a skyline where m B has been further pre-trained on DNTL. The composed model achieves similar performance for a tiny fraction of the training cost. In this section, we study if we can compose such a large anchor LM m B with a smaller augmenting LM m A that has been pre-trained on low-resource languages, to perform translation and math-word problem solving tasks presented in these low-resource languages. Low-resource Language Corpora We use the long-tail language set and the associated corpora from the Next Thousand Languages (NTL) effort (Caswell et al., 2020; Bapna et al., 2022) as the GSM8K (Low-resource Languages; Accuracy) Model meo mfa pcm efi min ilo ady mai nso mzn avg. Pa LM2-XXS 5.2 6.8 6.8 4.0 5.6 7.2 6.0 3.6 7.2 6.8 5.9 + NTL (m A) 7.6 4.0 4.4 3.2 6.0 4.8 6.4 3.2 6.0 4.8 5.0 Pa LM2-S (m B) 28.8 14.0 34.4 14.8 25.2 14.8 30.0 22.8 8.4 31.6 22.5 CALM (m A B) 34.0 17.6 33.6 18.0 23.6 16.8 36.4 24.8 8.4 36.4 25.0 m NTL B 33.2 20.4 31.6 14.0 24.8 14.0 29.2 21.2 9.6 27.6 22.6 (High-resource Languages) Model en te bn sw ja zh th fr es de avg. Pa LM2-XXS 5.6 4.0 2.0 7.6 2.0 4.4 6.0 6.8 5.6 9.2 5.3 + NTL (m A) 4.8 3.6 3.2 4.8 3.2 7.6 6.4 9.2 5.6 7.2 5.6 Pa LM2-S (m B) 36.8 19.2 23.2 16.0 2.0 39.2 29.6 38.0 32.4 43.2 28.0 CALM (m A B) 37.2 28.0 27.2 18.0 2.4 43.6 33.2 42.8 36.0 49.2 31.8 m NTL B 36.0 17.6 18.4 14.4 0.8 33.6 27.2 34.8 31.2 42.0 25.6 Table 3: Evaluations for grade-school mathematics (GSM) problems on low-resource (LRL) and high-resource (HRL) languages. We observe that CALM yields significant gains for both evaluation sets. Gains on the HRL set suggests that CALM avoids catastrophic forgetting. domain data DNTL. This large-scale corpora contains web-crawled monolingual sentences and translation pairs for 1000 languages. The dataset has been used for language expansion in translation systems and language models (Garcia et al., 2021; Siddhant et al., 2022). Models Akin to 4.1, we obtain augmenting model m A by training the Pa LM2-XXS model on DNTL to impart knowledge about these low-resource languages to the model. For m B, we use the pre-trained Pa LM2-S model. We use 5% of the same low-resource language corpora DNTL as the training data DC to compose m A and m B via CALM. Since both models are untrained during composition, the anchor model m B is not trained on any of the low-resource language data. Evaluation Tasks We evaluate the composed model m A B on two tasks: (i) Translating text from a non-English language to English: We carry out these evaluations in a 5-shot in-context learning paradigm on the FLORES-200 (Costa-juss a et al., 2022) dataset. This dataset contains examples for 200 highand low-resource languages. (ii) Performing grade school math word problems expressed in a non-English language: We evaluate on the multilingual version of the GSM-8K dataset (Shi et al., 2023) containing math word problems for English and 9 other high-resource languages. We further generated a silver-standard GSM-8K dataset for low-resource languages by automatically translating the English examples in GSM-8K to 25 low-resource languages supported by Google Translate.1 Results Table 2 shows results on the FLORES-200 dataset (Costa-juss a et al., 2022), where the input is a low-resource (XX) language sentence and the output should be the corresponding English translation. For 10 low-resource languages shown in the Table, we see that both the underlying models m A and m B are outperformed by our composed model m A B. We find that the composed model m A B outperforms m B on 175 of the complete set of 192 languages (Appendix A). Table 3 shows the performance of these models on the grade-school math word problems from the GSM8K task (Cobbe et al., 2021) on low-resource languages (top) and high-resource languages (Shi et al. (2023); bottom). Firstly, we observe that the augmenting model m A does not perform well on this task due to its limited mathematical reasoning capabilities. On the other hand, the anchor model m B does much better given its mathematical reasoning capabilities and transfer-learning from highresource languages. Finally, we observe that m A B outperforms both m A and m B on 18 of 25 low-resource and 9 of 10 high-resource languages, demonstrating effective composition of models. 1We perform quality evaluations in Appendix 7. Model CC (P@1) T2C (P@1) C2T (chr F1) Human Eval MBPP Python PHP Go Java JS Ruby Pa LM2-XXS + Code (m A) 19.5 28.0 28.0 34.7 32.6 29.6 26.5 26.0 Pa LM2-S (m B) 16.4 28.6 30.4 35.5 40.4 31.0 28.8 27.9 CALM (m A B) 22.5 32.2 30.5 35.8 40.6 31.4 29.3 29.0 m Code B 24.3 43.0 18.9 35.0 41.1 31.1 20.2 27.6 Table 4: Evaluations for code generation and understanding across three tasks: Code Completion (CC), Text-to-Code (T2C), and Code-to-Text (C2T). Augmenting code understanding to m B using m A significantly improves performances across all datasets. m Code B represents a skyline where m B further pretrained on the DCode, which shows catastrophic forgetting of text generation task. See Table 6 (Appendix A.2) for a complete set of evaluations. Note that the last row in Table 3 shows that m B when fine-tuned on DNTL leads to worse performance than the pre-trained m B indicating forgetting. Composing domain-specific model m A with m B using CALM avoids this. 4.3 CODE UNDERSTANDING AND GENERATION Code understanding and generation require two distinct types of capabilities: (a) knowledge of the syntax and semantics of code, and (b) knowledge of the world that the code is manipulating. While LLMs have a wealth of world knowledge, they could often lack the specific knowledge of code syntax due to a skewed representation of code data in their pretraining corpora. Conversely, small models trained specifically on code data could exhibit a good understanding of code syntax, but they may lack broad world knowledge and reasoning. CALM can enable best of both worlds. Code Domain Data Here, we use the code-specific corpus, DCode, consisting of open-source code extracted from Git Hub heads for a variety of programming languages to train m A. Models Similar to 4.1, a version of the Pa LM2-XXS model has been further pre-trained on DCode is used as m A, while the base pre-trained Pa LM2-S model acts as m B. We build m A B by training CALM with only 7% of the same code data (data used for m A) to have a data parity. Evaluation Tasks We evaluate the efficacy of CALM on three different tasks: (i) Code-Completion (CC): Given an initial set of lines of a code, the model is prompted to complete the code snippet. Here the aim is to evaluate the model for code syntax. We perform zero-shot evaluations on Human Eval benchmark dataset (Chen et al., 2021) and report the Pass@1 (P@1) metric. (ii) Text-to-Code (T2C): Given a textual context, the model is prompted to generate the corresponding code snippet. Here, the evaluation indicates language understanding and code generation capabilities. We perform 3-shot inference on the MBPP dataset (Austin et al., 2021) and report P@1. (iii) Code-to-Text (C2T): Given a code snippet, the goal is to generate a natural language explanation of the code. This task evaluates code understanding and text generation. We perform 3-shot evaluations on the Code XGlue benchmark (Lu et al., 2021) and report chr F1 scores across languages. Results Table 4 reports comparative performance for the individual models m A and m B, the composed version m A B, and a fine-tuned anchor baseline m Code B . Firstly, evaluations on the Human Eval dataset suggest that m A has a superior understanding of code syntax as a result of its additional training on DCode. While, due to the larger scale and general purpose pre-training of m B, it excels at general language understanding and hence performs better on the T2C and C2T tasks. When employing CALM to compose the two models, we observe a clear transfer and composition of capabilities through significant performance improvements: 6.1% and 3.6% absolute gains over m B on the CC and T2C tasks, respectively. We observe that fine-tuning m B on DCode leads to a significant decline in the C2T performance due to catastrophic forgetting. CALM retains the perfor- FLORES-200 (XX-En) GSM-8K (LRL) GSM-8K (HRL) CC T2C C2T chr F1 #(>m B) acc. #(>m B) acc. #(>m B) P@1 P@1 chr F1 m A B 60.5 175 21.4 20 33.1 11 22.5 32.2 32.6 Vanilla m A 59.2 115 19.0 15 29.7 8 20.0 28.0 32.2 Random m A 58.8 43 17.8 9 28.5 4 20.1 27.0 32.1 m A as an encoder 59.3 102 19.1 12 29.1 6 16.0 27.0 32.0 Lo RA 59.2 82 20.9 15 31.2 9 18.3 28.7 32.6 Table 5: Comparative performance of CALM (m A B) across various possible ablations. The metric #(>m B) depicts the number of languages for which the corresponding model is better than the base for NTL, m B out of 192, 25, and 11 languages for the three tasks respectively. mance and is marginally better than m B across all languages. We also study qualitative examples on the C2T task and observe interesting common patterns that are discussed in Appendix B. 4.4 ABLATIONS Influence of m A We first study the influence of m A by replacing it with vanilla and random variants during composition. Table 5 shows the variation of performance across NTL and Code tasks when the specialized m A is replaced with a vanilla Pa LM2-XXS checkpoint or an untrained version of the model, i.e., a random model. We see that there is a considerable drop of performance with these variants across all tasks. On FLORES-200 XX-En task, languages improved with composition drop to 115 and 43 with vanilla and random, respectively. A slight improvement of the vanilla model over m B indicates that an un-specialized model (with a different training regime than m B) might have orthogonal capabilities leading to an enhanced model. This finding validates that performance gains seen with CALM is a result of utilizing m A and not just the additional ΘC parameters. Influence of iterative decoding We also investigate a variation where we use m A as an encoder, i.e., an output token decoded at a given timestep is not amended to m A s input. In this case, only the prefix representations of m A are used. This setting eludes to past work for image and text models (Alayrac et al., 2022) where encoder and decoder models are composed. We observe a significant decline in performance across our various tasks when employing this setting. Comparision with Lo RA Finally, we evaluate a parameter efficient fine-tuning approach by training Lo RA (Hu et al., 2021) layers to adapt m B. For all experiments, we set the Lo RA rank such that the number of added parameters is equal to the number of parameters introduced with CALM. We also train Lo RA on the same data as CALM, i.e., DC. We see a considerable difference in performance between the two approaches across all tasks and metrics. 5 CONCLUSION The proposed CALM framework composes an anchor LLM with specialized augmenting models to enable new tasks not achievable by either models individually. CALM does not require updating the individual models and learns a dense interaction between the models through a few trainable crossattention parameters. Our experiments present consistent evidence that CALM learns to utilize the expertise from the two models. That is, when composed with relevant augmenting models, we observe a significant uptick in the anchor model s performance across multiple challenging tasks, such as low-resource translation, reasoning, and code explanation/generation. That is, CALM is especially useful in scenarios where proprietary data and knowledge is stored in parametric models. With CALM, a foundational LLM could be augmented with such proprietary models to extend a variety of foundational capabilities such as reasoning, world knowledge, and coherent generation over the target proprietary domains. Finally, extensions of CALM could be used to acquire distinct knowledge from multiple augmenting models. ACKNOWLEDGMENTS This work was done during RB s pre-doctoral tenure at Google Research, India (GRI) with PT and PJ. RB is indebted to Manish Gupta, Divvy Thakkar, and all others who enabled this oppurtunity. RB would also like to thank the members of the Languages team and other researchers at GRI (and beyond), including the incredible pre-doctoral cohort. This work wouldn t have been possible without their constant support. Namely: Aishwarya P.S., Laurent El Shafey, and Qiao Zhang for their massive help in coding and debugging; Palak Jain and Sagar Gubbi for their feedback and support throughout the project; Kartikeya Badola, Shreyas Havaldar, Amandeep Kaur, and Rishabh Tiwari for being the first ears to all ideas; Cyrus Rashtchian and Richa Dixit for their mentorship. Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a Visual Language Model for Few-Shot Learning, April 2022. URL http://arxiv.org/abs/2204.14198. ar Xiv:2204.14198 [cs]. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. ar Xiv preprint ar Xiv:2108.07732, 2021. Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao, Mia Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang, Alexander Gutkin, Apurva Shah, Yanping Huang, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. Building machine translation systems for the next thousand languages, 2022. S ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco T ulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4. Co RR, abs/2303.12712, 2023. doi: 10.48550/ar Xiv.2303.12712. URL https://doi.org/10.48550/ar Xiv.2303.12712. Isaac Caswell, Theresa Breiner, Daan van Esch, and Ankur Bapna. Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 6588 6608, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/ v1/2020.coling-main.579. URL https://aclanthology.org/2020.coling-main. 579. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ar Xiv preprint ar Xiv:2107.03374, 2021. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. Co RR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168. Marta R. Costa-juss a, James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Lo ıc Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzm an, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. No language left behind: Scaling human-centered machine translation. Co RR, abs/2207.04672, 2022. doi: 10.48550/ar Xiv.2207.04672. URL https://doi.org/10.48550/ar Xiv.2207.04672. Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji Watanabe, Florian Metze, Luke Zettlemoyer, and Abdelrahman Mohamed. Lego NN: Building Modular Encoder-Decoder Models, June 2022. URL http://arxiv.org/abs/2206.03318. ar Xiv:2206.03318 [cs, eess]. Xavier Garcia, Aditya Siddhant, Orhan Firat, and Ankur P. Parikh. Harnessing multilinguality in unsupervised machine translation for rare languages. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-T ur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp. 1126 1137. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.naacl-main.89. URL https://doi.org/ 10.18653/v1/2021.naacl-main.89. Google, Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Katlorahy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Cl ement Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark D ıaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, Ya Guang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. Palm 2 technical report, 2023. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2790 2799. PMLR, 2019. URL http://proceedings.mlr.press/v97/houlsby19a.html. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA: Low-Rank Adaptation of Large Language Models, October 2021. URL http://arxiv.org/abs/2106.09685. ar Xiv:2106.09685 [cs]. Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. ar Xiv preprint ar Xiv:2212.04089, 2022. Samuel Kessler, Bethan Thomas, and Salah Karout. An Adapter Based Pre-Training for Efficient and Scalable Self-Supervised Speech Representation Learning, February 2022. URL http: //arxiv.org/abs/2107.13530. ar Xiv:2107.13530 [cs, eess]. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman- Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Neur IPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ 18abbeef8cfe9203fdf9053c9c4fe191-Abstract-Conference.html. Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. Codexglue: A machine learning benchmark dataset for code understanding and generation. Co RR, abs/2102.04664, 2021. Jiaqi Ma, Zhe Zhao, Jilin Chen, Ang Li, Lichan Hong, and Ed H. Chi. SNR: sub-network routing for flexible parameter sharing in multi-task learning. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 216 223. AAAI Press, 2019. doi: 10.1609/aaai.v33i01.3301216. URL https: //doi.org/10.1609/aaai.v33i01.3301216. Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703 17716, 2022. Mohammed Muqeeth, Haokun Liu, and Colin Raffel. Soft merging of experts with adaptive routing. ar Xiv preprint ar Xiv:2306.03745, 2023. Jonas Pfeiffer, Aishwarya Kamath, Andreas R uckl e, Kyunghyun Cho, and Iryna Gurevych. Adapter Fusion: Non-Destructive Task Composition for Transfer Learning, January 2021. URL http: //arxiv.org/abs/2005.00247. ar Xiv:2005.00247 [cs]. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-totext transformer. J. Mach. Learn. Res., 21:140:1 140:67, 2020. URL http://jmlr.org/ papers/v21/20-074.html. Timo Schick, Jane Dwivedi-Yu, Roberto Dess ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Co RR, abs/2302.04761, 2023. doi: 10.48550/ARXIV.2302.04761. URL https: //doi.org/10.48550/ar Xiv.2302.04761. Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugging GPT: Solving AI Tasks with Chat GPT and its Friends in Hugging Face, April 2023. URL http://arxiv.org/abs/2303.17580. ar Xiv:2303.17580 [cs]. Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023. URL https://openreview.net/pdf?id=f R3w GCk-IXp. Aditya Siddhant, Ankur Bapna, Orhan Firat, Yuan Cao, Mia Xu Chen, Isaac Caswell, and Xavier Garcia. Towards the next 1000 languages in multilingual machine translation: Exploring the synergy between supervised and self-supervised learning. Co RR, abs/2201.03110, 2022. URL https://arxiv.org/abs/2201.03110. Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mohamed Amin, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Ag uera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahdavi, Joelle K. Barral, Dale R. Webster, Gregory S. Corrado, Yossi Matias, Shekoofeh Azizi, Alan Karthikesalingam, and Vivek Natarajan. Towards expert-level medical question answering with large language models. Co RR, abs/2305.09617, 2023. doi: 10.48550/ar Xiv.2305.09617. URL https://doi.org/10.48550/ar Xiv.2305.09617. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998 6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Jianshu ji, Guihong Cao, Daxin Jiang, and Ming Zhou. K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters, December 2020. URL http://arxiv.org/abs/2002.01808. ar Xiv:2002.01808 [cs]. Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesv ari, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 23965 23998. PMLR, 2022. URL https: //proceedings.mlr.press/v162/wortsman22a.html. Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Resolving interference when merging models. ar Xiv preprint ar Xiv:2306.01708, 2023. Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, May 2022. URL http://arxiv.org/abs/2204.00598. ar Xiv:2204.00598 [cs]. Zachary M. Ziegler, Luke Melas-Kyriazi, Sebastian Gehrmann, and Alexander M. Rush. Encoderagnostic adaptation for conditional language generation. Ar Xiv, abs/1908.06938, 2019. URL https://api.semanticscholar.org/Corpus ID:201070774. A SUPPLEMENTARY MATERIAL FOR NTL A.1 FLORES-200 Figure 2 depicts the gains over the anchor Pa LM2-S model when augmented with a model that has been trained on DNTL. We see a positive gain through CALM for 175 of 192 languages. The highest gains are seen for low-resource languages since they are the most underrepresented in the original model. Diminishing returns with higher resource languages is seen and this trend is similar to the trend seen for m NTL B . Low to High Resource Languages (#languages = 192) Gain over Anchor Model Figure 2: Gains seen by the composed model m A B over the anchor model, m B, for the complete set of FLORES-200 languages. The languages are sorted from low to high-resource. m A m B m A B (CALM) m NTL B meo 7.6 28.8 34.0 33.2 mfa 4.0 14.0 17.6 20.4 pcm 4.4 34.4 33.6 31.6 efi 3.2 14.8 18.0 14.0 min 6.0 25.2 23.6 24.8 ilo 4.8 14.8 16.8 14.0 ady 6.4 30.0 36.4 29.2 mai 3.2 22.8 24.8 21.2 nso 6.0 8.4 8.4 9.6 mzn 4.8 31.6 36.4 27.6 bew 4.4 33.6 34.8 33.6 ts 4.8 7.2 10.0 11.6 dv 2.8 11.2 14.8 13.2 m A m B m A B (CALM) m NTL B bho 4.0 23.6 29.2 22.8 cv 6.0 17.6 16.4 20.4 mni 3.6 2.8 4.4 6.0 or 2.4 9.6 12.4 12.0 kri 5.6 12.4 18.8 20.0 tk 5.2 27.2 29.2 28.8 gom 4.8 22.4 25.2 22.8 ug 6.0 23.2 29.2 26.4 ckb 3.2 25.6 28.0 27.2 as 1.2 5.2 9.2 4.0 doi 3.6 17.2 22.4 21.6 dz 4.4 0.8 0.4 0.0 avg. 4.5 18.6 21.4 19.8 Table 6: Performance evaluations on the complete set of low-resource languages for GSM-8K. Augmenting m A with m B as m A B improves performance over m B across a majority of languages. On average, we see an improvement of 2.8%. meo mfa pcm efi min ilo ady Overlap 83.17 75.54 81.28 78.35 77.90 77.80 76.21 Delta 1.15 1.25 1.18 1.22 1.23 1.24 1.28 mai nso mzn bew ts dv bho Overlap 76.63 69.58 71.32 71.37 61.62 55.18 73.67 Delta 1.26 1.40 1.38 1.37 1.55 1.70 1.30 cv mni or kri tk gom ug Overlap 58.52 58.94 68.03 77.18 66.06 71.21 57.66 Delta 1.62 1.60 1.45 1.27 1.48 1.36 1.65 Table 7: Quality evaluation for the LRL GSM-8K dataset across languages. We created the dataset by translating the original English sentences of GSM-8K to the target language using the Google Translate API. We measure quality by back-translating the obtained examples back to English and measuring: (i) The overlap between the back-translated and the original English sentence, and (ii) The delta change in performance when Pa LM2-S is evaluated on this back-translated version of GSM-8K as compared to the original version. Quality evaluation for LRL GSM-8K As described in Section 4.2, we created the GSM-8K dataset (Cobbe et al., 2021) for low-resource languages by using the Google Translate API to obtain silver translations in the target language from the source English sentence in the original dataset. We perform a quality evaluation of these examples by back-translating them back to English using the same translation API and defining two metrics over it: (i) Overlap: The BLUE score measure between the actual example and the back-translated example, (ii) Delta: The change in performance of the Pa LM2-S model when evaluated on the original GSM8K set as compared to the back-translated version. Table 7 shows the values for these metrics across the various languages. We see that a decently high overlap value is seen across all languages. At the same time, the delta in performance is also minimal indicating that key attributes in the GSM-8K examples are not affected by translation. Results on the complete language set Table 6 shows the comparative evaluations on the complete set of 25 low-resource languages for which GSM evaluations are performed. We see an improvement over the anchor model m B for 20 of 25 languages. We also compare against the fully continued pretrained version m NTL B and observe that m A B outperform it for 18 of 25 languages. B QUALITATIVE ANALYSIS Table 8 depicts a few qualitative examples for the code-to-text, or the code explanation task, for Python. These examples depict examples for the three broader bucket of examples that we observe in cases when CALM yields the correct responses: 1. When neither of m A or m B generates the correct response but m A B correctly attends over their latent representations to yield the correct output, 2. When either of m A or m B is seen to give the correct response while the other one is incorrect and m A B generates the correct response that matches the generation from the correct model of m A and m B, and 3. When both m A and m B generate the correct response and m A B reproduces those generations. We also observed similar qualitative patterns with other tasks for language inclusivity. C OVERHEAD WITH CALM In this section, we include a detailed computation of the expected parametric and training overhead while composing given models using our proposed CALM framework. def Consume Bool(self): try : result = Parse Bool(self.token) except Value Error as e : raise self. Parse Error(str(e)) self.Next Token() return result Consumes a boolean m A: Consumes a boolean m B: The object is not a member CALM: Consumes a boolean def value(self): if self.has value: return self. impl[OBJ].get val(K) else: raise Value Error("Not found") return Print an error message and exit. [a part of the given model prefix] Exit with error message Print an error message and exit def get positions(url): data = get resource(url) positions = [x for x in data[ p ]] return positions Returns a list of positions. Positions of specified instruments. Get all positions. Returns a list of positions . def distance(x0, y0, x1, y1): return ( sqrt(pow(x1 x0,2) + pow(y1 y0,2) ) Returns the distance between two points Calculates the distance between two points Return the distance between two points Calculates the distance between two points Table 8: Cherry-picked qualitative examples for the code-to-text task on Python that depict examples that fall into a set of larger bucket of patterns that we observe across examples. CALM does well in various settings: (i) when m Aproduces the correct output but not m B, (ii) vice-versa when m B does well, and (iii) when neither of the two base models do well but a combination of intermediate representations allow the composed model to give the correct output. This shows that composition implicitly learns to do both: routing across models and a combination, based on a given input. C.1 PARAMETRIC OVERHEAD Building from the notations in 3.1, let s say the two models m A and m B have NA and NB number of standard transformer layers, respectively, with each layer of output dimensionality DA and DB. As mentioned, we choose n = n A = n B number of layers to perform the composition. # params per fproj layer = (DA DB) # params per fcross layer = (3 D2 B) # total added params = n (DA DB + 3 D2 B) # params in m B = NB (VB DB +3 D2 B +2 DB DB KB) where, VB and KB depict the vocabulary size and hidden multiplication factor, respectively. Let s consider some standard transformer configurations to understand the parameter overhead. As an example, consider the layer configurations of standard BERT models: BERT-small (m A) and BERT-large (m B). That is: NA = 4, DA = 512, NB = 24, DB = 1024, VB = 30K, KB = 4. Assuming that we select all layers of m B, the value of n = 4. Then: # params for CALM = 4 (512 1024 +3 10242) 1.5 107 15M # params in m B = 24 (30K 1024 +3 10242 +2 10242 4) 1B %age added parameters = (15M/1B) 100 = 1.5% Hence, number of parameters added during composition is approximately 1.5% of those in m B. C.2 TRAINING OVERHEAD Although learning the new parameters during CALM training requires back propagating over m B, the total training costs are still significantly lesser than fine-tuning m B, owing to the significantly lesser training examples. Firstly, as discussed in the previous section, the additional number of parameters introduced during composition is 1.5% of those in m B hence, a negligible parametric addition. Further, since only 5-7% of the total m B fine-tuning data is required to train CALM, the training cost of CALM is minimal with respect to that of fine-tuning m B. Let s assume that (i) the number of parameters in m B is PB and the number of examples required to fine-tune m B is DB, (ii) the cost (in FLOPS), C(.), of training scales linearly with parameters and data, i.e., C(m B) = O(PB DB) (iii) the number of parameters in m A is 10% of those in m B, (iv) the number of parameters added for CALM is 2% of those on m B, and (v) the amount of data required to train CALM is 5% of m B training. Then: Pm A B = PA + PB + PΘC = 0.10 PB + PB + 0.02 PB = 1.12 PB Dm A B = 0.05 DB C(m A B) = O(Pm A B Dm A B) = O((1.12 PB) (0.05 DB)) = O(0.056 PB DB) = 5.6% O(PB DB) = 5.6% C(m B) Hence, the cost of training CALM is a significantly lesser (<10%) than that of fine-tuning m B.