# layerwise_representation_fusion_for_compositional_generalization__00724318.pdf Layer-Wise Representation Fusion for Compositional Generalization Yafang Zheng1,2*, Lei Lin1,2,3*, Shuangtao Li1,2, Yuxuan Yuan1,2, Zhaohong Lai1,2, Shan Liu1,2, Biao Fu1,2, Yidong Chen1,2, Xiaodong Shi1,2 1Department of Artificial Intelligence, School of Informatics, Xiamen University 2 Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China 3Kuaishou Technology, Beijing, China {zhengyafang, linlei}@stu.xmu.edu.cn, {ydchen, mandel}@xmu.edu.cn Existing neural models are demonstrated to struggle with compositional generalization (CG), i.e., the ability to systematically generalize to unseen compositions of seen components. A key reason for failure on CG is that the syntactic and semantic representations of sequences in both the uppermost layer of the encoder and decoder are entangled. However, previous work concentrates on separating the learning of syntax and semantics instead of exploring the reasons behind the representation entanglement (RE) problem to solve it. We explain why it exists by analyzing the representation evolving mechanism from the bottom to the top of the Transformer layers. We find that the shallow residual connections within each layer fail to fuse previous layers information effectively, leading to information forgetting between layers and further the RE problems. Inspired by this, we propose LRF, a novel Layer-wise Representation Fusion framework for CG, which learns to fuse previous layers information back into the encoding and decoding process effectively through introducing a fuse-attention module at each encoder and decoder layer. LRF achieves promising results on two realistic benchmarks, empirically demonstrating the effectiveness of our proposal. Codes are available at https://github.com/thinkaboutzero/LRF. Introduction The remarkable progress of sequence-to-sequence (seq2seq) models in language modeling has been primarily attributed to their ability to learn intricate patterns and representations from vast amounts of data (Sutskever, Vinyals, and Le 2014; Dong and Lapata 2016; Vaswani et al. 2017). However, a critical challenge that remains unsolved for neural sequence models is the ability to understand and produce novel combinations from known components (Fodor and Pylyshyn 1988; Lake et al. 2017), i.e., compositional generalization (CG). For example, if a person knows the doctor has lunch [Der Arzt hat Mittagessen] and the lawyer [Der Anwalt] where the segment in [] denotes the German translation, then it is natural for the person to know the translation of the lawyer has lunch is [Der Anwalt hat Mittagessen] even though they *These authors contributed equally. Corresponding Author. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Embedding Layer Transformer Layer Transformer Layer Transformer Layer (a) Standard Transformer Embedding Layer Transformer Layer Transformer Layer Transformer Layer (b) LRF (Ours) Figure 1: Comparison between (a) Standard Transformer and (b) LRF (ours). (a) illustrates the standard architecture of the Transformer model and (b) illustrates the architecture of our method, which fuses representations from previous layers at each layer effectively. have never seen it before. Such nature is beneficial for models to perform robustly in real-world scenarios, as even huge training data can not cover a potentially infinite number of novel combinations. Recent studies have demonstrated that a key reason for failure on CG is that the syntactic and semantic representations of sequences in both the uppermost layer of the encoder and decoder are entangled, i.e., encoder or decoder RE problem (Russin et al. 2019; Li et al. 2019; Raunak and et al 2019; Liu et al. 2020a, 2021; Mittal et al. 2022; Zheng and Lapata 2022; Thrush 2020; Akyurek and Andreas 2021; Yao and Koller 2022; Yin et al. 2022). To alleviate it, one line of research on CG concentrates on utilizing separate syntactic and semantic representations. Specifically, they either produce two separate syntactic and semantic representations and then compose them appropriately (Li et al. 2019; Russin et al. 2019; Jiang and Bansal 2021), or design external modules and then employ a multi-stage generation process (Liu et al. 2020b, 2021; Ruis and Lake 2022; Li et al. 2022; Cazzaro et al. 2023). However, they focus on separating the learning of syn- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) tax and semantics. Different from them, we explore the reasons behind the RE problem to solve it. We get insights from the findings that the residual connections are shallow , and conclusions on analyzing the Transformer (Peters et al. 2018; He, Tan, and Qin 2019; Voita, Sennrich, and Titov 2019; Belinkov et al. 2020) show that the bottom layers of the Transformer contain more syntactic information while the top ones contain more semantic information. As shown in Figure 1a, the residual connections within each layer have been shallow themselves, and only pass through simple, one-step operations (Yu et al. 2018), which make the model forget distant layers and fail to fuse information from previous layers effectively (Bapna et al. 2018; Dou et al. 2018; Wang et al. 2018, 2019). Furthermore, it is clear that the nature of the RE problems are lacking effective fusion of syntactic and semantic information stored in different layers, since the representations are gradually entangled from bottom to top of the Transformer layers (Raunak and et al 2019; Russin et al. 2019; Zheng and Lapata 2022). Based on the above findings, we hypothesize that shallow residual connections within each layer are one of the reasons resulting in the RE problems. To this end, we propose LRF, a novel Layer-wise Representation Fusion framework for CG, which learns to fuse previous layers information at each layer effectively. Specifically, we extend the base model by fusing previous layers information back into the encoding and decoding process by introducing a fuseattention module in each encoder and decoder layer. Experimental results on CFQ (Keysers et al. 2020) (semantic parsing) and Co Gnition (Li et al. 2021) (machine translation, MT) empirically show that our method achieves better generalization performance, outperforming competitive baselines and other techniques. Notably, LRF achieves 20.0% and 50.3% (about 30%, 20% relative improvements) for instance-level and aggregate-level error reduction rates on Co Gnition. Extensive analyses demonstrate that fusing previous layers information at each layer effectively leads to better generalization results, outperforming competitive baselines and more specialized techniques. Methodology We adopt the Transformer (Vaswani et al. 2017) as an example for clarification, however, our proposed method is applicable to any seq2seq models. In the following, we first introduce the Transformer, and then our proposed LRF. Transformer Given a sequence of a source sentence and a target sentence X = {x1, ..., x S}, Y = {y1, ..., y T }, where S, T denote the number of source and target tokens, respectively. The Transformer encoder first maps X to embedding matrix H0, and then takes H0 as input and outputs a contextualized representation HL Rd S, where d, L denote the hidden size and the number of encoder layers respectively. Similarly, the Transformer decoder maps Y to embedding matrix ˆH0 first, and then takes ˆH0 as input and outputs a contextualized representation ˆHL Rd T . Previous encoder layers' outputs lost the dog he liked Embedding Layer Self-Attention Fuse-Attention Feed-Forward Previous decoder layers' outputs 丢失 了 他 喜欢 的 狗 Masked Self-Attention Fuse-Attention Feed-Forward Cross-Attention Embedding Layer Figure 2: Architecture of LRF based on the Transformer. The dotted boxes denote the fuse-attention module. Attention Mechanism. An attention function can be described as a query (Q) and a set of key-value (K-V ) pairs mapped to an output. Formally, given Q, K, and V , the scaled dot product attention mechanism is computed as: Attention(Q, K, V ) = softmax(Q K dk )V, (1) where dk is the dimension of K. A typical extension of the above is multi-head attention (MHA), where multiple linear projections are executed in parallel. The calculation process is as follows: MHA(Q, K, V ) = [head1; ...; headh]W O, (2) headi = Attention(QW Q i , KW K i , V W V i ), (3) where W Q i Rd dk, W K i Rd dk, W V i Rd dv and W O i Rhdv d are model parameters. h denotes the number of heads. Layer Structure. The Transformer encoder has L identical layers, and each layer consists of two sub-layers (i.e., self-attention and feed-forward networks). The Transformer decoder has L identical layers, and each layer consists of three sub-layers (i.e., masked self-attention, cross-attention and feed-forward networks). In the l-th self-attention layer of the encoder, the query, key and value are all the hidden states outputted by the previous layer Hl 1. The selfattention mechanism in the decoder operates in a similar manner. The formal expression is as follows: Hl a = MHA(Hl 1, Hl 1, Hl 1), (4) ˆHl a = MHA( ˆHl 1, ˆHl 1, ˆHl 1). (5) In the l-th cross-attention layer of the decoder, the query is the hidden states outputted by the l-th self-attention layer ˆHl a and the key and value are all hidden states outputted by the uppermost layer of the encoder HL. The computation process is as follows: ˆHl ca = MHA( ˆHl a, HL, HL). (6) The feed-forward sub-layer is a two-layer transformation with a Re LU activation function: Hl = W l 2Re LU(W l 1Hl a + bl 1) + bl 2, (7) The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) ˆHl = W l 4Re LU(W l 3 ˆHl ca + bl 3) + bl 4, (8) where W l 1, bl 1, W l 2, W l 3, bl 2, W l 4, bl 3 and bl 4 are all trainable model parameters. We omit the layer normalization and residual connection for brevity. Layer-Wise Representation Fusion (LRF) Our proposed LRF extends the Transformer by introducing a fuse-attention module under the feed-forward module in each encoder and decoder layer, which learns to fuse information from previous layers at each encoder and decoder layer effectively. Fuse-Attention Module. In the l-th encoder layer, all previous layers outputs are stacked as Hplo Rd l, where l is the number of previous layers (include the embedding layer). The fuse-attention module fuses different aspects of language information from previous layers effectively via the multi-head attention mechanism: Hl p = MHA(Hl a, Hplo, Hplo), (9) where Hplo = {H0, ..., Hl 1}, and the localness of fuseattention module is implemented by mask mechanisms. The output Hl p is fed into the l-th feed-forward sub-layer of the encoder (Eq. 7). In the l-th decoder layer, all previous layers outputs are stacked as ˆHplo Rd l, where l is the number of previous layers (include the embedding layer). The fuseattention module fuses different aspects of language information from previous layers effectively via the multi-head attention mechanism: ˆHl p = MHA( ˆHl ca, ˆHplo, ˆHplo), (10) where ˆHplo = { ˆH0, ..., ˆHl 1}, and the localness of fuseattention module is implemented by mask mechanisms. The output ˆHl p is fed into the l-th feed-forward sub-layer of the decoder (Eq. 8). The differences between LRF and Transformer are illustrated by the dotted boxes in Figure 2. By introducing a fuseattention module in every encoder and decoder layer, each layer of the encoder and decoder are able to access and fuse previous layers information effectively. Training. Formally, D = {(X, Y )} denotes the training corpus, V denotes the vocabulary of D. LRF aims to estimate the conditional probability p(y1, ..., y T |x1, ..., x S), where (x1, ..., x S) is an input sequence and (y1, ..., y T ) is its corresponding output sequence: p(Y |X; {θ(0) θ+}) = t=1 p(yt|y