# sequencetosequence_learning_with_latent_neural_grammars__73e07c0f.pdf Sequence-to-Sequence Learning with Latent Neural Grammars Yoon Kim MIT CSAIL yoonkim@mit.edu Sequence-to-sequence learning with neural networks has become the de facto standard for sequence prediction tasks. This approach typically models the local distribution over the next word with a powerful neural network that can condition on arbitrary context. While flexible and performant, these models often require large datasets for training and can fail spectacularly on benchmarks designed to test for compositional generalization. This work explores an alternative, hierarchical approach to sequence-to-sequence learning with quasi-synchronous grammars, where each node in the target tree is transduced by a node in the source tree. Both the source and target trees are treated as latent and induced during training. We develop a neural parameterization of the grammar which enables parameter sharing over the combinatorial space of derivation rules without the need for manual feature engineering. We apply this latent neural grammar to various domains a diagnostic language navigation task designed to test for compositional generalization (SCAN), style transfer, and small-scale machine translation and find that it performs respectably compared to standard baselines. 1 Introduction Sequence-to-sequence learning with neural networks [62, 22, 106] encompasses a powerful and general class of methods for modeling the distribution over an output target sequence y given an input source sequence x. Key to its success is a factorization of the output distribution via the chain rule coupled with a richly-parameterized neural network that models the local conditional distribution over the next word given the previous words and the input. While architectural innovations such as attention [8], convolutional layers [39], and Transformers [110] have led to significant improvements, this word-by-word modeling remains core to the approach, and with good reason since any distribution over the output can be factorized autoregressively via the chain rule, this approach should be able to well-approximate the true target distribution given large-enough data and model.1 However, despite their excellent performance across key benchmarks these models are often sample inefficient and can moreover fail spectacularly on diagnostic tasks designed to test for compositional generalization [68, 63]. This is partially attributable to the fact that standard sequence-to-sequence models have relatively weak inductive biases (e.g. for capturing hierarchical structure [79]), which can result in learners that over-rely on surface-level (as opposed to structural) correlations. In this work, we explore an alternative, hierarchical approach to sequence-to-sequence learning with latent neural grammars. This work departs from previous approaches in three ways. First, we model the distribution over the target sequence with a quasi-synchronous grammar [103] which assumes a hierarchical generative process whereby each node in the target tree is transduced by Much of the work was completed while the author was at MIT-IBM Watson AI. Code is available at https://github.com/yoonkim/neural-qcfg. 1There are, however, weighted languages whose next-word conditional distributions are hard to compute in a formal sense, and these distributions cannot be captured by locally normalized autogressive models unless one allows the number of parameters (or runtime) to grow superpolynomially in sequence length [72]. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). nodes in the source tree. Such node-level alignments provide provenance and a causal mechanism for how each output part is generated, thereby making the generation process more interpretable. We additionally find that the explicit modeling of sourceand target-side hierarchy improves compositional generalization compared to non-hierarchical models. Second, in contrast the existing line of work on incorporating (often observed) tree structures into sequence modeling with neural networks [35, 5, 89, 37, 126, 1, 97, 18, 34, inter alia], we treat the source and target trees as fully latent and induce them during training. Finally, whereas previous work on synchronous grammars typically utilized log-linear models over handcrafted/pipelined features [20, 56, 115, 103, 112, 27, 42, inter alia] we make use of neural features to parameterize the grammar s rule probabilities, which enables efficient sharing of parameters over the combinatorial space of derivation rules without the need for any manual feature engineering. We also use the grammar directly for end-to-end generation instead of as part of a larger pipelined system (e.g. to extract alignments) [122, 41, 14]. We apply our approach to a variety of sequence-to-sequence learning tasks SCAN language navigation task designed to test for compositional generalization [68], style transfer on the English Penn Treebank [78], and small-scale English-French machine translation and find that it performs respectably compared to baseline approaches. 2 Neural Synchronous Grammars for Sequence-to-Sequence Learning We use x = x1, . . . , x S, y = y1, . . . , y T to denote the source/target strings, and further use s, t to refer to source/target trees, represented as a set of nodes including the leaves (i.e. yield(s) = x and yield(t) = y). 2.1 Quasi-Synchronous Grammars Quasi-synchronous grammars, introduced by Smith and Eisner [103], define a monolingual grammar over target strings conditioned on a source tree, where the grammar s rule set depends dynamically on the source tree s. In this paper we work with probabilistic quasi-synchronous context-free grammars (QCFG), which can be represented as a tuple G[s] = (S, N, P, Σ, R[s], θ) where S is the distinguished start symbol, N is the set of nonterminals which expand to other nonterminals, P is the set of nonterminals which expand to terminals (i.e. preterminals), Σ is the set of terminals, and R[s] is a set of context-free rules conditioned on s, where each rule is one of S A[αi], A N, αi s A[αi] B[αj]C[αk], A N, B, C N P, αi, αj, αk s D[αi] w, D P, w Σ, αi s. We use θ to parameterize the rule probabilities pθ(r) for each r R[s]. In the above, αi s are subsets of nodes in the source tree s, and thus a QCFG tranduces the output tree by aligning each target tree node to a subset of source tree nodes. This monolingual generation process differs from that of classic synchronous context-free grammars [118] which jointly generate source and target trees in tandem (and therefore require that source and target trees be isomorphic), making QCFGs appropriate tools for tasks such as machine translation where syntactic divergences are common.2 Since the αi s are elements of the power set of s, the above formulation as presented is completely intractable. We follow prior work [103, 112] and restrict αi s to be single nodes (i.e. αi, αj, αk s), which amounts to assuming that each target tree node is aligned to exactly one source tree node. In contrast to standard, flat sequence-to-sequence models where any hierarchical structure necessary for the task must be captured implicitly within a neural network s hidden layers, synchronous grammars explicitly model the hierarchical structure on both the source and target side, which acts as a strong source of inductive bias. This tree transduction process furthermore results in a more interpretable generation process as each span in the target aligned to a span in the source via nodelevel alignments.3 More generally, the grammar s rules provide a symbolic interface to the model with which operationalize constraints and imbue inductive biases, and we show how this mechanism can be used to, for example, incorporate phrase-level copy mechanisms (section 2.4). 2It is also possible to model syntactic divergences with richer grammatical formalisms [101, 81]. However these approaches require more expensive algorithms for learning and inference. 3Similarly, latent variable attention [121, 7, 28, 96, 119] provides for more a interpretable generation process than standard soft attention via explicit word-level alignments. 2.2 Parameterization Since each source tree node αi is likely to occur only a few times (or just once) in the training corpus, parameter sharing becomes crucial. Prior work on QCFGs typically utilized log-linear models over handcrafted features to share parameters across rules [103, 42]. In this work we instead use a neural parameterization which allows for easy parameter sharing without the need for manual feature engineering. Concretely, we represent each target nonterminal and source node combination A[αi] as an embedding, e A[αi] = u A + hαi, where u A is the embedding for A, and hαi is the representation of node αi given by running a Tree LSTM over the source tree s [107, 134]. These embeddings are then combined to produce the probability of each rule, pθ(S A[αi]) exp u S e A[αi] , pθ(A[αi] B[αj]C[αk]) exp f1(e A[αi]) (f2(e B[αj]) + f3(e C[αk])) , pθ(D[αi] w) exp f4(e D[αi]) uw + bw , where f1, f2, f3, f4 are feedforward networks with residual layers (see Appendix A.1 for the exact parameterization). Therefore the learnable parameters in this model are the nonterminal embeddings (i.e. u A for A {S} N P), terminal embeddings/biases (i.e. uw, bw for w Σ), and the parameters of the Tree LSTM and the feedforward networks. 2.3 Learning and Inference The QCFG described above defines a distribution over target trees (and by marginalization, target strings) given a source tree. While prior work on QCFGs typically relied on an off-the-shelf parser over the source to obtain its parse tree, this limits the generality of the approach. In this work, we learn a probabilistic source-side parser along with the QCFG. This parser is a monolingual PCFG with parameters φ that defines a posterior distribution over binary parse trees given source strings, i.e. pφ(s | x). Our PCFG uses the neural parameterization from Kim et al. [64]. With the parser in hand, we are now ready to define the log marginal likelihood, log pθ,φ(y | x) = log t T (y) pθ(t | s)pφ(s | x) Here T (x) and T (y) are the sets of trees whose yields are x and y respectively. Unlike in synchronous context-free grammars, it is not possible to efficiently marginalize over both T (y) and T (x) due to the non-isomorphic assumption. However, we observe that the inner summation P t T (y) pθ(t | s) = pθ(y | s) can be computed with the usual inside algorithm [9] in O(|N|(|N| + |P|)2S3T 3), where S is the source length and T is the target length. This motivates the following lower bound on the log marginal likelihood, log pθ,φ(y | x) Es pφ(s | x) [log pθ(y | s)] , which is obtained by the usual application of Jensen s inequality (see Appendix A.2).4 An unbiased Monte Carlo estimator for the gradient with respect to θ is straightforward to compute given a sample from pφ(s | x), since we can just backpropagate through the inside algorithm. For the gradient respect to φ, we use the score function estimator with a self-critical baseline [92], φ Es pφ(s | x) [log pθ(y | s)] (log pθ(y | s ) log pθ(y | bs)) φ log pφ(s | x), 4As is standard in variational approaches, one can tighten this bound with the use of a variational distribution qψ(s | x, y), which results in the following evidence lower bound, log pθ,φ(y | x) Es qψ(s | x,y) [log pθ(y | s)] KL[qψ(s | x, y) pφ(s | x)]. This is equivalent to our objective if we set qψ(s | x, y) = pφ(s | x). Rearranging some terms, we then have, Es pφ(s | x) [log pθ(y | s)] = log pθ,φ(y | x) KL[pφ(s | x) pθ,φ(s | x, y)]. Hence, our use of pφ(s | x) as the variational distribution is militating towards learning a model which achieves good likelihood but at the same time has a posterior distribution pθ,φ(s | x, y) that is close to the prior pφ(s | x) (i.e. learning a model where most of the uncertainty about s is captured by x alone). This is arguably reasonable for many language applications since parse trees are often assumed to be task-agnostic. where s is a sample from pφ(s | x) and bs is the MAP tree from pφ(s | x). We also found it important to regularize the source parser by simultaneously training it as a monolingual PCFG, and therefore add φ log pφ(x) to the gradient expression above.5 Obtaining the sample tree s , the argmax tree bs, and scoring the sampled tree log pφ(s | x) all require O(S3) dynamic programs. Hence the runtime is still dominated by the O(S3T 3) dynamic program to compute log pθ(y | s ) and log pθ(y | bs).6 We found this to be manageable on modern GPUs with a vectorized implementation of the inside algorithm. Our implementation uses the Torch-Struct library [93]. Predictive inference For decoding, we first run MAP inference with the source parser to obtain bs = argmaxs pφ(s | x). Given bs, finding the most probable sequence argmaxy pθ(y | bs) (i.e. the consensus string of the grammar G[bs]) is still difficult, and in fact NP-hard [102, 16, 77]. We therefore resort to an approximate decoding scheme where we sample K target trees t(1), . . . t(K) from G[bs], rescore the yields of the sampled trees, and return the tree whose yield has the lowest perplexity. 2.4 Extensions Here we show that the formalism of synchronous grammars provides a flexible interface with which to interact with the model. Phrase-level copying Incorporating copy mechanisms into sequence-to-sequence models has led to significant improvements for tasks where there is overlap between the source and target sequences [58, 83, 48, 73, 47, 95]. These models typically define a latent variable at each time step that learns to decide to either copy from the source or generate from the target vocabulary. While useful, most existing copy mechanisms can only copy singletons due to the word-level encoder/decoder.7 In contrast, the hierarchical generative process of QCFGs makes it convenient to incorporate phrase-level copy mechanisms by using a special-purpose nonterminal/preterminal that always copies the yield of the source subtree that it is combined with. Concretely, letting ACOPY N be a COPY nonterminal, we can expand the rule set R[s] to include rules of the form ACOPY[αi] v for v Σ+, and define the probabilities to be, pθ(ACOPY[αi] v) def = 1{v = yield(αi)}. (The preterminal copy mechanism is similarly defined.) Computing pθ(y | s) in this modified grammar requires a straightforward modification of the inside algorithm.8 In our style transfer experiments in section 3.2 we show that this phrase-level copying is important for obtaining good performance. While not explored in the present work, such a mechanism can readily be employed to incorporate external transformations rules (e.g. from bilingual lexicons or transliteration tables) into the modeling process, which has been previously investigated at the singleton-level [88, 3]. Adding constraints on rules For some applications we may want to place additional restrictions on the rule set to operationalize domain-specific constraints and inductive biases. For example, setting αj, αk descendant(αi) for rules of the form A[αi] B[αj]C[αk] would constrain the target tree hierarchy to respect the source tree hierarchy, while restricting αi to source terminals (i.e. αi yield(s)) for rules of the form D[αi] w would enforce that each target terminal be aligned to a source terminal. We indeed make use of such restrictions in our experiments. Incorporating autoregressive language models Finally, we remark that simple extensions of the QCFG can incorporate standard autoregressive language models. Let p LM(w | γ) be a distribution over the next word given by a (potentially conditional) language model given arbitrary context γ (e.g. γ = y