# ensemble_distillation_for_unsupervised_constituency_parsing__c5d393c1.pdf Published as a conference paper at ICLR 2024 ENSEMBLE DISTILLATION FOR UNSUPERVISED CONSTITUENCY PARSING Behzad Shayegh1, Yanshuai Cao2 Xiaodan Zhu3,4 Jackie C.K. Cheung5,6 Lili Mou1,6 1Dept. Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta 2Borealis AI 3Dept. Electrical and Computer Engineering, Queen s University 4Ingenuity Labs Research Institute, Queen s University 5Quebec Artificial Intelligence Institute (MILA), Mc Gill University 6Canada CIFAR AI Chair the.shayegh@gmail.com yanshuai.cao@borealisai.com xiaodan.zhu@queensu.ca jcheung@cs.mcgill.ca doublepower.mou@gmail.com We investigate the unsupervised constituency parsing task, which organizes words and phrases of a sentence into a hierarchical structure without using linguistically annotated data. We observe that existing unsupervised parsers capture different aspects of parsing structures, which can be leveraged to enhance unsupervised parsing performance. To this end, we propose a notion of tree averaging, based on which we further propose a novel ensemble method for unsupervised parsing. To improve inference efficiency, we further distill the ensemble knowledge into a student model; such an ensemble-then-distill process is an effective approach to mitigate the over-smoothing problem existing in common multi-teacher distilling methods. Experiments show that our method surpasses all previous approaches, consistently demonstrating its effectiveness and robustness across various runs, with different ensemble components, and under domain-shift conditions.1 1 INTRODUCTION Constituency parsing is a well-established task in natural language processing (NLP), which interprets a sentence and induces its constituency tree, a syntactic structure representation that organizes words and phrases into a hierarchy (Chomsky, 1967). It has wide applications in various downstream tasks, including semantic role labeling (Mohammadshahi & Henderson, 2023) and explainability of AI models (Tenney et al., 2019; Wu et al., 2022). Traditionally, parsing is accomplished by supervised models trained with linguistically annotated treebanks (Charniak, 2000), which are expensive to obtain and may not be available for low-resource scenarios. Also, these supervised parsers often underperform when encountering domain shifts. This motivates researchers to explore unsupervised methods as they eliminate the need for annotated data. To address unsupervised parsing, researchers have proposed various heuristics and indirect supervision signals. Clark (2001) employs context distribution clustering to induce a probabilistic context-free grammar (PCFG; Booth, 1969). Klein & Manning (2002) define a joint distribution for sentences and parse structures, the latter learned by expectation maximization (EM) algorithms. Snyder et al. (2009) further extend unsupervised parsing to the multilingual setting with bilingual supervision. In the deep learning era, unsupervised parsing techniques keep advancing. Cao et al. (2020) utilize linguistic constituency tests (Chomsky, 1967) as heuristics, evaluating all spans as potential constituents for selection. Li & Lu (2023) modify each span based on linguistic perturbations and observe changes in the contextual representations of a masked language model; according to the level of distortion, they determine how likely the span is a constituent. Maveli & Cohen (2022) use rules to train two classifiers with local features and contextual features, respectively, which are further refined in a co-training fashion. Another way to obtain the parsing structure in an unsupervised Work partially done during Mitacs internship at Borealis AI. 1Code available at https://github.com/MANGA-UOFA/ED4UCP Published as a conference paper at ICLR 2024 Compound PCFG 100 DIORA 55.8 100 S-DIORA 58.1 63.6 100 Compound DIORA S-DIORA PCFG DIORA 100 DIORA 74.1 100 DIORA 74.3 74.9 100 DIORA DIORA DIORA Table 1: Correlation analysis of unsupervised parsers. Numbers are the F1 score of one model against another on the Penn Treebank dataset (Marcus et al., 1993). The left table considers three heterogeneous models (Compound PCFG, DIORA, and S-DIORA), whereas the right table considers three runs ( , , and ) of the same model. All their F1 scores against the groundtruth fall within the range of 59 61, thus providing a controlled experimental setting. way is to treat it as a latent variable and train it in downstream tasks, such as text classification (Li et al., 2019), language modeling (Shen et al., 2019; Kim et al., 2019b), and sentence reconstruction (Drozdov et al., 2019; Kim et al., 2019a). Overall, unsupervised parsing is made feasible by such heuristics and indirect supervisions, and has become a curious research direction in NLP. In our work, we uncover an intriguing phenomenon of low correlation among different unsupervised parsers, despite their similar overall F1 scores (the main evaluation metric for parsing), shown in Table 1. While Williams et al. (2018) have shown low self-consistency in early latent-tree models, we go further and show the correlation among different models is even lower than restarts of the same model. This suggests that existing unsupervised parsers capture different aspects of the structures, and our insight is that combining these parsers may leverage their different expertise to achieve higher performance for unsupervised parsing. To this end, we propose an ensemble method for unsupervised parsing. We first introduce a notion of tree averaging based on the similarity of two constituency trees. Given a few existing unsupervised parsers, referred to as teachers,2 we then propose to use a CYK-like algorithm (Kasami, 1966; Younger, 1967; Manacher, 1978; Sennrich, 2014) that utilizes dynamic programming to search for the tree that is most similar to all teachers outputs. In this way, we are able to obtain an average parse tree, taking advantage of different existing unsupervised parsers. To improve the inference efficiency, we distill our ensemble knowledge into a student model. In particular, we choose the recurrent neural network grammar (RNNG; Dyer et al., 2016) with an unsupervised self-training procedure (URNNG; Kim et al., 2019b), following the common practice in unsupervised parsing (Kim et al., 2019a; Cao et al., 2020). Our ensemble-then-distill process is able to mitigate the over-smoothing problem, where the standard cross-entropy loss encourages the student to learn an overly smooth distribution (Wen et al., 2023b). Such a problem exists in common multi-teacher distilling methods (Wu et al., 2021), and would be especially severe when the teachers are heterogeneous, signifying the importance of our approach. We evaluated our ensemble method on the Penn Treebank (PTB; Marcus et al., 1993) and SUSANNE (Sampson, 2002) corpora. Results show that our approach outperforms existing unsupervised parsers by a large margin in terms of F1 scores, and that it achieves results comparable to the supervised counterpart in the domain-shift setting. Overall, our paper largely bridges the gap between supervised and unsupervised constituency parsing. In short, the main contributions of this paper include: 1) We reveal an intriguing phenomenon that existing unsupervised parsers have diverse expertise, which may be leveraged by model ensembles; 2) We propose a notion of tree averaging and utilize a CYK-like algorithm that searches for the average tree of existing unsupervised parsers; and 3) We propose an ensemble-then-distill approach to improve inference efficiency and to alleviate the over-smoothing problem in common multi-teacher distilling approaches. 2.1 UNSUPERVISED CONSTITUENCY PARSING In linguistics, a constituent refers to a word or a group of words that function as a single unit in a hierarchical tree structure (Chomsky, 1967). In the sentence The quick brown fox jumps over 2Our full approach involves training a student model from the ensemble; thus, it is appropriate to use the term teacher for an ensemble component. Published as a conference paper at ICLR 2024 the lazy dog, for example, the phrase the lazy dog serves as a noun phrase constituent, whereas jumps over the lazy dog is a verb phrase constituent. In this study, we address unsupervised constituency parsing, where no linguistic annotations are used for training. This reduces human labor and is potentially useful for low-resource languages. Following most previous work in this direction (Cao et al., 2020; Maveli & Cohen, 2022; Li & Lu, 2023), we focus on unlabeled, binary parse trees, in which each constituent has a binary branching and is not labeled with its syntactic role (such as a noun phrase or a verb phrase). The standard evaluation metric for constituency parsing is the F1 score, which is the harmonic mean of precision and recall (Shen et al., 2018; Zhang et al., 2021): P = |C(Tpred) C(Tref)| |C(Tpred)| , R = |C(Tpred) C(Tref)| |C(Tref)| , F1 = 2 PR where Tpred and Tref are predicted and reference trees, respectively, and C(T) is the set of constituents in a tree T. 2.2 A NOTION OF AVERAGING CONSTITUENCY TREES In our study, we propose an ensemble approach to combine the expertise of existing unsupervised parsers (called teachers), as we observe they have low correlation among themselves despite their similar overall F1 scores (Table 1). To accomplish ensemble binary constituency parsing, we need to define a notion of tree averaging; that is, our ensemble output is the average tree that is most similar to all teachers outputs. Inspired by the evaluation metric, we suggest the average tree should have the highest total F1 score compared with different teachers. Let s be a sentence and Tk be the kth teacher parser. Given K teachers, we define the average tree to be Avg Tree(s, {Tk}K k=1) = arg max T T (s) k=1 F1(T, Tk(s)) (2) where T (s) is all possible unlabeled binary trees on sentence s, and Tk(s) is the kth teacher s output. It is emphasized that only the trees of the same sentence can be averaged. This simplifies the F1 score of binary trees, as the denominators for both precision and recall are 2|s| 1 for a sentence with |s| words, i.e., |C(Tpred)| = |C(Tref)| = 2|s| 1. Thus, Eqn. (2) can be rewritten as: Avg Tree(s, {Tk}K k=1) = arg max T T (s) k=1 F1(T, Tk(s)) = arg max T T (s) |C(T) C(Tk(s))| = arg max T T (s) k=1 1[c C(Tk(s))] | {z } Hit Count(c,{Tk(s)}K k=1) Here, we define the Hit Count function to be the number of times that a constituent c appears in the teachers outputs. In other words, Eqn. (4) suggests that the average tree should be the one that hits the teachers predicted constituents most. Discussion on MBR decoding. Our work can be seen as minimum Bayes risk (MBR) decoding (Bickel & Doksum, 2015). In general, MBR yields an output that minimizes an expected error (called Bayes risk), defined according to the task of interest. In our case, the error function can be thought of as P c C(T ) Hit Count(c, {Tk(s)}K k=1), and minimizing such-defined Bayes risk is equivalent to maximizing the total hit count in Eqn. (4). However, our MBR approach significantly differs from prior MBR studies in NLP. In fact, MBR has been widely applied to text generation (Kumar & Byrne, 2004; Freitag et al., 2022; Suzgun et al., 2023), where a set of candidate output sentences are obtained by sampling or beam search, and the best one is selected based on a given error function, e.g., the dissimilarity against others; such an MBR method is selective, meaning that the output can only be selected from a candidate set. On the contrary, our MBR is generative, as the sentence s entire parse space T (s) will be considered Published as a conference paper at ICLR 2024 during the arg max process in (4). This follows Petrov & Klein (2007) who search for the global lowest-risk tree in the task of supervised constituency parsing. Here, the global search is feasible because the nature of tree structures facilitates efficient exact decoding with dynamic programming, discussed in the next subsection. 2.3 OUR CYK VARIANT As indicated by Eqn. (4), our method searches for the constituency tree with the highest total hit count of its constituents in the teachers outputs. We can achieve this by a CYK (Kasami, 1966; Younger, 1967)-like dynamic programming algorithm, because an optimal constituency parse structure of a span a continuous subsequence of a sentence is independent of the rest of the sentence. Given a sentence s, we denote by sb:e a span starting from the bth word and ending right before the eth word. Considering a set of teachers {Tk}K k=1, we define a recursion variable Hb:e = max T T (sb:e) c C(T ) Hit Count(c, {Tk(s)}K k=1) (5) which is the best total hit count for this span.3 We also define Lb:e to be the corresponding, locally best parse structure, unambiguously represented by the set of all constituents in it. The base cases are Hb:b+1 = K and Lb:b+1 = {sb:b+1} for b = 1, , |s|, suggesting that the best parse tree of a single-word span is the word itself, which appears in all teachers outputs and has a hit count of K. For recursion, we see a span sb:e will be split into two subspans sb:j and sj:e for some split position j, because our work focuses on binary constituency parsing. Given j, the total hit count for the span sb:e is the summation over those of the two subspans sb:j and sj:e, plus its own hit count. To obtain the best split, we need to vary j from b to e (exclusive), given by j b:e = arg max b