# weaklysupervised_grammarinformed_bayesian_ccg_parser_learning__cad0a91d.pdf Weakly-Supervised Grammar-Informed Bayesian CCG Parser Learning Dan Garrette Chris Dyer Jason Baldridge Noah A. Smith Department of Computer Science, University of Texas at Austin, dhg@cs.utexas.edu School of Computer Science, Carnegie Mellon University, {cdyer,nasmith}@cs.cmu.edu Department of Linguistics, University of Texas at Austin, jbaldrid@utexas.edu Combinatory Categorial Grammar (CCG) is a lexicalized grammar formalism in which words are associated with categories that, in combination with a small universal set of rules, specify the syntactic configurations in which they may occur. Previous work has shown that learning sequence models for CCG tagging can be improved by using priors that are sensitive to the formal properties of CCG as well as cross-linguistic universals. We extend this approach to the task of learning a full CCG parser from weak supervision. We present a Bayesian formulation for CCG parser induction that assumes only supervision in the form of an incomplete tag dictionary mapping some word types to sets of potential categories. Our approach outperforms a baseline model trained with uniform priors by exploiting universal, intrinsic properties of the CCG formalism to bias the model toward simpler, more cross-linguistically common categories. Introduction Supervised learning of natural language parsers of various types (context-free grammars, dependency grammars, categorial grammars, and the like) is by now a well-understood task with plenty of high-performing models when training data is abundant. Learning from sparse, incomplete information is, naturally, a greater challenge. To build parsers for domains and languages where resources are scarce, we need techniques that take advantage of very limited kinds and amounts of supervision. The strategy we pursue in this paper is to approach the problem in a Bayesian framework, using priors built from linguistic knowledge such as grammar universals, linguistic typology, and cheaply obtained annotations from a linguist. We focus on the task of learning parsers for Combinatory Categorial Grammar (CCG) (Steedman 2000; Steedman and Baldridge 2011) without having access to annotated parse trees. CCG is a lexicalized grammar formalism in which every constituent in a parse is assigned a category that describes its grammatical role in the sentence. CCG categories, in contrast to the labels used in standard phrase structure grammars, are not atomic labels like ADJECTIVE or VERB Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. PHRASE, but instead have a detailed recursive structure. Instead of VERB, a CCG category might encode information such as this category can combine with a noun phrase to the right (an object) and then a noun phrase to the left (a subject) to produce a sentence. Practical interest in CCG has grown in the last few years within an array of NLP applications of particular note are semantic parsing (Zettlemoyer and Collins 2005) and machine translation (Weese, Callison-Burch, and Lopez 2012). As these tasks mature and move into new domains and languages, the ability to learn CCG parsers from scarce data will be increasingly important. In this regard, CCG has particular appeal both theoretical and practical in the setting of weakly-supervised learning because the structure of its categories and its small set of universal rules provide a foundation for constructing linguistically-informative priors that go beyond general preferences for sparseness that are commonly expressed with generic priors. In this paper, we show that the intrinsic structure of CCG categories can be exploited cross-linguistically to learn better parsers when supervision is scarce. Our starting point is the method we presented in Garrette et al. (2014) for specifying a probability distribution over the space of potential CCG categories. This distribution served as a prior for the training of a hidden Markov model (HMM) supertagger, which assigns a category to each word in a sentence (lexical categories are often called supertags ). The prior is designed to bias the HMM toward the use of cross-linguistically common categories: those that are less complex or those that are modifiers of other categories. This method improves supertagging performance in limited data situations; however, supertags are just the start of a full syntactic analysis of a sentence, and for all but very short sentences an HMM is very unlikely to produce a sequence of supertags that can actually be combined into a complete tree. Here we model the entire tree, which additionally requires supertags for different words to be compatible with each other. Bisk and Hockenmaier (2013) present a model of CCG parser induction from only basic properties of the CCG formalism. However, their model produces trees that use only a simplified form of CCG consisting of just two atomic categories, and they require gold-standard part-of-speech tags for each token. We wish to model the same kinds of complex Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence categories encoded in resources like CCGBank or which are used in hand-crafted grammars such as those used with Open CCG (Baldridge et al. 2007), and to do so by starting with words rather than gold tags. Our inputs are unannotated sentences and an incomplete tag dictionary mapping some words to their potential categories, and we model CCG trees with a probabilistic contextfree grammar (PCFG). The parameters of the PCFG are estimated using a blocked sampling algorithm based on the Markov chain Monte Carlo approach of Johnson, Griffiths, and Goldwater (2007). This allows us to efficiently sample parse trees for sentences in an unlabeled training corpus according to their posterior probabilities as informed by the linguistically-informed priors. This approach yields improvements over a baseline uniform-prior PCFG. Further, as a demonstration of the universality of our approach in capturing valuable grammatical biases, we evaluate on three diverse languages: English, Italian, and Chinese. Combinatory Categorial Grammar In the CCG formalism, every constituent, including those at the lexical level, is associated with a structured CCG category that provides information about that constituent s relationships in the overall grammar of the sentence. Categories are defined by a simple recursive structure, where a category is either atomic, which may potentially have features that restrict the categories with which they can unify, or a function from one category to another, as indicated by one of two slash operators: C {s, sdcl, sadj, sb, n, np, npnb, pp, ...} C {(C/C), (C \C)} Categories of adjacent constituents can be combined using one of a set of combination rules to form categories of higher-level constituents, as seen in Figure 1. The direction of the slash operator gives the behavior of the function. A category (s\np)/pp might describe an intransitive verb with a prepositional phrase complement; it combines on the right (/) with a constituent with category pp, and then on the left (\) with a noun phrase (np) that serves as its subject. We follow Lewis and Steedman (2014) in allowing only a small set of generic, linguistically-plausible grammar rules. More details can be found there; in addition to the standard binary combination rules, we use their set of 13 unary category-rewriting rules, as well as rules for combining with punctuation to the left and right. We further allow for a merge rule X X X since this is seen frequently in the corpora (Clark and Curran 2007). Because of their structured nature, CCG categories (unlike the part-of-speech tags and non-terminals of a standard PCFG) contain intrinsic information that gives evidence of their frequencies. First, simple categories are a priori more likely than complex categories. For example, the transitive verb buy appears with supertag (sb\np)/np 342 times in CCGbank, but just once with (((sb\np)/pp)/pp)/np. Second, modifier categories, those of the form X/X or X\X, are more likely than non-modifiers of similar complexity. For example, a category containing six atoms may, in general, be very unlikely, but a six-atom category that is merely pp/np np The man walks to work Figure 1: CCG parse for The man walks to work. a modifier of a three-atom category (like an adverb modifying a transitive verb) would be fairly common. Baldridge (2008) used this information to bias supertagger learning via an ad hoc initialization of Expectation Maximization for an HMM. Garrette et al. (2014) built upon these concepts to introduce a probabilistic grammar over CCG categories that provides a well-founded notion of category complexity. The ideas are described in detail in previous work, but we restate some of them here briefly. The category grammar captures important aspects of what makes a category complex: smaller categories are more likely (defined by pterm > 1 2, the probability of generating a terminal, atomic, category), some atomic categories are more likely than others (patom), modifier categories are more likely than non-modifiers (pmod), and slash operators may occur with different likelihoods (pfwd). This category grammar defines the probability distribution PG via the following recursive definition (let p denote 1 p): C a pterm patom(a) C A/A pterm pfwd pmod PG(A) C A/B, A = B pterm pfwd pmod PG(A) PG(B) C A\A pterm pfwd pmod PG(A) C A\B, A = B pterm pfwd pmod PG(A) PG(B) where A, B, C are recursively defined categories and a is an atomic category: a {s, sdcl, sadj, sb, n, np, pp, . . . }. Supertagging accuracy was improved further by complementing the language-universal knowledge from CCG with corpus-specific information extracted automatically. Counts were estimated using a tag dictionary and unannotated data, then used to empirically set the parameters of the prior. Generative Model Our CCG parsing model assumes the following generative process. First, the parameters that define our PCFG are drawn. We generate a distribution σ over root categories, a conditional distribution θt over binary branching nonterminal productions given each category t, a conditional distribution πt over unary non-terminal productions given each category t, and a conditional distribution µt over terminal (word) productions given each category t. Each of these parameters is drawn from a Dirichlet distribution parameterized by a concentration parameter (ασ, αθ, απ, αµ) and a prior mean distribution (σ0, θ0, π0, µ0 t). By setting each α close to zero, we can bias learning toward relatively peaked distributions. The prior means, explained in detail below, are used to encode both universal linguistic knowledge as well as information automatically extracted from the weak supervision. Note that unlike a standard phrase-structure grammar where the sets of terminal and non-terminal labels are nonoverlapping (part-of-speech tags vs. internal nodes), a CCG category may appear at any level the tree and, thus, may yield binary, unary, or terminal word productions. Therefore, we also generate a distribution λt for every category t that defines the mixture over production types (binary, unary, terminal) yielded by t. For simplicity, these parameters are generated by draws from an unbiased Dirichlet. Next, the process generates each sentence in the corpus. This begins by generating a root category s and then recursively generating subtrees. For each subtree rooted by a category t, with probability determined by λt, we generate either a binary ( u, v ), unary ( u ), or terminal (w) production from t; for binary and unary productions, we generate child categories and recursively generate subtrees. A tree is complete when all branches end in terminal words. Borrowing from the recursive generative function notation of Johnson, Griffiths, and Goldwater (2007), our process can be summarized as: Parameters: σ Dirichlet(ασ, σ0) root categories θt Dirichlet(αθ, θ0) t T binary productions πt Dirichlet(απ, π0) t T unary productions µt Dirichlet(αµ, µ0 t) t T terminal productions λt Dir( 1, 1, 1 ) t T production mixture Sentence: s Categorical(σ) generate(s) where function generate(t) : z Categorical(λt) if z = 1 : u, v | t Categorical(θt) Tree(t, generate(u), generate(v)) if z = 2 : u | t Categorical(πt) Tree(t, generate(u))) if z = 3 : w | t Categorical(µt) Leaf(t, w) Root prior mean (σ0) Since σ is a distribution over root categories, we can use PG, the probability of a category as defined above in terms of the category grammar, as its prior mean, biasing our model toward simpler root categories. Thus, σ0(t) = PG(t). Non-terminal production prior means (θ0 and π0) Our model includes two types of non-terminal productions: binary productions of the form A B C , and unary productions of the form A B . As with the root distribution prior, we would like our model to prefer productions that yield high-likelihood categories. To provide this bias, we again use PG: θ0( u, v ) = PG(u) PG(v) π0( u ) = PG(u) Terminal production prior means (µ0 t) Because we model terminal productions separately, we are able to borrow directly from Garrette et al. (2014) to define the terminal production prior mean µ0 t in a way that exploits the dictionary and unlabeled corpus to estimate the distribution over words for each supertag. Terminal productions in our grammar are defined as word given supertag, which is exactly the relationship of the emission distribution in an HMM supertagger. Thus, we simply use the supertagger s emission prior mean, as defined in the previous work, for our terminal productions: µ0 t(w) = Pem(w | t) If C(w) is the number of times of word w appears in the raw corpus, TD(w) is the set of supertags associated with w in the tag dictionary, and TD(t) is the set of known words (words appearing in the tag dictionary) for which supertag t TD(w), the count of a word/tag pair for a known word is estimated by uniformly distributing the word s (δ-smoothed) raw counts over its tag dictionary entries: Cknown(t, w) = ( C(w)+δ |TD(w)| if t TD(w) 0 otherwise To address unknown words, we employ the concept of tag openness , estimating the probability of a tag t applying to some unknown word: if a tag is known to apply to many word types, it is likely to also apply to some new word type. P(unk | t) |known words w s.t. t TD(w)| We can calculate P(t | unk) using Bayes rule, which allows us to estimate word/tag counts for unknown words: P(t | unk) P(unk | t) PG(t) Cunk(t, w) = C(w) P(t | unk) Finally, we can calculate a probability estimate considering the relationship between t and all known and unknown words: Pem(w | t) = Cknown(t, w) + Cunk(t, w) P w Cknown(t, w ) + Cunk(t, w ) In order to parse with our model, we seek the highestprobability parse tree for a given sentence w: ˆy = argmaxy P(y | w). This can be computed efficiently using the well-known probabilistic CKY algorithm. Posterior Inference Since inference about the parameters of our model using a corpus of unlabeled training data is intractable, we resort to Gibbs sampling to find an approximate solution. Our strategy is based on that of Johnson, Griffiths, and Goldwater (2007), using a block sampling approach. We initialize our parameters by setting each distribution to its prior mean (σ = σ0, θt = θ0, etc.) and λt = 1 3 . We then alternate between sampling trees given the current model parameters and observed word sequences, and sampling model parameters (σ, θ, π, µ, λ) given the current set of parse trees. To efficiently sample new model parameters, we exploit Dirichlet-multinomial conjugacy. We accumulate all parse trees sampled across all sampling iterations and use them to approximate the posterior quantities. Our inference procedure takes as input each of the distribution prior means (σ0, θ0, π0, µ0), along with the raw corpus and tag dictionary. During sampling, we always restrict the possible supertag choices for a word w to the categories found in the tag dictionary entry for that w: TD(w). Since real-world learning scenarios will always lack complete knowledge of the lexicon, we, too, want to allow for unknown words. Thus, we use incomplete tag dictionaries in our experiments, meaning that for a word w not present in the dictionary, we assign TD(w) to be the full set of known categories, indicating maximal ambiguity. It is also possible that the correct supertag for a given word is not present in the tag dictionary, though in these scenarios we hope that the parse will succeed through a different route. Our Gibbs sampler, based on the one proposed by Goodman (1998) and used by Johnson, Griffiths, and Goldwater (2007), uses a block sampling approach to sample an entire parse tree at once. The procedure is similar in principle to the Forward-Filter Backward-Sampler algorithm used by Garrette et al. (2014) for the HMM supertagger, but sampling trees instead of sequences (Carter and Kohn 1996). To sample a tree for a sentence w, the strategy is to use the Inside algorithm (Lari and Young 1990) to inductively compute, for each potential non-terminal position (i, j) (spanning words wi through wj 1) and category t, going up the tree, the probability of generating wi, . . . , wj 1 via any arrangement of productions that is rooted by yij = t: p(yi,i+1 = t | wi) = λt(3) µt(wi) t u λt(2) πt( u ) p(yij = u | wi:j 1) p(yij = t | wi:j 1) = X t u λt(2) πt( u ) p(yij = u | wi:j 1) i