# weaklysupervised_grammarinformed_bayesian_ccg_parser_learning__cad0a91d.pdf

Weakly-Supervised Grammar-Informed Bayesian CCG Parser Learning

Dan Garrette Chris Dyer Jason Baldridge Noah A. Smith

Department of Computer Science, University of Texas at Austin, dhg@cs.utexas.edu School of Computer Science, Carnegie Mellon University, {cdyer,nasmith}@cs.cmu.edu Department of Linguistics, University of Texas at Austin, jbaldrid@utexas.edu

Combinatory Categorial Grammar (CCG) is a lexicalized grammar formalism in which words are associated with categories that, in combination with a small universal set of rules, specify the syntactic conﬁgurations in which they may occur. Previous work has shown that learning sequence models for CCG tagging can be improved by using priors that are sensitive to the formal properties of CCG as well as cross-linguistic universals. We extend this approach to the task of learning a full CCG parser from weak supervision. We present a Bayesian formulation for CCG parser induction that assumes only supervision in the form of an incomplete tag dictionary mapping some word types to sets of potential categories. Our approach outperforms a baseline model trained with uniform priors by exploiting universal, intrinsic properties of the CCG formalism to bias the model toward simpler, more cross-linguistically common categories.

Introduction Supervised learning of natural language parsers of various types (context-free grammars, dependency grammars, categorial grammars, and the like) is by now a well-understood task with plenty of high-performing models when training data is abundant. Learning from sparse, incomplete information is, naturally, a greater challenge. To build parsers for domains and languages where resources are scarce, we need techniques that take advantage of very limited kinds and amounts of supervision. The strategy we pursue in this paper is to approach the problem in a Bayesian framework, using priors built from linguistic knowledge such as grammar universals, linguistic typology, and cheaply obtained annotations from a linguist. We focus on the task of learning parsers for Combinatory Categorial Grammar (CCG) (Steedman 2000; Steedman and Baldridge 2011) without having access to annotated parse trees. CCG is a lexicalized grammar formalism in which every constituent in a parse is assigned a category that describes its grammatical role in the sentence. CCG categories, in contrast to the labels used in standard phrase structure grammars, are not atomic labels like ADJECTIVE or VERB

Copyright c 2015, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

PHRASE, but instead have a detailed recursive structure. Instead of VERB, a CCG category might encode information such as this category can combine with a noun phrase to the right (an object) and then a noun phrase to the left (a subject) to produce a sentence. Practical interest in CCG has grown in the last few years within an array of NLP applications of particular note are semantic parsing (Zettlemoyer and Collins 2005) and machine translation (Weese, Callison-Burch, and Lopez 2012). As these tasks mature and move into new domains and languages, the ability to learn CCG parsers from scarce data will be increasingly important. In this regard, CCG has particular appeal both theoretical and practical in the setting of weakly-supervised learning because the structure of its categories and its small set of universal rules provide a foundation for constructing linguistically-informative priors that go beyond general preferences for sparseness that are commonly expressed with generic priors. In this paper, we show that the intrinsic structure of CCG categories can be exploited cross-linguistically to learn better parsers when supervision is scarce. Our starting point is the method we presented in Garrette et al. (2014) for specifying a probability distribution over the space of potential CCG categories. This distribution served as a prior for the training of a hidden Markov model (HMM) supertagger, which assigns a category to each word in a sentence (lexical categories are often called supertags ). The prior is designed to bias the HMM toward the use of cross-linguistically common categories: those that are less complex or those that are modiﬁers of other categories. This method improves supertagging performance in limited data situations; however, supertags are just the start of a full syntactic analysis of a sentence, and for all but very short sentences an HMM is very unlikely to produce a sequence of supertags that can actually be combined into a complete tree. Here we model the entire tree, which additionally requires supertags for different words to be compatible with each other. Bisk and Hockenmaier (2013) present a model of CCG parser induction from only basic properties of the CCG formalism. However, their model produces trees that use only a simpliﬁed form of CCG consisting of just two atomic categories, and they require gold-standard part-of-speech tags for each token. We wish to model the same kinds of complex

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

categories encoded in resources like CCGBank or which are used in hand-crafted grammars such as those used with Open CCG (Baldridge et al. 2007), and to do so by starting with words rather than gold tags. Our inputs are unannotated sentences and an incomplete tag dictionary mapping some words to their potential categories, and we model CCG trees with a probabilistic contextfree grammar (PCFG). The parameters of the PCFG are estimated using a blocked sampling algorithm based on the Markov chain Monte Carlo approach of Johnson, Grifﬁths, and Goldwater (2007). This allows us to efﬁciently sample parse trees for sentences in an unlabeled training corpus according to their posterior probabilities as informed by the linguistically-informed priors. This approach yields improvements over a baseline uniform-prior PCFG. Further, as a demonstration of the universality of our approach in capturing valuable grammatical biases, we evaluate on three diverse languages: English, Italian, and Chinese.

Combinatory Categorial Grammar In the CCG formalism, every constituent, including those at the lexical level, is associated with a structured CCG category that provides information about that constituent s relationships in the overall grammar of the sentence. Categories are deﬁned by a simple recursive structure, where a category is either atomic, which may potentially have features that restrict the categories with which they can unify, or a function from one category to another, as indicated by one of two slash operators: C {s, sdcl, sadj, sb, n, np, npnb, pp, ...} C {(C/C), (C \C)}

Categories of adjacent constituents can be combined using one of a set of combination rules to form categories of higher-level constituents, as seen in Figure 1. The direction of the slash operator gives the behavior of the function. A category (s\np)/pp might describe an intransitive verb with a prepositional phrase complement; it combines on the right (/) with a constituent with category pp, and then on the left (\) with a noun phrase (np) that serves as its subject. We follow Lewis and Steedman (2014) in allowing only a small set of generic, linguistically-plausible grammar rules. More details can be found there; in addition to the standard binary combination rules, we use their set of 13 unary category-rewriting rules, as well as rules for combining with punctuation to the left and right. We further allow for a merge rule X X X since this is seen frequently in the corpora (Clark and Curran 2007). Because of their structured nature, CCG categories (unlike the part-of-speech tags and non-terminals of a standard PCFG) contain intrinsic information that gives evidence of their frequencies. First, simple categories are a priori more likely than complex categories. For example, the transitive verb buy appears with supertag (sb\np)/np 342 times in CCGbank, but just once with (((sb\np)/pp)/pp)/np. Second, modiﬁer categories, those of the form X/X or X\X, are more likely than non-modiﬁers of similar complexity. For example, a category containing six atoms may, in general, be very unlikely, but a six-atom category that is merely

pp/np np The man walks to work

Figure 1: CCG parse for The man walks to work.

a modiﬁer of a three-atom category (like an adverb modifying a transitive verb) would be fairly common. Baldridge (2008) used this information to bias supertagger learning via an ad hoc initialization of Expectation Maximization for an HMM. Garrette et al. (2014) built upon these concepts to introduce a probabilistic grammar over CCG categories that provides a well-founded notion of category complexity. The ideas are described in detail in previous work, but we restate some of them here brieﬂy. The category grammar captures important aspects of what makes a category complex: smaller categories are more likely (deﬁned by pterm > 1

2, the probability of generating a terminal, atomic, category), some atomic categories are more likely than others (patom), modiﬁer categories are more likely than non-modiﬁers (pmod), and slash operators may occur with different likelihoods (pfwd). This category grammar deﬁnes the probability distribution PG via the following recursive deﬁnition (let p denote 1 p): C a pterm patom(a) C A/A pterm pfwd pmod PG(A) C A/B, A = B pterm pfwd pmod PG(A) PG(B) C A\A pterm pfwd pmod PG(A)

C A\B, A = B pterm pfwd pmod PG(A) PG(B)

where A, B, C are recursively deﬁned categories and a is an atomic category: a {s, sdcl, sadj, sb, n, np, pp, . . . }. Supertagging accuracy was improved further by complementing the language-universal knowledge from CCG with corpus-speciﬁc information extracted automatically. Counts were estimated using a tag dictionary and unannotated data, then used to empirically set the parameters of the prior.

Generative Model Our CCG parsing model assumes the following generative process. First, the parameters that deﬁne our PCFG are drawn. We generate a distribution σ over root categories, a conditional distribution θt over binary branching nonterminal productions given each category t, a conditional distribution πt over unary non-terminal productions given each category t, and a conditional distribution µt over terminal (word) productions given each category t. Each of these parameters is drawn from a Dirichlet distribution parameterized by a concentration parameter (ασ, αθ, απ, αµ) and a prior mean distribution (σ0, θ0, π0, µ0 t). By setting each α close to zero, we can bias learning toward relatively peaked distributions. The prior means, explained in detail below, are used to encode both universal linguistic knowledge as well as information automatically extracted from the weak supervision.

Note that unlike a standard phrase-structure grammar where the sets of terminal and non-terminal labels are nonoverlapping (part-of-speech tags vs. internal nodes), a CCG category may appear at any level the tree and, thus, may yield binary, unary, or terminal word productions. Therefore, we also generate a distribution λt for every category t that deﬁnes the mixture over production types (binary, unary, terminal) yielded by t. For simplicity, these parameters are generated by draws from an unbiased Dirichlet. Next, the process generates each sentence in the corpus. This begins by generating a root category s and then recursively generating subtrees. For each subtree rooted by a category t, with probability determined by λt, we generate either a binary ( u, v ), unary ( u ), or terminal (w) production from t; for binary and unary productions, we generate child categories and recursively generate subtrees. A tree is complete when all branches end in terminal words. Borrowing from the recursive generative function notation of Johnson, Grifﬁths, and Goldwater (2007), our process can be summarized as:

Parameters:

σ Dirichlet(ασ, σ0) root categories

θt Dirichlet(αθ, θ0) t T binary productions

πt Dirichlet(απ, π0) t T unary productions

µt Dirichlet(αµ, µ0 t) t T terminal productions λt Dir( 1, 1, 1 ) t T production mixture

Sentence: s Categorical(σ) generate(s) where function generate(t) : z Categorical(λt) if z = 1 : u, v | t Categorical(θt) Tree(t, generate(u), generate(v)) if z = 2 : u | t Categorical(πt) Tree(t, generate(u))) if z = 3 : w | t Categorical(µt) Leaf(t, w)

Root prior mean (σ0)

Since σ is a distribution over root categories, we can use PG, the probability of a category as deﬁned above in terms of the category grammar, as its prior mean, biasing our model toward simpler root categories. Thus, σ0(t) = PG(t).

Non-terminal production prior means (θ0 and π0)

Our model includes two types of non-terminal productions: binary productions of the form A B C , and unary productions of the form A B . As with the root distribution prior, we would like our model to prefer productions that yield high-likelihood categories. To provide this bias,

we again use PG:

θ0( u, v ) = PG(u) PG(v)

π0( u ) = PG(u)

Terminal production prior means (µ0 t)

Because we model terminal productions separately, we are able to borrow directly from Garrette et al. (2014) to deﬁne the terminal production prior mean µ0 t in a way that exploits the dictionary and unlabeled corpus to estimate the distribution over words for each supertag. Terminal productions in our grammar are deﬁned as word given supertag, which is exactly the relationship of the emission distribution in an HMM supertagger. Thus, we simply use the supertagger s emission prior mean, as deﬁned in the previous work, for our terminal productions:

µ0 t(w) = Pem(w | t)

If C(w) is the number of times of word w appears in the raw corpus, TD(w) is the set of supertags associated with w in the tag dictionary, and TD(t) is the set of known words (words appearing in the tag dictionary) for which supertag t TD(w), the count of a word/tag pair for a known word is estimated by uniformly distributing the word s (δ-smoothed) raw counts over its tag dictionary entries:

Cknown(t, w) =

( C(w)+δ |TD(w)| if t TD(w)

0 otherwise

To address unknown words, we employ the concept of tag openness , estimating the probability of a tag t applying to some unknown word: if a tag is known to apply to many word types, it is likely to also apply to some new word type.

P(unk | t) |known words w s.t. t TD(w)|

We can calculate P(t | unk) using Bayes rule, which allows us to estimate word/tag counts for unknown words:

P(t | unk) P(unk | t) PG(t)

Cunk(t, w) = C(w) P(t | unk)

Finally, we can calculate a probability estimate considering the relationship between t and all known and unknown words:

Pem(w | t) = Cknown(t, w) + Cunk(t, w) P w Cknown(t, w ) + Cunk(t, w )

In order to parse with our model, we seek the highestprobability parse tree for a given sentence w:

ˆy = argmaxy P(y | w).

This can be computed efﬁciently using the well-known probabilistic CKY algorithm.

Posterior Inference

Since inference about the parameters of our model using a corpus of unlabeled training data is intractable, we resort to Gibbs sampling to ﬁnd an approximate solution. Our strategy is based on that of Johnson, Grifﬁths, and Goldwater (2007), using a block sampling approach. We initialize our parameters by setting each distribution to its prior mean (σ = σ0, θt = θ0, etc.) and λt = 1

3 . We then alternate between sampling trees given the current model parameters and observed word sequences, and sampling model parameters (σ, θ, π, µ, λ) given the current set of parse trees. To efﬁciently sample new model parameters, we exploit Dirichlet-multinomial conjugacy. We accumulate all parse trees sampled across all sampling iterations and use them to approximate the posterior quantities. Our inference procedure takes as input each of the distribution prior means (σ0, θ0, π0, µ0), along with the raw corpus and tag dictionary. During sampling, we always restrict the possible supertag choices for a word w to the categories found in the tag dictionary entry for that w: TD(w). Since real-world learning scenarios will always lack complete knowledge of the lexicon, we, too, want to allow for unknown words. Thus, we use incomplete tag dictionaries in our experiments, meaning that for a word w not present in the dictionary, we assign TD(w) to be the full set of known categories, indicating maximal ambiguity. It is also possible that the correct supertag for a given word is not present in the tag dictionary, though in these scenarios we hope that the parse will succeed through a different route. Our Gibbs sampler, based on the one proposed by Goodman (1998) and used by Johnson, Grifﬁths, and Goldwater (2007), uses a block sampling approach to sample an entire parse tree at once. The procedure is similar in principle to the Forward-Filter Backward-Sampler algorithm used by Garrette et al. (2014) for the HMM supertagger, but sampling trees instead of sequences (Carter and Kohn 1996). To sample a tree for a sentence w, the strategy is to use the Inside algorithm (Lari and Young 1990) to inductively compute, for each potential non-terminal position (i, j) (spanning words wi through wj 1) and category t, going up the tree, the probability of generating wi, . . . , wj 1 via any arrangement of productions that is rooted by yij = t:

p(yi,i+1 = t | wi) = λt(3) µt(wi)

t u λt(2) πt( u ) p(yij = u | wi:j 1)

p(yij = t | wi:j 1) = X

t u λt(2) πt( u ) p(yij = u | wi:j 1)

i<k<j λt(1) θt( u, v ) p(yik = u | wi:k 1) p(ykj = v | wk:j 1)

We then pass through the chart again, this time downward starting at the root and sampling productions until we reach a terminal word on all branches:

y0n σt p(y0n = t | w0:n 1)

x | yij θyij( u, v ) p(yik = u | wi:k 1)

p(ykj = v | wk:j 1) yik, ykj,

πyij( u ) p(y ij = u | wi:j 1) y ij,

where x is either a split point k and pair of categories yik, ykj resulting from a binary rewrite rule, a single category y ij resulting from a unary rule, or a word w resulting from a terminal rule. Resampling the parameters uses the just-sampled parse trees y to compute Croot(t), the count of trees in which the root category is t, C(t u, v ), the count of binary nonterminal productions whose category is t that are producing the pair of categories u, v , the count of unary non-terminal productions C(t u ), and the count of terminal productions C(t w). We then sample, for each t T where T is the full set of valid CCG categories (and V is the full vocabulary of known words):

σ Dir ασ σ0(t) + Croot(t) t T

θt Dir αθ θ0( u, v ) + C(t u, v ) u,v T

πt Dir απ π0( u ) + C(t u ) u T

µt Dir αµ µ0 t(w) + C(t w) w V

λt Dir ( 1 + P C(t u, v ), 1 + P C(t u ), 1 + P C(t w) )

These distributions are derived from the conjugacy of the Dirichlet prior to the multinomial; note that the result selects parameters based on both the data (counts) and the bias encoded in the prior. After 50 sampling iterations have completed, the parameters are estimated as the maximum likelihood estimate of the pool of trees resulting from all sampling iterations. Dramatic time savings can be obtained by generating and reusing a chart that compactly stores all possible parses for all possible sentences. This allows avoiding calculation for subtrees and productions that never participate in a complete parse. As a further optimization, we enforce the use of punctuation as phrasal-boundary indicators, a technique used previously by Ponvert, Baldridge, and Erk (2011) and Spitkovsky, Alshawi, and Jurafsky (2011). This means that when we attempt to parse a sentence (or sample a parse tree for a sentence), we do not allow constituents that cross punctuation without covering the whole inter-punctuation phrase. For example, the sentence On Sunday, he walked. , the constituent Sunday , he would be disallowed. Since punctuation does regularly mark a phrasal boundary, this choice has negligible effect on accuracy while reducing runtime and memory use. In cases where a punctuation-as-boundary requirement (along with the tag dictionary) renders a sentence unparseable according to the CCG rules, we lift the punctuation requirement for that sentence.

English Chinese Italian 1. uniform 53.38 35.94 58.16 2. PG 54.75 40.08 59.21 3. PG, Pem 55.69 42.00 60.04

(a) Main results.

English Chinese Italian 0.001 0.01 0.1 0.001 0.01 0.1 0.001 0.01 0.1 56.62 61.30 56.46 34.88 41.24 47.42 60.29 59.91 54.72 57.79 61.93 56.69 41.59 42.86 47.74 58.58 60.76 53.31 60.12 62.11 57.20 43.42 43.84 49.40 59.60 60.02 53.39

(b) Increasing degrees of artiﬁcial tag dictionary pruning (see text). The pruning cutoff is given at the top of each column.

Table 1: Experimental results: test-set dependency accuracies. (1) uses uniform priors on all distributions. (2) uses the category prior PG. (3) uses the tag dictionary and raw corpus to automatically estimate category prior and word production information. Table (a) gives the results obtained when no artiﬁcial tag dictionary pruning was performed; table (b) shows performance as the artiﬁcial pruning is increased.

Experiments We evaluated our approach on the three available CCG corpora: English CCGBank (Hockenmaier and Steedman 2007), Chinese Treebank CCG (Tse and Curran 2010), and the Italian CCG-TUT corpus (Bos, Bosco, and Mazzei 2009). Each corpus was split into four non-overlapping datasets: a portion for constructing the tag dictionary, sentences for the unlabeled training data, development trees (used for tuning α, pterm, pmod, and pfwd hyperparameters), and test trees. We used the same splits as Garrette et al. (2014). Since these treebanks use special representations for conjunctions, we chose to rewrite the trees to use conjunction categories of the form (X\X)/X so that additional special rules would not need to be introduced. We ran our sampler for 50 iterations.1 For the category grammar, we used pterm=0.7, pmod=0.1, pfwd=0.5. For the priors, we use ασ=1, αθ=100, απ=10,000, αµ=10,000.2 We trained on 1,000 sentences for English and 750 for Chinese, but only 150 for Italian since it is a much smaller corpus. To limit the amount of spurious ambiguity, we limit combinatory rules to forward and backward application, punctuation, and merge: X/Y Y X forward application Y X\Y X backward application X . X right punctuation . X X left punctuation X X X merge We also allow the 13 unary rules proposed by Lewis and Steedman (2014). CCG composition rules are rarely necessary to parse a sentence, but do increase the overall number of parses, many of which represent the same underlying grammatical structures. This choice drastically reduces the time and space requirements for learning, without sacriﬁces in accuracy. Allowing the backward crossed composition rule the third most-frequent rule in CCGBank not

1We experimented with higher numbers of iterations but found that accuracy was not improved past 50 iterations. 2In order to ensure that these concentration parameters, while high, were not dominating the posterior distributions, we ran experiments in which they were set much higher (including using the prior alone), and found that accuracies plummeted in those cases, demonstrating that there is a good balance with the prior.

only dramatically increases the time and memory requirements, but also tends to lower the accuracy of the resulting parser by 1% or more, likely because it increases ambiguity. CCG parsers are typically evaluated on the dependencies they produce instead of their CCG derivations directly. There can be many different CCG parse trees that all represent the same dependency relationships (spurious ambiguity), and CCG-to-dependency conversion can collapse those differences. To convert a CCG tree into a dependency tree, we traverse the parse tree, dictating at every branching node which words will be the dependents of which. For binary branching nodes of forward rules, the right side the argument side is the dependent, unless the left side is a modiﬁer (X/X) of the right, in which case the left is the dependent. The opposite is true for backward rules. For punctuation rules, the punctuation is always the dependent. For merge rules, the right side is always made the parent. The results presented in this paper are dependency accuracy scores: the proportion of words that were assigned the correct parent (or root for the root of a tree). During training, we only use sentences for which we are able to ﬁnd at least one parse since we cannot sample a parse tree for a sentence that has no available parses. However, for testing, we must make extra efforts to ﬁnd valid parses. To that end, if we encounter a test sentence that cannot be parsed given the tag dictionary and CCG rules (either with or without enforcement of a punctuation-as-boundary requirement), then we fall back to a plan in which additional supertag options are added for each token. For a word wi, we add the set of tags X\X for all X TD(wi 1), and Y/Y for all Y TD(wi+1). Since X\X and Y/Y represent modiﬁer categories, the result of these categories is to provide the parser the option of simply making wi a modiﬁer of one of its immediate neighbors, resulting in wi simply being assigned as a dependent of that neighboring word. This effectively allows the parser to ignore inconvenient tokens as it searches for the optimal tree. This is similar to the deletion strategy employed by Zettlemoyer and Collins (2007) in which words can be skipped.

Baseline As a baseline, we trained our model with uniform prior mean distributions (σ0, θ0, π0, µ0 t). The uniform priors do not make any distinction among the relative likelihoods of different CCG categories or words, and thus do not take ad-

vantage of either the universal properties of the CCG formalism (PG), or the initialization information that can be automatically estimated from the type-supervised data (Pem).

Results The results of our experiments are given in Table 1a. We ﬁnd that the use of a well-designed category prior (PG) achieves performance gains over the baseline across all three languages. Our results also show that still further gains can be achieved by using the available weak supervision the tag dictionary and unlabeled text to estimate corpus counts that can be used to inﬂuence the priors on terminal productions (Pem). The largest gains are in the Chinese data, though the accuracies are lower on Chinese overall, indicating the difﬁculty of the Chinese parsing task.

Error analysis Supertag accuracy degrades roughly 2% for each language from the uniform prior to the full prior. Inspection of the errors shows us that this is due in part to the category prior encouraging simpler categories, e.g., categories like ((s\np)/(s\np))/np being learned as pp/np. It is counterintuitive that supertagging accuracy decreases while parsing performance improves, but note that it may be easier for the parser to recover correct dependencies, using the merge rule, when an incorrect supertag is simpler. The most frequent errors under uniform priors involve very complex categories, like ((sdcl\np)/(sdcl\np))/np in Chinese. When the category prior is introduced, these complex categories vanish from the errors; the most complex common category error with the category prior in Chinese is (sdcl\np)/np. After bringing the data-based prior in, we again see more complex categories, plus others that have high arity, like modiﬁers of modiﬁers (np/np)/(np/np). This suggests that good performance relies on priors that blend theoretical constraints with empirical guidance. We also trained the parser on gold-standard trees for an upper-bound on performance for the given training sentences, obtaining 66%, 48%, and 65% for English, Chinese, and Italian, respectively.

Supertag dictionary pruning Tag dictionaries used in experimental setups are typically extracted from labeled corpora by ﬁnding all word/tag pairs in some set of annotated sentences. As dictionaries, however, the distinctions between highand low-frequency tags are lost, and all tags in the dictionary entry appear equally valid for a given word. Unfortunately, the inclusion of lowprobability tags in this way tends to cause problems during training by over-representing the likelihoods of tags that are only rarely applicable, or even tags that are the result of annotation errors and should not have been included in the dictionary at all. Traditionally, researchers have avoided this problem by using tag frequency information to automatically prune the tag dictionary of its low-frequency tags (Merialdo 1994; Kupiec 1992), leading Banko and Moore (2004), among others, to argue that early successes in type-supervised learning

were due, in large part, to the use of that frequency information that is not available from unlabeled data alone, undercutting the promises of weakly-supervised learning. Since it is the goal of this research is to develop techniques that can be applied without artiﬁcial data cleaning, we desire models that are robust to noise in the training data. To see how this noise affects our model, we executed a series of experiments in which varing degrees of noise were artiﬁcially removed. For different cutoff levels (0.001, 0.01, 0.1), we computed the tag dictionary entry for word w, as the supertags t where:

TD(w) = n t | freq(w,t) P

t T freq(w,t ) cutoff o .

Results under pruned conditions are given in Table 1b. Table 1a can be interpreted as results when cutoff = 0. From the results, we can see that in most scenarios, grammar-informed priors still provide beneﬁts to the model. More notable, however, is that these priors provide more value in cases where there is less artiﬁcial pruning. This tells us that our constructed priors are most helpful in the noisier, more difﬁcult, and more realistic learning scenarios. When only uniform priors are used, the model is not able to differentiate a priori between probable and improbable categories. This results in poor performance when artiﬁcial assistance is not given. However, our category prior, with its knowledge of the intrinsic properties of the CCG formalism, is able to overcome this problem, allowing the model to differentiate between likely and unlikely categories and biasing the model toward better categories even though category frequency information is not available. Importantly, it also does this without eliminating tags that (though infrequent) are useful for parsing. These results support our hypothesis that when supervised data is scarce, it becomes more important to take advantage of linguistic knowledge.

Conclusion and Future Work

We have presented a Bayesian approach to CCG parser learning that can be trained given only a lexicon and raw text. It ﬂexibly incorporates linguistically-informed prior distributions and naturally accommodates the deviations from pure CCG grammars that have been employed in annotations for existing CCG corpora, especially CCGBank. The model enhances a standard PCFG by factoring in prior distributions over categories into both non-terminal and terminal productions; those priors can be derived from a universal prior distribution, from a distribution built by combining a tag dictionary with raw text, or both. Our results show that using both sources for deﬁning these priors leads to better performing CCG parsers in low-resource scenarios. The idea of incorporating linguistic knowledge into a model via priors is very appealing when supervised data is scarce. We have shown how knowledge about the structure of CCG categories can be used, but a more sophisticated model may be able to additionally make use of knowledge about the relative likelihoods of various CCG rules. For example, we know that application rules are always preferred,

and that composition rules should only be used when necessary, and in speciﬁc scenarios (Baldridge 2002). Further, rules like merge should only be used as a last resort. Finally, while a relatively large tag dictionary was used for the experiments presented here, it may be possible to learn from even less supervision by generalizing a small dictionary into a large one, as has been done successfully for type-supervised part-of-speech tagging (Das and Petrov 2011; Garrette and Baldridge 2013; Garrette, Mielens, and Baldridge 2013).

Acknowledgements This work was supported by the U.S. Department of Defense through the U.S. Army Research Ofﬁce (grant number W911NF-10-1-0533). Experiments were run on the UTCS Mastodon Cluster, provided by NSF grant EIA-0303609.

References Baldridge, J.; Chatterjee, S.; Palmer, A.; and Wing, B. 2007. Dot CCG and Vis CCG: Wiki and programming paradigms for improved grammar engineering with Open CCG. In Proceedings of the Workshop on Grammar Engineering Across Frameworks. Baldridge, J. 2002. Lexically Speciﬁed Derivational Control in Combinatory Categorial Grammar. Ph.D. Dissertation, University of Edinburgh. Baldridge, J. 2008. Weakly supervised supertagging with grammar-informed initialization. In Proceedings of COLING. Banko, M., and Moore, R. C. 2004. Part-of-speech tagging in context. In Proceedings of COLING. Bisk, Y., and Hockenmaier, J. 2013. An HDP model for inducing combinatory categorial grammars. Transactions of the Association for Computational Linguistics 1. Bos, J.; Bosco, C.; and Mazzei, A. 2009. Converting a dependency treebank to a categorial grammar treebank for Italian. In Passarotti, M.; Przepi orkowski, A.; Raynaud, S.; and Van Eynde, F., eds., Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories (TLT8). Carter, C. K., and Kohn, R. 1996. On Gibbs sampling for state space models. Biometrika 81(3):341 553. Clark, S., and Curran, J. R. 2007. Wide-coverage efﬁcient statistical parsing with CCG and log-linear models. Computational Linguistics 33. Das, D., and Petrov, S. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of ACL-HLT. Garrette, D., and Baldridge, J. 2013. Learning a part-ofspeech tagger from two hours of annotation. In Proceedings of NAACL. Garrette, D.; Dyer, C.; Baldridge, J.; and Smith, N. A. 2014. Weakly-supervised Bayesian learning of a CCG supertagger. In Proceedings of Co NLL. Garrette, D.; Mielens, J.; and Baldridge, J. 2013. Real-world semi-supervised learning of POS-taggers for low-resource languages. In Proceedings of ACL.

Goodman, J. 1998. Parsing inside-out. Ph.D. Dissertation, Harvard University. Available from http://research.microsoft.com/ joshuago/. Hockenmaier, J., and Steedman, M. 2007. CCGbank: A corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Computational Linguistics 33(3). Johnson, M.; Grifﬁths, T.; and Goldwater, S. 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proceedings of NAACL. Kupiec, J. 1992. Robust part-of-speech tagging using a hidden Markov model. Computer Speech & Language 6(3). Lari, K., and Young, S. J. 1990. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language 4:35 56. Lewis, M., and Steedman, M. 2014. A* CCG parsing with a supertag-factored model. In Proceedings of EMNLP. Merialdo, B. 1994. Tagging English text with a probabilistic model. Computational Linguistics 20(2). Ponvert, E.; Baldridge, J.; and Erk, K. 2011. Simple unsupervised grammar induction from raw text with cascaded ﬁnite state models. In Proceedings of ACL-HLT. Spitkovsky, V. I.; Alshawi, H.; and Jurafsky, D. 2011. Punctuation: Making a point in unsupervised dependency parsing. In Proceedings of Co NLL. Steedman, M., and Baldridge, J. 2011. Combinatory categorial grammar. In Borsley, R., and Borjars, K., eds., Non Transformational Syntax: Formal and Explicit Models of Grammar. Wiley-Blackwell. Steedman, M. 2000. The Syntactic Process. MIT Press. Tse, D., and Curran, J. R. 2010. Chinese CCGbank: Extracting CCG derivations from the Penn Chinese Treebank. In Proceedings of COLING. Weese, J.; Callison-Burch, C.; and Lopez, A. 2012. Using categorial grammar to label translation rules. In Proceedings of WMT. Zettlemoyer, L. S., and Collins, M. 2005. Learning to map sentences to logical form: Structured classiﬁcation with probabilistic categorial grammars. In Proceedings of UAI. Zettlemoyer, L. S., and Collins, M. 2007. Online learning of relaxed CCG grammars for parsing to logical form. In Proceedings of EMNLP.