# word_segmentation_for_chinese_novels__30356509.pdf Word Segmentation for Chinese Novels Likun Qiu and Yue Zhang Singapore University of Technology and Design 20 Dover Drive, Singapore 138682 qiulikun@gmail.com, yue zhang@sutd.edu.sg Word segmentation is a necessary first step for automatic syntactic analysis of Chinese text. Chinese segmentation is highly accurate on news data, but the accuracies drop significantly on other domains, such as science and literature. For scientific domains, a significant portion of out-of-vocabulary words are domain-specific terms, and therefore lexicons can be used to improve segmentation significantly. For the literature domain, however, there is not a fixed set of domain terms. For example, each novel can contain a specific set of person, organization and location names. We investigate a method for automatically mining common noun entities for each novel using information extraction techniques, and use the resulting entities to improve a state-of-the-art segmentation model for the novel. In particular, we design a novel double-propagation algorithm that mines noun entities together with common contextual patterns, and use them as plug-in features to a model trained on the source domain. An advantage of our method is that no retraining for the segmentation model is needed for each novel, and hence it can be applied efficiently given the huge number of novels on the web. Results on five different novels show significantly improved accuracies, in particular for OOV words. 1 Introduction Word segmentation is a necessary first step for automatic syntactic analysis of Chinese text. Statistical Chinese word segmentation systems perform highly accurately on the news domain, thanks to large-scale manually-annotated training data. However, robust wide-coverage Chinese word segmentation is still an open problem, because the performance usually degrades significantly for other domains, such as science and literature. There has been a line of research on improving cross-domain word segmentation (Chang and Han 2010; Liu and Zhang 2012; Li and Xue 2014); for scientific texts such as patents, a domain dictionary can enhance the performance significantly. The challenges to word segmentation for the literature domain, and novels in particular, are quite different from those for scientific texts. For scientific domains such as chemistry and computer science, OOV words mainly belong to domain Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Flowchart of the proposed method. terms, which form a relatively stable vocabulary. For novels, however, OOV words are usually named entities such as person, location and organization names, and other common noun entities that are specific to the setting of each individual novel. Apparently, novel-specific lexicons or annotated sentences can reduce a large proportion of segmentation errors for a specific novel (Zhang et al. 2014). However, the large number of novels on the web makes it unfeasibly expensive to annotate resources manually for each novel. This paper addresses the domain adaptation problem for segmenting Chinese novels by automatically mining novelspecific noun entities using information extraction (IE) techniques. Our method does not use any target-domain annotation, such as domain dictionaries or small-scale annotated target-domain sentences. There has been work on semi-supervised word segmentation under the same setting, typically by incorporating target-domain statistics into source news-domain training (Suzuki and Isozaki 2008; Chang and Han 2010; Wang et al. 2011). However, such methods require the retraining of a statistical model for each Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence First type of context Second type of context Correctly segmented instances Simlar words from news corpus Incorrectly segmented instances X( ... XØ ... 5 ; úw XX Ø aa r ?5 (Tianlinger ... went to Tianbuyi s side.) (He ... came to the Crown Prince s side.) (He saw that Tian Buyi came in wobbling.) XØ f X å { (Tianbuyi p I¾ f é )í X Ø :Þ listened to his daughter s words.) (Gaoguozhu was very angry after listening.) (Tian Buyi nodded slowly.) 81 lm XØ Hè Ñ Iè O Å X Ø ù 3f (She helped (His looks left Tianbuyi,) (The Vietnamese team lost to Thailand.) her husband Tian Buyi to teach these students.) Table 1: Instances of the novel-specific word XØ occurring in different contexts. target domain, using the news-domain annotated data. Given the large number of novels on the Internet, these methods can be unfeasibly costly in terms of training time. In contrast, the proposed model takes novel-specific nouns as plugin resources to a model trained on a large-scale annotated news corpus, and does not need retraining for each novel. Our method is inspired by the observation that noun entities tend to occur frequently in common contexts. Formally, a context pattern (l, x, r) consists of a left context word l, a right context word r and a noun entity x, which we call the target word. For example, in the context pattern (5 (come to), x, (side)), l=5 (come to), r= (side) and x is typically a person name; in the context pattern (? \ (go), x, S (inside)), x is typically a location or organization name. Context patterns can be classified into two types according to their occurrences in the source and target domains. In the first type of patterns, the context words l and r occur frequently in both the source-domain corpus and the target-domain corpus. In the second type of patterns, at least one of the context words is common in the target-domain but not in the source-domain. It is likely for a segmentation algorithm to segment a new word correctly under contexts similar to the source-domain training data, but incorrectly under contexts rarely seen in the source domain. Several examples from our experimental data are given in Table 1. The left and right columns list instances of the target word XØ (Tianbuyi) occurring under the two types of contexts, respectively. It is more frequently recognized correctly under the first type of context patterns, but incorrectly under the second type, being split into X (Tian; field) and Ø (Buyi; hard) . For a human reader, the first type of context patterns can help recognize the meaning of a target word, which in turn helps the understanding of the second type of context patterns in a novel. Based on this observation, we develop a bootstrapping process that iteratively mines new words and context patterns. Given a novel, a double propagation process is used to detect new noun entities using the first type of context, which are in turn used to find the second type of context, before more noun entities and patterns are detected iteratively. We use the joint segmentation and part-of-speech (POS) tagging model of Zhang and Clark (2010) as the baseline segmentor, which gives the state-of-the-art accuracies on news data. The set of noun entities and patterns mined from each novel are incorporated into the baseline as new features to improve segmentation for the novel. We perform experiments on segmenting five differen- t novels. For each model, we manually annotate a set of sentences as the test data. Results show that the proposed method can significantly improve the accuracies of both word segmentation and POS tagging over a self-training baseline. In addition, unlike most previous semi-supervised segmentation methods, which improve recall on out-ofvocabulary (OOV) words at the expense of slightly decreased in-vocabulary (IV) recall, our method achieves an overall 30.2% error reduction on OOV recall, together with 8.7% error reduction on IV recall. 2 Overview of the Proposed Method The flowchart of the proposed method is shown in Figure 1. It consists of a training process and a testing process. Given an annotated news corpus, which we call Gen Corpus, the training process is executed only once, resulting in a set of models that are used to analyze different novels without retraining. The testing process refers to the segmentation process of novels; it is carried out for each novel separately. In the training process, a baseline segmentor ( 3.1) and an enhanced segmentor ( 3.2) that supports noun entity and pattern features are trained on Gen Corpus separately. In addition, two log-linear classifiers ( 4.1 and 4.3) are trained on Gen Corpus for detecting new words and assigning POS tags to unknown words, respectively. The segmentation process for a given novel consists of three steps. First, the novel is segmented and POS-tagged by the baseline segmentor ( 3.1), resulting in a set of automatically segmented sentences SEN. Second, through a double propagation process, new word and context pattern mining is executed iteratively on SEN ( 4.2), using the log-linear word ( 4.1) and POS ( 4.3) classifiers through a double propagation process. Finally, the newly-mined words together with their context patterns are converted into features, and given to the enhanced segmentor ( 3.2) to produce the final segmentation for the novel. 3 Segmentation and POS-tagging Joint segmentation and POS-tagging has received growing research attention due to improvements of lexical analysis over pipelined segmentors and POS taggers (Zhang and Clark 2008; Jiang et al. 2008; Kruengkrai et al. 2009; Zhang and Clark 2010). It makes better use of POS infor- To facilitate future comparisons, we release our annotated datasets at http://people.sutd.edu.sg/\%7Eyue zhang/ publication.html. mation and reduces segmentation error propagation, and is preferred when POS annotation is available. 3.1 The Baseline Segmentor We apply the joint segmentor and POS-tagger of Zhang and Clark (2010)1 as our baseline system. The segmentor processes a sentence from left to right, using a buffer to maintain partially-built outputs and a queue to hold the next incoming characters. In the initial state, the buffer is empty, and the queue contains the whole input sentence. Two transition actions are defined to consume input characters from the queue and construct output sentences on the buffer: APPEND, which removes the front character from the queue and appends it to the last word in the buffer; SEPARATE-x, which removes the front character from the queue and puts it as the start of a new word in the buffer, assigning the POS-tag x to the new word. Given an input sentence, the system starts from the initial state, and repeatedly applies transition actions until all the characters on the queue are consumed and a full sentence is constructed on the buffer. Beam-search is applied to find a highest-scored sequence of transitions heuristically. The system scores search candidates using a linear model, which is trained using the averaged perceptron (Collins 2002) and early-update (Collins and Roark 2004). The Base rows of Table 2 lists the the feature templates of our baseline segmentor, which are taken from Zhang and Clark (2010). w, t and c denote a word, a POS-tag and a character, respectively. The subscripts are based on the current character, which is the front character in the queue. w 1 represents the first word to the left of the current character, and t 2 represents the POS-tag on the second word to the left of the current character. start(w), end(w) and len(w) indicate the first character, the last character and the length of word w, respectively. cat(c) represents the set of all possible POS-tags seen on the character c. 3.2 The Enhanced Segmentor The enhanced segmentor is the baseline segmentor with additional feature templates that support information on new words (W) and patterns (P) in a target novel. The new features are shown in the New rows of Table 2. ISNOUN(w) indicates whether the word w is in W. ISPATTERN(w 2, c0) represents whether the word w 2 and the word starting with c0 form a pattern in P, regardless whether w 1 is in W or not. ISTRIPLE(w 2, w 1, c0) represents whether words w 2 and w 1, and the word starting with c0 form a pattern in P, with the noun w 1 being in W. To train weights for W and P on the source-domain Gen Corpus, the new feature templates are instantiated using a set of source-domain noun entities Ws and Ps. We construct Ps by extracting the set of source-domain patterns that occur at least 10 times in Gen Corpus, and Ws by extracting the set of source-domain noun entities that occur at least under two 1from http://sourceforge.net/zpar version 0.6, using the agenda implementation of chinese.postagger. Type Feature Base w 1; w 1w 2; w 1, where len(w 1) = 1; Base start(w 1)len(w 1); end(w 1)len(w 1); Base c 1c0; begin(w 1)end(w 1); end(w 2w 1; Base start(w 1)c0; end(w 2end(w 1); w 1c0; Base w 2len(w 1); len(w 2)w 1; end(w 1)c0); Base w 1t 1; t 1t0; t 2t 1t0; w 1t0; t 2w 1; Base w 1t 1end(w 2); w 1t 1c0; start(w0)t0; Base c 2c 1c0t 1 (len(w 1) = 1); t 1start(w 1); Base ct 1end(w 1) (c w 1 and c = end(w 1)); Base t0c0; c0t0start(w0); c0t0c 1t 1; c0t0c 1; Base c0t0cat(start(w0)) (c w 1 and c = end(w 1)); New ISNOUN(w 2)t 2len(w 2); New ISNOUN(w 1)t 1len(w 1); New ISNOUN(c0)t0len(w0); New ISTRIPLE(w 2, w 1, c)t 1len(w 1); New ISPATTERN(w 2, c)t 1len(w 1); Table 2: Feature templates for the joint word segmentation and POS tagging system. Type Feature Context Both end and start with punctuations Context 20[COUNT(p); 10[COUNT(p)<20 Context 2[COUNT(p)<10; COUNT(p)=1 Context 50[FREQ(p); 20[FREQ(p)<50 Context 5[FREQ(p)<20; FREQ(p)<5 Structure PMI(C1, C2,n); PMI(C1,n 1, Cn) Structure PMI(C1,2,C3,n); PMI(C1,n 2, Cn 1,n) Table 3: Feature templates of the word classifier. context patterns in Ps. All the words in the general dictionary the PKU Grammatical Dictionary (Yu et al. 1998) are removed from Ws so that the remaining words simulate the distribution of new words in target novels. For the final segmentation of a novel, a set of novelspecific nouns (W) and the corresponding set of patterns (P) are extracted from the novel by using the double propagation algorithm in 4, and replace Ws and Ps for instantiating the new features in the enhanced segmentor. 4 Noun Entity and Pattern Mining Our approach identifies new noun entities and context patterns using known and extracted noun entities and patterns iteratively, using noun entities to identify new patterns, and vice versa. Because of the two-way information passage between noun entities and patterns, this method is also called double propagation (Wu et al. 2009; Carlson et al. 2010; Qiu et al. 2011; Qiu and Zhang 2014). 4.1 Training a Word Classifier For noun entity mining, a logistic regression classifier Word Classifier is used to score candidates, with features shown in Table 3. In the table, p denotes the context patterns that a target noun entity occurs in, and ci denotes a substring of the target noun entity. For example, C1 and C(1,2) denote the substrings consisting of the first character and the first two characters, respectively. COUNT(p) and FREQ(p) indicate the number of distinct patterns p and their total count, respectively, while PMI(C1, C(2,n)) is used to measure the point mutual information between the two substrings C1 and C(2,n) in the target noun entity. The feature templates cover both context and word structure patterns. We train Word Classifier using Gen Corpus. All the nouns and their context patterns in the corpus can be taken as positive training examples for the word classifier. However, in order to simulate the test scenario when analyzing novels, we choose the top 30% most frequent person names, locations, organization names, and common nouns as the set of positive examples Wi. The frequencies of different types of nouns are counted separately. In addition, the context patterns of the chosen nouns are taken as the pattern set Pi. For negative examples, we concatenate each noun in Wi and its right context word. However, if the resulting word also occurs in Wi, it is removed from the set of negative examples. This ensures that a word does not occur in both the positive and negative set. 4.2 Double Propagation The pattern mining algorithm consists of three steps. Step 1 is an initialization process, where the tokens in the automatically-segmented novel SEN that have been tagged as nouns (including person names, locations, organization names, and common nouns) are extracted and put into the candidate new noun entity set Wc. The output noun entity set Wo is initialized to an empty set, and the output pattern set Po is initialized to Pi. In Step 2, Word Classifier is used to classify the candidate new noun entities in Wc, taking words with probabilities above a threshold α as novel-specific nouns, and putting them into the set Wo. Word Classifier uses Po for context features in the classification. In Step 3, the updated noun set Wo is used to expand Po, with the context patterns of all words in Wo being added into Po. If new patterns are detected, the process is repeated from Step 2. Otherwise, the algorithm finishes, returning Wo and Po. Algorithm 1 shows pseudocode of the double propagation algorithm. We take the novel Äl (Zhuxian) as an example to illustrate the workflow of the algorithm. The original context pattern set Pi contains the context patterns in the left column of Table 1. Using the lines 2 to 4 of Algorithm 1, the target word XØ (Tianbuyi) is put into the candidate new word set Wc. The target word XØ (Tianbuyi) passes the confidence test in line 8 and is removed from Wc and put into Wo. As a result, the context patterns in the right column of Table 1 will be put into Po. After repeating the execution of lines 7 to 15, more new words, including the person names ê (Wanjianyi) and UÌ (Shentiandou) , and the locations 9Ä (Mount Longshou) and ÏU (Mount Tongtian) , are extracted. 4.3 Tagging Unknown Words Because our POS set differentiates common nouns (n), person names (nr), locations (ns) and organization names (nt), a new noun typically should have only one POS tag. However, some of the mined words are tagged with two or more Input : auto-analyzed sentences SEN, context patterns Pi, annotated news corpus Gen Corpus, noun entity classifier Word Classifier. Output: New words Wo, context patterns Po. 1 Wo =Φ, Wc =Φ; Po =Pi; 2 for each word SEN and word / Gen Corpus do 3 Add To Set(word, Wc); 5 while True do 6 count =0; 7 for each word Wc do 8 if Is Word(word,Po,Word Classifier) then 9 Remove From Set(word,Wc); 10 Add To Set(word,Wo); 11 pattern =Get Context Pat(word); 12 Add To Set(pattern,Po); 13 count ++; 16 if count =0 then Algorithm 1: The double propagation algorithm. Type Feature Context first context word on the left Context second context word on the left Context first context word on the right Context second context word on the right Context POS of first context word on the left Context POS of first context word on the right Structure the first character of the target word Structure the last character of the target word Table 4: Feature templates of the POS classifier. POS under different contexts by the baseline segmentor, due to segmentation errors or inherent ambiguities. To disambiguate between the POS tags, we develop a loglinear regression classifier POSClassifier with the feature templates listed in Table 4, and try to attach a single POS tag to each new word. The POS classifier incorporates both global context features and internal structure features, and is trained on the source-domain news corpus. The training data is selected using the same way as selecting the positive examples for the Word Classifier. POSClassifier is used to filter POS tags for the words in Wo. For a word with the most probable POS having a probability above a threshold β, only the most probable POS is given to the word. For the remaining words, all POS in Wo are kept. In the segmenting process, the word and POS information in Wo is used to instantiate features using the feature templates such as ISNOUN(w 2)t 2len(w 2) and ISNOUN(c0)t0len(w0) in Table 2. Data Set #Sents #Words OOV Rate Gen Corpus 42,132 1,052,119 Äl(ZX-dev) 300 8,117 0.137 Äl(ZX-test) 667 21,066 0.161