# segmental_recurrent_neural_networks__8b5c4070.pdf Published as a conference paper at ICLR 2016 SEGMENTAL RECURRENT NEURAL NETWORKS Lingpeng Kong, Chris Dyer School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA {lingpenk, cdyer}@cs.cmu.edu Noah A. Smith Computer Science & Engineering University of Washington Seattle, WA 98195, USA nasmith@cs.washington.edu We introduce segmental recurrent neural networks (SRNNs) which define, given an input sequence, a joint probability distribution over segmentations of the input and labelings of the segments. Representations of the input segments (i.e., contiguous subsequences of the input) are computed by encoding their constituent tokens using bidirectional recurrent neural nets, and these segment embeddings are used to define compatibility scores with output labels. These local compatibility scores are integrated using a global semi-Markov conditional random field. Both fully supervised training in which segment boundaries and labels are observed as well as partially supervised training in which segment boundaries are latent are straightforward. Experiments on handwriting recognition and joint Chinese word segmentation/POS tagging show that, compared to models that do not explicitly represent segments such as BIO tagging schemes and connectionist temporal classification (CTC), SRNNs obtain substantially higher accuracies. 1 INTRODUCTION For sequential data like speech, handwriting, and DNA, segmentation and segment-labeling are abstractions that capture many common data analysis challenges. We consider the joint task of breaking an input sequence into contiguous, arbitrary-length segments while labeling each segment. Our new approach to this problem is the segmental recursive neural network (SRNN). SRNNs combine two powerful machine learning tools: representation learning and structured prediction. First, bidirectional recurrent neural networks (RNNs) embed every feasible segment of the input in a continuous space, and these embeddings are then used to calculate the compatibility of each candidate segment with a label. Unlike past RNN-based approaches (e.g., connectionist temporal classification or CTC; Graves et al., 2006) each candidate segment is represented explicitly, allowing application in settings where an alignment between segments and labels is desired as part of the output (e.g., protein secondary structure prediction or information extraction from text). At the same time, SRNNs are a variant of semi-Markov conditional random fields (Sarawagi & Cohen, 2004), in that they define a conditional probability distribution over the output space (segmentation and labeling) given the input sequence ( 2). This allows explicit modeling of statistical dependencies, such as those between adjacent labels, and also of segment lengths (unlike widely used symbolic approaches based on BIO tagging; Ramshaw & Marcus, 1995). Because the probability score decomposes into chain-structured clique potentials, polynomial-time dynamic programming algorithms exist for prediction and parameter estimation ( 3). ar Xiv:1511.06018v2 [cs.CL] 1 Mar 2016 Published as a conference paper at ICLR 2016 Parameters can be learned with either a fully supervised objective where both segment boundaries and segment labels are provided at training time and partially supervised training objectives where segment boundaries are latent ( 4). We compare SRNNs to strong models that do not explicitly represent segments on handwriting recognition and joint word segmentation and part-of-speech tagging for Chinese text, showing significant accuracy improvements in both, demonstrating the value of models that explicitly model segmentation even when segmentation is not necessary for downstream tasks ( 5). Given a sequence of input observations x = x1, x2, . . . , x|x| with length |x|, a segmental recurrent neural network (SRNN) defines a joint distribution p(y, z | x) over a sequence of labeled segments each of which is characterized by a duration (zi Z+) and label (yi Y ). The segment durations constrained such that P|z| i=1 zi = |x|. The length of the output sequence |y| = |z| is a random variable, and |y| |x| with probability 1. We write the starting time of segment i as si = 1 + P j