# segmental_recurrent_neural_networks__8b5c4070.pdf

Published as a conference paper at ICLR 2016

SEGMENTAL RECURRENT NEURAL NETWORKS

Lingpeng Kong, Chris Dyer School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA {lingpenk, cdyer}@cs.cmu.edu

Noah A. Smith Computer Science & Engineering University of Washington Seattle, WA 98195, USA nasmith@cs.washington.edu

We introduce segmental recurrent neural networks (SRNNs) which deﬁne, given an input sequence, a joint probability distribution over segmentations of the input and labelings of the segments. Representations of the input segments (i.e., contiguous subsequences of the input) are computed by encoding their constituent tokens using bidirectional recurrent neural nets, and these segment embeddings are used to deﬁne compatibility scores with output labels. These local compatibility scores are integrated using a global semi-Markov conditional random ﬁeld. Both fully supervised training in which segment boundaries and labels are observed as well as partially supervised training in which segment boundaries are latent are straightforward. Experiments on handwriting recognition and joint Chinese word segmentation/POS tagging show that, compared to models that do not explicitly represent segments such as BIO tagging schemes and connectionist temporal classiﬁcation (CTC), SRNNs obtain substantially higher accuracies.

1 INTRODUCTION

For sequential data like speech, handwriting, and DNA, segmentation and segment-labeling are abstractions that capture many common data analysis challenges. We consider the joint task of breaking an input sequence into contiguous, arbitrary-length segments while labeling each segment.

Our new approach to this problem is the segmental recursive neural network (SRNN). SRNNs combine two powerful machine learning tools: representation learning and structured prediction. First, bidirectional recurrent neural networks (RNNs) embed every feasible segment of the input in a continuous space, and these embeddings are then used to calculate the compatibility of each candidate segment with a label. Unlike past RNN-based approaches (e.g., connectionist temporal classiﬁcation or CTC; Graves et al., 2006) each candidate segment is represented explicitly, allowing application in settings where an alignment between segments and labels is desired as part of the output (e.g., protein secondary structure prediction or information extraction from text).

At the same time, SRNNs are a variant of semi-Markov conditional random ﬁelds (Sarawagi & Cohen, 2004), in that they deﬁne a conditional probability distribution over the output space (segmentation and labeling) given the input sequence ( 2). This allows explicit modeling of statistical dependencies, such as those between adjacent labels, and also of segment lengths (unlike widely used symbolic approaches based on BIO tagging; Ramshaw & Marcus, 1995). Because the probability score decomposes into chain-structured clique potentials, polynomial-time dynamic programming algorithms exist for prediction and parameter estimation ( 3).

ar Xiv:1511.06018v2 [cs.CL] 1 Mar 2016

Published as a conference paper at ICLR 2016

Parameters can be learned with either a fully supervised objective where both segment boundaries and segment labels are provided at training time and partially supervised training objectives where segment boundaries are latent ( 4).

We compare SRNNs to strong models that do not explicitly represent segments on handwriting recognition and joint word segmentation and part-of-speech tagging for Chinese text, showing signiﬁcant accuracy improvements in both, demonstrating the value of models that explicitly model segmentation even when segmentation is not necessary for downstream tasks ( 5).

Given a sequence of input observations x = x1, x2, . . . , x|x| with length |x|, a segmental recurrent neural network (SRNN) deﬁnes a joint distribution p(y, z | x) over a sequence of labeled segments each of which is characterized by a duration (zi Z+) and label (yi Y ). The segment durations constrained such that P|z| i=1 zi = |x|. The length of the output sequence |y| = |z| is a random variable, and |y| |x| with probability 1. We write the starting time of segment i as si = 1 + P j<i zj.

To motivate our model form, we state several desiderata. First, we are interested in the following prediction problem,

y = arg max y p(y | x) = arg max y

z p(y, z | x) arg max y max z p(y, z | x). (1)

Note the use of joint maximization over y and z as a computationally tractable substitute for marginalizing out z; this is commonly done in natural language processing.

Second, for problems where the explicit durations observations are unavailable at training time and are inferred as a latent variable, we must be able to use a marginal likelihood training criterion,

L = log p(y | x) = log X

z p(y, z | x). (2)

In Eqs. 1 and 2, the conditional probability of the labeled segment sequence is (assuming kth order dependencies on y):

p(y, z | x) = 1 Z(x)

i=1 exp f(yi k:i, zi, x) (3)

where Z(x) is an appropriate normalization function. To ensure the expressiveness of f and the computational efﬁciency of the maximization and marginalization problems in Eqs. 1 and 2, we use the following deﬁnition of f,

f(yi k:i, zi, xsi:si+zi 1) = w φ(V[gy(yi k); . . . ; gy(yi); gz(zi); RNN(csi:si+zi 1); RNN(csi:si+zi 1)] + a) + b (4)

where RNN(csi:si+zi 1) is a recurrent neural network that computes the forward segment embedding by encoding the zi-length subsequence of x starting at index si,1 and RNN computes the reverse segment embedding (i.e., traversing the sequence in reverse order), and gy and gz are functions which map the label candidate y and segmentation duration z into a vector representation. The notation [a; b; c] denotes vector concatenation. Finally, the concatenated segment duration, label candidates and segment embedding are passed through a afﬁne transformation layer parameterized by V and a and a nonlinear activation function φ (e.g., tanh), and a dot product with a vector w and addition by scalar b computes the log potential for the clique. Our proposed model is equivalent to a semi-Markov conditional random ﬁeld with local features computed using neural networks. Figure 1 shows the model graphically.

1Rather than directly reading the xi s, each token is represented as the concatenation, ci, of a forward and backward over the sequence of raw inputs. This permits tokens to be sensitive to the contexts they occur in, and this is standardly used with neural net sequence labeling models (Graves et al., 2006).

Published as a conference paper at ICLR 2016

We chose bidirectional LSTMs (Graves & Schmidhuber, 2005) as the implementation of the RNNs in Eq. 4. LSTMs (Hochreiter & Schmidhuber, 1997) are a popular variant of RNNs which have been seen successful in many representation learning problems (Graves & Jaitly, 2014; Karpathy & Fei-Fei, 2015). Bidirectional LSTMs enable effective computation for embedings in both directions and are known to be good at preserving long distance dependencies, and hence are well-suited for our task.

x1 x2 x3 x4 x5 x6

Encoder Bi RNN Segmentation/Labeling Model

x1 x2 x1 x2 x1 x2 x1 x2 x1 x1 x2 x1

c1 c2 c3 c4 c5 c6

x1 x2 x3 x4 x5 x6

! h 1,3 ! h 4,5 ! h 6,6

h 6,6 h 4,5 h 1,3

Figure 1: Graphical model showing a six-frame input and three output segments with durations z = 3, 2, 1 (this particular setting of z is shown only to simplify the layout of this ﬁgure; the model assigns probabilities to all valid settings of z). Circles represent random variables. Shaded nodes are observed in training; open nodes are latent random variables; diamonds are deterministic functions of their parents; dashed lines indicate optional statistical dependencies that can be included at the cost of increased inference complexity. The graphical notation we use here draws on conventions used to illustrate neural networks and graphical models.

3 INFERENCE WITH DYNAMIC PROGRAMMING

We are interested in three inference problems: (i) ﬁnding the most probable segmentation/labeling for a model given a sequence x; (ii) evaluating the partition function Z(x); and (iii) computing the posterior marginal Z(x, y), which sums over all segmentations compatible with a reference sequence y. These can all be solved using dynamic programming. For simplicity, we will assume zeroth order Markov dependencies between the yis. Extensions to the kth order Markov dependencies should be straightforward. Since each of these algorithms relies on the forward and reverse segment embeddings, we ﬁrst discuss how these can be computed before going on to the inference algorithms.

3.1 COMPUTING SEGMENT EMBEDDINGS

Let the h i,j designate the RNN encoding of the input span (i, j), traversing from left to right, and let h i,j designate the reverse direction encoding using RNN. There are thus O(|x|2) vectors that must be computed, each of length O(|x|). Naively this can be computed in time O(|x|3), but the

Published as a conference paper at ICLR 2016

following dynamic program reduces this to O(|x|2): h i,i = RNN( h 0, ci) h i,j = RNN( h i,j 1, cj) h i,i = RNN( h 0, ci) h i,j = RNN( h i+1,j, ci)

The algorithm is executed by initializing in the values on the diagonal (representing segments of length 1) and then inductively ﬁlling out the rest of the matrix. In practice, we often can put a upper bound for the length of a eligible segment thus reducing the complexity of runtime to O(|x|). This savings can be substantial for very long sequences (e.g., those encountered in speech recognition).

3.2 COMPUTING THE MOST PROBABLE SEGMENTATION/LABELING AND Z(x)

For the input sequence x, there are 2|x| 1 possible segmentations and O(|Y ||x|) different labelings of these segments, making exhaustive computation entirely infeasible. Fortunately, the partition function Z(x) may be computed in polynomial time with the following dynamic program:

exp w φ(V[gy(y); gz(zi); RNN(csi:si+zi 1); RNN(csi:si+zi 1)] + a) + b .

After computing these values, Z(x) = α|x|. By changing the summations to a max operators (and storing the corresponding arg max values), the maximal a posteriori segmentation/labeling can be computed.

Both the partition function evaluation and the search for the MAP outputs run in time O(|x|2 |Y |) with this dynamic program. Adding nth order Markov dependencies between the yis adds requires additional information in each state and increases the time and space requirements by a factor of O(|Y |n). However, this may be tractable for small |Y | and n.

Avoiding overﬂow. Since this dynamic program sums over exponentially many segmentations and labelings, the values in the αi chart can become very large. Thus, to avoid issues with overﬂow, computations of the αi s must be carried out in log space.2

3.3 COMPUTING Z(x, y)

To compute the posterior marginal Z(x, y), it is necessary to sum over all segmentations that are compatible with a label sequence y given an input sequence x. To do so requires only a minor modiﬁcation of the previous dynamic program to track how much of the reference label sequence y has been consumed. We introduce the variable m as the index into y for this purpose. The modiﬁed recurrences are:

i<j γi(m 1)

exp w φ(V[gy(yi); gz(zi); RNN(csi:si+zi 1); RNN(csi:si+zi 1)] + a) + b .

The value Z(x, y) is γ|x|(|y|).

2An alternative strategy for avoiding overﬂow in similar dynamic programs is to rescale the forward summations at each time step (Rabiner, 1989; Graves et al., 2006). Unfortunately, in a semi-Markov architecture each term in αi sums over different segmentations (e.g., the summation for α2 will have contain some terms that include α1 and some terms that include only α0), which means there are no common factors, making this strategy inapplicable.

Published as a conference paper at ICLR 2016

4 PARAMETER LEARNING

We consider two different learning objectives.

4.1 SUPERVISED LEARNING

In the supervised case, both the segment durations (z) and their labels (y) are observed.

(x,y,z) D log p(y, z | x)

(x,y,z) D log Z(x) log Z(x, y, z)

In this expression, the unnormalized conditional probability of the reference segmentation/labeling, given the input x is written as Z(x, y, z).

4.2 PARTIALLY SUPERVISED LEARNING

In the partially supervised case, only the labels are observed and the segments (the z) are unobserved and marginalized.

(x,y) D log p(y | x)

z Z(x,y) log p(y, z | x)

(x,y) D log Z(x) log Z(x, y)

For both the fully and partially supervised scenarios, the necessary derivatives can be computed using automatic differentiation or (equivalently) with backward variants of the above dynamic programs (Sarawagi & Cohen, 2004).

5 EXPERIMENTS

We present two sets of experiments to compare segmental recurrent neural networks against models that do not include explicit representations of segmentation. For the handwriting recognition task, we consider connectionist temporal classiﬁcation (CTC) (Graves et al., 2006); for Chinese word segmentation, we consider BIO tagging. In these experiments, we do not include Markovian dependencies between adjacent labels for our models or the baselines.

5.1 ONLINE HANDWRITING RECOGNITION

Dataset We use the handwriting dataset from Kassel (1995). This dataset is an online collection of hand-written words from 150 writers. It is recorded as the coordinates (x, y) at time t plus special pen-down/pen-up notations. We break the coordinates into strokes using the pen-down and pen-up notations. One character typically consists one or more contiguous strokes.3

The dataset is split into train, development and test set following Kassel (1995). Table 1 presents the statistics for the dataset.

A well-know variant of this dataset was introduced by Taskar et al. (2004). Taskar et al. (2004) selected a clean subset of about 6,100 words and rasterized and normalized the images of each letter. Then, the uppercased letters (since they are usually the ﬁrst character in a word) are removed and only the lowercase letters are used. The main difference between our dataset and theirs is that their dataset is ofﬂine Taskar et al. (2004) mapped each character into a bitmap and treated the segmentation of characters as a preprocessing step. We use the richer representation of the sequence of strokes as input.

3There are infrequent cases where one stroke can go across multiple characters or the strokes which form the character can be not contiguous. We leave those cases for future work.

Published as a conference paper at ICLR 2016

#words #characters

Train 4,368 37,247 Dev 1,269 10,905 Test 637 5,516

Total 6,274 53,668

Table 1: Statistics of the Online Handwriting Recognition Dataset

Implementation We trained two versions of our model on this dataset, namely, the fully supervised model ( 4.1), which takes advantage of the gold segmentations on training data, and the partially supervised model ( 4.2) in which the gold segmentations are only used in the evaluation. A CTC model reimplemented on the top of our Encoder Bi RNNs layer (Figure 1) is used as a baseline so that we can see the effect of explicitly representing the segmentation.4 For the decoding of the CTC model, we simply use the best path decoding, where we assume that the most probable path will correspond to the most probable labeling, although it is known that preﬁx search decoding can slightly improve the results (Graves et al., 2006).

As a preprocessing step, we ﬁrst represented each point in the dataset using a 4 dimensional vector, p = (px, py, px, py), where px and py are the normalized coordinates of the point and px and py are the corresponding changes in the coordinates with respect to the previous point. px and py are meant to capture basic direction information. Then we map the points inside one stroke into a ﬁxed-length vector using a bi-direction LSTM. Speciﬁcally, we concatenated the last position s hidden states in both directions and use it as the input vector x for the stroke.

In all the experiments, we use Adam (Kingma & Ba, 2014) with λ = 1 10 6 to optimize the parameters in the models. We train these models until convergence and picked the best model over the iterations based on development set performance then report performance on the test set.

We used 5 as the hidden state dimension in the bidirectional RNNs, which map the points into ﬁxedlength stroke embeddings (hence the input vector size 5 2 = 10). We set the hidden dimensions of c in our model and CTC model to 24 and segment embedding h in our model as 18. These dimensions were chosen based on intuitively reasonable values, and it was conﬁrmed on development data that they performed well. We tried to experiment with larger hidden dimensions and we found the performance did not vary much. Future work might more carefully optimize these parameters.

As for speed, the partially supervised SRNNs run at 40 instances per second and the fully supervised SRNNs run at 53 instances during training using a single CPU.

Results The results of the online handwriting recognition task are presented in Table 2. We see that both of our models outperform the baseline CTC model, which does not carry an explicit representation for the segments being labeled, by a signiﬁcant margin. An interesting ﬁnding is, although the partially supervised model performs slightly worse in the development set, it actually outperforms the fully supervised model in the test set. Because the test set is written by different people from the train and development set, they exhibit different styles in their handwriting; our results suggest that the partially supervised model may generalize better across different writing styles.

5.2 JOINT CHINESE WORD SEGMENTATION AND POS TAGGING

In this section, we will look into two related tasks. The ﬁrst task is joint Chinese word segmentation and POS tagging, where the z variables will group the Chinese characters into words and the y variables assign POS tags as labels to these words. We also tested our model on pure Chinese word segmentation task, where the assignments of z is the only thing we care about (simulated using a single label for all segments).

4The CTC interpretation rules specify that repeated symbols, e.g. aa will be interpreted as a single token of a. However since the segments in the handwriting recognition problem are extremely short, we use different rules and interpret this as aa. That is, only the blank symbol may be used to represent extended durations. Our experiments indicate this has little effect, and Graves (p.c.) reports that this change does not harm performance in general.

Published as a conference paper at ICLR 2016

Dev Test Pseg Rseg Fseg Error Pseg Rseg Fseg Error

SRNNs (Partial) 98.7% 98.4% 98.6% 4.2% 99.2% 99.1% 99.2% 2.7% SRNNs (Full) 98.9% 98.6% 98.8% 4.3% 98.8% 98.6% 98.6% 5.4% CTC - - - 15.2% - - - 13.8%

Table 2: Hand-writing Recognition Task

Dev Test Pseg Rseg Fseg Pseg Rseg Fseg Bi RNNs 93.2% 92.9% 93.0% 94.7% 95.2% 95.0% SRNNs 93.8% 93.8% 93.8% 95.3% 95.8% 95.5%

Ptag Rtag Ftag Ptag Rtag Ftag Bi RNNs 87.1% 86.9% 87.0% 88.1% 88.5% 88.3% SRNNs 89.0% 89.1% 89.0% 89.8% 90.3% 90.0%

Table 3: Joint Chinese Word Segmentation and POS Tagging

Dataset We used standard benchmark datasets for these two tasks. For the joint Chinese word segmentation and POS tagging task, we use the Penn Chinese Treebank 5 (Xue et al., 2005), following the standard train/dev/test splits. For the pure Chinese word segmentation task, we used the SIGHAN 2005 dataset5. This dataset contains four portions, covering both simpliﬁed and traditional Chinese. Since there is no pre-assigned dev set in this dataset (only train and test set are provided), we manually split the original train set into two, one of which (roughly the same size as the test set) is used as the dev set. For both tasks, we use Wang2Vec (Ling et al., 2015) to generate the pre-trained character embeddings from the Chinese Gigaword (Graff & Chen, 2005).

Implementation Only supervised version SRNNs ( 4.1) is tested in these tasks. The baseline model is a bi-directional LSTM tagger (basically the same structure as our Encoder Bi RNNs in Figure 1). It takes the c at each time step and pushes it through an element-wise non-linear transformation (tanh) followed by an afﬁne transformation to map it to the same dimension as the number of labels. The total loss is therefore the sum of negative log probabilities over the sequence. Greedy decoding is applied in the baseline model, making it a zeroth order model like our SRNNs.

In order to perform segmentation and POS tagging jointly, we composed the POS tags with B or I to represent the segmentation point. For the segmentation-only task, in the SRNNs we simply used same dummy tag for all y and only care about the z assignments. In the Bi RNN case, we used B and I tags.

For both tasks, the dimension for the input character embedding is 64. For our model, the dimension for c and the segment embedding h is set to 24. For the baseline bi-directional LSTM tagger, we set the hidden dimension (the c equivalent) size to 128. Here we deliberately chose a larger size than in our model hoping to make the number of parameters in the bi-directional LSTM tagger roughly the same as our model. We trained these models until convergence and picked the best model over iterations based on its performance on the development set.

As for speed, the SRNNs run at 3.7 sentence per second during training on the CTB dataset using a single CPU.

Results Table 3 presents the results for the joint Chinese word segmentation task. We can see that in both segmentation and POS tagging, the SRNNs achieve higher F-scores than the Bi RNNs.

Table 4 presents the results for the pure Chinese word segmentation task. The SRNNs perform better than the Bi RNNs with the exception of the PKU portion of the dataset. The reason for this is probably because the training set in this portion is the smallest among the four. Thus leads to high variance in the test results.

5http://www.sighan.org/bakeoff2005/

Published as a conference paper at ICLR 2016

Bi RNNs SRNNs Pseg Rseg Fseg Pseg Rseg Fseg CU 92.7% 93.1% 92.9% 93.3% 93.7% 93.5% AS 92.8% 93.5% 93.1% 93.2% 94.2% 93.7% MSR 89.9% 90.1% 90.0% 90.9% 90.4% 90.7% PKU 91.5% 91.2% 91.3% 90.6% 90.6% 90.6%

Table 4: Chinese Word Segmentation Results on SIGHAN 2005 dataset. There are four portions of the dataset from City University of Hong Kong (CU), Academia Sinica (AS), Microsoft Research (MSR) and Peking University (PKU). The former two are in traditional Chinese and the latter two are in simpliﬁed Chinese.

6 RELATED WORK

Segmental labeling problems have been widely studied. A widely used approach to a segmental labeling problems with neural networks is the connectionist temporal classiﬁcation (CTC) objective and decoding rule of Graves et al. (2006). CTC reduces the segmental sequence label problem to a classical sequence labeling problem in which every position in an input sequence x is explicitly labeled by interpreting repetitions of input labels or input labels followed by a special blank output symbol as being a single label with a longer duration. During training, the marginal likelihood of the set of labelings compatible (according to the CTC interpretation rules) with the reference label y is maximized. CTC has demonstrated impressive success in various fully discriminative end-toend speech recognition models (Graves & Jaitly, 2014; Maas et al., 2015; Hannun et al., 2014, inter alia).

Although CTC has been used successfully and its reuse of conventional sequence labeling architectures is appealing, it has several potentially serious limitations. First, it is not possible to model interlabel dependencies explicitly these must instead be captured indirectly by the underlying RNNs. Second, CTC has no explicit segmentation model. Although this is most serious in applications where segmentation is a necessary/desired output (e.g., information extraction, protein secondary structure prediction), we argue that explicit segmentation is potentially valuable even when the segmentation is not required. To illustrate the value of explicit segments, consider the problem of phone recognition. For this task, segmental duration is strongly correlated with label identity (e.g., while an [o] phone token might last 300ms, it is unlikely that a [t] would) and thus modeling it explicitly may be useful. Finally, making an explicit labeling decision for every position (and introducing a special blank symbol) in an input sequence is conceptually unappealing.

Several alternatives to CTC have been approached, such as using various attention mechanisms in place of marginalization (Chan et al., 2015; Bahdanau et al., 2015). These have been applied to endto-end discriminative speech recognition problem. A more direct alternative to our method indeed it was proposed to solve several of the same problems we identiﬁed is due to Graves (2012). However, a crucial difference is that our model explicitly constructs representations of segments which are used to label the segment while that model relies on a marginalized frame-level labeling with a null symbol.

The work of Abdel-Hamid (2013) also seeks to construct embeddings of multi-frame segments. Their approach is quite different than the one taken here. First, they compute representations of variable-sized segments by uniformly sampling a ﬁxed number of frames and using these to construct a representation of the segment with a simple feedforward network. Second, they do not consider them problem of latent segmentation.

Finally, using neural networks to provide local features in conditional random ﬁeld models has also been proposed for sequential models (Peng et al., 2009) and tree-structured models (Durrett & Klein, 2015). To our knowledge, this is the ﬁrst application to semi-Markov structures.

Published as a conference paper at ICLR 2016

7 CONCLUSION

We have proposed a new model for segment labeling problems that learns representations of segments of an input sequence and then labels these. We outperform existing alternatives both when segmental information should be recovered and when it is only latent. We have not trained the segmental representations to be of any use beyond making good labeling (or segmentation) decisions, but an intriguing avenue for future work would be to construct representations that are useful for other tasks.

ACKNOWLEDGMENTS

The authors thank the anonymous reviewers, Yanchuan Sim, and Hao Tang for their helpful feedback. This work was sponsored in part by the Defense Advanced Research Projects Agency (DARPA) Information Innovation Ofﬁce (I2O) under the Low Resource Languages for Emergent Incidents (LORELEI) program issued by DARPA/I2O under Contract No. HR0011-15-C-0114.

Abdel-Hamid, Huda. Structural-Functional Analysis of Plant Cyclic Nucleotide Gated Ion Channels. Ph D thesis, University of Toronto, 2013.

Bahdanau, Dzmitry, Chorowski, Jan, Serdyuk, Dmitriy, Brakel, Phil emon, and Bengio, Yoshua. End-to-end attention-based large vocabulary speech recognition. Co RR, abs/1508.04395, 2015.

Chan, William, Jaitly, Navdeep, Le, Quoc V., and Vinyals, Oriol. Listen, attend, and spell. Co RR, abs/1508.01211, 2015.

Durrett, Greg and Klein, Dan. Neural CRF parsing. In Proc. ACL, 2015.

Graff, David and Chen, Ke. Chinese gigaword. LDC Catalog No.: LDC2003T09, 1, 2005.

Graves, Alex. Sequence transduction with recurrent neural networks. In Proc. ICML, 2012.

Graves, Alex and Jaitly, Navdeep. Towards end-to-end speech recognition with recurrent neural networks. In Proc. ICML, 2014.

Graves, Alex and Schmidhuber, J urgen. Framewise phoneme classiﬁcation with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5):602 610, 2005.

Graves, Alex, Fern andez, Santiago, Gomez, Faustino, and Schmidhuber, J urgen. Connectionist temporal classiﬁcation: Labelling unsegmented sequence data with recurrent neural networks. In Proc. ICML, 2006.

Hannun, Awni Y., Case, Carl, Casper, Jared, Catanzaro, Bryan C., Diamos, Greg, Elsen, Erich, Prenger, Ryan, Satheesh, Sanjeev, Sengupta, Shubho, Coates, Adam, and Ng, Andrew Y. Deep speech: Scaling up end-to-end speech recognition. Co RR, abs/1412.5567, 2014.

Hochreiter, Sepp and Schmidhuber, J urgen. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997.

Karpathy, Andrej and Fei-Fei, Li. Deep visual-semantic alignments for generating image descriptions. In Proc. CVPR, 2015.

Kassel, Robert H. A comparison of approaches to on-line handwritten character recognition. Ph D thesis, Massachusetts Institute of Technology, 1995.

Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Ling, Wang, Dyer, Chris, Black, Alan W, and Trancoso, Isabel. Two/too simple adaptations of word2vec for syntax problems. In Proc. NAACL, 2015.

Maas, Andrew L., Xie, Ziang, Jurafsky, Dan, and Ng, Andrew Y. Lexicon-free conversational speech recognition with neural networks. In Proc. NAACL, 2015.

Published as a conference paper at ICLR 2016

Peng, Jian, Bo, Liefeng, and Xu, Jinbo. Conditional neural ﬁelds. In Proc. NIPS, 2009.

Rabiner, Lawrence R. A tutorion on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77(2), 1989.

Ramshaw, Lance A. and Marcus, Mitchell P. Text chunking using transformation-based learning. In Proceedings of the Workshop on Very Large Corpora, 1995.

Sarawagi, Sunita and Cohen, William W. Semi-Markov conditional random ﬁelds for information extraction. In Proc. NIPS, 2004.

Taskar, Ben, Guestrin, Carlos, and Koller, Daphne. Max-margin Markov networks. NIPS, 16:25, 2004.

Xue, Naiwen, Xia, Fei, Chiou, Fu-Dong, and Palmer, Martha. The Penn Chinese Tree Bank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2):207 238, 2005.