# unsupervised_sequence_classification_using_sequential_output_statistics__1b1b0eed.pdf Unsupervised Sequence Classification using Sequential Output Statistics Yu Liu , Jianshu Chen , and Li Deng Microsoft Research, Redmond, WA 98052, USA jianshuc@microsoft.com Citadel LLC, Seattle/Chicago, USA Li.Deng@citadel.com We consider learning a sequence classifier without labeled data by using sequential output statistics. The problem is highly valuable since obtaining labels in training data is often costly, while the sequential output statistics (e.g., language models) could be obtained independently of input data and thus with low or no cost. To address the problem, we propose an unsupervised learning cost function and study its properties. We show that, compared to earlier works, it is less inclined to be stuck in trivial solutions and avoids the need for a strong generative model. Although it is harder to optimize in its functional form, a stochastic primal-dual gradient method is developed to effectively solve the problem. Experiment results on real-world datasets demonstrate that the new unsupervised learning method gives drastically lower errors than other baseline methods. Specifically, it reaches test errors about twice of those obtained by fully supervised learning. 1 Introduction Unsupervised learning is one of the most challenging problems in machine learning. It is often formulated as the modeling of how the world works without requiring a huge amount of human labeling effort, e.g. [8]. To reach this grand goal, it is necessary to first solve a sub-goal of unsupervised learning with high practical value; that is, learning to predict output labels from input data without requiring costly labeled data. Toward this end, we study in this paper the learning of a sequence classifier without labels by using sequential output statistics. The problem is highly valuable since the sequential output statistics, such as language models, could be obtained independently of the input data and thus with no labeling cost. The problem we consider here is different from most studies on unsupervised learning, which concern automatic discovery of inherent regularities of the input data to learn their representations [13, 28, 18, 17, 5, 1, 31, 20, 14, 12]. When these methods are applied in prediction tasks, either the learned representations are used as feature vectors [22] or the learned unsupervised models are used to initialize a supervised learning algorithm [9, 18, 2, 24, 10]. In both ways, the above unsupervised methods played an auxiliary role in helping supervised learning when it is applied to prediction tasks. Recently, various solutions have been proposed to address the input-to-output prediction problem without using labeled training data, all without demonstrated successes [11, 30, 7]. Similar to this work, the authors in [7] proposed an unsupervised cost that also exploits the sequence prior of the output samples to train classifiers. The power of such a strong prior in the form of language All the three authors contributed equally to the paper. The work was done while Yu Liu and Li Deng were at Microsoft Research. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. models in unsupervised learning was also demonstrated in earlier studies in [21, 3]. However, these earlier methods did not perform well in practical prediction tasks with real-world data without using additional strong generative models. Possible reasons are inappropriately formulated cost functions and inappropriate choices of optimization methods. For example, it was shown in [7] that optimizing the highly non-convex unsupervised cost function could easily get stuck in trivial solutions, although adding a special regularization mitigated the problem somewhat. The solution provided in this paper fundamentally improves these prior works in [11, 30, 7] in following aspects. First, we propose a novel cost function for unsupervised learning, and find that it has a desired coverage-seeking property that makes the learning algorithm less inclined to be stuck in trivial solutions than the cost function in [7]. Second, we develop a special empirical formulation of this cost function that avoids the need for a strong generative model as in [30, 11].3 Third, although the proposed cost function is more difficult to optimize in its functional form, we develop a stochastic primal-dual gradient (SPDG) algorithm to effectively solve problem. Our analysis of SPDG demonstrates how it is able to reduce the high barriers in the cost function by transforming it into a primal-dual domain. Finally and most importantly, we demonstrate the new cost function and the associated SPDG optimization algorithm work well in two real-world classification tasks. In the rest of the paper, we proceed to demonstrate these points and discuss related works along the way. 2 Empirical-ODM: An unsupervised learning cost for sequence classifiers In this section, we extend the earlier work of [30] and propose an unsupervised learning cost named Empirical Output Distribution Match (Empirical-ODM) for training classifiers without labeled data. We first formulate the unsupervised learning problem with sequential output structures. Then, we introduce the Empirical-ODM cost and discuss its important properties that are closely related to unsupervised learning. 2.1 Problem formulation We consider the problem of learning a sequence classifier that predicts an output sequence (y1, . . . , y T0) from an input sequence (x1, . . . , x T0) without using labeled data, where T0 denotes the length of the sequence. Specifically, the learning algorithm does not have access to a labeled training set DXY , {(xn 1, . . . , xn 1 , . . . , yn Tn) : n = 1, . . . , M}, where Tn denotes the length of the n-th sequence. Instead, what is available is a collection of input sequences, denoted as DX , {(xn 1, . . . , xn Tn) : n = 1, . . . , M}. In addition, we assume that the sequential output statistics (or sequence prior), in the form of an N-gram probability, are available: p LM(i1, . . . , i N) , p LM(yn t N+1 = i1, . . . , yn where i1, . . . , i N 2 {1, . . . , C} and the subscript LM stands for language model. Our objective is to train the sequence classifier by just using DX and p LM( ). Note that the sequence prior p LM( ), in the form of language models, is a type of structure commonly found in natural language data, which can be learned from a large amount of text data freely available without labeling cost. For example, in optical character recognition (OCR) tasks, yn t could be an English character and xn t is the input image containing this character. We can estimate an N-gram character-level language model p LM( ) from a separate text corpus. Therefore, our learning algorithm will work in a fully unsupervised manner, without any human labeling cost. In our experiment section, we will demonstrate the effectiveness of our method on such a real OCR task. Other potential applications include speech recognition, machine translation, and image/video captioning. In this paper, we focus on the sequence classifier in the form of p (yn t ) that is, it computes the posterior probability p (yn t ) only based on the current input sample xn t in the sequence. Furthermore, we restrict our choice of p (yn t ) to be linear classifiers4 and focus our attention on designing and understanding unsupervised learning costs and methods for label-free prediction. In 3The work [11] only proposed a conceptual idea of using generative models to integrate the output structure and the output-to-input structure for unsupervised learning in speech recognition. Specifically, the generative models are built from the domain knowledge of speech waveform generation mechanism. No mathematical formulation or successful experimental results are provided in [11]. t ) = eγw T t , where the model parameter is , {wi 2 Rd, i = 1, . . . , C}. fact, as we will show in later sections, even with linear models, the unsupervised learning problem is still highly nontrivial and the cost function is also highly non-convex. And we emphasize that developing a successful unsupervised learning approach for linear classifiers, as we do in this paper, provides important insights and is an important first step towards more advanced nonlinear models (e.g., deep neural networks). We expect that, in future work, the insights obtained here could help us generalize our techniques to nonlinear models. A recent work that shares the same motivations as our work is [29], which also recognizes the high cost of obtaining labeled data and seeks label-free prediction. Different from our setting, they exploit domain knowledge from laws of physics in computer vision applications, whereas our approach exploits sequential statistics in the natural language outputs. Finally, our problem is fundamentally different from the sequence transduction method in [15], although it also exploits language models for sequence prediction. Specifically, the method in [15] is a fully supervised learning in that it requires supervision at the sequence level; that is, for each input sequence, a corresponding output sequence (of possibly different length) is provided as a label. The use of language model in [15] only serves the purpose of regularization in the sequence-level supervised learning. In stark contrast, the unsupervised learning we propose does not require supervision at any level including specifically the sequence level; we do not need the sequence labels but only the prior distribution p LM( ) of the output sequences. 2.2 The Empirical-ODM We now introduce an unsupervised learning cost that exploits the sequence structure in p LM( ). It is mainly inspired by the approach to breaking the Caesar cipher, one of the simplest forms of encryption [23]. Caesar cipher is a substitution cipher where each letter in the original message is replaced with a letter corresponding to a certain number of letters up or down in the alphabet. For example, the letter D is replaced by the letter A , the letter E is replaced by the letter B , and so on. In this way, the original message that was readable ends up being less understandable. The amount of this shifting is also known to the intended receiver of the message, who can decode the message by shifting back each letter in the encrypted message. However, Caesar cipher could also be broken by an unintended receiver (not knowing the shift) when it analyzes the frequencies of the letters in the encrypted messages and matches them up with the letter distribution of the original text [4, pp.9-11]. More formally, let yt = f(xt) denote a function that maps each encrypted letter xt into an original letter yt. And let p LM(i) , p LM(yt = i) denote the prior letter distribution of the original message, estimated from a regular text corpus. When f( ) is constructed in a way that all mapped letters {yt : yt = f(xt), t = 1, . . . , T} have the same distribution as the prior p LM(i), it is able to break the Caesar cipher and recover the original letters at the mapping outputs. Inspired by the above approach, the posterior probability p (yn t ) in our classification problem can be interpreted as a stochastic mapping, which maps each input vector xn t (the encrypted letter ) into an output vector yn t (the original letter ) with probability p (yn t ). Then in a samplewise manner, each input sequence (xn 1, . . . , xn Tn) is stochastically mapped into an output sequence (yn 1 , . . . , yn Tn). We move a step further than the above approach by requiring that the distribution of the N-grams among all the mapped output sequences are close to the prior N-gram distribution p LM(i1, . . . , i N). With this motivation, we propose to learn the classifier p (yt|xt) by minimizing the cross entropy between the prior distribution and the expected N-gram frequency of the output sequences: p LM(i1, . . . , i N) ln p (i1, . . . , i N) where p (i1, . . . , i N) denotes the expected frequency of a given N-gram (i1, . . . , i N) among all the output sequences. In Appendix B of the supplementary material, we derive its expression as p (i1, . . . , i N) , 1 t k = i N k|xn where T , T1 + + TM is the total number of samples in all sequences. Note that minimizing the cross entropy in (1) is also equivalent to minimizing the Kullback-Leibler (KL) divergence between the two distributions since they only differ by a constant term, P p LM ln p LM. Therefore, the cost function (1) seeks to estimate by matching the two output distributions, where the expected N-gram distribution in (2) is an empirical average over all the samples in the training set. For this reason, we name the cost (1) as Empirical Output Distribution Match (Empirical-ODM) cost. In [30], the authors proposed to minimize an output distribution match (ODM) cost, defined as the KL-divergence between the prior output distribution and the marginalized output distribution, D(p LM(y)||p (y)), where p (y) , p (y|x)p(x)dx. However, evaluating p (y) requires integrating over the input space using a generative model p(x). Due to the lack of such a generative model, they were not able to optimize this proposed ODM cost. Instead, alternative approaches such as Dual autoencoders and GANs were proposed as heuristics. Their results were not successful without using a few labeled data. Our proposed Empirical-ODM cost is different from the ODM cost in [30] in three key aspects. (i) We do not need any labeled data for training. (ii) We exploit sequence structure of output statistics, i.e., in our case y = (y1, . . . , y N) (N-gram) whereas in [30] y = yt (unigram, i.e., no sequence structure). This is crucial in developing a working unsupervised learning algorithm. The change from unigram to N-gram allows us to explicitly exploit the sequence structures at the output, which makes the technique from non-working to working (see Table 2 in Section 4). It might also explain why the method in [30] failed as it does not exploit the sequence structure. (iii) We replace the marginalized distribution p (y) by the expected N-gram frequency in (2). This is critical in that it allows us to directly minimize the divergence between two output distributions without the need for a generative model, which [30] could not do. In fact, we can further show that p (i1, . . . , i N) is an empirical approximation of p (y) with y = (y1, . . . , y N) (see Appendix B.2 of the supplementary material). In this way, our cost (1) can be understood as an N-gram and empirical version of the ODM cost except for an additive constant, i.e., y is replaced by y = (y1, . . . , y N) and p (y) is replaced by its empirical approximation. 2.3 Coverage-seeking versus mode-seeking We now discuss an important property of the proposed Empirical-ODM cost (1) by comparing it with the cost proposed in [7]. We show that the Empirical-ODM cost has a coverage-seeking property, which makes it more suitable for unsupervised learning than the mode-seeking cost in [7]. In [7], the authors proposed the expected negative log-likelihood as the unsupervised learning cost function that exploits the output sequential statistics. The intuition was to maximize the aggregated log-likelihood of all the output sequences assumed to be generated by the stochastic mapping p (yn t ). We show in Appendix A of the supplementary material that their cost is equivalent to i1,...,i N 1 p (i1, . . . , i N) ln p LM(i N|i N 1, . . . , i1) (3) where p LM(i N|i N 1, . . . , i1) , p(yn t 1 = i N 1, . . . , yn t N+1 = i1), and the summations are over all possible values of i1, . . . , i N 2 {1, . . . , C}. In contrast, we can rewrite our cost (1) as i1,...,i N 1 p LM(i1, . . . , i N 1) p LM(i N|i N 1, . . . , i1) ln p (i1, . . . , i N) (4) where we used the chain rule of conditional probabilities. Note that both costs (3) and (4) are in a cross entropy form. However, a key difference is that the positions of the distributions p ( ) and p LM( ) are swapped. We show that the cost in the form of (3) proposed in [7] is a mode-seeking divergence between two distributions, while by swapping p ( ) and p LM( ), our cost in (4) becomes a coverage-seeking divergence (see [25] for a detailed discussion on divergences with these two different behaviors). To understand this, we consider the following two situations: If p LM(i N|i N 1, . . . , i1) ! 0 and p (i1, . . . , i N) > 0 for a certain (i1, . . . , i N), the cross entropy in (3) goes to +1 and the cross entropy in (4) approaches zero. If p LM(i N|i N 1, . . . , i1) > 0 and p (i1, . . . , i N) ! 0 for a certain (i1, . . . , i N), the cross entropy in (3) approaches zero and the cross entropy in (4) goes to +1. Therefore, the cost function (3) will heavily penalize the classifier if it predicts an output that is believed to be less probable by the prior distribution p LM( ), and it will not penalize the classifier when it does not predict an output that p LM( ) believes to be probable. That is, the classifier is encouraged to predict a single output mode with high probability in p LM( ), a behavior called mode-seeking in [25]. This probably explains the phenomena observed in [7]: the training process easily converges to Figure 1: The profiles of J ( ) for the OCR dataset on a two-dimensional affine space passing through the supervised solution. The three figures show the same profile from different angles, where the red dot is the supervised solution. The contours of the profiles are shown at the bottom. a trivial solution of predicting the same output that has the largest probability in p LM( ). In contrast, the cost (4) will heavily penalize the classifier if it does not predict the output for which p LM( ) is positive, and will penalize less if it predicts outputs for which p LM( ) is zero. That is, this cost will encourage p (y|x) to cover as much of p LM( ) as possible, a behavior called coverage-seeking in [25]. Therefore, training the classifier using (4) will make it less inclined to learn trivial solutions than that in [7] since it will be heavily penalized. We will verify this fact in our experiment section 4. In addition, the coverage-seeking property could make the learning less sensitive to the sparseness of language models (i.e., p LM is zero for some N-grams) since the cost will not penalize these N-grams. In summary, our proposed cost (1) is more suitable for unsupervised learning than that in [7]. 2.4 The difficulties of optimizing J ( ) However, there are two main challenges of optimizing the Empirical-ODM cost J ( ) in (1). The first one is that the sample average (over the entire training data set) in the expression of p ( ) (see (2)) is inside the logarithmic loss, which is different from traditional machine learning problems where the average is outside loss functions (e.g., P t ft( )). This functional form prevents us from applying stochastic gradient descent (SGD) to minimize (1) as the stochastic gradients would be intrinsically biased (see Appendix C for a detailed discussion and see section 4 for the experiment results). The second challenge is that the cost function J ( ) is highly non-convex even with linear classifiers. To see this, we visualize the profile of the cost function J ( ) (restricted to a two-dimensional sub-space) around the supervised solution in Figure 1.56 We observe that there are local optimal solutions and there are high barriers between the local and global optimal solutions. Therefore, besides the difficulty of having the sample average inside the logarithmic loss, minimizing this cost function directly will be difficult since crossing the high barriers to reach the global optimal solution would be hard if not properly initialized. 3 The Stochastic Primal-Dual Gradient (SPDG) Algorithm To address the first difficulty in Section 2.4, we transform the original cost (1) into an equivalent min-max problem in order to bring the sample average out of the logarithmic loss. Then, we could obtain unbiased stochastic gradients to solve the problem. To this end, we first introduce the concept of convex conjugate functions. For a given convex function f(u), its convex conjugate function f ?( ) is defined as f ?( ) , supu( T u f(u)) [6, pp.90-95], where u and are called primal and dual variables, respectively. For a scalar function f(u) = ln u, its conjugate function can be calculated as f ?( ) = 1 ln( ) with < 0. Furthermore, it holds that f(u) = sup (u T f ?( )), by 5The approach to visualizing the profile is explained with more detail in Appendix F. More slices and a video of the profiles from many angles can be found in the supplementary material. 6Note that the supervised solution (red dot) coincides with the global optimal solution of J ( ). The intuition for this is that the classifier trained by supervised learning should also produce output N-gram distribution that is close to the prior marginal output N-gram distribution given by p LM( ). Algorithm 1 Stochastic Primal-Dual Gradient Method 1: Input data: DX = {(xn 1, . . . , xn Tn) : n = 1, . . . , M} and p LM(i1, . . . , i N). 2: Initialize and V where the elements of V are negative 3: repeat 4: Randomly sample a mini-batch of B subsequences of length N from all the sequences in the training set DX, i.e., B = {(xnm tm N+1, . . . , xnm m=1. 5: Compute the stochastic gradients for each subsequence in the mini-batch and average them tm @ , V = 1 p LM(i1,. . ., i N) ln( i1,...,i N) 6: Update and V according to µ and V V + µv V . 7: until convergence or a certain stopping condition is met which we have ln u = max (u +1+ln( )).7 Substituting it into (1), the original minimization problem becomes the following equivalent min-max problem: max { i1,...,i N <0} L( , V ) , 1 t ( , V ) + p LM(i1, . . . , i N) ln( i1,...,i N ) where V , { i1,...,i N } is a collection of all the dual variables i1,...,i N , and Ln t ( , V ) is the t-th component function in the n-th sequence, defined as t ( , V ) , p LM(i1, . . . , i N) i1,...,i N t k =i N k|xn In the equivalent min-max problem (5), we find the optimal solution ( ?, V ?) by minimizing L with respect to the primal variable and maximizing L with respect to the dual variable V . The obtained optimal solution to (5), ( ?, V ?), is called the saddle point of L [6]. Once it is obtained, we only keep ?, which is also the optimal solution to (1) and thus the model parameter. We further note that the equivalent min-max problem (5) is now in a form that sums over T = T1 + + TM component functions Ln t ( , V ). Therefore, the empirical average has been brought out of the logarithmic loss and we are ready to apply stochastic gradient methods. Specifically, we minimize L with respect to the primal variable by stochastic gradient descent and maximize L with respect to the dual variable V by stochastic gradient ascent. Therefore, we name the algorithm stochastic primal-dual gradient (SPDG) method (see its details in Algorithm 1). We implement the SPDG algorithm in Tensor Flow, which automatically computes the stochastic gradients.8 Finally, the constraint on dual variables i1,...,i N are automatically enforced by the inherent log-barrier, ln( i1,...,i N ), in (5) [6]. Therefore, we do not need a separate method to enforce the constraint. We now show that the above min-max (primal-dual) reformulation also alleviates the second difficulty discussed in Section 2.4. Similar to the case of J ( ), we examine the profile of L( , V ) in (5) (restricted to a two-dimensional sub-space) around the optimal (supervised) solution in Figure 2a (see Appendix F for the visualization details). Comparing Figure 2a to Figure 1, we observe that the profile of L( , V ) is smoother than that of J ( ) and the barrier is significantly lower. To further compare J ( ) and L( , V ), we plot in Figure 2b the values of J ( ) and L( , V ) along the same line of ? + λp( 1 ?) for different λp. It shows that the barrier of L( , V ) along the primal direction is lower than that in J ( ). These observations imply that the reformulated min-max problem (5) is better conditioned than the original problem (1), which further justifies the use of SPDG method. 7The supremum is attainable and is thus replaced by maximum. 8The code will be released soon. Figure 2: The profiles of L( , V ) for the OCR dataset. (a) The profile on a two-dimensional affine space passing through the optimal solution (red dot). (b) The profile along the line of ? +λp( 1 ?) for different values of λp 2 R, where the circles are the optimal solutions. 4 Experiments 4.1 Experimental setup We evaluate our unsupervised learning scheme described in earlier secitons using two classification tasks, unsupervised character-level OCR and unsupervised English Spelling Correction (Spell-Corr). In both tasks, there is no label provided during training. Hence, they are both unsupervised. For the OCR task, we obtain our dataset from a public database UWIII English Document Image Database [27], which contains images for each line of text with its corresponding groudtruth. We first use Tesseract [19] to segment the image for each line of text into characters tiles and assign each tile with one character. We verify the segmentation result by training a simple neural network classifier on the segmented results and achieve 0.9% error rate on the test set. Then, we select sentence segments that are longer than 100 and contain only lowercase English characters and common punctuations (space, comma, and period). As a result, we have a vocabulary of size 29 and we obtain 1,175 sentence segments including 153,221 characters for our OCR task. To represent images, we extract VGG19 features with dim = 4096, and project them into 200-dimension vectors using Principal Component Analysis. We train the language models (LM) p LM( ) to provide the required output sequence statistics from both in-domain and out-of-domain data sources. The out-of-domain data sources are completely different databases, including three different language partitions (CNA, NYT, XIN) in the English Gigaword database [26]. In Spell-Corr task, we learn to correct the spelling from a mis-spelled text. From the AFP partition of the Gigaword database, we select 500 sentence segments into our Spell-Corr dataset. We select sentences that are longer than 100 and contain only English characters and common punctuations, resulting in a total of 83,567 characters. The mis-spelled texts are generated by substitution simulations and are treated as our inputs. The objective of this task is to recover the original text. 4.2 Results: Comparing optimization algorithms In the first set of experiments, we aim to evaluate the effectiveness of the SPDG method as described in Section 3, which is designed for optimizing the Empirical-ODM cost in Section 2. The analysis provided in Sections 2 and 3 sheds insight to why SPDG is superior to the method in [7] and to the standard stochastic gradient descent (SGD) method. The coverage-seeking behavior of the proposed Empirical-ODM cost helps avoid trivial solutions, and the simultaneous optimization of primal-dual variables reduces the barriers in the highly non-convex profile of the cost function. Furthermore, we do not include the methods from [30] because their approaches could not achieve satisfactory results without a few labeled data, while we only consider fully unsupervised learning setting. In addition, the methods in [30] are not optimizing the ODM cost and do not exploit the output sequential statistics. Table 1 provides strong experimental evidence demonstrating the substantially greater effectiveness of the primal-dual method over the SGD and the method in [7] on both tasks. All these results are obtained by training the models until converge. Let us examine the results on the OCR in detail. First, the SPGD on the unsupervised cost function achieves 9.21% error rate, much lower than the error rates of any of mini-batch SGD runs, where the size of the mini-batches ranges from 10 to 10,000. Note that, larger mini-batch sizes produce lower errors here because it becomes closer to full-batch gradient and thus lower bias in SGD. On the other hand, when the mini-batch size is as small as 10, the high error rate of 83.09% is close to a guess by majority rule predicting the character (space) that has a largest proportion in the train set, i.e., 25, 499/153, 221 = 83.37%. Furthermore, the method from [7] does not perform well no matter how we tune the hyperparameters for the generative regularization. Finally and perhaps most interestingly, with no labels provided in the training, the classification errors produced by our method are only about twice compared with supervised learning (4.63% shown in Table 1). This clearly demonstrates that the unsupervised learning scheme proposed in this paper is an effective one. For the Spelling Correction data set (see the third column in Table 1), we observe rather consistent results with the OCR data set. Table 1: Test error rates on two datasets: OCR and Spell-Corr. The 2-gram character LM is trained from in-domain data. The numbers inside h i are the mini-batch sizes of the SGD method. Data sets SPDG (Ours) Method from [7] Supervised Learning Majority Guess OCR 9.59% 83.37% 83.09% 78.05% 67.14% 56.48% 4.63% 83.37% Spell-Corr 1.94% 82.91% 82.91% 72.93% 65.69% 45.24% 0.00% 82.91% 4.3 Results: Comparing orders of language modeling In the second set of experiments, we examine to what extent the use of sequential statistics (e.g. 2and 3-gram LMs) can do better than the uni-gram LM (no sequential information) in unsupervised learning. The unsupervised prediction results are shown in Table 2, using different data sources to estimate N-gram LM parameters. Consistent across all four ways of estimating reliable N-gram LMs, we observe significantly lower error rates when the unsupervised learning exploits 2-gram and 3-gram LM as sequential statistics compared with exploiting the prior with no sequential statistics (i.e. 1-gram). In three of four cases, exploiting a 3-gram LM gives better results than a 2-gram LM. Furthermore, the comparable error rate associated with 3-gram using out-of-domain output character data (10.17% in Table 2) to that using in-domain output character data (9.59% in Table 1) indicates that the effectiveness of the unsupervised learning paradigm presented in this paper is robust to the quality of the LM acting as the sequential prior. Table 2: Test error rates on the OCR dataset. Character-level language models (LMs) with the orders are trained from three out-of-domain datasets and from the fused in-domain and out-of-domain data. NYT-LM XIN-LM CNA-LM Fused-LM No. Sents 1,206,903 155,647 12,234 15,409 No. Chars 86,005,542 18,626,451 1,911,124 2,064,345 1-gram 71.83% 72.14% 71.51% 71.25% 2-gram 10.93% 12.55% 10.56% 10.33% 3-gram 10.17% 12.89% 10.29 % 9.21% 5 Conclusions and future work In this paper, we study the problem of learning a sequence classifier without the need for labeled training data. The practical benefit of such unsupervised learning is tremendous. For example, in large scale speech recognition systems, the currently dominant supervised learning methods typically require a few thousand hours of training data, where each utterance in the acoustic form needs to be labeled by humans. Although there are millions of hours of natural speech data available for training, labeling all of them for supervised learning is less feasible. To make effective use of such huge amounts of acoustic data, the practical unsupervised learning approach discussed in this paper would be called for. Other potential applications such as machine translation, image and video captioning could also benefit from our paradigm. This is mainly because of their common natural language output structure, from which we could exploit the sequential structures for learning the classifier without labels. For other (non-natural-language) applications where there is also a sequential output strucutre, our proposed approach could be applicable in a similar manner. Furthermore, our proposed Empirical-ODM cost function significantly improves over the one in [7] by emphasizing the coverage-seeking behavior. Although the new cost function has a functional form that is more difficult to optimize, a novel SPDG algorithm is developed to effectively address the problem. An analysis of profiles of the cost functions sheds insight to why SPDG works well and why previous methods failed. Finally, we demonstrate in two datasets that our unsupervised learning method is highly effective, producing only about twice errors as fully supervised learning, which no previous unsupervised learning could produce without additional steps of supervised learning. While the current work is restricted to linear classifiers, we intend to generalize the approach to nonlinear models (e.g., deep neural nets [16]) in our future work. We also plan to extend our current method from exploiting N-gram LM to exploiting the currently state-of-the-art neural-LM. Finally, one challenge that remains to be addressed is the scaling of the current method to large vocabulary and high-order LM (i.e., large C and N). In this case, the summation over all (i1, . . . , i N) in (5) becomes computationally expensive. A potential solution is to parameterize the dual variable i1,...,i N by a recurrent neural network and approximate the sum using beamsearch, which we leave as a future work. Acknowledgments The authors would like to thank all the anonymous reviewers for their constructive feedback. [1] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1 127, January 2009. [2] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), pages 153 160, 2007. [3] Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein. Unsupervised transcription of historical documents. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 207 217, 2013. [4] Albrecht Beutelspacher. Cryptology. Mathematical Association of America, 1994. [5] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993 1022, March 2003. [6] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, [7] Jianshu Chen, Po-Sen Huang, Xiaodong He, Jianfeng Gao, and Li Deng. Unsupervised learning of predictors from unpaired input-output samples. ar Xiv:1606.04646, 2016. [8] Soumith Chintala and Yann Le Cun. A path to unsupervised learning through adversarial networks. In https://code.facebook.com/posts/1587249151575490/a-path-to-unsupervisedlearning-through-adversarial-networks/, 2016. [9] George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30 42, 2012. [10] Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), pages 3079 3087, 2015. [11] Li Deng. Deep learning for speech and language processing. In Tutorial at Interspeech Conf, Dresden, Germany, https://www.microsoft.com/en-us/research/wpcontent/uploads/2016/07/interspeech-tutorial-2015-lideng-sept6a.pdf, Aug-Sept, 2015. [12] Ian Goodfellow. Generative adversarial nets. In Tutorial at NIPS, http://www.cs.toronto.edu/ dtarlow/pos14/talks/goodfellow.pdf, 2016. [13] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning, by MIT Press. 2016. [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), pages 2672 2680, 2014. [15] Alex Graves. Sequence transduction with recurrent neural networks. ar Xiv preprint ar Xiv:1211.3711, 2012. [16] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-Rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82 97, November 2012. [17] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527 1554, 2006. [18] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 507, 2006. [19] Anthony Kay. Tesseract: An open-source optical character recognition engine. Linux Journal, [20] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [21] Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Yamada. Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL, pages 499 506, 2006. [22] Quoc Le, Marc Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg Corrado, Jeff Dean, and Andrew Ng. Building high-level features using large scale unsupervised learning. In International Conference in Machine Learning, 2012. [23] Dennis Luciano and Gordon Prichett. Cryptology: From caesar ciphers to public-key cryptosys- tems. The College Mathematics Journal, 18(1):2 17, 1987. [24] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ar Xiv preprint ar Xiv:1301.3781, 2013. [25] Tom Minka. Divergence measures and message passing. Technical report, Technical report, Microsoft Research, 2005. [26] Robert et al Parker. English gigaword fourth edition ldc2009t13. Philadelphia: Linguistic Data Consortium, 2009. [27] Ihsin Phillips, Bhabatosh Chanda, and Robert Haralick. http://isisdata.science.uva.nl/events/dlia//datasets/uwash3.html. [28] P. Smolensky. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Information Processing in Dynamical Systems: Foundations of Harmony Theory, pages 194 281. 1986. [29] Russell Stewart and Stefano Ermon. Label-free supervision of neural networks with physics and domain knowledge. In Proceedings of AAAI, 2017. [30] Ilya Sutskever, Rafal Jozefowicz, Karol Gregor, Danilo Rezende, Tim Lillicrap, and Oriol Vinyals. Towards principled unsupervised learning. ar Xiv preprint ar Xiv:1511.06440, 2015. [31] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 11:3371 3408, 2010.