# unsupervised_learning_for_lexiconbased_classification__415e1028.pdf Unsupervised Learning for Lexicon-Based Classification Jacob Eisenstein Georgia Institute of Technology 801 Atlantic Drive NW Atlanta, Georgia 30318 In lexicon-based classification, documents are assigned labels by comparing the number of words that appear from two opposed lexicons, such as positive and negative sentiment. Creating such words lists is often easier than labeling instances, and they can be debugged by non-experts if classification performance is unsatisfactory. However, there is little analysis or justification of this classification heuristic. This paper describes a set of assumptions that can be used to derive a probabilistic justification for lexicon-based classification, as well as an analysis of its expected accuracy. One key assumption behind lexicon-based classification is that all words in each lexicon are equally predictive. This is rarely true in practice, which is why lexicon-based approaches are usually outperformed by supervised classifiers that learn distinct weights on each word from labeled instances. This paper shows that it is possible to learn such weights without labeled data, by leveraging co-occurrence statistics across the lexicons. This offers the best of both worlds: light supervision in the form of lexicons, and data-driven classification with higher accuracy than traditional word-counting heuristics. Introduction Lexicon-based classification refers to a classification rule in which documents are assigned labels based on the count of words from lexicons associated with each label (Taboada et al. 2011). For example, suppose that we have opposed labels Y {0, 1}, and we have associated lexicons W0 and W1. Then for a document with a vector of word counts x, the lexicon-based decision rule is, where the operator indicates a decision rule. Put simply, the rule is to select the label whose lexicon matches the most word tokens. Lexicon-based classification is widely used in industry and academia, with applications ranging from sentiment classification and opinion mining (Pang and Lee 2008; Liu 2015) to the psychological and ideological analysis of texts (Laver and Garry 2000; Tausczik and Pennebaker 2010). The popularity of this approach can be explained by Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. its relative simplicity and ease of use: for domain experts, creating lexicons is intuitive, and, in comparison with labeling instances, it may offer a faster path towards a reasonably accurate classifier (Settles 2011). Furthermore, classification errors can be iteratively debugged by refining the lexicons. However, from a machine learning perspective, there are a number of drawbacks to lexicon-based classification. First, while intuitively reasonable, lexicon-based classification lacks theoretical justification: it is not clear what conditions are necessary for it to work. Second, the lexicons may be incomplete, even for designers with strong substantive intuitions. Third, lexicon-based classification assigns an equal weight to each word, but some words may be more strongly predictive than others.1 Fourth, lexiconbased classification ignores multi-word phenomena, such as negation (e.g., not so good) and discourse (e.g., the movie would be watchable if it had better acting). Supervised classification systems, which are trained on labeled examples, tend to outperform lexicon-based classifiers, even without accounting for multi-word phenomena (Liu 2015; Pang and Lee 2008). Several researchers have proposed methods for lexicon expansion, automatically growing lexicons from an initial seed set (Hatzivassiloglou and Mc Keown 1997; Qiu et al. 2011). There is also work on handling multi-word phenomena such as negation (Wilson, Wiebe, and Hoffmann 2005; Polanyi and Zaenen 2006), and discourse (Somasundaran, Wiebe, and Ruppenhofer 2008; Bhatia, Ji, and Eisenstein 2015). However, the theoretical foundations of lexiconbased classification remain poorly understood, and we lack principled means for automatically assigning weights to lexicon items without resorting to labeled instances. This paper elaborates a set of assumptions under which lexicon-based classification is equivalent to Na ıve Bayes classification. I then derive expected error rates under these assumptions. These expected error rates are not matched by observations on real data, suggesting that the underlying assumptions are invalid. Of key importance is the assumption that each lexicon item is equally predictive. To relax this as- 1Some lexicons attach coarse-grained predefined weights to each word. For example, the Opinion Finder Subjectivity lexicon labels words as strongly or weakly subjective (Wilson, Wiebe, and Hoffmann 2005). This poses an additional burden on the lexicon creator. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) sumption, I derive a principled method for estimating word probabilities under each label, using a method-of-moments estimator on cross-lexical co-occurrence counts. Overall, this paper makes the following contributions: justifying lexicon-based classification as a special case of multinomial Na ıve Bayes; mathematically analyzing this model to compute the expected performance of lexicon-based classifiers; extending the model to justify a popular variant of lexicon-based classification, which incorporates word presence rather than raw counts; deriving a method-of-moments estimator for the parameters of this model, enabling lexicon-based classification with unique weights per word, without labeled data; empirically demonstrating that this classifier outperforms lexicon-based classification and alternative approaches. Lexicon-Based Classification as Na ıve Bayes I begin by showing how the lexicon-based classification rule shown in (1) can be derived as a special case of Na ıve Bayes classification. Suppose we have a prior probability PY for the label Y , and a likelihood function PX|Y , where X is a random variable corresponding to a vector of word counts. The conditional label probability can be computed by Bayesian inversion, (2) P(y | x) = P(x | y)P(y) y P(x | y )P(y ). Assuming that the costs for each type of misclassification error are identical, then the minimum Bayes risk classification rule is, (3) log Pr(Y = 0) + log P(x | Y = 0) log Pr(Y = 1) + log P(x | Y = 1), moving to the log domain for simplicity of notation. I now show that lexicon-based classification can be justified under this decision rule, given a set of assumptions about the probability distributions. Let us introduce some assumptions about the likelihood function, PX|Y . The random variable X is defined over vectors of counts, so a natural choice for the form of this likelihood is the multinomial distribution, corresponding to a multinomial Na ıve Bayes classifier. For a specific vector of counts X = x, write P(x | y) Pmultinomial(x; θy, N), where θy is a probability vector associated with label y, and N = V i=1 xi is the total count of tokens in x, and xi is the count of word i {1, 2, . . . , V }. The multinomial likelihood is proportional to a product of likelihoods of categorical variables corresponding to individual words (tokens), Pr(W = i | Y = y; θ) = θy,i, (4) where the random variable W corresponds to a single token, whose probability of being word type i is equal to θy,i in a document with label y. The multinomial log-likelihood can be written as, log P(x | y) = log Pmultinomial(x; θy, N) i=1 xi log Pr(W = i | Y = y; θ) i=1 xi log θy,i, (5) where K(x) is a function of x that is constant in y. The first necessary assumption about the likelihood function is that the lexicons are complete: words that are in neither lexicon have identical probability under both labels. Formally, for any word i / W0 W1, we assume, Pr(W = i | Y = 0) = Pr(W = i | Y = 1), (6) which implies that these words are irrelevant to the classification boundary. Next, we must assume that each in-lexicon word is equally predictive. Specifically, for words that are in lexicon y, Pr(W = i | Y = y) Pr(W = i | Y = y) = 1 + γ where y is the opposite label from y. The parameter γ controls the predictiveness of the lexicon: for example, if γ = 0.5 in a sentiment classification problem, this would indicate that words in the positive sentiment lexicon are three times more likely to appear in documents with positive sentiment than in documents with negative sentiment, and vice versa. The word atrocious might be less likely overall than good, but still three times more likely in the negative class than in the positive class. In the limit, γ = 0 implies that the lexicons do not distinguish the classes at all, and γ = 1 implies that the lexicons distinguish the classes perfectly, so that the observation of a single in-lexicon word would completely determine the document label. The conditions enumerated in (6) and (7) are ensured by the following definition, (1 + γ)μi, i Wy (1 γ)μi, i W y μi, i / Wy W y, (8) where y is the opposite label from y, and μ is a vector of baseline probabilities, which are independent of the label. Because the probability vectors θ0 and θ1 must each sum to one, we require an assumption of equal coverage, j W1 μj. (9) Finally, assume that the labels have equal prior likelihood, Pr(Y = 0) = Pr(Y = 1). It is trivial to relax this assumption by adding a constant term to one side of the decision rule in (1). With these assumptions in hand, it is now possible to simplify the decision rule in (3). Thanks to the assumption of equal prior probability, we can drop the priors P(Y ), so that the decision rule is a comparison of the likelihoods, log P(x | Y = 0) log P(x | Y = 1) (10) i xi log θ0,i K(x) + i xi log θ1,i. (11) Canceling K(x) and applying the definition from (8), i W0 xi log((1 + γ)μi) + i W1 xi log((1 γ)μi) i W0 xi log((1 γ)μi) + i W1 xi log((1 + γ)μi). The μi terms cancel after distributing the log, leaving, i W0 xi log 1 + γ i W1 xi log 1 + γ For any γ (0, 1), the term log 1+γ 1 γ is a finite and positive constant. Therefore, (13) is identical to the counting-based classification rule in (1). In other words, lexicon-based classification is minimum Bayes risk classification in a multinomial probability model, under the assumptions of equal prior likelihood, lexicon completeness, equal predictiveness of words, and equal coverage. Analysis of Lexicon-Based Classification One advantage of deriving a formal foundation for lexiconbased classification is that it is possible to analyze its expected performance. For a label y, let us write the count of in-lexicon words as my = i Wy xi, and the count of opposite-lexicon words as m y = i W y xi. Lexiconbased classification makes a correct prediction whenever my > m y for the correct label y. To assess the likelihood that my > m y, it is sufficient to compute the expectation and variance of the difference my m y; under the central limit theorem, we can treat this difference as approximately normally distributed, and compute the probability that the difference is positive using the Gaussian cumulative distribution function (CDF). Let us use the convenience notation sμ, i W1 μi. (14) Recall that we have already taken the assumption that the sums of baseline word probabilities for the two lexicons are equal. Under the multinomial probability model, given a document with N tokens, the expected counts are, i Wy θy,i = N(1 + γ)sμ (15) i W y θ y,i = N(1 γ)sμ (16) E[my m y] =2Nγsμ. (17) Next we compute the variance of this margin, V [my m y] =V [my] + V [m y] + Cov(my, m y). (18) Each of these terms is the variance of a sum of counts. Under the multinomial distribution, the variance of a single count is V [xi] = Nθi(1 θi). The variance of the sum my is, i Wy Nθi(1 θi) j Wy,j =i Nθiθj i Wy Nθi Nθ2 i j Wy,j =i Nθiθj (19) i Wy θi = N i Wy (1 + γ)μi =N(1 + γ)sμ. (20) An equivalent upper bound can be computed for the variance of the count of opposite lexicon words, V [m y] N(1 γ)sμ. (21) These bounds are fairly tight because the products of probabilities θ2 i and θiθj are nearly always small, due to the fact that most words are rare. Because the covariance Cov(my, m y) is negative (and also involves a product of word probabilities), we can further bound the variance of the margin, obtaining the upper bound, V [my m y] N(1+γ)sμ+N(1 γ)sμ = 2Nsμ. (22) By the central limit theorem, the margin my m y is approximately normally distributed, with mean 2Nγsμ and variance upper-bounded by 2Nsμ. The probability of making a correct prediction (which occurs when my > m y) is then equal to the cumulative density of a standard normal distribution Φ(z), where the z-score is equal to the ratio of the expectation and the standard deviation, z = E[my m y] V [my m y] 2Nγsμ Note that by upper-bounding the variance, we obtain a lower bound on the z-score, and thus a lower bound on the expected accuracy. According to this approximation, accuracy is expected to increase with the predictiveness γ, the document length N, and the lexicon coverage sμ. This helps to explain a dilemma in lexicon design: as more words are added, the coverage increases, but the average predictiveness of each word decreases (assuming the most predictive words are added first). Thus, increasing the size of a lexicon by adding marginal words may not improve performance. The analysis also predicts that longer documents should be easier to classify. This is because the expected size of the gap my m y grows with N, while its standard deviation grows only with N. This prediction can be tested empirically, and on all four datasets considered in this paper, it is false: longer documents are harder to classify accurately. This is a clue that the underlying assumptions are not valid. The decreased accuracy for especially long reviews may be due to these reviews being more complex, perhaps requiring modeling of the discourse structure (Somasundaran, Wiebe, and Ruppenhofer 2008). Justifying the Word-Appearance Heuristic An alternative heuristic to lexicon-based classification is to consider only the presence of each word type, and not its count. This corresponds to the decision rule, i W0 δ(xi > 0) j W1 δ(xj > 0), where δ( ) is a delta function that returns one if the Boolean condition is true, and zero otherwise. In the context of supervised classification, Pang, Lee, and Vaithyanathan (2002) find that word presence is a more predictive feature than word frequency. By ignoring repeated mentions of the same word, heuristic (24) emphasizes the diversity of ways in which a document covers a lexicon, and is more robust to document-specific idiosyncrasies such as a review of The Joy Luck Club, which might include the positive words joy and luck many times even if the review is negative. The word-appearance heuristic can also be explained in the framework defined above. The multinomial likelihood PX|Y can be replaced by a Dirichlet-compound multinomial (DCM) distribution, also known as a multivariate Polya distribution (Madsen, Kauchak, and Elkan 2005). This distribution is written Pdcm(x; αy), where αy is a vector of parameters associated with label y, with αy,i > 0 for all i {1, 2, . . . , V }. The DCM is a compound distribution because it treats the parameter of the multinomial as a latent variable to be marginalized out, Pdcm(x; αy) = ν Pmultinomial(x | ν)PDirichlet(ν | αy)dν. (25) Intuitively, one can think of the DCM distribution as encoding a model in which each document has its own multinomial distribution over words; this document-specific distribution is itself drawn from a prior that depends on the class label y. Suppose we set the DCM parameter α = τθ, with θ as defined in (8). The constant τ > 0 is then the concentration of the distribution: as τ grows, the probability distribution over α is more closely concentrated around the prior expectation θ. Because V i θi = 1, the likelihood function under this model is, Pdcm(x | y) = Γ(τ) Γ(N + τ) Γ(xi + τθy,i) Γ(τθy,i) , (26) where Γ( ) is the gamma function. Minimum Bayes risk classification in this model implies the decision rule: i W0 log rin(xi) i W1 log rin(xt,i) rout(xi) (27) rin(xi) Γ(xi + τ(1 + γ)μi) Γ(τ(1 + γ)μi) (28) rout(xi) Γ(xi + τ(1 γ)μi) Γ(τ(1 γ)μi) . (29) As τ , the prior on ν is tightly linked to θ, so that the model reduces to the multinomial defined above. Another 0 1 2 3 4 5 6 7 counts effective counts γ = 0. 5, μ = 0. 001 τ = 100 τ = 1000 τ = 10000 τ = 100000 Figure 1: Effective counts for varying values of τ. For the datasets considered in this paper, τ usually falls in the range between 500 and 1000. way to see this is to apply the equality Γ(x + 1) = xΓ(x) to (28) and (29) when τμi xi. As τ 0, the prior on ν becomes increasingly diffuse. Repeated counts of any word are better explained by document-specific variation from the prior, than by properties of the label. This situation is shown in Figure 1, which plots the effective counts implied by the classification rule (27) for a range of values of the concentration parameter τ, holding the other parameters constant (μ = 10 3, γ = 0.5). For high values of τ, the effective counts track the observed counts linearly, as in the multinomial model; for low values of τ, the effective counts barely increase beyond 1. Minka (2012) presents a number of estimators for the concentration parameter τ from a corpus of text. When the label y is unknown, we cannot apply these estimators directly. However, as described above, out-of-lexicon words are assumed to have identical probability under both labels. This assumption can be exploited to estimate τ exclusively from the first and second moments of these out-of-lexicon words. Analysis of the expected accuracy of this model is left to future work. Estimating Word Predictiveness A crucial simplification made by lexicon-based classification is that all words in each lexicon are equally predictive. In reality, words may be more or less predictive of class labels, for reasons such as sense ambiguity (e.g., well) and degree (e.g., good vs flawless). By introducing a per-word predictiveness factor γi into (8), we arrive at a model that is a restricted form of Na ıve Bayes. (The restriction is that the probabilities of non-lexicon words are constrained to be identical across classes.) If labeled data were available, this model could be estimated by maximum likelihood. This section shows how to estimate the model without labeled data, using the method of moments. First, note that the baseline probabilities μi can be estimated directly from counts on an unlabeled corpus; the challenge is to estimate the parameters γi for all words in the two lexicons. The key intuition that makes this possible is that highly predictive words should rarely appear with words in the opposite lexicon. This idea can be formalized in terms of cross-label counts: the cross-label count ci is the co-occurrence count of word i with all words in the opposite lexicon, j W y x(t) i x(t) j , (30) where x(t) is the vector of word counts for document t, with t {1 . . . T}. Under the multinomial model defined above, for a single document with N tokens, the expected product of counts for a word pair (i, j) is equal to, E[xixj] =E[xi]E[xj] + Cov(xi, xj) =Nθi Nθj Nθiθj =N(N 1)θiθj. (31) Let us focus on the expected products of counts for crosslexicon word pairs (i W0, j W1). The parameter θ depends on the document label y, as defined in (8). As a result, we have the following expectations, E[xixj | Y = 0] =N(N 1)μi(1 + γi)μj(1 γj) =N(N 1)μiμj(1 + γi γj γiγj) (32) E[xixj | Y = 1] =N(N 1)μi(1 γi)μj(1 + γj) =N(N 1)μiμj(1 γi + γj γiγj) (33) E[xixj] =P(Y = 0)E[xixj | Y = 0] + P(Y = 1)E[xixj | Y = 1] =N(N 1)μiμj(1 γiγj). (34) Summing over all words j W1 and all documents t, j W1 E[x(t) i x(t) j ] t=1 Nt(Nt 1)μi j W1 μj(1 γiγj) Let us write γ(1) to indicate the vector of γj parameters for all j W1, and γ(0) for all i W0. The expectation in (35) is a linear function of γi, and a linear function of the vector γ(1). Analogously, for all j W1, E[cj] is a linear function of γj and γ(1). Our goal is to choose γ so that the expectations E[ci] closely match the observed counts ci. This can be viewed as form of method of moments estimation, with the following objective, i W0 (ci E[ci])2 + 1 j W1 (cj E[cj])2, (36) which can be minimized in terms of γ(0) and γ(1). However, there is an additional constraint: the probability distributions θ0 and θ1 must still sum to one. We can express this as a linear constraint on γ(0) and γ(1), μ(0) γ(0) μ(1) γ(1) = 0, (37) where μ(y) is the vector of baseline probabilities for words i Wy, and μ(0) γ(0) indicates a dot product. We therefore formulate the following constrained optimization problem, min γ(0),γ(1) 1 2 i W0 (ci E[ci])2 + 1 j W1 (cj E[cj])2 s.t. μ(0) γ(0) μ(1) γ(1) = 0 i (W0 W1), γi [0, 1). (38) This problem can be solved by alternating direction method of multipliers (Boyd et al. 2011). The equality constraint can be incorporated into an augmented Lagrangian, Lρ(γ(0), γ(1)) =1 i W0 (ci E[ci])2 + 1 j W1 (cj E[cj])2 2(μ(0) γ(0) μ(1) γ(1))2, (39) where ρ > 0 is the penalty parameter. The augmented Lagrangian is biconvex in γ(0) and γ(1), which suggests an iterative solution (Boyd et al. 2011, page 76). Specifically, we hold γ(1) fixed and solve for γ(0), subject to γi [0, 1) for all i W0. We then solve for γ(1) under the same conditions. Finally, we update a dual variable u, representing the extent to which the equality constraint is violated. These updates are iterated until convergence. The unconstrained local updates to γ(0) and γ(1) can be computed by solving a system of linear equations, and the result can be projected back onto the feasible region. The penalty parameter ρ is initialized at 1, and then dynamically updated based on the primal and dual residuals (Boyd et al. 2011, pages 20-21). More details are available in the online supplement and source code.2 Evaluation An empirical evaluation is performed on four datasets in two languages. All datasets involve binary classification problems, and performance is quantified by the area-under-thecurve (AUC), a measure of classification performance that is robust to unbalanced class distributions. A perfect classifier achieves AUC = 1; in expectation, a random decision rule gives AUC = 0.5. Datasets The proposed method relies on co-occurrence counts, and therefore is best suited to documents containing at least a few sentences each. With this in mind, the following datasets are used in the evaluation: Amazon English-language product reviews across four domains; of these reviews, 8000 are labeled and another 19677 are unlabeled (Blitzer, Dredze, and Pereira 2007). Cornell 2000 English-language film reviews (version 2.0), labeled as positive or negative (Pang and Lee 2004). Corpus Cine 3800 Spanish-language movie reviews, rated on a scale of one to five (Vilares, Alonson, and G omez Rodr ıguez 2015). Ratings of four or five are considered as positive; ratings of one and two are considered as negative. Reviews with a rating of three are excluded. 2Online supplement: http://link.to/appendix. Source code: https://github.com/jacobeisenstein/probabilistic-lexiconclassification IMDB 50,000 English-language film reviews (Maas et al. 2011). This evaluation includes only the test set of 25,000 reviews, of which half are positive and half are negative. Lexicons Preliminary evaluation compared several English-language sentiment lexicons. The Liu lexicon (Liu 2015) consistently obtained the best performance on all three English-language datasets, so it was made the focus of all subsequent experiments. Ribeiro et al. (2016) also found that the Liu lexicon is one of the strongest lexicons for review analysis. For the Spanish data, the ISOL lexicon was used (Molina-Gonz alez et al. 2013). It is a modified translation of the Liu lexicon. Classifiers The evaluation compares the following unsupervised classification strategies: LEXICON basic word counting, as in decision rule (1); LEX-PRESENCE counting word presence rather than frequency, as in decision rule (24); PROBLEX-MULT probabilistic lexicon-based classification, as proposed in this paper, using the multinomial likelihood model; PROBLEX-DCM probabilistic lexicon-based classification, using the Dirichlet Compound Multinomial likelihood to reduce effective counts for repeated words; PMI An alternative approach, discussed in the related work, is to impute document labels from a seed set of words, and then compute sentiment scores for individual words from pointwise mutual information between the words and imputed labels (Turney 2002). The implementation of this method is based on the description from Kiritchenko, Zhu, and Mohammad (2014), using the lexicons as the seed word sets. As an upper bound, a supervised logistic regression classifier is also considered. This classifier is trained using fivefold cross validation. It is the only classifier with access to training data. For the PROBLEX-MULT and PROBLEXDCM methods, lexicon words which co-occur with the opposite lexicon at greater than chance frequency are eliminated from the lexicon in a preprocessing step. Results Results are shown in Table 1. The superior performance of the logistic regression classifier confirms the principle that supervised classification is far more accurate than lexicon-based classification. Therefore, supervised classification should be preferred when labeled data is available. Nonetheless, the probabilistic lexicon-based classifiers developed in this paper (PROBLEX-MULT and PROBLEXDCM) go a considerable way towards closing the gap, with improvements in AUC ranging from less than 1% on the Corpus Cine data to nearly 8% on the IMDB data. The PMI approach performs poorly, improving over the simpler lexicon-based classifiers on only one of the four datasets. The word presence heuristic offers no consistent improvements, and the Bayesian adjustment to the classification rule (PROBLEX-DCM) offers only modest improvements on two of the four datasets. Amazon Cornell Cine IMDB LEXICON .820 .765 .636 .807 LEX-PRESENCE .820 .770 .638 .805 PMI .793 .761 .638 .868 PROBLEX-MULT .832 .810 .644 .884 PROBLEX-DCM .836 .824 .645 .884 LOGREG .897 .914 .889 .955 Table 1: Area-under-the-curve (AUC) for all classifiers. The best unsupervised result is shown in bold for each dataset. Related work Turney (2002) uses pointwise mutual information to estimate the semantic orientation of all vocabulary words from co-occurrence with a small seed set. This approach has later been extended to the social media domain by using emoticons as the seed set (Kiritchenko, Zhu, and Mohammad 2014). Like the approach proposed here, the basic intuition is to leverage co-occurrence statistics to learn weights for individual words; however, PMI is a heuristic score that is not justified by a probabilistic model of the text classification problem. PMI-based classification underperforms PROBLEX-MULT and PROBLEX-DCM on all four datasets in our evaluation. The method-of-moments has become an increasingly popular estimator in unsupervised machine learning, with applications in topic models (Anandkumar et al. 2014), sequence models (Hsu, Kakade, and Zhang 2012), and more elaborate linguistic structures (Cohen et al. 2014). Of particular relevance are anchor word techniques for learning latent topic models (Arora, Ge, and Moitra 2012). In these methods, each topic is defined first by a few keywords, which are assumed to be generated only from a single topic. From these anchor words and co-occurrence statistics, the topic-word probabilities can be recovered. A key difference is that the strong anchor word assumption is not required in this work: none of the words are assumed to be perfectly predictive of either label. We require only the much weaker assumption that words in a lexicon tend to co-occur less frequently with words in the opposite lexicon. Conclusion Lexicon-based classification is a popular heuristic that has not previously been analyzed from a machine learning perspective. This analysis yields two techniques for improving unsupervised binary classification: a method-of-moments estimator for word predictiveness, and a Bayesian adjustment for repeated counts of the same word. The method-ofmoments estimator yields substantially better performance than conventional lexicon-based classification, without requiring any additional annotation effort. Future work will consider the generalization to multi-class classification, and more ambitiously, the extension to multiword units. Acknowledgment This research was supported by the National Institutes of Health under award number R01GM112697-01, and by the Air Force Office of Scientific Research. Anandkumar, A.; Ge, R.; Hsu, D.; Kakade, S. M.; and Telgarsky, M. 2014. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research 15(1):2773 2832. Arora, S.; Ge, R.; and Moitra, A. 2012. Learning topic models - going beyond SVD. In FOCS, 1 10. Bhatia, P.; Ji, Y.; and Eisenstein, J. 2015. Better documentlevel sentiment analysis from rst discourse parsing. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP). Blitzer, J.; Dredze, M.; and Pereira, F. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the Association for Computational Linguistics (ACL), 440 447. Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning 3(1):1 122. Cohen, S. B.; Stratos, K.; Collins, M.; Foster, D. P.; and Ungar, L. 2014. Spectral learning of latent-variable PCFGs: Algorithms and sample complexity. Journal of Machine Learning Research 15:2399 2449. Hatzivassiloglou, V., and Mc Keown, K. R. 1997. Predicting the semantic orientation of adjectives. In Proceedings of the Association for Computational Linguistics (ACL), 174 181. Hsu, D.; Kakade, S. M.; and Zhang, T. 2012. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences 78(5):1460 1480. Kiritchenko, S.; Zhu, X.; and Mohammad, S. M. 2014. Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research 50:723 762. Laver, M., and Garry, J. 2000. Estimating policy positions from political texts. American Journal of Political Science 619 634. Liu, B. 2015. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge University Press. Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.; and Potts, C. 2011. Learning word vectors for sentiment analysis. In Proceedings of the Association for Computational Linguistics (ACL). Madsen, R. E.; Kauchak, D.; and Elkan, C. 2005. Modeling word burstiness using the dirichlet distribution. In Proceedings of the 22nd international conference on Machine learning, 545 552. ACM. Minka, T. 2012. Estimating a dirichlet distribution. http://research.microsoft.com/en-us/um/people/minka/ papers/dirichlet/minka-dirichlet.pdf. Molina-Gonz alez, M. D.; Mart ınez-C amara, E.; Mart ın Valdivia, M.-T.; and Perea-Ortega, J. M. 2013. Semantic orientation for polarity classification in spanish reviews. Expert Systems with Applications 40(18):7250 7257. Pang, B., and Lee, L. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the Association for Computational Linguistics (ACL), 271 278. Pang, B., and Lee, L. 2008. Opinion mining and sentiment analysis. Foundations and trends in information retrieval 2(1-2):1 135. Pang, B.; Lee, L.; and Vaithyanathan, S. 2002. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), 79 86. Polanyi, L., and Zaenen, A. 2006. Contextual valence shifters. In Computing attitude and affect in text: Theory and applications. Springer. Qiu, G.; Liu, B.; Bu, J.; and Chen, C. 2011. Opinion word expansion and target extraction through double propagation. Computational linguistics 37(1):9 27. Ribeiro, F. N.; Ara ujo, M.; Gonc alves, P.; Gonc alves, M. A.; and Benevenuto, F. 2016. Sentibench-a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Science 5(1):1 29. Settles, B. 2011. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1467 1478. Association for Computational Linguistics. Somasundaran, S.; Wiebe, J.; and Ruppenhofer, J. 2008. Discourse level opinion interpretation. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, 801 808. Association for Computational Linguistics. Taboada, M.; Brooke, J.; Tofiloski, M.; Voll, K.; and Stede, M. 2011. Lexicon-based methods for sentiment analysis. Computational linguistics 37(2):267 307. Tausczik, Y. R., and Pennebaker, J. W. 2010. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology 29(1):24 54. Turney, P. 2002. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In Proceedings of the Association for Computational Linguistics (ACL), 417 424. Vilares, D.; Alonson, M. A.; and G omez-Rodr ıguez, C. 2015. A syntactic approach for opinion mining on spanish reviews. Natural Language Engineering 21:139 163. Wilson, T.; Wiebe, J.; and Hoffmann, P. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), 347 354.