# unsupervised_learning_for_lexiconbased_classification__415e1028.pdf

Unsupervised Learning for Lexicon-Based Classiﬁcation

Jacob Eisenstein Georgia Institute of Technology 801 Atlantic Drive NW Atlanta, Georgia 30318

In lexicon-based classiﬁcation, documents are assigned labels by comparing the number of words that appear from two opposed lexicons, such as positive and negative sentiment. Creating such words lists is often easier than labeling instances, and they can be debugged by non-experts if classiﬁcation performance is unsatisfactory. However, there is little analysis or justiﬁcation of this classiﬁcation heuristic. This paper describes a set of assumptions that can be used to derive a probabilistic justiﬁcation for lexicon-based classiﬁcation, as well as an analysis of its expected accuracy. One key assumption behind lexicon-based classiﬁcation is that all words in each lexicon are equally predictive. This is rarely true in practice, which is why lexicon-based approaches are usually outperformed by supervised classiﬁers that learn distinct weights on each word from labeled instances. This paper shows that it is possible to learn such weights without labeled data, by leveraging co-occurrence statistics across the lexicons. This offers the best of both worlds: light supervision in the form of lexicons, and data-driven classiﬁcation with higher accuracy than traditional word-counting heuristics.

Introduction Lexicon-based classiﬁcation refers to a classiﬁcation rule in which documents are assigned labels based on the count of words from lexicons associated with each label (Taboada et al. 2011). For example, suppose that we have opposed labels Y {0, 1}, and we have associated lexicons W0 and W1. Then for a document with a vector of word counts x, the lexicon-based decision rule is,

where the operator indicates a decision rule. Put simply, the rule is to select the label whose lexicon matches the most word tokens. Lexicon-based classiﬁcation is widely used in industry and academia, with applications ranging from sentiment classiﬁcation and opinion mining (Pang and Lee 2008; Liu 2015) to the psychological and ideological analysis of texts (Laver and Garry 2000; Tausczik and Pennebaker 2010). The popularity of this approach can be explained by

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

its relative simplicity and ease of use: for domain experts, creating lexicons is intuitive, and, in comparison with labeling instances, it may offer a faster path towards a reasonably accurate classiﬁer (Settles 2011). Furthermore, classiﬁcation errors can be iteratively debugged by reﬁning the lexicons. However, from a machine learning perspective, there are a number of drawbacks to lexicon-based classiﬁcation. First, while intuitively reasonable, lexicon-based classiﬁcation lacks theoretical justiﬁcation: it is not clear what conditions are necessary for it to work. Second, the lexicons may be incomplete, even for designers with strong substantive intuitions. Third, lexicon-based classiﬁcation assigns an equal weight to each word, but some words may be more strongly predictive than others.1 Fourth, lexiconbased classiﬁcation ignores multi-word phenomena, such as negation (e.g., not so good) and discourse (e.g., the movie would be watchable if it had better acting). Supervised classiﬁcation systems, which are trained on labeled examples, tend to outperform lexicon-based classiﬁers, even without accounting for multi-word phenomena (Liu 2015; Pang and Lee 2008). Several researchers have proposed methods for lexicon expansion, automatically growing lexicons from an initial seed set (Hatzivassiloglou and Mc Keown 1997; Qiu et al. 2011). There is also work on handling multi-word phenomena such as negation (Wilson, Wiebe, and Hoffmann 2005; Polanyi and Zaenen 2006), and discourse (Somasundaran, Wiebe, and Ruppenhofer 2008; Bhatia, Ji, and Eisenstein 2015). However, the theoretical foundations of lexiconbased classiﬁcation remain poorly understood, and we lack principled means for automatically assigning weights to lexicon items without resorting to labeled instances. This paper elaborates a set of assumptions under which lexicon-based classiﬁcation is equivalent to Na ıve Bayes classiﬁcation. I then derive expected error rates under these assumptions. These expected error rates are not matched by observations on real data, suggesting that the underlying assumptions are invalid. Of key importance is the assumption that each lexicon item is equally predictive. To relax this as-

1Some lexicons attach coarse-grained predeﬁned weights to each word. For example, the Opinion Finder Subjectivity lexicon labels words as strongly or weakly subjective (Wilson, Wiebe, and Hoffmann 2005). This poses an additional burden on the lexicon creator.

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

sumption, I derive a principled method for estimating word probabilities under each label, using a method-of-moments estimator on cross-lexical co-occurrence counts. Overall, this paper makes the following contributions:

justifying lexicon-based classiﬁcation as a special case of multinomial Na ıve Bayes;

mathematically analyzing this model to compute the expected performance of lexicon-based classiﬁers;

extending the model to justify a popular variant of lexicon-based classiﬁcation, which incorporates word presence rather than raw counts;

deriving a method-of-moments estimator for the parameters of this model, enabling lexicon-based classiﬁcation with unique weights per word, without labeled data;

empirically demonstrating that this classiﬁer outperforms lexicon-based classiﬁcation and alternative approaches.

Lexicon-Based Classiﬁcation as Na ıve Bayes I begin by showing how the lexicon-based classiﬁcation rule shown in (1) can be derived as a special case of Na ıve Bayes classiﬁcation. Suppose we have a prior probability PY for the label Y , and a likelihood function PX|Y , where X is a random variable corresponding to a vector of word counts. The conditional label probability can be computed by Bayesian inversion,

(2) P(y | x) = P(x | y)P(y)

y P(x | y )P(y ).

Assuming that the costs for each type of misclassiﬁcation error are identical, then the minimum Bayes risk classiﬁcation rule is,

(3) log Pr(Y = 0) + log P(x | Y = 0) log Pr(Y = 1) + log P(x | Y = 1),

moving to the log domain for simplicity of notation. I now show that lexicon-based classiﬁcation can be justiﬁed under this decision rule, given a set of assumptions about the probability distributions. Let us introduce some assumptions about the likelihood function, PX|Y . The random variable X is deﬁned over vectors of counts, so a natural choice for the form of this likelihood is the multinomial distribution, corresponding to a multinomial Na ıve Bayes classiﬁer. For a speciﬁc vector of counts X = x, write P(x | y) Pmultinomial(x; θy, N), where θy is a probability vector associated with label y, and N = V i=1 xi is the total count of tokens in x, and xi is the count of word i {1, 2, . . . , V }. The multinomial likelihood is proportional to a product of likelihoods of categorical variables corresponding to individual words (tokens),

Pr(W = i | Y = y; θ) = θy,i, (4)

where the random variable W corresponds to a single token, whose probability of being word type i is equal to θy,i in a document with label y. The multinomial log-likelihood can be written as,

log P(x | y) = log Pmultinomial(x; θy, N)

i=1 xi log Pr(W = i | Y = y; θ)

i=1 xi log θy,i, (5)

where K(x) is a function of x that is constant in y. The ﬁrst necessary assumption about the likelihood function is that the lexicons are complete: words that are in neither lexicon have identical probability under both labels. Formally, for any word i / W0 W1, we assume,

Pr(W = i | Y = 0) = Pr(W = i | Y = 1), (6)

which implies that these words are irrelevant to the classiﬁcation boundary. Next, we must assume that each in-lexicon word is equally predictive. Speciﬁcally, for words that are in lexicon y, Pr(W = i | Y = y) Pr(W = i | Y = y) = 1 + γ

where y is the opposite label from y. The parameter γ controls the predictiveness of the lexicon: for example, if γ = 0.5 in a sentiment classiﬁcation problem, this would indicate that words in the positive sentiment lexicon are three times more likely to appear in documents with positive sentiment than in documents with negative sentiment, and vice versa. The word atrocious might be less likely overall than good, but still three times more likely in the negative class than in the positive class. In the limit, γ = 0 implies that the lexicons do not distinguish the classes at all, and γ = 1 implies that the lexicons distinguish the classes perfectly, so that the observation of a single in-lexicon word would completely determine the document label. The conditions enumerated in (6) and (7) are ensured by the following deﬁnition,

(1 + γ)μi, i Wy (1 γ)μi, i W y μi, i / Wy W y, (8)

where y is the opposite label from y, and μ is a vector of baseline probabilities, which are independent of the label. Because the probability vectors θ0 and θ1 must each sum to one, we require an assumption of equal coverage,

j W1 μj. (9)

Finally, assume that the labels have equal prior likelihood, Pr(Y = 0) = Pr(Y = 1). It is trivial to relax this assumption by adding a constant term to one side of the decision rule in (1). With these assumptions in hand, it is now possible to simplify the decision rule in (3). Thanks to the assumption of

equal prior probability, we can drop the priors P(Y ), so that the decision rule is a comparison of the likelihoods,

log P(x | Y = 0) log P(x | Y = 1) (10)

i xi log θ0,i K(x) +

i xi log θ1,i. (11)

Canceling K(x) and applying the deﬁnition from (8),

i W0 xi log((1 + γ)μi) +

i W1 xi log((1 γ)μi)

i W0 xi log((1 γ)μi) +

i W1 xi log((1 + γ)μi).

The μi terms cancel after distributing the log, leaving,

i W0 xi log 1 + γ

i W1 xi log 1 + γ

For any γ (0, 1), the term log 1+γ

1 γ is a ﬁnite and positive constant. Therefore, (13) is identical to the counting-based classiﬁcation rule in (1). In other words, lexicon-based classiﬁcation is minimum Bayes risk classiﬁcation in a multinomial probability model, under the assumptions of equal prior likelihood, lexicon completeness, equal predictiveness of words, and equal coverage.

Analysis of Lexicon-Based Classiﬁcation One advantage of deriving a formal foundation for lexiconbased classiﬁcation is that it is possible to analyze its expected performance. For a label y, let us write the count of in-lexicon words as my = i Wy xi, and the count of opposite-lexicon words as m y = i W y xi. Lexiconbased classiﬁcation makes a correct prediction whenever my > m y for the correct label y. To assess the likelihood that my > m y, it is sufﬁcient to compute the expectation and variance of the difference my m y; under the central limit theorem, we can treat this difference as approximately normally distributed, and compute the probability that the difference is positive using the Gaussian cumulative distribution function (CDF). Let us use the convenience notation sμ,

i W1 μi. (14)

Recall that we have already taken the assumption that the sums of baseline word probabilities for the two lexicons are equal. Under the multinomial probability model, given a document with N tokens, the expected counts are,

i Wy θy,i = N(1 + γ)sμ (15)

i W y θ y,i = N(1 γ)sμ (16)

E[my m y] =2Nγsμ. (17)

Next we compute the variance of this margin,

V [my m y] =V [my] + V [m y] + Cov(my, m y). (18)

Each of these terms is the variance of a sum of counts. Under the multinomial distribution, the variance of a single count is V [xi] = Nθi(1 θi). The variance of the sum my is,

i Wy Nθi(1 θi)

j Wy,j =i Nθiθj

i Wy Nθi Nθ2 i

j Wy,j =i Nθiθj (19)

i Wy θi = N

i Wy (1 + γ)μi

=N(1 + γ)sμ. (20)

An equivalent upper bound can be computed for the variance of the count of opposite lexicon words,

V [m y] N(1 γ)sμ. (21)

These bounds are fairly tight because the products of probabilities θ2 i and θiθj are nearly always small, due to the fact that most words are rare. Because the covariance Cov(my, m y) is negative (and also involves a product of word probabilities), we can further bound the variance of the margin, obtaining the upper bound,

V [my m y] N(1+γ)sμ+N(1 γ)sμ = 2Nsμ. (22)

By the central limit theorem, the margin my m y is approximately normally distributed, with mean 2Nγsμ and variance upper-bounded by 2Nsμ. The probability of making a correct prediction (which occurs when my > m y) is then equal to the cumulative density of a standard normal distribution Φ(z), where the z-score is equal to the ratio of the expectation and the standard deviation,

z = E[my m y]

V [my m y] 2Nγsμ

Note that by upper-bounding the variance, we obtain a lower bound on the z-score, and thus a lower bound on the expected accuracy. According to this approximation, accuracy is expected to increase with the predictiveness γ, the document length N, and the lexicon coverage sμ. This helps to explain a dilemma in lexicon design: as more words are added, the coverage increases, but the average predictiveness of each word decreases (assuming the most predictive words are added ﬁrst). Thus, increasing the size of a lexicon by adding marginal words may not improve performance. The analysis also predicts that longer documents should be easier to classify. This is because the expected size of the gap my m y grows with N, while its standard deviation grows only with

N. This prediction can be tested empirically, and on all four datasets considered in this paper, it is false: longer documents are harder to classify accurately. This is a clue that the underlying assumptions are not valid. The decreased accuracy for especially long reviews may be due to these reviews being more complex, perhaps requiring modeling of the discourse structure (Somasundaran, Wiebe, and Ruppenhofer 2008).

Justifying the Word-Appearance Heuristic An alternative heuristic to lexicon-based classiﬁcation is to consider only the presence of each word type, and not its count. This corresponds to the decision rule,

i W0 δ(xi > 0)

j W1 δ(xj > 0),

where δ( ) is a delta function that returns one if the Boolean condition is true, and zero otherwise. In the context of supervised classiﬁcation, Pang, Lee, and Vaithyanathan (2002) ﬁnd that word presence is a more predictive feature than word frequency. By ignoring repeated mentions of the same word, heuristic (24) emphasizes the diversity of ways in which a document covers a lexicon, and is more robust to document-speciﬁc idiosyncrasies such as a review of The Joy Luck Club, which might include the positive words joy and luck many times even if the review is negative. The word-appearance heuristic can also be explained in the framework deﬁned above. The multinomial likelihood PX|Y can be replaced by a Dirichlet-compound multinomial (DCM) distribution, also known as a multivariate Polya distribution (Madsen, Kauchak, and Elkan 2005). This distribution is written Pdcm(x; αy), where αy is a vector of parameters associated with label y, with αy,i > 0 for all i {1, 2, . . . , V }. The DCM is a compound distribution because it treats the parameter of the multinomial as a latent variable to be marginalized out,

Pdcm(x; αy) =

ν Pmultinomial(x | ν)PDirichlet(ν | αy)dν.

(25) Intuitively, one can think of the DCM distribution as encoding a model in which each document has its own multinomial distribution over words; this document-speciﬁc distribution is itself drawn from a prior that depends on the class label y. Suppose we set the DCM parameter α = τθ, with θ as deﬁned in (8). The constant τ > 0 is then the concentration of the distribution: as τ grows, the probability distribution over α is more closely concentrated around the prior expectation θ. Because V i θi = 1, the likelihood function under this model is,

Pdcm(x | y) = Γ(τ) Γ(N + τ)

Γ(xi + τθy,i)

Γ(τθy,i) , (26)

where Γ( ) is the gamma function. Minimum Bayes risk classiﬁcation in this model implies the decision rule:

i W0 log rin(xi)

i W1 log rin(xt,i)

rout(xi) (27)

rin(xi) Γ(xi + τ(1 + γ)μi)

Γ(τ(1 + γ)μi) (28)

rout(xi) Γ(xi + τ(1 γ)μi)

Γ(τ(1 γ)μi) . (29)

As τ , the prior on ν is tightly linked to θ, so that the model reduces to the multinomial deﬁned above. Another

0 1 2 3 4 5 6 7 counts

effective counts

γ = 0. 5, μ = 0. 001

τ = 100 τ = 1000 τ = 10000 τ = 100000

Figure 1: Effective counts for varying values of τ. For the datasets considered in this paper, τ usually falls in the range between 500 and 1000.

way to see this is to apply the equality Γ(x + 1) = xΓ(x) to (28) and (29) when τμi xi. As τ 0, the prior on ν becomes increasingly diffuse. Repeated counts of any word are better explained by document-speciﬁc variation from the prior, than by properties of the label. This situation is shown in Figure 1, which plots the effective counts implied by the classiﬁcation rule (27) for a range of values of the concentration parameter τ, holding the other parameters constant (μ = 10 3, γ = 0.5). For high values of τ, the effective counts track the observed counts linearly, as in the multinomial model; for low values of τ, the effective counts barely increase beyond 1. Minka (2012) presents a number of estimators for the concentration parameter τ from a corpus of text. When the label y is unknown, we cannot apply these estimators directly. However, as described above, out-of-lexicon words are assumed to have identical probability under both labels. This assumption can be exploited to estimate τ exclusively from the ﬁrst and second moments of these out-of-lexicon words. Analysis of the expected accuracy of this model is left to future work.

Estimating Word Predictiveness A crucial simpliﬁcation made by lexicon-based classiﬁcation is that all words in each lexicon are equally predictive. In reality, words may be more or less predictive of class labels, for reasons such as sense ambiguity (e.g., well) and degree (e.g., good vs ﬂawless). By introducing a per-word predictiveness factor γi into (8), we arrive at a model that is a restricted form of Na ıve Bayes. (The restriction is that the probabilities of non-lexicon words are constrained to be identical across classes.) If labeled data were available, this model could be estimated by maximum likelihood. This section shows how to estimate the model without labeled data, using the method of moments. First, note that the baseline probabilities μi can be estimated directly from counts on an unlabeled corpus; the challenge is to estimate the parameters γi for all words in the two lexicons. The key intuition that makes this possible is that highly predictive words should rarely appear with words in the opposite lexicon. This idea can be formalized

in terms of cross-label counts: the cross-label count ci is the co-occurrence count of word i with all words in the opposite lexicon,

j W y x(t) i x(t) j , (30)

where x(t) is the vector of word counts for document t, with t {1 . . . T}. Under the multinomial model deﬁned above, for a single document with N tokens, the expected product of counts for a word pair (i, j) is equal to,

E[xixj] =E[xi]E[xj] + Cov(xi, xj) =Nθi Nθj Nθiθj =N(N 1)θiθj. (31)

Let us focus on the expected products of counts for crosslexicon word pairs (i W0, j W1). The parameter θ depends on the document label y, as deﬁned in (8). As a result, we have the following expectations,

E[xixj | Y = 0] =N(N 1)μi(1 + γi)μj(1 γj) =N(N 1)μiμj(1 + γi γj γiγj) (32) E[xixj | Y = 1] =N(N 1)μi(1 γi)μj(1 + γj) =N(N 1)μiμj(1 γi + γj γiγj) (33) E[xixj] =P(Y = 0)E[xixj | Y = 0] + P(Y = 1)E[xixj | Y = 1] =N(N 1)μiμj(1 γiγj). (34)

Summing over all words j W1 and all documents t,

j W1 E[x(t) i x(t) j ]

t=1 Nt(Nt 1)μi

j W1 μj(1 γiγj)

Let us write γ(1) to indicate the vector of γj parameters for all j W1, and γ(0) for all i W0. The expectation in (35) is a linear function of γi, and a linear function of the vector γ(1). Analogously, for all j W1, E[cj] is a linear function of γj and γ(1). Our goal is to choose γ so that the expectations E[ci] closely match the observed counts ci. This can be viewed as form of method of moments estimation, with the following objective,

i W0 (ci E[ci])2 + 1

j W1 (cj E[cj])2, (36)

which can be minimized in terms of γ(0) and γ(1). However, there is an additional constraint: the probability distributions θ0 and θ1 must still sum to one. We can express this as a linear constraint on γ(0) and γ(1),

μ(0) γ(0) μ(1) γ(1) = 0, (37)

where μ(y) is the vector of baseline probabilities for words i Wy, and μ(0) γ(0) indicates a dot product.

We therefore formulate the following constrained optimization problem,

min γ(0),γ(1) 1 2

i W0 (ci E[ci])2 + 1

j W1 (cj E[cj])2

s.t. μ(0) γ(0) μ(1) γ(1) = 0 i (W0 W1), γi [0, 1). (38)

This problem can be solved by alternating direction method of multipliers (Boyd et al. 2011). The equality constraint can be incorporated into an augmented Lagrangian,

Lρ(γ(0), γ(1)) =1

i W0 (ci E[ci])2 + 1

j W1 (cj E[cj])2

2(μ(0) γ(0) μ(1) γ(1))2, (39)

where ρ > 0 is the penalty parameter. The augmented Lagrangian is biconvex in γ(0) and γ(1), which suggests an iterative solution (Boyd et al. 2011, page 76). Speciﬁcally, we hold γ(1) ﬁxed and solve for γ(0), subject to γi [0, 1) for all i W0. We then solve for γ(1) under the same conditions. Finally, we update a dual variable u, representing the extent to which the equality constraint is violated. These updates are iterated until convergence. The unconstrained local updates to γ(0) and γ(1) can be computed by solving a system of linear equations, and the result can be projected back onto the feasible region. The penalty parameter ρ is initialized at 1, and then dynamically updated based on the primal and dual residuals (Boyd et al. 2011, pages 20-21). More details are available in the online supplement and source code.2

Evaluation An empirical evaluation is performed on four datasets in two languages. All datasets involve binary classiﬁcation problems, and performance is quantiﬁed by the area-under-thecurve (AUC), a measure of classiﬁcation performance that is robust to unbalanced class distributions. A perfect classiﬁer achieves AUC = 1; in expectation, a random decision rule gives AUC = 0.5.

Datasets The proposed method relies on co-occurrence counts, and therefore is best suited to documents containing at least a few sentences each. With this in mind, the following datasets are used in the evaluation: Amazon English-language product reviews across four domains; of these reviews, 8000 are labeled and another 19677 are unlabeled (Blitzer, Dredze, and Pereira 2007). Cornell 2000 English-language ﬁlm reviews (version 2.0), labeled as positive or negative (Pang and Lee 2004). Corpus Cine 3800 Spanish-language movie reviews, rated on a scale of one to ﬁve (Vilares, Alonson, and G omez Rodr ıguez 2015). Ratings of four or ﬁve are considered as positive; ratings of one and two are considered as negative. Reviews with a rating of three are excluded.

2Online supplement: http://link.to/appendix. Source code: https://github.com/jacobeisenstein/probabilistic-lexiconclassiﬁcation

IMDB 50,000 English-language ﬁlm reviews (Maas et al. 2011). This evaluation includes only the test set of 25,000 reviews, of which half are positive and half are negative.

Lexicons Preliminary evaluation compared several English-language sentiment lexicons. The Liu lexicon (Liu 2015) consistently obtained the best performance on all three English-language datasets, so it was made the focus of all subsequent experiments. Ribeiro et al. (2016) also found that the Liu lexicon is one of the strongest lexicons for review analysis. For the Spanish data, the ISOL lexicon was used (Molina-Gonz alez et al. 2013). It is a modiﬁed translation of the Liu lexicon.

Classiﬁers The evaluation compares the following unsupervised classiﬁcation strategies: LEXICON basic word counting, as in decision rule (1); LEX-PRESENCE counting word presence rather than frequency, as in decision rule (24); PROBLEX-MULT probabilistic lexicon-based classiﬁcation, as proposed in this paper, using the multinomial likelihood model; PROBLEX-DCM probabilistic lexicon-based classiﬁcation, using the Dirichlet Compound Multinomial likelihood to reduce effective counts for repeated words; PMI An alternative approach, discussed in the related work, is to impute document labels from a seed set of words, and then compute sentiment scores for individual words from pointwise mutual information between the words and imputed labels (Turney 2002). The implementation of this method is based on the description from Kiritchenko, Zhu, and Mohammad (2014), using the lexicons as the seed word sets. As an upper bound, a supervised logistic regression classiﬁer is also considered. This classiﬁer is trained using ﬁvefold cross validation. It is the only classiﬁer with access to training data. For the PROBLEX-MULT and PROBLEXDCM methods, lexicon words which co-occur with the opposite lexicon at greater than chance frequency are eliminated from the lexicon in a preprocessing step.

Results Results are shown in Table 1. The superior performance of the logistic regression classiﬁer conﬁrms the principle that supervised classiﬁcation is far more accurate than lexicon-based classiﬁcation. Therefore, supervised classiﬁcation should be preferred when labeled data is available. Nonetheless, the probabilistic lexicon-based classiﬁers developed in this paper (PROBLEX-MULT and PROBLEXDCM) go a considerable way towards closing the gap, with improvements in AUC ranging from less than 1% on the Corpus Cine data to nearly 8% on the IMDB data. The PMI approach performs poorly, improving over the simpler lexicon-based classiﬁers on only one of the four datasets. The word presence heuristic offers no consistent improvements, and the Bayesian adjustment to the classiﬁcation rule (PROBLEX-DCM) offers only modest improvements on two of the four datasets.

Amazon Cornell Cine IMDB

LEXICON .820 .765 .636 .807 LEX-PRESENCE .820 .770 .638 .805 PMI .793 .761 .638 .868 PROBLEX-MULT .832 .810 .644 .884 PROBLEX-DCM .836 .824 .645 .884 LOGREG .897 .914 .889 .955

Table 1: Area-under-the-curve (AUC) for all classiﬁers. The best unsupervised result is shown in bold for each dataset.

Related work Turney (2002) uses pointwise mutual information to estimate the semantic orientation of all vocabulary words from co-occurrence with a small seed set. This approach has later been extended to the social media domain by using emoticons as the seed set (Kiritchenko, Zhu, and Mohammad 2014). Like the approach proposed here, the basic intuition is to leverage co-occurrence statistics to learn weights for individual words; however, PMI is a heuristic score that is not justiﬁed by a probabilistic model of the text classiﬁcation problem. PMI-based classiﬁcation underperforms PROBLEX-MULT and PROBLEX-DCM on all four datasets in our evaluation. The method-of-moments has become an increasingly popular estimator in unsupervised machine learning, with applications in topic models (Anandkumar et al. 2014), sequence models (Hsu, Kakade, and Zhang 2012), and more elaborate linguistic structures (Cohen et al. 2014). Of particular relevance are anchor word techniques for learning latent topic models (Arora, Ge, and Moitra 2012). In these methods, each topic is deﬁned ﬁrst by a few keywords, which are assumed to be generated only from a single topic. From these anchor words and co-occurrence statistics, the topic-word probabilities can be recovered. A key difference is that the strong anchor word assumption is not required in this work: none of the words are assumed to be perfectly predictive of either label. We require only the much weaker assumption that words in a lexicon tend to co-occur less frequently with words in the opposite lexicon.

Conclusion Lexicon-based classiﬁcation is a popular heuristic that has not previously been analyzed from a machine learning perspective. This analysis yields two techniques for improving unsupervised binary classiﬁcation: a method-of-moments estimator for word predictiveness, and a Bayesian adjustment for repeated counts of the same word. The method-ofmoments estimator yields substantially better performance than conventional lexicon-based classiﬁcation, without requiring any additional annotation effort. Future work will consider the generalization to multi-class classiﬁcation, and more ambitiously, the extension to multiword units.

Acknowledgment This research was supported by the National Institutes of Health under award number R01GM112697-01, and by the Air Force Ofﬁce of Scientiﬁc Research.

Anandkumar, A.; Ge, R.; Hsu, D.; Kakade, S. M.; and Telgarsky, M. 2014. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research 15(1):2773 2832. Arora, S.; Ge, R.; and Moitra, A. 2012. Learning topic models - going beyond SVD. In FOCS, 1 10. Bhatia, P.; Ji, Y.; and Eisenstein, J. 2015. Better documentlevel sentiment analysis from rst discourse parsing. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP). Blitzer, J.; Dredze, M.; and Pereira, F. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classiﬁcation. In Proceedings of the Association for Computational Linguistics (ACL), 440 447. Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning 3(1):1 122. Cohen, S. B.; Stratos, K.; Collins, M.; Foster, D. P.; and Ungar, L. 2014. Spectral learning of latent-variable PCFGs: Algorithms and sample complexity. Journal of Machine Learning Research 15:2399 2449. Hatzivassiloglou, V., and Mc Keown, K. R. 1997. Predicting the semantic orientation of adjectives. In Proceedings of the Association for Computational Linguistics (ACL), 174 181. Hsu, D.; Kakade, S. M.; and Zhang, T. 2012. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences 78(5):1460 1480. Kiritchenko, S.; Zhu, X.; and Mohammad, S. M. 2014. Sentiment analysis of short informal texts. Journal of Artiﬁcial Intelligence Research 50:723 762. Laver, M., and Garry, J. 2000. Estimating policy positions from political texts. American Journal of Political Science 619 634. Liu, B. 2015. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge University Press. Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.; and Potts, C. 2011. Learning word vectors for sentiment analysis. In Proceedings of the Association for Computational Linguistics (ACL). Madsen, R. E.; Kauchak, D.; and Elkan, C. 2005. Modeling word burstiness using the dirichlet distribution. In Proceedings of the 22nd international conference on Machine learning, 545 552. ACM. Minka, T. 2012. Estimating a dirichlet distribution. http://research.microsoft.com/en-us/um/people/minka/ papers/dirichlet/minka-dirichlet.pdf. Molina-Gonz alez, M. D.; Mart ınez-C amara, E.; Mart ın Valdivia, M.-T.; and Perea-Ortega, J. M. 2013. Semantic orientation for polarity classiﬁcation in spanish reviews. Expert Systems with Applications 40(18):7250 7257. Pang, B., and Lee, L. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on

minimum cuts. In Proceedings of the Association for Computational Linguistics (ACL), 271 278. Pang, B., and Lee, L. 2008. Opinion mining and sentiment analysis. Foundations and trends in information retrieval 2(1-2):1 135. Pang, B.; Lee, L.; and Vaithyanathan, S. 2002. Thumbs up?: sentiment classiﬁcation using machine learning techniques. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), 79 86. Polanyi, L., and Zaenen, A. 2006. Contextual valence shifters. In Computing attitude and affect in text: Theory and applications. Springer. Qiu, G.; Liu, B.; Bu, J.; and Chen, C. 2011. Opinion word expansion and target extraction through double propagation. Computational linguistics 37(1):9 27. Ribeiro, F. N.; Ara ujo, M.; Gonc alves, P.; Gonc alves, M. A.; and Benevenuto, F. 2016. Sentibench-a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Science 5(1):1 29. Settles, B. 2011. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1467 1478. Association for Computational Linguistics. Somasundaran, S.; Wiebe, J.; and Ruppenhofer, J. 2008. Discourse level opinion interpretation. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, 801 808. Association for Computational Linguistics. Taboada, M.; Brooke, J.; Toﬁloski, M.; Voll, K.; and Stede, M. 2011. Lexicon-based methods for sentiment analysis. Computational linguistics 37(2):267 307. Tausczik, Y. R., and Pennebaker, J. W. 2010. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology 29(1):24 54. Turney, P. 2002. Thumbs up or thumbs down? semantic orientation applied to unsupervised classiﬁcation of reviews. In Proceedings of the Association for Computational Linguistics (ACL), 417 424. Vilares, D.; Alonson, M. A.; and G omez-Rodr ıguez, C. 2015. A syntactic approach for opinion mining on spanish reviews. Natural Language Engineering 21:139 163. Wilson, T.; Wiebe, J.; and Hoffmann, P. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), 347 354.