# semantic_visualization_with_neighborhood_graph_regularization__79a530e6.pdf

Journal of Artiﬁcial Intelligence Research 55 (2016) 1091-1133 Submitted 09/2015; published 04/2016

Semantic Visualization with Neighborhood Graph Regularization

Tuan M. V. Le vmtle.2012@phdis.smu.edu.sg Hady W. Lauw hadywlauw@smu.edu.sg School of Information Systems Singapore Management University 80 Stamford Road, Singapore 178902

Visualization of high-dimensional data, such as text documents, is useful to map out the similarities among various data points. In the high-dimensional space, documents are commonly represented as bags of words, with dimensionality equal to the vocabulary size. Classical approaches to document visualization directly reduce this into visualizable two or three dimensions. Recent approaches consider an intermediate representation in topic space, between word space and visualization space, which preserves the semantics by topic modeling. While aiming for a good ﬁt between the model parameters and the observed data, previous approaches have not considered the local consistency among data instances. We consider the problem of semantic visualization by jointly modeling topics and visualization on the intrinsic document manifold, modeled using a neighborhood graph. Each document has both a topic distribution and visualization coordinate. Speciﬁcally, we propose an unsupervised probabilistic model, called Semafore, which aims to preserve the manifold in the lower-dimensional spaces through a neighborhood regularization framework designed for the semantic visualization task. To validate the eﬃcacy of Semafore, our comprehensive experiments on a number of real-life text datasets of news articles and Web pages show that the proposed methods outperform the state-of-the-art baselines on objective evaluation metrics.

1. Introduction

Text documents come in various ﬂavors, such as Web pages, news articles, blog posts, emails, or messages on social media such as Twitter. While much is in English, there are increasing amounts of content in various languages as well. With the backdrop of the growth in volume, diversity, and complexity of various corpora, we need more useful tools to analyze the wealth of text content. One form of analysis which we will look into in this paper is visualization. There are diﬀerent types of visualizations, be it of the temporal or longitudinal, networked, or other natures. What we are interested in is a form of visualization where we can represent a collection of documents as coordinates on the same low-dimensional space, so as to learn of the similarities and diﬀerences among documents based on their distances on the visualization space.

Visualization of high-dimensional data is an important exploratory data analysis task, which is actively studied by various academic communities. While the HCI community is interested in the presentation of information, as well as other interface aspects (Chi, 2000), the machine learning community is interested in the quality of dimensionality reduction

c 2016 AI Access Foundation. All rights reserved.

(Van der Maaten & Hinton, 2008), i.e., how to transform the high-dimensional representation into a lower-dimensional representation that can be shown on a scatterplot. This visualization form is simple, and widely applicable across various domains.

Consider therefore the problem of visualizing documents on a scatterplot. Commonly, a document is represented as a bag of words, i.e., a vector of word counts. This highdimensional representation would be reduced into coordinates on a visualizable 2D (or 3D) space. One pioneering technique is Multidimensional Scaling (MDS) (Kruskal, 1964). The goal is to preserve the distances in the high-dimensional space in the low-dimensional embedding. When applied to documents, a visualization technique for generic high-dimensional data, e.g., MDS, may not necessarily preserve the topical semantics. Words are often ambiguous, with issues such as polysemy, when the same word carries multiple senses, and synonymy, when diﬀerent words carry the same sense. Because the dimensions in the original representation (which are words) may not accurately capture this ambiguity, this aﬀects the quality of the reduced representation (which is the visualization space) as well.

To model semantics in documents in a way that can resolve some of this ambiguity, the current popular approach is by topic modeling, such as PLSA (Hofmann, 1999) or LDA (Blei, Ng, & Jordan, 2003). Each document is associated with a probability distribution over a set of topics. Each topic is a probability distribution over words in the vocabulary. In this way, polysemous words can be separated into diﬀerent topics, and synonymous words can be grouped into the same topic.

Topic modeling itself is another form of dimensionality reduction: from word space to topic space. The word space refers to a document s original representation, which is usually a bag of words. The topic space refers to the simplex of topic distributions. A document s probability distribution over topics is eﬀectively the representation of this document in this topic space. However, a topic model by itself is not designed for visualization. While one possible visualization is to plot documents topic distributions on a simplex, a 2D visualization space could express only three topics, which is very limiting.

Given its success in modeling semantics in documents, we therefore ask the question of whether and how best to do both forms of dimensionality reductions (visualization and topic modeling) for documents. The end goal is to arrive at a visualization of documents that is consistent with both the semantic representation (topics), as well as the original representation (words). This coupling is a distinct task from topic modeling or visualization respectively, as it enables novel capabilities. For one thing, topic modeling helps to create a richer visualization, as we can now associate each coordinate on the visualization space with both topic and word distributions, providing semantics to the visualization space. For another, the tight integration potentially allows the visualization to serve as a way to explore and tune topic models, allowing users to introduce feedback (Hu, Boyd-Graber, Satinoﬀ, & Smith, 2014) to the model through a visual interface (Choo, Lee, Reddy, & Park, 2013). These capabilities support several use case scenarios. One potential use case is a document organizer system. The visualization could potentially help in assigning categories to documents, by showing how closely related documents have been labeled. Another is an augmented retrieval system. Given a query, the results may include not just relevant documents, but also other similar documents (neighbors in the visualization).

Semantic Visualization with Neighborhood Graph Regularization

1.1 Problem Statement

We refer to the task of jointly modeling topics and visualization as semantic visualization. The input is a set of documents D. For a speciﬁed number of topics Z and visualization dimensionality (assumed to be 2D, without losing any generality), the goal is to derive, for every document in D, a latent coordinate on the visualization space, and a probability distribution over the Z topics. While we focus on documents in our description, the same approach would apply to visualization of other data types for which latent factor modeling, i.e., topic model, makes sense. A straightforward way is to undergo two-step reductions. In the ﬁrst reduction, the original representation for documents are reduced into topic distributions using topic modeling. In the second reduction, documents topic distributions are further reduced into visualization coordinates. This approach may have some value compared to direct reduction from word space to visualization space. However, it is not ideal, because the disjoint reductions could mean that errors may propagate from the ﬁrst to the second reduction, and the resulting visualization may not faithfully capture the original representation. A better way to solve this problem is to join up the two reductions into a single, joint process that produces both topic distributions and visualization coordinates. This approach was ﬁrst pioneered by PLSV (Iwata, Yamada, & Ueda, 2008), which also showed that the joint approach outperformed the disjoint approach. PLSV derives the latent parameters by maximizing the likelihood of observing the documents. This goal is concerned with the error between the model and the observation. In the literature, it is found that algorithms that ensure smoothness tend to perform better at learning tasks (Zhou, Bousquet, Lal, Weston, & Sch olkopf, 2004). Smoothness concerns preserving the observed proximity between documents. This objective arises naturally from the assumption that the intrinsic geometry of the data is a low-rank, non-linear subspace within the high-dimensional space. Therefore, preserving neighborhood structure is important for learning tasks. This assumption is well-accepted in the machine learning community (Laﬀerty & Wasserman, 2007), and ﬁnds application in both supervised and unsupervised learning (Belkin & Niyogi, 2003; Zhou et al., 2004; Zhu, Ghahramani, Lafferty, et al., 2003). Recently, there is a preponderance of evidence that this assumption also applies to text data in particular (Cai, Mei, Han, & Zhai, 2008; Cai, Wang, & He, 2009; Huh & Fienberg, 2012). We therefore propose to incorporate this assumption into a new unsupervised, semantic visualization model.

1.2 Overview

We propose an unsupervised probabilistic model that jointly derives topic distributions and visualization coordinates on the intrinsic geometry of the data. Our proposed model is called Semafore, which stands for SEmantic visualization with MAni FOld REgularization. We build a neighborhood regularization framework into a semantic visualization model. The framework involves new issues to resolve, including the regularization function, and the space in which regularization should take place. The model is evaluated on a series of real-life, publicly available datasets, which are also benchmark datasets used in document classiﬁcation task. An advantage of a statistical method, such as ours, is that it is not dependent on a speciﬁc language. Two of the datasets

are in English, and one is in Brazilian Portuguese. While our model is unsupervised (class label is neither required nor used in learning), to objectively quantify the visualization quality, we leverage on the class label information. It is a common assumption that documents of the same class are expected to be neighbors on the original space (Belkin, Niyogi, & Sindhwani, 2006; Zhou et al., 2004; Zhu et al., 2003), which suggests that they should also be close on the visualization space. We investigate the eﬀectiveness of Semafore in placing documents of the same class nearby on the visualization space, and systematically compare it to existing baselines without one or more of our properties, namely: joint modeling of topic and visualization, or neighborhood regularization.

1.3 Contributions

While visualization and topic modeling are, separately, well-studied problems, the interface between the two, semantic visualization, is a relatively new problem, with very few previous work. In this work, we make the following contributions.

We propose incorporating neighborhood structure in semantic visualization. In this respect, we propose a probabilistic model Semafore, with two integrated components. One is a kernelized semantic visualization model, enabling the substitution of the kernel functions that relate visualization coordinates to topic distributions (see Section 3.3). The other is a neighborhood graph regularization framework for semantic visualization as described in Section 4.1.

Realizing the neighborhood graph regularization involves an exploration of how to incorporate the appropriate forms of the neighborhood structure. In this respect, we investigate the eﬀects of neighborhood graph construction techniques such as knearest neighbors (k-NN), ϵ-ball, and disjoint minimum spanning trees (DMST), as well as diﬀerent edge weight estimations such as heat-kernel (see Section 4.2) in the context of semantic visualization.

In Section 5, we describe the requisite learning algorithms based on maximum a posteriori (MAP) estimation using expectation-maximization (EM), in order to ﬁt the parameters for the various regularization functions and kernels that we propose.

Our ﬁnal contribution is the evaluation of Semafore s eﬀectiveness on a series of reallife, public datasets described in Section 6, which shows that Semafore outperforms existing baselines on a well-established and objective visualization metric.

In our prior work (Le & Lauw, 2014b), we proposed the problem and described the preliminary model. In this extended article, there are signiﬁcant technical changes that provide a signiﬁcantly more comprehensive discussion of the model. For instance, we now discuss the Student-t kernel, in addition to the previously introduced Gaussian kernel. Furthermore, we investigate the eﬃcacies of diﬀerent neighborhood graph constructions, including the ϵ-ball and DMST graphs, in addition to the previously introduced k NN graph. The graph weights are also enhanced through investigation of heat kernel, in addition to the simple-minded binary scheme previously. As discussed in Section 6.3, these enhancements collectively result in statistically signiﬁcant improvements over the previous model. Beyond

Semantic Visualization with Neighborhood Graph Regularization

the technical enhancements, we also provide more comprehensive model analysis and empirical validation, including richer quantitative and qualitative discussions of the visualizations and the resulting topic models, as well as a metric to measure topic interpretability based on pairwise mutual information.

2. Related Work

In this section, we discuss the diﬀerent aspects of our work, identify the related papers in the literature, and point out the key conceptual diﬀerences.

2.1 Visualization and Dimensionality Reduction

One way to perform visualization is by using a generic dimensionality reduction technique. Such techniques come in several ﬂavors, depending on the objective. Principal component analysis (PCA) (Jolliﬀe, 2005) identiﬁes the components that explain most of the variance in the data. Related to PCA is singular value decomposition (SVD) (Golub & Van Loan, 2012). Comparatively, independent component analysis (ICA) (Comon, 1994) identiﬁes the components that are independent of one another, whereas linear discriminant analysis (Fisher s LDA) (Fisher, 1936) identiﬁes the components that most discriminate between known class labels. Being generic, these techniques are more frequently applied to feature extraction, as they are not optimized for visualization. They focus more on the properties of the components (e.g., orthogonality, independence) rather than on the intrinsic relationship among data instances. Furthermore, as they are based on linear projections, they may not capture non-linearities in the data well. Another category of techniques, which is more directly related to visualization, is the embedding approach. It aims to preserve the high-dimensional similarities or diﬀerences in the low-dimensional embedding. One pioneering such work is multidimensional scaling (MDS) (Kruskal, 1964). Given a set of pairwise distances δij between data points i and j, MDS determines coordinates xi and xj respectively, such that the embedded visualization distance ||xi xj|| approximates δij as much as possible. For MDS, the distance to be preserved δij is frequently the linear distance, measuring the distance along a straight line between two points in the input space. Instead of this linear distance, Isomap (Tenenbaum, De Silva, & Langford, 2000) seeks to preserve the geodesic distance, by ﬁnding shortest paths in a graph with edges connecting neighboring data points. LLE (Roweis & Saul, 2000) seeks to preserve linear distances, but only among the neighboring points and avoiding the need to estimate pairwise distances between widely separated data points. Recently there are also works applying a similar concept to embedding but using probabilistic modeling, such as PE (Iwata, Saito, Ueda, Stromsten, Griﬃths, & Tenenbaum, 2007), SNE (Hinton & Roweis, 2002), t-SNE (Van der Maaten & Hinton, 2008), and GTM (Bishop, Svens en, & Williams, 1998). Yet others are based on semi-deﬁnite programming (Shaw & Jebara, 2007, 2009). Alternatively, several embedding techniques do not aim to preserve relationship among data instances, but rather other properties such as local minima (Kim & Torre, 2010). Importantly, all these techniques are not optimized for semantic visualization, as they do not model topics at all. The coordinates do not reﬂect any semantic meaning, other than reﬂecting the optimization objective.

There are only a few related works so far that seek to address the semantic visualization task directly. The closest previous work that does both topic modeling and visualization in a single generative process is Probabilistic Latent Semantic Visualization (PLSV) (Iwata et al., 2008), which also shows that a joint approach outperforms a separate approach. Just as PLSV builds upon the foundation of the topic modeling technique Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 1999) by incorporating visualization coordinates, so do we build upon the foundation of PLSV by incorporating RBF kernels (Section 3.3) and neighborhood structure (Section 4). There are also related works that share a similar objective, but do not share the same paradigm of visualization or topic modeling. For instance, LDA-SOM (Millar, Peterson, & Mendenhall, 2009) ﬁrst conducts topic modeling using Latent Dirichlet Allocation (LDA) (Blei et al., 2003), and then separately embeds the documents topic distributions on a Self-Organizing Map (SOM) (Kohonen, 1990). However, this is not a joint model, and SOM uses a diﬀerent visualization space than the Euclidean space that we are interested in. For another instance, SSE (Le & Lauw, 2014a) builds on the Spherical Admixture Model (SAM) (Reisinger, Waters, Silverthorn, & Mooney, 2010) belonging to the class of spherical topic models targeted at spherical (unit vector) reprepresentations of topics and documents, which are not directly comparable or equivalent with the simplex representation and multinomial modeling (probability distribution over words) adopted in this work as well as PLSV. By semantic visualization, we refer to the task of joining visualization and topic modeling. A related, but diﬀerent, task is topic visualization, where the objective is to visualize the topics, in terms of which keywords are dominant for each topic (Chaney & Blei, 2012; Chuang, Manning, & Heer, 2012), which topics are dominant in a corpus (Wei, Liu, Song, Pan, Zhou, Qian, Shi, Tan, & Zhang, 2010), and how topics are related to one another (Gretarsson, O donovan, Bostandjiev, H ollerer, Asuncion, Newman, & Smyth, 2012).

2.2 Topic Modeling

Topic model involves statistical modeling of text (documents and words) in order to discover some abstract concepts or topics that occur in a corpus. Beginning with latent semantic indexing (Dumais, Furnas, Landauer, Deerwester, Deerwester, et al., 1995), topic model evolves into the modern probabilistic treatments, such as Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 1999) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Intuitively, a topic captures a collection of words that tend to co-occur because they describe the same concept. This has the appeal of producing highly interpretable statistical models that let users make semantic sense of the corpus. Other than text-only document corpora, topic models have also been applied to cases where links are observed in addition to text (Mc Callum, Wang, & Corrada-Emmanuel, 2007). Meanwhile, the assumption that the intrinsic geometry of the data is a non-linear low dimensional subspace within the high-dimensional space ﬁnds application in both supervised and unsupervised (Belkin & Niyogi, 2003) learning algorithms. It is especially prevalent in semi-supervised learning (Zhou et al., 2004; Zhu et al., 2003) as a way to bridge labeled and unlabeled data. Regularization as a technique to realize this assumption has a long history (Belkin et al., 2006). The speciﬁc form of the regularization function varies among

Semantic Visualization with Neighborhood Graph Regularization

applications. The study of this assumption for unsupervised topic models begins with Lap PLSI (Cai et al., 2008), which introduces regularization to PLSA (Hofmann, 1999), by minimizing the Euclidean distance between neighboring documents topic distributions. Follow-up work introduce other distance functions (Cai et al., 2009; Wu, Bu, Chen, Zhu, Zhang, Liu, Wang, & Cai, 2012). While these previous work focus on maintaining proximity of similar documents, DTM (Huh & Fienberg, 2012) adds a new criterion to also maintain the distance among diﬀerent documents. Our work is diﬀerent in that we also need to contend with the visualization aspects, and not just topic modeling.

2.3 Semantic Similarity

Other than topic models, there are alternative mechanisms to learn the semantic relationship between documents. One way is by measuring the semantic similarity among documents or words. For instance, in vector space model, documents may be represented as a term vector, and their similarity may be expressed in terms of cosine similarity (Turney, Pantel, et al., 2010). Other than word occurrences alone, there could also be additional signals of semantic similarity. For instance, working with Wikipedia corpus, the categories and links are also took into account to determine the similarity among articles (Gabrilovich & Markovitch, 2009; Ponzetto & Strube, 2007). Our work diﬀers from these in several important respects. First, our objective is not in the similarity value per se, but rather in determining lower-dimensional embedding coordinates, which would allow visualization as one application. Second, our method is based on probabilistic modeling of latent variables, akin to topic modeling, instead of operating on the vector space model representation of documents.

3. Semantic Visualization

We introduce the problem formulation for semantic visualization in Section 3.1. Our focus in this paper is on the eﬀects of the neighborhood graph structure on the semantic visualization task. We ﬁgure that the clearest way to showcase these eﬀects is to design a neighborhood preservation framework over and above an existing generative process, such as PLSV (Iwata et al., 2008), which we will review in Section 3.2. In Section 3.3, we describe an innovation over the semantic visualization model, which is an abstraction of the mapping between the topic space and the visualization space using radial basis function (RBF) kernels. This allows the exploration of various kernels, of which we identify two for further exploration. For ease of following the discussion, we include a table of notations in Table 1.

3.1 Problem

For the task of semantic visualization, the input is a corpus of documents D = {d1, . . . , d N}. Every dn is a bag of words, and wnm denotes the mth word in dn. The total number of words in dn is Mn. The objective is to learn, for each dn, a latent distribution over Z topics {P(z|dn)}Z z=1. Each topic z is associated with a parameter θz, which is a probability distribution {P(w|θz)}w W over words in the vocabulary W. The words with the highest probabilities for a given topic capture the semantic of that topic.

Notation Description dn a speciﬁc document xn latent coordinate of dn in the visualization space Mn number of words in document dn z a speciﬁc topic φz coordinate of topic z in the visualization space θz word distribution of topic z W the vocabulary (the set of words in the lexicon) N total number of documents in the corpus Z total number of topics (user-deﬁned) χ the collection of xn s for all documents Φ the collection of φz s for all topics Θ the collection of θz s for all topics Ψ the collective set of parameters {χ, Φ, Θ}

Table 1: Notations.

In semantic visualization, there is an additional objective for semantic visualization, which is to learn, for each document dn, its latent coordinate xn on a low-dimensionality visualization space. Similarly, each topic z is associated with a latent coordinate φz on the visualization space. A document dn s topic distribution is then expressed in terms of the Euclidean distance between its coordinate xn and the diﬀerent topic coordinates Φ = {φz}Z z=1. Intuitively, the closer is xn to a topic s φz, the higher is P(z|dn) or the probability of topic z for document dn. In the following sections, we systematically describe the various components of our solution. The generative process that links the latent variables (coordinates) and the words in the documents is described in Section 3.2. The speciﬁc relationship between documents and topics coordinates constitutes a speciﬁc mapping function, which we model as an RBF kernel in Section 3.3. In the following Section 4, we discuss how to incorporate neighborhood structure into semantic visualization.

3.2 Generative Process

We now describe the generative process of documents based on both topics and visualization coordinates. Below we review PLSV whose graphical model is shown in Figure 1. Our eventual complete model is a generalization of this model, involving enhancements through kernelization (Section 3.3) and neighborhood structure preservation (Section 4). The generative process is as follows:

1. For each topic z = 1, . . . , Z:

(a) Draw z s word distribution: θz Dirichlet(α)

(b) Draw z s coordinate: φz Normal(0, β 1I)

2. For each document dn, where n = 1, . . . , N:

(a) Draw dn s coordinate: xn Normal(0, γ 1I)

Semantic Visualization with Neighborhood Graph Regularization

Figure 1: Graphical model of PLSV.

(b) For each word wnm dn:

i. Draw a topic: z Multi({P(z|xn, Φ)}Z z=1) ii. Draw a word: wnm Multi(θz)

Here, α is a Dirichlet prior, I is the identity matrix, β and γ control the variance of the Normal distributions. The parameters χ = {xn}N n=1, Φ = {φz}Z z=1, Θ = {θz}Z z=1, collectively denoted as Ψ = χ, Φ, Θ , are learned from documents D based on maximum a posteriori estimation. The log likelihood function is shown in Equation 1.

z=1 P(z|xn, Φ)P(wnm|θz) (1)

We reiterate that our focus here is on incorporating neighborhood graph structure into semantic visualization. By building a neighborhood graph regularization framework into an existing generative process, i.e., PLSV, we can clearly observe that any improvement over PLSV arises from the neighborhood graph regularization. In this sense, our work is in the tradition of introducing neighborhood graph regularization to probabilistic topic modeling (Huh & Fienberg, 2012; Cai et al., 2008, 2009), where the contributions relate to the neighborhood graph regularization, rather than the generative process. That said, there is one signiﬁcant diﬀerence to PLSV, which is our ﬂexibility in allowing various kernel functions, which we will discuss next.

3.3 RBF Kernels

In the Step 2(b)i of the above generative process, the topic z of a word is drawn from the distribution {P(z|xn, Φ)}Z z=1. This distribution relates the coordinates of topics in the visualization space Φ = {φz}Z z=1 and the coordinate xn of a document dn with the document s topic distribution {P(z|dn)}Z z=1. This relationship can be formulated as a mapping problem where we want to ﬁnd a function G which maps a point in visualization space to a point in the topic space. However, the form of G cannot be known exactly because both visualization space and topic space are latent spaces and G may be diﬀerent across diﬀerent domains. Therefore, to compute the topic distributions, we need a way to approximate G. To build a function approximation of the unknown function G, we use the abstraction of Radial Basis Function (RBF) neural networks (Bishop, 1995) because feedforward multilayered RBF neural networks with one hidden layer can serve as a universal approximator

Z Ěŝ ů ď ŝ Ĩ ƚŝ

KƵƚƉƵƚ nz θ ǌсϭ

ZĂĚŝĂů ďĂƐŝƐ ĨƵŶĐƚŝŽŶƐ ǌсϭ ǌсϮ ǌс Ͳϭ ǌс

Figure 2: Topic distribution is expressed as a function of visualization coordinates using Radial Basis Function (RBF) network.

to arbitrary continuous functions (Park & Sandberg, 1991). This property provides the conﬁdence that the model would have the ability to approximate any existing relationship between visualization space and topic space with arbitrary precision. Unlike PLSV (Iwata et al., 2008) that deﬁned a speciﬁc mapping function, our approach generalizes the semantic visualization model by deﬁning the mapping problem in terms of kernelization, which admits several mapping functions within the family of RBF kernels. In our context, Radial Basis Function (Buhmann, 2000) will relate coordinate variables based on distances which deﬁnes a kernel function Λ(||xn φz||) in terms of how far a data point (e.g., xn) is from a center (e.g., φz). The kernel function Λ may take on various forms, e.g., Gaussian, multi-quadric, inverse quadratic, polyharmonic spline. To express P(z|dn) as a function of xn, we consider the normalized architecture of RBF network, with three layers as shown in Figure 2. The input layer consists of one input node (xn). The hidden layer consists of Z number of normalized RBF activation functions. Each is centered at φz and computes Λ(||xn φz||) Z z =1 Λ(||xn φz ||). The linear output layer consists of Z output nodes.

Each output node yz(xn) corresponds to P(z|dn), which is a linear combination of the RBF functions, as shown in Equation 2. Here, wz,z is the weight of inﬂuence of the RBF function of z on the P(z|dn), with the constraint Z z =1 wz,z = 1.

P(z|dn) = yz(xn) = Z z =1 wz,z Λ(||xn φz ||)

Z z =1 Λ(||xn φz ||) (2)

While Equation 2 is the general form, to instantiate a speciﬁc mapping function, we need to determine both the assignment of wz,z and the form of the function Λ. For wz,z , we will experiment with a special case wz,z = 1 when z = z and 0 otherwise. For the kernel function Λ, one variation we consider is Gaussian, which yields the function in Equation 3, where Φ refers to the collective set of φz s. Note that here we set variance of Gaussian to 1. However, its true value is not really important because a diﬀerent variance value just produces a re-scaled visualization with the scaling factor equal to that variance.

Semantic Visualization with Neighborhood Graph Regularization

P(z|dn)Gaussian = P(z|xn, Φ)Gaussian = exp( 1

2||xn φz||2) Z z =1 exp( 1

2||xn φz ||2) (3)

Another variation of Λ being considered is Student-t. This distribution is also used by t-SNE (Van der Maaten & Hinton, 2008) in the context of non-semantic, direct embedding to mitigate the eﬀects of crowding. Due to mismatched dimensionalities, the points are crunched together in the center of the visualization, which prevents gaps from forming between the clusters. Therefore, we hypothesize that using Student-t as radial basis function, which yields the function in Equation 4, can help to improve the performance of our model if crowding becomes an issue. Note that the Student-t distribution with one degree of freedom yields a radial basis function having the form similar to the inverse quadratic.

P(z|dn)Student t = P(z|xn, Φ)Student t = (1 + ||xn φz||2) 1 Z z =1(1 + ||xn φz ||2) 1 (4)

The Gaussian function (Equation 3) was also used previously in the baseline PLSV (Iwata et al., 2008) that we will compare to. Its inclusion helps to establish parity for comparative purposes, both to investigate the eﬀectiveness of the alternative Student-t kernel (described above), as well as that of the neighborhood regularization (described in the next section).

4. Neighborhood Graph Regularization Framework

There are recent works (Cai et al., 2008, 2009; Huh & Fienberg, 2012) trying to preserve the local neighborhood structure when learning low-dimensional topic representations of documents. These works assume that documents are sampled from a nonlinear low-dimensional subspace that are embedded in a high-dimensional space. Therefore, the local neighborhood structure is important for revealing the hidden topics of documents and should be preserved when learning topic representations of documents (Bai, Guo, Lan, & Cheng, 2014). In the generative process for semantic visualization described in Section 3, the document parameters are sampled independently, and may not necessarily reﬂect the underlying local neighborhood structure. We therefore seek to realize this assumption for semantic visualization. In particular, we assume that when two documents di and dj are close in the original space, then their parameters ψi and ψj of the low-rank representation are similar as well. Coupled with the kernelized semantic visualization model described in Section 3, the neighborhood preservation approach described in this section constitutes our proposed model, Semafore, which stands for SEmantic visualization with MAni FOld REgularization.

4.1 Neighborhood Regularization

The neighborhood structure can be represented by a neighborhood graph. Given a set of data points in the Euclidean space, a neighborhood graph is constructed with the input data points as vertices. By deﬁnition, edges are symmetric, i.e., ωij = ωji, and weighted. The collection of edge weights are collectively denoted as Ω = {ωij}. For the moment, we will assume that we have the neighborhood graph, and address the issue of how this neighborhood graph may be incorporated into our semantic visualiza-

tion framework. In actuality, the neighborhood graph construction itself is an important component, whose construction is described in detail in Section 4.2. One eﬀective means to incorporate a neighborhood structure into a learning model is through a regularization framework (Belkin et al., 2006). This leads to a re-design of the log-likelihood function in Equation 1 into a new regularized function L (Equation 5), where Ψ consists of the parameters (visualization coordinates and topic distributions), and D and Ω are the documents and neighborhood structure.

L(Ψ|D, Ω) = L(Ψ|D) + λ R(Ψ|Ω) (5)

The ﬁrst component L is the log-likelihood function in Equation 1, which reﬂects the ﬁt between the latent parameters Ψ and the observation D. The second component R is a regularization function, which reﬂects the consistency between the latent parameters Ψ of neighboring documents in the neighborhood structure Ω. λ is the regularization parameter, commonly found in neighborhood based algorithms (Belkin et al., 2006; Cai et al., 2008, 2009), which controls the extent of regularization (we will experiment with diﬀerent λ s in experiments).

4.1.1 Proposed Regularization Function

We now turn to the deﬁnition of the R function. The intuition is that the data points that are close in the high-dimensional space, should also be close in their low-rank representations, i.e., local consistency, also known as smoothness. One function that satisﬁes this is R+ in Equation 6. Here, F is a distance function that operates on the low-rank space. Minimizing R+ leads to minimizing the distance F(ψi, ψj) between neighbors (ωij = 1).

i,j=1;i =j ωij F(ψi, ψj) (6)

The above level of local consistency is still insuﬃcient, because it does not regulate how non-neighbors (i.e., ωij = 0) behave. For instance, it does not prevent non-neighbors from having similar low-rank representations. Another valid objective in visualization is to keep non-neighbors apart, which is satisﬁed by another objective function R in Equation 7. R is minimized when two non-neighbors di and dj (i.e., ωij = 0) are distant in their low-rank representations. The addition of 1 to F is to prevent division-by-zero error.

i,j=1;i =j;ωij=0

1 ωij F(ψi, ψj) + 1 (7)

We hypothesize that neither objective is eﬀective on its own. A more complete objective would capture the spirits of both keeping neighbors close, and keeping non-neighbors apart. Therefore, we put Equation 6 and Equation 7 together using summation and maximize the objective function as shown in Equation 8. Note that the coeﬃcient 1

2 in Equation 8 is for simplifying the formula of the derivative of R (Ψ|Ω).

R (Ψ|Ω) = 1

2(R+(Ψ|Ω) + R (Ψ|Ω)) (8)

Semantic Visualization with Neighborhood Graph Regularization

-2 -1 0 1 2 3 4

Figure 3: Example of how the same topic distribution may have diﬀerent visualization coordinates. Any points on the red line have same topic distributions.

Summation preserves the absolute magnitude of the distance, and helps to improve the visualization task by keeping non-neighbors separated on a visualizable Euclidean space. Taking the product is unsuitable, because it constrains the ratio of distances between neighbors to distances between non-neighbors. This may result in the crowding eﬀect, where many documents are clustered together, because the relative ratio may be maintained, but the absolute distances on the visualization space could be too small. Other than the proposed regularization function above, it is also possible to consider other regularization functions. For instance, we have also experimented with modifying the regularization function adapted from Discriminative Topic Model (DTM) (Huh & Fienberg, 2012), which addressed topic modeling but not semantic visualization. Note that while in the original DTM formulation, the distance function F(ψi, ψj) operates in the topic space, we adapt it for semantic visualization by redeﬁning the distance function F(ψi, ψj) so that it will operate in the visualization space instead. This modiﬁed DTM formulation is shown to underperform the proposed regularization function above (Le & Lauw, 2014b).

4.1.2 Enforcing Neighborhood Structure: Visualization vs. Topic Space

We now turn to the deﬁnition of F(ψ1, ψ2). In neighborhood-based models (Belkin et al., 2006; Cai et al., 2008, 2009), there is only one low-rank representative space. For semantic visualization, there are two: topic and visualization spaces. We look into where and how to enforce the neighborhood graph structure. At ﬁrst glance, they seem equivalent. After all, they are representations of the same documents. However, this is not necessarily the case. Consider a simple example of two topics z1 and z2 with visualization coordinates φ1 = (0, 0) and φ2 = (2, 0) respectively. Meanwhile, there are three documents {d1, d2, d3} with coordinates x1 = (1, 1), x2 = (1, 1), and x3 = (1, 1). If two documents have the same coordinates, they will also have the same topic distributions. In this example, x1 and x2 are both equidistant from φ1 and φ2, and therefore according to Equation 3, they have the same topic distribution P(z1|d1) = P(z1|d2) = 0.5, and P(z2|d1) = P(z2|d2) = 0.5. If two documents have the same topic distributions, they may not necessarily have the same coordinates. d3 also has the same

topic distribution as d1 and d2, but a diﬀerent coordinate. In fact, any coordinate of the form (1, ?) will have the same topic distribution. This example is illustrated in Figure 3. This suggests that enforcing neighborhood structure on the topic space may not necessarily lead to having data points closer on the visualization space. We postulate that regularizing the visualization space is more eﬀective. There are also advantages in computational eﬃciency to doing so, which we will describe further shortly. Therefore, we deﬁne F(ψi, ψj) as the squared Euclidean distance ||xi xj||2 between the corresponding visualization coordinates.

4.2 Neighborhood Graph

We discuss how the neighborhood graph may be approximated, which concerns the two issues of how the graph edges are deﬁned, as well as how they are weighted. The neighborhood graph is constructed in the original data space where we represent each document as a tf-idf vector (Manning, Raghavan, Sch utze, et al., 2008). We also experiment with diﬀerent vector representations, including word counts and term frequencies, and ﬁnd tf-idf to give the best results. The distance between two document vectors is measured using Euclidean distance.

4.2.1 Graph Construction

There have been research studies on the properties and methods for construction of neighborhood graphs (Zemel & Carreira-Perpi n an, 2004; Carey & Mahadevan, 2014). Since the construction of neighborhood graph is a critical step that may aﬀect the performance of various graph-based algorithms, this problem itself is a research issue of independent interest. Our scope is in exploring how some well-established graph construction techniques may apply to the case of semantic visualization. We will investigate these various graph construction methods empirically in Section 6. In the following, we brieﬂy review two categories of graph construction methods.

1. Neighborhood-based Graphs. In this formulation, edges are formed between data points that are deemed to be suﬃciently close to each other. This admits diﬀerent deﬁnitions of suﬃcient closeness . The most common deﬁnitions found in the literature include the two below.

(a) ϵ-ball: The neighborhood graph contains an edge connecting two documents di and dj, if di and dj have a distance less than a threshold ϵ.

(b) k-nearest neighbors (k-NN) graph: The neighborhood graph contains an edge connecting two documents di and dj, if di is in the set Nk(dj) of the k nearest neighbors of dj, or dj is in the set Nk(di).

ϵ-ball and k-NN both have strongly data-dependent parameters (i.e., ϵ and k) and it is not straightforward to choose the best value for these parameters. Neither guarantees that the graph would be connected. They also need to be carefully selected or tuned, as to some extent they also aﬀect the balance between the contribution of neighbors R+ and non-neighbors R to the neighborhood regularization R in Equation 8. In

Semantic Visualization with Neighborhood Graph Regularization

Appendix A, we explore empirically how these graph parameters can help to maintain this balance within the neighborhood regularization function.

ϵ-ball suﬀers from another issue that it tends to produce many edges for the points located at high-density regions, and thus has little restriction on the maximum degree of a vertex. k-NN does not suﬀer from that problem and is one of the most commonly used types of graphs.

In our subsequent development and experiments, we will experiment with both ϵ-ball and k-NN graph as there may be some variance in the performance of diﬀerent graph construction techniques for diﬀerent datasets (Hein, Audibert, & Luxburg, 2007; Ting, Huang, & Jordan, 2010; Coifman & Lafon, 2006).

2. Minimum Spanning Tree-based Graphs. While ϵ-ball and k-NN are quite sensitive to noise and sparsity, graph construction based on combining multiple minimum spanning trees can help to reduce sensitivity to noise of the output graph (Zemel & Carreira-Perpi n an, 2004). There are two variations based on this approach.

(a) Perturbed Minimum Spanning Trees (PMST): PMST builds a neighborhood graph by generating T > 1 perturbed copies of the whole dataset according to the local noise model and ﬁtting an MST to each perturbed copy. A weight eij [0, 1] will be assigned to the edge between points xi and xj equal to the average number of times that edge appears on the trees. (b) Disjoint Minimum Spanning Trees (DMST): DMST produces a neighborhood graph by ﬁnding a deterministic collection of r minimum spanning trees that satisﬁes the property that no tree in the collection uses any edge of other trees. The neighborhood graph is the union of all edges of trees and contains r(N 1) edges.

As the representative of this category, we use DMST, which is deterministic and easier to construct than PMST while showing similar eﬃcacies.

4.2.2 Graph Weighting

The next issue is how to assign weights to the edges in the neighborhood graph. In this respect, we consider two variations of edge weights.

1. Simple Minded:

1, if only if di and dj are connected, 0, otherwise. (9)

This is the simplest approach where we use binary weighting to assign the weights to the edges. However, this approach to assign uniform weights to edges can be sensitive to errors, because of the cliﬀeﬀect from 1 immediately to 0. Moreover, since the weights are not smoothed, it could result in some loss of information. We hypothesize that among the connected nodes, there may still be some diﬀerences in terms of degrees of similarity, which are expressed by their mutual distances. This motivates the second approach below.

2. Heat Kernel:

exp( ||di dj||2

τ ), if only if di and dj are connected, 0, otherwise. (10)

An alternative approach is using the Heat Kernel function (Belkin & Niyogi, 2001; Jebara, Wang, & Chang, 2009). Heat Kernel has the advantage over Simple Minded by allowing smoother weights for the edges, which helps address the issues of sensitivity and loss of information. However, while Simple Minded is not parameterized, Heat Kernel has one parameter that needs to be determined (i.e., τ). Note that for τ = , Heat Kernel degenerates into Simple Minded, i.e., the former is the more general formulation. The exact value of τ is not important in our model because it would eﬀectively be absorbed by the regularization parameter. For simplicity, we set τ = 2.

5. Model Fitting

We now discuss how the parameters of the model described in Sections 3 and 4 can be learned. One well-accepted framework to learn model parameters using maximum a posteriori (MAP) estimation is the Expectation-Maximization or EM algorithm (Dempster, Laird, & Rubin, 1977). For our model, the regularized conditional expectation of the complete-data log likelihood in MAP estimation with priors is:

z=1 P(z|n, m, ˆΨ) log P(z|xn, Φ)P(wnm|θz)

n=1 log(P(xn)) +

z=1 log(P(φz)) +

z=1 log(P(θz))

+ λ R(Ψ|Ω),

where ˆΨ is the current estimate. P(z|n, m, ˆΨ) is the class posterior probability of the nth

document and the mth word in the current estimate. P(θz) is a symmetric Dirichlet prior with parameter α for word probability θz. P(xn) and P(φz) are Gaussian priors with a zero mean and a spherical covariance for the document coordinates xn and topic coordinates φz. We set the hyper-parameters to α = 0.01, β = 0.1N and γ = 0.1Z following PLSV (Iwata et al., 2008). In the E-step, P(z|n, m, ˆΨ) is updated as follows:

P(z|n, m, ˆΨ) = P(z|ˆxn, ˆΦ)P(wnm|ˆθz) Z z =1 P(z |ˆxn, ˆΦ)P(wnm|ˆθz ) .

In the M-step, by maximizing Q(Ψ|ˆΨ) w.r.t θzw, the next estimate of word probability θzw is as follows:

θzw = N n=1 Mn m=1 I(wnm = w)P(z|n, m, ˆΨ) + α W w =1 N n=1 Mn m=1 I(wnm = w )P(z|n, m, ˆΨ) + αW ,

Semantic Visualization with Neighborhood Graph Regularization

where I(.) is the indicator function. φz and xn cannot be solved in a closed form, and are estimated by maximizing Q(Ψ|ˆΨ) using quasi-Newton (Liu & Nocedal, 1989). The computation fo the gradients of Q(Ψ|ˆΨ) w.r.t φz and xn depend on the speciﬁc kernel used (see Section 3.3).

For the Gaussian kernel, we have the following gradients:

P(z|xn, Φ) P(z|n, m, ˆΨ) (φz xn) βφz,

P(z|xn, Φ) P(z|n, m, ˆΨ) (xn φz) γxn + λ R(Ψ|Ω)

For the Student-t kernel, we have the following gradients:

2 P(z|xn, Φ) P(z|n, m, ˆΨ) (φz xn) 1 + ||xn φz||2 βφz,

2 P(z|xn, Φ) P(z|n, m, ˆΨ) (xn φz) 1 + ||xn φz||2 γxn + λ R(Ψ|Ω)

The gradient of R(Ψ|Ω) w.r.t. xn is computed depending on the form of the regularization function R(Ψ|Ω). When we use the proposed regularization function R (Ψ|Ω) described in Section 4.1.1, we have the following gradient:

xn = R (Ψ|Ω)

4ωnj(xn xj)

4(1 ωnj) (xn xj) (F(ψn, ψj) + 1)2 .

As mentioned earlier, there is an eﬃciency advantage to regularizing on the visualization space. R(Ψ|Ω) does not contain the variable φz if we do regularization on visualization space. The complexity of computing all R(Ψ|Ω)

xn is O(N2). In contrast, if we do regularization on topic space, we have to take the gradient of R(Ψ|Ω) w.r.t to φz. That contributes towards a greater complexity of O(Z2 N2) to compute all R(Ψ|Ω)

θz . Therefore, regularization on topic space would run much slower than on visualization space.

6. Experiments

The main objective of our experiments is to evaluate the eﬀectiveness of neighborhood regularization for semantic visualization model. After describing the experimental setup, we ﬁrst examine the diﬀerent design choices of the model relating to kernel, graph construction, and regularization function. Thereafter, we compare Semafore against the baseline methods that also aim to address both visualization and topic modeling, quantitatively and qualitatively, ﬁrst in terms of visualization and then in terms of topic modeling.

6.1 Experimental Setup

In this section, we give a description of benchmark datasets as well as suitable metrics that are used for evaluation.

6.1.1 Datasets

We use three real-life, publicly available datasets (Cardoso-Cachopo, 2007) for evaluation.

20News contains newsgroup articles (in English) from 20 classes.

Reuters8 contains newswire articles (in English) from 8 classes.

Cade12 contains web pages (in Brazilian Portuguese) classiﬁed into 12 classes.

These are benchmark datasets used for document classiﬁcation. While our task is fully unsupervised, the ground-truth class labels are useful for an objective evaluation. We create balanced classes by sampling ﬁfty documents from each class, following the practice in PLSV (Iwata et al., 2008). This results in, for one sample, 1000 documents for 20News, 400 for Reuters8, and 600 for Cade12. The vocabulary sizes are 5.4K for 20News, 1.9K for Reuters8, 7.6K for Cade12. As the algorithms are probabilistic, we generate ﬁve samples for each dataset. For each sample, we conduct ﬁve independent runs. Therefore, the result reported for each setting is the average over a total of 25 runs.

6.1.2 Metrics

For a suitable metric, we return to the fundamental principle that a good visualization should preserve the relationship between documents (in high-dimensional space) in the lower-dimensional visualization space. User studies, even when well-designed, could be overly subjective and may not be repeatable across diﬀerent users reliably. Therefore, for a more objective evaluation, we rely on two types of quantitative analysis:

Classiﬁcation: This evaluation relies on the ground-truth class labels found in the datasets. This is a well-established practice in many clustering and visualization works in machine learning. The basis for this evaluation is the reasonable assumption that documents of the same class are more related than documents of diﬀerent classes. Therefore a good visualization would place documents of the same class as neighbors on the visualization.

For each document dn, we hide its true class cn, and generate a prediction for its class ˆCt(n) by taking the majority class among its t-nearest neighbors, as determined by Euclidean distance on the visualization space. Classiﬁcation accuracy Classification Acc(t) is deﬁned as the fraction of documents whose predicted class ˆCt(n) matches the true class cn. More speciﬁcally, we have:

Classification Acc(t) = 1

n=1 δ( ˆCt(n) = cn),

Semantic Visualization with Neighborhood Graph Regularization

where δ is the delta function that equals 1 if the prediction matches and 0 otherwise.

The same metric is used in PLSV (Iwata et al., 2008). While accuracy is computed based on documents coordinates, the same trends will be produced if computed based on topic distributions (due to their coupling through the kernels described in Section 3.3).

Neighborhood Preservation: This evaluation does not rely on the ground-truth class labels but on the local neighborhood structure in the input data. The assumption is that a good visualization would be able to preserve the local structure in the input data as much as possible. If two documents are neighbors in the input data, they should still be neighbors in the visualization space.

For every document dn, we compute sets of t-nearest neighbors Yt(n) and Xt(n) of document dn in the input data and the visualization respectively. The neighborhood preservation accuracy Preservation Acc(t) is then deﬁned as the average fraction of the overlap size of Yt(n) and Xt(n) over the size of Yt(n) (i.e. t), where n = 1, . . . , N. More speciﬁcally, we have:

Preservation Acc(t) = 1

|Yt(n) Xt(n)|

where |Yt(n) Xt(n)| is the size of the overlap set Yt(n) Xt(n).

A similar measure can be found in the literature (Akkucuk & Carroll, 2006), where it is called the rate of agreement in local structure or agreement rate and is used to measure how well the local structure is preserved between the input data and the low dimensional embedding. It is also used for tuning the parameters of a non-linear dimensionality reduction method (Chen & Buja, 2009).

In the subsequent experiments, we let t vary in the range [5, 50] with the step size 5 and report the accuracies. Since diﬀerent methods may behave diﬀerently at diﬀerent t s, choosing a speciﬁc t for comparison may be unfair for some methods. Moreover, a method that consistently does well for diﬀerent t s would also have a smoother local structure. Therefore, when comparing various methods, we present the preservation or classiﬁcation accuracies averaged across t [5, 50], denoted Preservation Acc(Avg) and Classification Acc(Avg) respectively.

6.2 Parameter Study

In this section, we study the eﬀects of graph parameters on our model. Speciﬁcally, the parameters concern the graph construction, including the number of neighbors k in k-NN graph, the distance threshold ϵ in ϵ-ball graph, and the number of minimum spanning trees r in DMST. For each type of graph, we use the Simple Minded weight. For the following ﬁgures, the regularization function is R with λ = 10 and the number of topics Z = 20. We use neighborhood preservation accuracy Preservation Acc(t) to show the eﬀects of graph parameters because this metric does not need ground-truth class labels, which are not always available for tuning these graph parameters.

ĞƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

Ŭсϱ ŬсϭϬ ŬсϮϬ ŬсϯϬ ŬсϰϬ ŬсϱϬ ŬсϭϬϬ

ĞƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

ĞƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

ƚ Ă ϮϬEĞǁƐ ď ZĞƵƚĞƌƐϴ Đ ĂĚĞϭϮ

Figure 4: Preservation accuracy of Semafore when using k-NN graph with diﬀerent neighborhood size k for (a) 20News, (b) Reuters8, and (c) Cade12.

ĞƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

ƌсϰ ƌсϱ ƌсϲ ƌсϳ ƌсϴ ƌсϵ ƌсϭϬ

ĞƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

ĞƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

Ă ϮϬEĞǁƐ ď ZĞƵƚĞƌƐϴ Đ ĂĚĞϭϮ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Figure 5: Preservation accuracy of Semafore when using DMST graph with diﬀerent number of minimum spanning trees r for (a) 20News, (b) Reuters8, and (c) Cade12.

ĞƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

ɸсϭ ϯϮ ɸсϭ ϯϯ ɸсϭ ϯϰ ɸсϭ ϯϱ ɸсϭ ϯϲ ɸсϭ ϯϳ ɸсϭ ϯϴ ɸсϭ ϯϵ ɸсϭ ϰϬ

ĞƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

ĞƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

Ă ϮϬEĞǁƐ ď ZĞƵƚĞƌƐϴ Đ ĂĚĞϭϮ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Figure 6: Preservation accuracy of Semafore when using ϵ-ball graph with diﬀerent values of distance threshold ϵ for (a) 20News, (b) Reuters8, and (c) Cade12.

Semantic Visualization with Neighborhood Graph Regularization

In Figure 4, we show the performance of our model with diﬀerent neighborhood size k in k-NN graph for diﬀerent datasets. For every k, we vary t and plot the Preservation Acc(t). Figure 4 shows that the optimum k for 20News, Reuters8, and Cade12 is 10, 10, and 5 respectively. We compute the average accuracy Preservation Acc(Avg) and it conﬁrms that the optima are indeed at those k values. From now on, we will use k=10 for 20News and Reuters8, and k=5 for Cade12 when k-NN graph is used. For DMST graph, we plot the Preservation Acc(t) for diﬀerent number of minimum spanning trees r with diﬀerent datasets in Figure 5. It is diﬃcult to see which r is the best in the ﬁgure because the diﬀerences between them are not much. The Preservation Acc(Avg) is computed and it shows that for all three datasets, the optimum is about at r=5,6,7. Subsequently, we will use r=6 for DMST graphs for all three datasets. For ϵ-ball graph, in Figure 6 we plot the Preservation Acc(t) for diﬀerent values of ϵ in the range [1.32, 1.40]. We choose that range because ϵ=1.32 and ϵ=1.40 roughly give an average number of neighbors of 5 and 100 respectively. The Preservation Acc(Avg) shows that the optimum ϵ for 20News, Reuters8, and Cade12 is 1.34, 1.35, and 1.33 respectively.

6.3 Model Analysis

In this section, we study the various design choices involved in designing the Semafore model, before ﬁnally concluding on the eventual synthesis of design choices to be used for comparison against the baselines. To keep the discussion focused and organized, in each of the following sub-section, we vary a single design choice, in order to isolate its eﬀects. When unvaried, the model has the following setup by default: the number of topics is Z = 20, the graph construction method is k-NN, the graph weighting method is simple minded, the RBF kernel is Gaussian, and the regularization function is R with λ = 10.

6.3.1 Neighborhood Graph Construction

We investigate three graph construction methods: k-NN, ϵ-ball and DMST, which are representatives of neighborhood-based and minimum spanning tree-based methods respectively. For each graph, its parameter is tuned as shown in Section 6.2. For the regularization parameter λ, we try diﬀerent settings of λ on each dataset. It so happens that λ = 10 performs the best for all the graph construction methods across the three datasets. In Figure 7, we run Semafore with diﬀerent types of graph on the three datasets and report the Preservation Acc(Avg) at diﬀerent number of topics Z. The results show that diﬀerent types of graph behave diﬀerently with diﬀerent datasets. In 20News, ϵ-ball and DMST give our model highest performance. Since the diﬀerence between the two are not statistically signiﬁcant, we choose to use DMST for subsequent experiments on 20News. For Reuters8, since ϵ-ball outperforms the others (signiﬁcant at 0.05 level), it is going to be the default choice for subsequent experiments. For Cade12, the choice is DMST, which is slightly better than k-NN (statistically signiﬁcant for Z = 10, 40, 50).

6.3.2 Neighborhood Graph Weighting

We now compare two variations of graph weighting methods, namely: Simple Minded and Heat Kernel methods. In this experiment, we use k-NN graph with speciﬁc k s for diﬀerent

ǀĂƚŝŽŶͺ ĐĐ ǀŐͿ

ŬͲEE D d ɸͲďĂůů

ǀĂƚŝŽŶͺ ĐĐ ǀŐͿ

ǀĂƚŝŽŶͺ ĐĐ ǀŐͿ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

Ă ϮϬEĞǁƐ ď ZĞƵƚĞƌƐϴ Đ ĂĚĞϭϮ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

Figure 7: The eﬀects of diﬀerent graph construction methods on our model s performance.

ǀĂƚŝŽŶͺ ĐĐ ǀŐͿ

ŝŵƉůĞ DŝŶĚĞĚ tĞŝŐŚƚƐ ,ĞĂƚ <ĞƌŶĞů tĞŝŐŚƚƐ

ĂƚŝŽŶͺ ĐĐ ǀŐͿ

ǀĂƚŝŽŶͺ ĐĐ ǀŐͿ

Ă ϮϬEĞǁƐ ď ZĞƵƚĞƌƐϴ Đ ĂĚĞϭϮ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

Figure 8: The eﬀects of diﬀerent graph weighting schemes on our model s performance. The graph used in this experiment is k-NN graph with speciﬁc k s for diﬀerent datasets as studied in Section 6.2.

ĂƚŝŽŶͺ ĐĐ ǀŐͿ

'ĂƵƐƐŝĂŶ <ĞƌŶĞů ƚƵĚĞŶƚͲƚ <ĞƌŶĞů

ǀĂƚŝŽŶͺ ĐĐ ǀŐͿ

ǀĂƚŝŽŶͺ ĐĐ ǀŐͿ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

Ă ϮϬEĞǁƐ ď ZĞƵƚĞƌƐϴ Đ ĂĚĞϭϮ

Figure 9: The eﬀects of Gaussian and Student-t RBF kernels on our model s performance.

Semantic Visualization with Neighborhood Graph Regularization

20News Reuters8 Cade12 Regularization function R R R Graph construction DMST ϵ-ball DMST Graph weighting Heat Kernel Heat Kernel Simple Minded RBF kernel Student-t Student-t Student-t

Table 2: Synthesized Model for Each Dataset.

datasets as studied in Section 6.2. The regularization parameter λ is set to 10 after trying various settings and picking the best one. In Figure 8, we compare Simple Minded method and Heat Kernel method to see their inﬂuences on our model at diﬀerent number of topics Z. We observe that Heat Kernel is signiﬁcantly and consistently better than Simple Minded method across all the cases in 20News and Reuters8. The diﬀerence is statistically signiﬁcant at 0.01 level. One explanation is that Heat Kernel assigns smoother weights to the graph edges, and thus is more robust than Simple Minded. For Cade12, Simple Minded is slightly better, though the diﬀerences are statistically signiﬁcant at 0.05 level only for Z = 40. Subsequently, we will use Heat Kernel for 20News and Reuters8, and Simple Minded for Cade12 as part of the ﬁnal synthesis.

6.3.3 RBF Kernel

As described in Section 3.3, we express topic distributions as a function of visualization coordinates using RBF network as an abstraction. In this section, we show how diﬀerent RBF kernels aﬀect our model s performance. The two kernels we are exploring are Gaussian (Equation 3) and Student-t (Equation 4). We tune the regularization term λ for each kernel and see that the best one for the two kernels are λ = 10. Figure 9 shows the results for diﬀerent number of topics Z. Student-t kernel has a slight edge over Gaussian kernel consistently across diﬀerent number of topics. The diﬀerence is small, but is statistically signiﬁcant (at 0.05 level) in a majority of the cases (for 20News at Z = 10, 20, 30, 50, for Reuters8 at Z = 30, and for Cade12 at Z = 10, 30, 50). The slight improvement could be a sign that crowding problem does exist in the model. Student-t kernel would be even more useful when there is more extreme crowding issues, such as when the number of documents to be visualized is even larger. Subsequently, due to its slight edge, we will use Student-t as part of the ﬁnal synthesis. As we will see shortly, using Student-t within the synthesized model results in a signiﬁcant improvement overall.

6.3.4 Synthesised Semafore Model

Based on the model analysis in the preceding paragraphs, we combine the design choices into a ﬁnal synthesis model called Semafore. The synthesized model is slightly diﬀerent for diﬀerent datasets, as listed in Table 2. We now conduct another set of experiments to verify that those synthesized models would produce a noticeable improvement over the earlier version (k NN + Simple Minded + Gaussian Kernel) that appeared in our earlier work (Le & Lauw, 2014b), underlining

Ϭ Ϯϵ Ϭ ϯϬ Ϭ ϯϭ

ŬͺEE н ŝŵƉůĞ DŝŶĚĞĚ н 'ĂƵƐƐŝĂŶ <ĞƌŶĞů D d н ,ĞĂƚ <ĞƌŶĞů н ƚƵĚĞŶƚͲƚ <ĞƌŶĞů

Ϭ ϰϭ Ϭ ϰϮ Ϭ ϰϯ Ϭ ϰϰ

ŬͺEE н ŝŵƉůĞ DŝŶĚĞĚ н 'ĂƵƐƐŝĂŶ <ĞƌŶĞů ɸͲďĂůů н ,ĞĂƚ <ĞƌŶĞů н ƚƵĚĞŶƚͲƚ <ĞƌŶĞů

Ϭ Ϯϯ Ϭ Ϯϰ Ϭ Ϯϱ

ŬͺEE н ŝŵƉůĞ DŝŶĚĞĚ н 'ĂƵƐƐŝĂŶ <ĞƌŶĞů

D d н ŝŵƉůĞ DŝŶĚĞĚ н ƚƵĚĞŶƚͲƚ <ĞƌŶĞů

Ă ϮϬEĞǁƐ ď ZĞƵƚĞƌƐϴ Đ ĂĚĞϭϮ

Ϭ Ϯϲ Ϭ Ϯϳ Ϭ Ϯϴ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

WƌĞƐĞƌǀĂƚŝŽ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

Ϭ ϯϴ Ϭ ϯϵ Ϭ ϰϬ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

WƌĞƐĞƌǀĂƚŝŽ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

Ϭ Ϯ Ϭ Ϯϭ Ϭ ϮϮ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

WƌĞƐĞƌǀĂƚŝŽŶ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

Figure 10: Our synthesized models with diﬀerent properties compared to the earlier version (k NN + Simple Minded + Gaussian Kernel) that appeared in our earlier work (Le & Lauw, 2014b).

the utility of the subsequent enhancements. Figure 10 shows that this is indeed the case. Based on the standard deviations shown in the ﬁgures, the improvements are very clear in 20News and Reuters8 but not so clear in Cade12. Paired samples t-test indicate that the improvement is signiﬁcant at 0.05 level or lower in all cases, except for the cases where Z = 10, 20 in Cade12. We will use these synthesized models in the comparisons against the baseline methods in the following section.

6.4 Comparison of Visualizations

We now compare our proposed model with several baselines. First, we outline the set of comparative methods. Thereafter, we discuss quantitative evaluation (in terms of accuracy), as well as qualitative evaluation (in terms of example visualizations). Finally, we will show that the gains in visualization quality does not come at the expense of topic modeling. As semantic visualization seeks to ensure consistency between topic model and visualization, the comparison focuses on methods producing both topics and visualization coordinates which are listed in Table 3.

Semafore is our proposed method that incorporates neighborhood structure into semantic visualization.

PLSV (Iwata et al., 2008) is the state-of-the-art, representing the joint approach without neighborhood structure preservation.

PE (LDA) represents the pipeline approach involving topic modeling with LDA (Blei et al., 2003), followed by visualizing documents topic distributions with PE (Iwata et al., 2007). This pipeline is better than the LDA/MDS that appeared in our earlier work (Le & Lauw, 2014b). There are other pipeline methods, shown inferior to PLSV (Iwata et al., 2008), which are not reproduced here to avoid duplication.

Semantic Visualization with Neighborhood Graph Regularization

Visualization Topic model Joint model Neighborhood Semafore PLSV PE (LDA) t-SNE (LDA)

Table 3: Comparative Methods.

t-SNE (LDA) is another pipeline approach that ﬁrst uses LDA (Blei et al., 2003) to learn topic model and then use t-SNE (Van der Maaten & Hinton, 2008) to visualize documents topic distributions.

For completeness, we also conduct experiments for comparing our method with t-SNE and Laplacian Eigen Maps (LE) (Belkin & Niyogi, 2003) (direct visualization, without topic modeling). To keep the discussion focused, we show them in Appendix B, as we do not consider t-SNE and LE as comparative baselines because these two methods only model visualization, but not topics.

6.4.1 Accuracy

In this section, we compare our model with several baselines in terms of classiﬁcation accuracy (Figure 11) and neighborhood preservation accuracy (Figure 12). In the two ﬁgures, only the standard deviations for Semafore are shown. Classﬁcation Accuracy. Figure 11(a), 11(c) and 11(e) show the Classfication Acc(t) at diﬀerent t s for Z = 20 for 20News, Reuters8, and Cade12 respectively. At any t, the comparison shows outperformance by Semafore over the baselines consistently. All four methods show the same behavior that their performances decrease when t increases. As t increases, they may lose accuracy in predicting labels for documents near to the border of each cluster . Now, we vary the number of topics Z. In Figure 11(b), we show the performance in Classfication Acc(Avg) on 20News. Figure 11(d) and 11(f) show the same for Reuters8 and Cade12 respectively. From these ﬁgures, we draw the following observations about the comparative methods:

Semafore performs the best on all datasets across various numbers of topics (Z). Semafore beats PLSV by 25% to 51% on 20News, by 6 13% on Reuters8, and by 22 32% on Cade12. These margins of performance with respect to PLSV are statistically signiﬁcant at 0.01 signiﬁcant level or lower in all cases. This eﬀectively showcases the utility of neighborhood regularization in enhancing the quality of visualization. By preserving local consistency, Semafore achieves a good accuracy even at small number of topics (e.g., 10).

PLSV performs better than PE (LDA) and t-SNE (LDA), which shows that there is utility to having a joint, instead of separate, modeling of topics and visualization. PE (LDA) and t-SNE (LDA) are worse than PLSV because it embeds documents by using two-step reductions that optimize separately two diﬀerent objective functions.

Therefore, the errors from the previous step may propagate to the next, without an opportunity for correction. This may cause distortions in the visualization.

In some cases, PLSV, PE (LDA) and t-SNE (LDA) tend to have decreasing accuracies when the number of topics increases. This may be because when number of topic increases, the topic distributions and the word probabilities may overﬁt the data and thus the accuracy is reduced. In contrast, Semafore shows a quite stable performance across diﬀerent numbers of topics. This may be explained by the utility of neighborhood regularization, which helps to prevent overﬁtting when the number of topics increases.

Neighborhood Preservation Accuracy. While having better classiﬁcation accuracy, Semafore also preserves well the local structure of the input data in the visualization space. The Preservation Acc(t) results in Figure 12(a), 12(c) and 12(e) show that Semafore is consistently better than the other baselines in terms of neighborhood preservation across diﬀerent t s and diﬀerent datasets. In Figure 12(b), 12(d) and 12(f), we vary the number of topics Z and report the Preservation Acc(Avg) results. Semafore beats PLSV by 41% to 76% on 20News, by 24 36% on Reuters8, and by 29 45% on Cade12 in terms of neighborhood preservation accuracy. The improvements of Semafore over PLSV are statistically signiﬁcant at 0.01 signiﬁcant level or lower in all cases. The above accuracy results are based on visualization coordinates. We have also computed accuracies based on topic distributions, which have similar trends.

6.4.2 Visualizations

To provide an intuitive appreciation, we brieﬂy describe a qualitative comparison of visualizations. For each method on each dataset, a visualization is shown as a scatterplot (best seen in color). Each document has a coordinate, and is assigned a shape and color based on its class. Each topic also has a coordinate, drawn as a black, hollow circle. A legend is provided, mapping each symbol to the corresponding class label. Note that this is an illustrative, rather than a comparative discussion, as an objective evaluation should not rely on eyeballing alone. However, as we have shown the quantitative results in the preceding section, in this section, we focus on the qualitative study of the output visualizations. 20News. Figure 13 shows a visualization of 20News dataset. Semafore s Figure 13(a) shows that the diﬀerent classes are well separated. There are distinct clusters of blue squares and purple diamonds at the top for hockey and baseball classes respectively, clusters of orange triangles and pink asterisks at the bottom for cryptography and medicine, etc. Beyond individual classes, the visualization also places related classes nearby. Computerrelated classes are found on the lower left. Politics and religion are on the lower right. Comparatively, Figure 13(b) by PLSV shows crowding at the center. For instance, motorcycle (green dashes) and autos (red dashes) are mixed at the center without a good separation. Figure 13(c) by PE (LDA) is worse. PE (LDA) does not give good separation for not similar classes. It mixes autos (red dashes) and space (green circles) together at the center. Medicine (pink asterisks) is also mixed with other classes in PE (LDA) while Semafore and PLSV give a good separation for it. Figure 13(d) is visualization by t SNE (LDA). Although t-SNE (LDA) can separate well hockey (blue squares) and baseball

Semantic Visualization with Neighborhood Graph Regularization

ĂƚŝŽŶͺ ĐĐ ƚͿ

ĞŵĂĨŽƌĞ W> s W > Ϳ ƚͲ E > Ϳ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

ůĂƐƐŝĨŝĐĂƚŝŽŶͺ ĐĐ ƚͿ

ĞŵĂĨŽƌĞ W> s W > Ϳ ƚͲ E > Ϳ

Ă ϮϬEĞǁƐ sĂƌǇ ƚ ĨŽƌ с ϮϬͿ ď ϮϬEĞǁƐ sĂƌǇ Ϳ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

ůĂƐƐŝĨŝĐĂƚŝŽŶͺ ĐĐ ǀŐͿ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

ƚŝŽŶͺ ĐĐ ǀŐͿ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

ůĂƐƐŝĨŝĐĂƚŝŽŶͺ ĐĐ ƚͿ

ĞŵĂĨŽƌĞ W> s W > Ϳ ƚͲ E > Ϳ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Đ ZĞƵƚĞƌƐϴ sĂƌǇ ƚ ĨŽƌ с ϮϬͿ Ě ZĞƵƚĞƌƐϴ sĂƌǇ Ϳ

Ğ ĂĚĞϭϮ sĂƌǇ ƚ ĨŽƌ с ϮϬͿ Ĩ ĂĚĞϭϮ sĂƌǇ Ϳ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

ůĂƐƐŝĨŝĐĂƚŝŽŶͺ ĐĐ ǀŐͿ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

Figure 11: Classiﬁcation Accuracy Comparison.

Ϭ Ϯ Ϭ ϯ Ϭ ϰ Ϭ ϱ

ĂƚŝŽŶͺ ĐĐ ƚͿ

ĞŵĂĨŽƌĞ W> s W > Ϳ ƚͲ E > Ϳ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

WƌĞƐĞƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

ĞŵĂĨŽƌĞ W> s W > Ϳ ƚͲ E > Ϳ

Ă ϮϬEĞǁƐ sĂƌǇ ƚ ĨŽƌ с ϮϬͿ ď ϮϬEĞǁƐ sĂƌǇ Ϳ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

WƌĞƐĞƌǀĂƚŝŽŶͺ ĐĐ ǀŐͿ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

Ϭ Ϯ Ϭ ϯ Ϭ ϰ Ϭ ϱ

ƚŝŽŶͺ ĐĐ ǀŐͿ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

WƌĞƐĞƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

ĞŵĂĨŽƌĞ W> s W > Ϳ ƚͲ E > Ϳ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Đ ZĞƵƚĞƌƐϴ sĂƌǇ ƚ ĨŽƌ с ϮϬͿ Ě ZĞƵƚĞƌƐϴ sĂƌǇ Ϳ

Ğ ĂĚĞϭϮ sĂƌǇ ƚ ĨŽƌ с ϮϬͿ Ĩ ĂĚĞϭϮ sĂƌǇ Ϳ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

WƌĞƐĞƌǀĂƚŝŽŶͺ ĐĐ ǀŐͿ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

Figure 12: Preservation Accuracy Comparison.

Semantic Visualization with Neighborhood Graph Regularization

(purple diamonds) classes, it is not able to detect their semantic similarities (as baseball and hockey are both about sports). In addition, it still mixes documents of diﬀerent classes together at the center and on the upper right. Reuters8. Figure 14 shows the visualization outputs for Reuters8 dataset. Semafore in Figure 14(a) is better at separating the eight classes into distinct clusters. In an anticlockwise direction from the top, we have navy blue diamonds (money-fx), red dashes (interest), red squares (crude), light blue pluses (earn), green triangles (acq), purple crosses (ship), blue asterisks (grain), and ﬁnally orange circles (trade). In comparison, PLSV in Figure 14(b) shows that several classes are intermixed at the center, including red dashes (interest), orange circles (trade), and navy blue diamonds (money-fx). PE (LDA) in Figure 14(c) is also worse when it mixes diﬀerentiated classes such as red dashes (interest) and navy blue diamonds (money-fx) together. t-SNE (LDA) in Figure 14(d) seems have better cluster separation but still mix documents with diﬀerent classes together such as red squares (crude) and green triangles (acq) on the upper right. Green triangles (acq) also mix with light blue pluses (earn) on the left in the visualization by t-SNE (LDA). Cade12. Figure 15 shows the visualization outputs for Cade12. This is the most challenging dataset. Even so, Semafore in Figure 15(a) still achieves a better separation between the classes, as compared to PLSV in Figure 15(b). Particularly, Semafore gives better separation for esportes (green triangles) as well as compras-on-line (orange circles) than PLSV and PE (LDA). t-SNE (LDA) shows quite good clusters for esportes (green triangles) as well as compras-on-line (orange circles) but it also merges many diﬀerent classes together as in the clusters on the right and on the upper right.

6.5 Comparison of Topic Models

One question is whether Semafore s gain in visualization quality over the closest baseline PLSV is at the expense of the quality of its topic model. To investigate this, we will compare the topic models of Semafore and PLSV, which share a core generative process. For parity, in this comparison, we only include the joint models, whereby the visualization coordinates aﬀect the topic models as well. The metric we use to measure the quality of topic models is pairwise mutual information or PMI. It measures topic interpretability, based on coocurrence frequencies of the top words in each topic in a large external corpus. Although other metrics such as perplexity or heldout likelihood can show the generalization ability of a learned topic model on unseen test data, these traditional metrics do not capture whether topics are coherent (Chang, Gerrish, Wang, Boyd-Graber, & Blei, 2009). Therefore, in this comparison, we rely on PMI, which can measure the quality of topic words in terms of their interpretability to a human. To human subjects, interpretability is closely related to coherence (Newman, Lau, Grieser, & Baldwin, 2010), i.e., how much the top keywords in each topic are associated with each other. After an extensive study of evaluation methods for coherence, Newman et al. (2010) identify Pointwise Mutual Information (PMI) as the best measure, in terms of having the greatest correlation with human judgments. PMI is based on term cooccurrences. For a pair of words wi and wj, PMI is deﬁned as log p(wi,wj) p(wi)p(wj). For a topic, we average the pairwise PMI s among the top 10 words of

Ă ĞŵĂĨŽƌĞ ď W> s

Ăůƚ ĂƚŚĞŝƐŵ

ĐŽŵƉ ƐǇƐ ŝďŵ ƉĐ ŚĂƌĚǁĂƌĞ

ŵŝƐĐ ĨŽƌƐĂůĞ

ƌĞĐ ƐƉŽƌƚ ďĂƐĞďĂůů

ƐĐŝ ĞůĞĐƚƌŽŶŝĐƐ

ƐŽĐ ƌĞůŝŐŝŽŶ ĐŚƌŝƐƚŝĂŶ

ĐŽŵƉ ŐƌĂƉŚŝĐƐ

ĐŽŵƉ ƐǇƐ ŵĂĐ ŚĂƌĚǁĂƌĞ

ƌĞĐ ƐƉŽƌƚ ŚŽĐŬĞǇ

ƚĂůŬ ƉŽůŝƚŝĐƐ ŐƵŶƐ

ĐŽŵƉ ŽƐ ŵƐͲǁŝŶĚŽǁƐ ŵŝƐĐ

ĐŽŵƉ ǁŝŶĚŽǁƐ ǆ

ƌĞĐ ŵŽƚŽƌĐǇĐůĞƐ

ƚĂůŬ ƉŽůŝƚŝĐƐ ŵŝĚĞĂƐƚ

ƚĂůŬ ƉŽůŝƚŝĐƐ ŵŝƐĐ ƚĂůŬ ƌĞůŝŐŝŽŶ ŵŝƐĐ ƚŽƉŝĐƐ

Figure 13: Visualization of documents in 20News for number of topics Z = 20. Each point represents a document and the shape and color represent document class. Each topic is drawn as a black, hollow circle.

Semantic Visualization with Neighborhood Graph Regularization

Ă ĞŵĂĨŽƌĞ ď W> s

ŵŽŶĞǇͲĨǆ ĐƌƵĚĞ ĂĐƋ ƐŚŝƉ ŐƌĂŝŶ ƚƌĂĚĞ ĞĂƌŶ ŝŶƚĞƌĞƐƚ ƚŽƉŝĐƐ

Figure 14: Visualization of documents in Reuters8 for number of topics Z = 20. Each point represents a document and the shape and color represent document class. Each topic is drawn as a black, hollow circle.

Ă ĞŵĂĨŽƌĞ ď W> s

ϬϭͺƐĞƌǀŝĐŽƐ

ϭϭ ĐŝĞŶĐŝĂƐ

ϬϮͺƐŽĐŝĞĚĂĚĞ

ϬϲͺĞĚƵĐĂĐĂŽ

ϭϬͺŶŽƚŝĐŝĂƐ

ϬϰͺŝŶĨŽƌŵĂƚŝĐĂ

ϬϳͺŝŶƚĞƌŶĞƚ

ϬϵͺĞƐƉŽƌƚĞƐ

ϭϮͺĐŽŵƉƌĂƐͺŽŶͺůŝŶĞ

Figure 15: Visualization of documents in Cade12 for number of topics Z = 20. Each point represents a document and the shape and color represent document class. Each topic is drawn as a black, hollow circle.

Semantic Visualization with Neighborhood Graph Regularization

ĞŵĂĨŽƌĞ W> s

ĞŵĂĨŽƌĞ W> s

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

Ă ϮϬEĞǁƐ ď ZĞƵƚĞƌƐϴ

ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

EƵŵďĞƌ ŽĨ dŽƉŝĐƐ Ϳ

Figure 16: Topic Interpretability of Semafore and PLSV in terms of PMI Score (higher is better).

that topic. For a topic model, we average PMI across the topics. Intuitively, PMI is higher (better), if each topic features words that are highly correlated with one another. Key to PMI is the use of an external corpus to estimate p(wi, wj) and p(wi). Following Newman et al. (2009), we use Google Web 1T 5-gram Version 1 (Brants & Franz, 2006), a huge corpus of n-grams generated from 1 trillion word tokens. p(wi) is estimated from the frequencies of 1-grams. As recommended by Newman et al., p(wi, wj) is estimated from the frequencies of 5-grams. We obtain PMI for the English-based 20News and Reuters8, but not for Cade12 because we do not possess a large-scale n-gram corpus speciﬁcally for Brazilian Portuguese. In Figure 16, we plot the PMI score for various number of topics Z. Semafore performs better than PLSV across most of the topics settings. In Figure 16(a) for 20News, except for the case at Z = 10, all cases of Semafore s outperformance are signiﬁcant at 0.05 level or lower. In Figure 16(b) for Reuters8, all cases of Semafore s outperformance are signiﬁcant at 0.05 level or lower except for the Z = 30. These results show that Semafore improves visualization while not sacriﬁcing the topic interpretability of learned topics. For a greater appreciation of the quality of the output topic models, in Appendix C, we show several examples of topic models for Z = 20, for both Semafore and PLSV, in terms of the top keywords with the highest probabilities for each topic.

7. Conclusion

In this paper, we address the semantic visualization problem, which jointly conducts topic modeling and visualization of documents. We propose a new framework to incorporate neighborhood structure within a probabilistic semantic visualization model called Semafore. The model is carefully designed to reﬂect the context of semantic visualization, leading to a number of design choices related to the RBF kernel for mapping topic and visualiza-

tion spaces, the approximation of neighborhood graph through construction and weighting, as well as the appropriate regularization functions and spaces. Experiments on real-life datasets show that Semafore signiﬁcantly outperforms the baselines in terms of visualization quality and accuracy, while having a similar, if not slightly better topic model. This provides evidence that neighborhood structure, together with joint modeling of topics and visualization, is important for semantic visualization.

Appendix A. Balancing Contributions of Neighbors and Non-neighbors to Regularization

As mentioned in Section 4.2, the balance between the contribution of neighbors R+ and non-neighbors R to the neighborhood regularization R in Equation 8 may require careful tuning of the graph parameters (i.e., ϵ or k). For example, in the case of using k-NN graph and N total number of documents, we would have k N terms in the neighbor regularization R+, and (N k)N terms in the non-neighbor regularization R . Supposing that N increases signiﬁcantly, there might be imbalance if k were to remain unchanged. Therefore, as N changes, k should also be tuned accordingly to maintain this balance. For a simplistic point, the ratio between k N and (N k)N would remain roughly the same if both N and k grow by similar factors. In practice, we recommend tuning k carefully. We run additional experiments to validate the above argument on the 20News dataset. Our basic point is that as N changes, k can be tuned to still show signiﬁcant improvement due to the neighborhood graph regularization. The closest baseline is PLSV, both empirically in terms of classiﬁcation accuracy, as well as conceptually as PLSV shares a similar generative process but with a diﬀerent kernel and without neighborhood regularization. Hence, we compare the performance of our method Semafore (with k-NN graph, heat kernel weighting, and Student-t kernel) to PLSV on various data sizes at Z = 20 topics.

Figure 17(a) is for dataset of size N = 500, and Semafore runs with k = 10.

Figure 17(b) is for dataset of size N = 1000, and Semafore runs with k = 10.

Figure 17(c) is for dataset of size N = 5000, and Semafore runs with k = 50.

We note that there is a 10X diﬀerence between the smallest and the largest datasets. Yet the relative outperformance of Semafore over PLSV by around 15% to 20% is evident across the three datasets. This supports the case that k can be tuned to produce a positive eﬀect using neighborhood graph regularization.

Appendix B. Additional Comparisons

As mentioned in Section 6.4, for completeness, we include here additional comparisons to visualization methods that do not also aim at topic modeling. In particular, we include two methods. First, we include t-SNE (Van der Maaten & Hinton, 2008), which is also used in the composite t-SNE (LDA). Second, we include Laplacian Eigen Maps (LE) (Belkin & Niyogi, 2003), which takes as input the neighborhood graph. Figure 18 and Figure 19 show the classiﬁcation accuracy and preservation accuracy of Semafore , t-SNE and LE

Semantic Visualization with Neighborhood Graph Regularization

ƚŝŽŶͺ ĐĐ ƚͿ

ĞŵĂĨŽƌĞ W> s

ƚŝŽŶͺ ĐĐ ƚͿ

ĞŵĂĨŽƌĞ W> s

ĂƚŝŽŶͺ ĐĐ ƚͿ

ĞŵĂĨŽƌĞ W> s

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ă ϮϬEĞǁƐ ϱϬϬ ĚŽĐƵŵĞŶƚƐͿ ď ϮϬEĞǁƐ ϭϬϬϬ ĚŽĐƵŵĞŶƚƐͿ Đ ϮϬEĞǁƐ ϱϬϬϬ ĚŽĐƵŵĞŶƚƐͿ

Figure 17: Classiﬁcation accuracy comparison on 20News with various data sizes (Z = 20).

ĨŝĐĂƚŝŽŶͺ ĐĐ ƚͿ

ĞŵĂĨŽƌĞ > ƚͲ E

ĨŝĐĂƚŝŽŶͺ ĐĐ ƚͿ

ĨŝĐĂƚŝŽŶͺ ĐĐ ƚͿ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ă ϮϬEĞǁƐ ď ZĞƵƚĞƌƐϴ Đ ĂĚĞϭϮ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Figure 18: Classiﬁcation accuracy comparison.

ƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

ĞŵĂĨŽƌĞ > ƚͲ E

ƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

ƌǀĂƚŝŽŶͺ ĐĐ ƚͿ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ă ϮϬEĞǁƐ ď ZĞƵƚĞƌƐϴ Đ ĂĚĞϭϮ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Ϭ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ

Figure 19: Preservation accuracy comparison.

when varying t. Semafore outperforms LE in all cases. For t-SNE, Semafore outperforms t-SNE for Reuters8. However, for 20News and Cade12, it is more diﬃcult to tell whether Semafore or t-SNE is better. t-SNE tends to have decreasing accuracy as t increases. This is expected because t-SNE is known to focus on preserving the local structure (Van der Maaten & Hinton, 2008). When t is small, we basically consider only the local structure of the visualization. When t increases, we consider the more global structure of the visualization and Semafore outperforms t-SNE signiﬁcantly. Overall, Semafore is more stable than t-SNE as t changes, which indicates that Semafore tries to balance preserving the local and global structure better than t-SNE. We emphasize that this comparison is for information purpose only, as we do not regard t-SNE and LE as comparative baselines.

Appendix C. Topic Model Examples

We showcase the topic models derived by Semafore and PLSV. For 20News, Table 4 shows the topics of Semafore, and Table 5 shows the topics of PLSV. For Reuters8, Table 6 shows the topics of Semafore, and Table 7 shows the topics of PLSV. For Cade12, Table 8 shows the topics of Semafore, and Table 9 shows the topics of PLSV. For each method, we show the list of twenty topics. For each topic, we produce the top ten words with the highest probabilities. As shown by the top words, the topics do correspond strongly to some of the classes. For example, topic s19 in Table 4 for 20News is about Christianity, which corresponds to the soc.religion.christian class. Topic s4 is about cars and motorcycles, corresponding to rec.autos and rec.motorcycles. Topic s12 is probably concerning the categories of rec.sport.baseball and/or rec.sport.hockey. Overall, we observe that the quality of topic words are comparable across the comparative methods. Note that there is no direct correspondence between the topics of diﬀerent methods (e.g., the ﬁrst topic of Semafore may not correspond to the ﬁrst topic of PLSV). Through manual inspection, we can see that there are some related topics, e.g., s4 and p6, or s12 and p7. However, the sets of topics and the set of keywords for each topic are not identical. This is borne out in the slight diﬀerence in terms of PMI scores. This qualitative study helps to show that Semafore improves the visualization quality, while still maintaining at least the same quality of topic words, if not better. This supports the conclusion reached by the quantitative comparisons in the main manuscript.

Semantic Visualization with Neighborhood Graph Regularization

Table 4: Semafore s Topic Model for 20News (for 20 topics) Topic ID Top 10 Words s0 space, system, -rcb-, book, computer, university, list, post, price, science s1 article, year, good, write, guy, well, time, head, question, leave s2 gun, law, kuwait, people, death, fbus, article, control, weapon, child s3 window, ﬁle, program, widget, application, type, will, resource, call, function s4 car, bike, speed, engine, drive, lock, turn, mile, front, change s5 will, power, place, work, rate, write, sound, lead, good, interested s6 write, article, thing, time, people, better, start, problem, will, good s7 write, time, people, friend, pay, public, article, tax, opinion, money s8 people, claim, write, system, person, moral, evidence, objective, read, state s9 image, datum, graphic, send, ﬁle, format, package, software, mail, include s10 armenian, ﬁre, jew, child, kill, start, people, turkish, door, israel s11 system, board, will, datum, time, work, tape, test, copy, command s12 game, team, year, player, win, play, will, hit, season, hockey s13 will, post, space, good, time, include, cost, option, launch, people s14 drive, card, window, appear, disk, ram, driver, memory, work, color s15 mr., president, stephanopoulo, state, group, consider, party, question, issue, press s16 write, article, well, will, thing, work, point, include, time, help s17 key, article, chip, food, write, people, government, encryption, thing, algorithm s18 price, buy, apple, computer, dealer, ﬁt, model, problem, sell, monitor s19 god, jesus, will, christian, religion, faith, truth, bible, belief, church

Table 5: PLSV s Topic Model for 20News (for 20 topics) Topic ID Top 10 Words p0 write, people, christian, belief, time, faith, god, religion, life, will p1 god, will, jesus, kuwait, atheist, church, christian, man, religion, sin p2 armenian, appear, art, turkish, tartar, 1st, village, armenia, 1.40, genocide p3 will, key, write, time, article, government, system, thing, chip, hit p4 mr., stephanopoulo, president, will, party, state, door, time, meeting, open p5 write, ﬁre, article, gun, system, -rcb-, start, people, fbus, claim p6 car, will, bike, engine, drive, well, dealer, battery, change, front p7 game, win, year, will, team, play, season, good, goal, playoﬀ p8 player, team, write, hockey, game, fan, article, year, will, guy p9 space, system, datum, will, april, nasa, security, university, computer, list p10 graphic, image, ﬁle, ftp, send, format, package, system, datum, object p11 image, datum, program, window, version, ﬁle, software, tool, support, user p12 drive, jumper, master, ndet loop, slave, rate, gun, function, crime, set p13 window, ﬁle, card, will, program, color, driver, support, disk, bit p14 people, write, state, article, law, government, country, rights, jew, will p15 write, article, thing, people, good, will, time, lot, year, day p16 work, drive, tape, scsus, problem, simm, controller, write, memory, article p17 widget, -rcb-, window, -lcb-, application, resource, set, visual, type, ﬁle p18 price, will, write, system, computer, article, apple, chip, monitor, board p19 will, vote, comp, newsgroup, suit, problem, os2, sco, post, mail

Table 6: Semafore s Topic Model for Reuters8 (for 20 topics) Topic ID Top 10 Words s0 company, pipeline, raise, crude, march, spokesman, reﬁnery, capacity, corp, post s1 pct, bank, day, stg, today, reuter, money, market, mln, bill s2 oﬀer, share, company, board, group, acquire, stock, dlr, acquisition, receive s3 exchange, currency, dollar, west, ﬁnance, baker, monetary, germany, continue, interest s4 share, reuter, dlr, mln, buy, company, corp, pay, stock, group s5 price, opec, market, bpd, oﬃcial, february, month, output, saudus, january s6 rate, bank, pct, cut, fund, prime, point, reserve, issue, lower s7 billion, foreign, import, increase, dlr, trade, economic, export, will, country s8 bank, billion, market, government, fall, stock, economy, rise, surplus, deﬁcit s9 will, company, sell, pct, vessel, operation, week, billion, shipping, unit s10 strike, port, union, spokesman, cargo, employer, worker, sector, redundancy, court s11 oil, export, dlr, industry, year, pct, future, company, report, price s12 reuter, pct, report, national, week, brazil, today, increase, pay, april s13 trade, japan, japanese, reagan, state, tariﬀ, unite, market, washington, oﬃcial s14 grain, mln, soviet, crop, tonne, year, usda, production, fall, analyst s15 trade, talk, gulf, gatt, bill, yeutter, round, reuter, call, negotiation s16 certiﬁcate, reuter, cost, government, program, agreement, agriculture, will, study, loan s17 year, oﬃcial, import, will, state, price, government, china, land, rise s18 mln, ct, loss, net, shr, dlr, proﬁt, qtr, reuter, year s19 oil, mln, will, barrel, dlr, crude, source, level, petroleum, day

Table 7: PLSV s Topic Model for Reuters8 (for 20 topics) Topic ID Top 10 Words p0 will, oil, company, reuter, industry, canada, price, shell, raise, sell p1 rate, currency, dollar, exchange, baker, west, will, bank, reuter, treasury p2 bank, pct, day, import, year, rate, export, february, expect, reuter p3 share, company, corp, oﬀer, stock, board, will, reuter, dlr, buy p4 rate, bank, pct, prime, cut, point, interest, market, lower, savings p5 market, bank, stock, price, japan, ministry, rise, oﬃcial, gulf, bond p6 reuter, pct, week, report, year, march, mark, american, commission, ﬁgure p7 mln, ct, loss, net, dlr, shr, year, proﬁt, qtr, reuter p8 mln, pct, billion, stg, dlr, reuter, market, january, revise, rise p9 billion, dlr, rate, market, surplus, currency, reserve, trading, dollar, foreign p10 oil, opec, price, bpd, pipeline, mln, crude, oﬃcial, dlr, output p11 crude, dlr, barrel, corp, capacity, reﬁnery, oil, company, oﬀer, group p12 reuter, oﬃcial, state, cut, gulf, government, today, action, force, tell p13 oil, government, indonesium, price, foreign, bank, billion, reserve, company, industry p14 certiﬁcate, company, mln, year, grain, cooperative, program, dlr, government, cost p15 year, trade, agriculture, reuter, grain, agreement, gatt, yeutter, ﬁnancial, agricultural p16 strike, port, union, spokesman, employer, brazil, cargo, worker, redundancy, sector p17 trade, japan, japanese, reagan, tariﬀ, unite, washington, state, nakasone, semiconductor p18 grain, mln, crop, tonne, soviet, year, oﬃcial, china, pct, oﬀer p19 trade, country, minister, talk, state, meeting, economic, exchange, issue, baldrige

Semantic Visualization with Neighborhood Graph Regularization

Table 8: Semafore s Topic Model for Cade12 (for 20 topics) Topic ID Top 10 Words s0 sp, aulas, tecnologia, rj, sao, area, janeiro, particulares, areas, ﬁsica s1 terra, jun, gif, busca, virtual, brasil, forum, tempo, noticias, revistas s2 trabalho, seguranca, saude, medicina, ocupacional, prevencao, ppra, pcmso, imagem, imagens s3 peixes, cade, lazer, pesca, agua, rio, praia, hotel, sao, doce s4 agar, vida, personal, ﬁsica, base, tratamento, tem, pode, sistema, trainer s5 sao, br, rio, sul, criancas, www, escola, mail, http, atendimento s6 links, page, home, fotos, pagina, dicas, download, tenis, informacoes, jogos s7 internet, informatica, acesso, mg, br, servicos, provedor, mail, revista, horizonte s8 servicos, sao, paulo, entregas, entrega, sp, cesta, express, empresa, servico s9 pesca, sp, grupo, brasil, eventos, video, mg, informacoes, turismo, danca s10 astronomia, pagina, jose, foi, bem, espaco, tem, veja, ﬁlosoﬁa, correio s11 mp, banda, musicas, rock, musica, page, letras, bandas, pagina, site s12 historia, cultura, mundo, site, page, brasil, informacoes, rs, livro, arte s13 noticias, jornal, cidade, sp, sao, regiao, demolay, ordem, rio, capitulo s14 empresas, informacoes, informacao, dados, atraves, textos, mail, equipe, unicamp, centro s15 engenharia, servicos, projetos, empresa, consultoria, quimica, instituto, pesquisa, rio, manutencao s16 site, informacoes, brasil, associacao, educacao, pagina, organizacao, centro, brasileira, direitos s17 software, web, empresa, sistemas, sistema, br, marketing, desenvolvimento, windows, dados s18 virtual, online, venda, produtos, cade, shopping, internet, loja, compras, cursos s19 futebol, informacoes, fotos, clube, historia, paulo, sao, quake, pagina, cade

Table 9: PLSV s Topic Model for Cade12 (for 20 topics) Topic ID Top 10 Words p0 engenharia, projetos, servicos, trabalho, empresa, consultoria, seguranca, sp, medicina, sao p1 sao, ong, rio, instituto, personal, educacao, organizacao, sp, paulo, ﬁns p2 sao, br, desenvolvimento, sistema, tratamento, mail, sistemas, clientes, informacoes, empresa p3 aulas, formula, quimica, particulares, informacoes, matematica, pilotos, fotos, ﬁsica, site p4 jornal, tenis, noticias, esportes, sp, informacoes, sao, esporte, fotos, links p5 musica, page, rock, bandas, links, home, pagina, musicas, music, fotos p6 pesca, demolay, sp, peixes, sao, fotos, ordem, capitulo, paulo, jitsu p7 mp, musicas, nacionais, agar, internacionais, rock, formato, site, page, pagina p8 pesquisa, tecnologia, informacoes, cade, ciencia, geograﬁa, pesquisas, area, instituto, pagina p9 site, pagina, internet, mail, clique, veja, br, pode, foi, links p10 astronomia, informacoes, cultura, site, pagina, brasil, home, page, fotos, historia p11 banda, fotos, rock, letras, page, musicas, pagina, site, home, mp p12 internet, provedor, acesso, mg, informatica, software, servicos, belo, horizonte, manutencao p13 futebol, clube, sao, paulo, campeonato, historia, informacoes, pagina, turismo, tricolor p14 noticias, terra, internet, brasil, informatica, online, jornal, virtual, servicos, busca p15 links, page, quake, home, pagina, fotos, dicas, mp, download, informacoes p16 grupo, banda, karate, pagina, page, informacoes, fotos, home, rio, historia p17 produtos, virtual, shopping, cade, venda, online, sao, rio, loja, compras p18 br, sao, informacoes, marketing, mail, empresa, internet, www, fax, site p19 vida, dia, sao, foi, terra, panico, jose, tem, planetas, grande

Akkucuk, U., & Carroll, J. D. (2006). PARAMAP vs. Isomap: a comparison of two nonlinear mapping algorithms. Journal of Classiﬁcation, 23(2), 221 254.

Bai, L., Guo, J., Lan, Y., & Cheng, X. (2014). Local Linear Matrix Factorization for Document Modeling. In Advances in Information Retrieval, pp. 398 411. Springer.

Belkin, M., & Niyogi, P. (2001). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems (NIPS), Vol. 14, pp. 585 591.

Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373 1396.

Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research (JMLR), 7, 2399 2434.

Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.

Bishop, C. M., Svens en, M., & Williams, C. K. (1998). GTM: The generative topographic mapping. Neural Computation, 10(1), 215 234.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research (JMLR), 3, 993 1022.

Brants, T., & Franz, A. (2006). Web 1T 5-gram Version 1. Linguistic Data Consortium, Philadelphia.

Buhmann, M. D. (2000). Radial basis functions. Acta Numerica 2000, 9.

Cai, D., Mei, Q., Han, J., & Zhai, C. (2008). Modeling hidden topics on document manifold. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM).

Cai, D., Wang, X., & He, X. (2009). Probabilistic dyadic data analysis with local and global consistency. In Proceedings of the International Conference on Machine Learning (ICML).

Cardoso-Cachopo, A. (2007). Improving Methods for Single-label Text Categorization. Ph D Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa.

Carey, C., & Mahadevan, S. (2014). Manifold Spanning Graphs. In Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence.

Chaney, A. J.-B., & Blei, D. M. (2012). Visualizing Topic Models. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM).

Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems, pp. 288 296.

Chen, L., & Buja, A. (2009). Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. Journal of the American Statistical Association, 104(485), 209 219.

Semantic Visualization with Neighborhood Graph Regularization

Chi, E. H.-h. (2000). A taxonomy of visualization techniques using the data state reference model. In Proceedings of the IEEE Symposium on Information Visualization (Info Vis), pp. 69 75.

Choo, J., Lee, C., Reddy, C. K., & Park, H. (2013). UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics, 19(12), 1992 2001.

Chuang, J., Manning, C. D., & Heer, J. (2012). Termite: visualization techniques for assessing textual topic models. In Proceedings of the International Working Conference on Advanced Visual Interfaces (AVI), pp. 74 77.

Coifman, R. R., & Lafon, S. (2006). Diﬀusion maps. Applied and Computational Harmonic Analysis, 21(1), 5 30.

Comon, P. (1994). Independent component analysis, a new concept?. Signal Processing, 36(3), 287 314.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1 38.

Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Deerwester, S., et al. (1995). Latent semantic indexing. In Proceedings of the Text Retrieval Conference.

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179 188.

Gabrilovich, E., & Markovitch, S. (2009). Wikipedia-based semantic interpretation for natural language processing. Journal of Artiﬁcial Intelligence Research (JAIR), 34(2), 443.

Golub, G. H., & Van Loan, C. F. (2012). Matrix Computations, Vol. 3. JHU Press.

Gretarsson, B., O donovan, J., Bostandjiev, S., H ollerer, T., Asuncion, A., Newman, D., & Smyth, P. (2012). Topic Nets: Visual analysis of large text corpora with topic modeling. ACM Transactions on Intelligent Systems and Technology (TIST), 3(2), 23.

Hein, M., Audibert, J.-y., & Luxburg, U. V. (2007). Graph Laplacians and their Convergence on Random Neighborhood Graphs. In Journal of Machine Learning Research, pp. 1325 1368.

Hinton, G. E., & Roweis, S. T. (2002). Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), pp. 833 840.

Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 50 57.

Hu, Y., Boyd-Graber, J., Satinoﬀ, B., & Smith, A. (2014). Interactive topic modeling. Machine Learning, 95(3), 423 469.

Huh, S., & Fienberg, S. E. (2012). Discriminative topic modeling based on manifold learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(4), 20.

Iwata, T., Saito, K., Ueda, N., Stromsten, S., Griﬃths, T. L., & Tenenbaum, J. B. (2007). Parametric embedding for class visualization. Neural Computation, 19(9), 2536 2556.

Iwata, T., Yamada, T., & Ueda, N. (2008). Probabilistic latent semantic visualization: topic model for visualizing documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 363 371.

Jebara, T., Wang, J., & Chang, S.-F. (2009). Graph construction and b-matching for semisupervised learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 441 448. ACM.

Jolliﬀe, I. (2005). Principal Component Analysis. Wiley Online Library.

Kim, M., & Torre, F. (2010). Local minima embedding. In Proceedings of the International Conference on Machine Learning (ICML), pp. 527 534.

Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464 1480.

Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of ﬁt to a nonmetric hypothesis. Psychometrika, 29(1), 1 27.

Laﬀerty, J. D., & Wasserman, L. (2007). Statistical Analysis of Semi-Supervised Regression. In Advances in Neural Information Processing Systems (NIPS), pp. 801 808.

Le, T., & Lauw, H. W. (2014a). Semantic visualization for spherical representation. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1007 1016. ACM.

Le, T. M., & Lauw, H. W. (2014b). Manifold learning for jointly modeling topic and visualization. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.

Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45, 503 528.

Manning, C. D., Raghavan, P., Sch utze, H., et al. (2008). Introduction to Information Retrieval, Vol. 1. Cambridge University Press Cambridge.

Mc Callum, A., Wang, X., & Corrada-Emmanuel, A. (2007). Topic and role discovery in social networks with experiments on enron and academic email.. Journal of Artiﬁcial Intelligence Research (JAIR), 30, 249 272.

Millar, J. R., Peterson, G. L., & Mendenhall, M. J. (2009). Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps. In FLAIRS Conference, Vol. 21, pp. 69 74.

Newman, D., Karimi, S., & Cavedon, L. (2009). External evaluation of topic models. In Australasian Document Computing Symposium (ADCS).

Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100 108.

Park, J., & Sandberg, I. W. (1991). Universal approximation using radial-basis-function networks. Neural Computation, 3(2), 246 257.

Semantic Visualization with Neighborhood Graph Regularization

Ponzetto, S. P., & Strube, M. (2007). Knowledge derived from Wikipedia for computing semantic relatedness.. Journal of Artiﬁcial Intelligence Research (JAIR), 30, 181 212.

Reisinger, J., Waters, A., Silverthorn, B., & Mooney, R. J. (2010). Spherical topic models. In Proceedings of the International Conference on Machine Learning (ICML), pp. 903 910.

Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323 2326.

Shaw, B., & Jebara, T. (2007). Minimum volume embedding. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 460 467.

Shaw, B., & Jebara, T. (2009). Structure preserving embedding. In Proceedings of the International Conference on Machine Learning (ICML), pp. 937 944. ACM.

Tenenbaum, J. B., De Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319 2323.

Ting, D., Huang, L., & Jordan, M. I. (2010). An Analysis of the Convergence of Graph Laplacians. In Proceedings of the International Conference on Machine Learning (ICML).

Turney, P. D., Pantel, P., et al. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artiﬁcial Intelligence Research (JAIR), 37(1), 141 188.

Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), 9(2579-2605), 85.

Wei, F., Liu, S., Song, Y., Pan, S., Zhou, M. X., Qian, W., Shi, L., Tan, L., & Zhang, Q. (2010). Tiara: a visual exploratory text analytic system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 153 162.

Wu, H., Bu, J., Chen, C., Zhu, J., Zhang, L., Liu, H., Wang, C., & Cai, D. (2012). Locally discriminative topic modeling. Pattern Recognition, 45(1), 617 625.

Zemel, R. S., & Carreira-Perpi n an, M. A. (2004). Proximity graphs for clustering and manifold learning. In Advances in Neural Information Processing Systems (NIPS), pp. 225 232.

Zhou, D., Bousquet, O., Lal, T. N., Weston, J., & Sch olkopf, B. (2004). Learning with local and global consistency. Advances in Neural Information Processing Systems (NIPS), 16(16).

Zhu, X., Ghahramani, Z., Laﬀerty, J., et al. (2003). Semi-supervised learning using Gaussian ﬁelds and harmonic functions. In Proceedings of the International Conference on Machine Learning (ICML), Vol. 3, pp. 912 919.