# dirichlet_simplex_nest_and_geometric_inference__40c4031f.pdf Dirichlet Simplex Nest and Geometric Inference Mikhail Yurochkin 1 2 * Aritra Guha 3 * Yuekai Sun 3 Xuan Long Nguyen 3 We propose Dirichlet Simplex Nest, a class of probabilistic models suitable for a variety of data types, and develop fast and provably accurate inference algorithms by accounting for the model s convex geometry and low dimensional simplicial structure. By exploiting the connection to Voronoi tessellation and properties of Dirichlet distribution, the proposed inference algorithm is shown to achieve consistency and strong error bound guarantees on a range of model settings and data distributions. The effectiveness of our model and the learning algorithm is demonstrated by simulations and by analyses of text and financial data.1 1. Introduction For many complex probabilistic models, especially those with latent variables, the probability distribution of interest can be represented as an element of a convex polytope in a suitable ambient space, for which model fitting may be cast as the problem of finding the extreme points of the polytope. For instance, a mixture density can be identified as a point in a convex set of distributions whose extreme points are the mixture components. In the well-known topic model (Blei et al., 2003) for text analysis, a document corresponds to a point drawn from the topic polytope, its extreme points are the topics to be inferred. This convex geometric viewpoint provides the basis for posterior contraction behavior analysis of topic models, as well as developing fast geometric inference algorithms (Nguyen, 2015; Tang et al., 2014; Yurochkin & Nguyen, 2016; Yurochkin et al., 2017). The basic topic model the Latent Dirichlet Allocation (LDA) of Blei et al. (2003), as well as the comparable finite admixtures developed in population genetics (Pritchard *Equal contribution 1IBM Research, Cambridge 2MITIBM Watson AI Lab 3Department of Statistics, University of Michigan. Correspondence to: Mikhail Yurochkin . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). 1Code: https://github.com/moonfolk/VLAD et al., 2000) were originally designed for categorical data. However, there are many real world applications in which the convex geometric probabilistic modeling continues to be a sensible approach, even if observed measurements are no longer discrete-valued, but endowed with a variety of distributions. To expand the scope of admixture modeling for a variety of data types, we propose to study Dirichlet Simplex Nest (DSN), a class of probabilistic models that generalizes the LDA, and to develop fast and provably accurate inference algorithms by accounting for the model s convex geometry and its low dimensional simplicial structure. The generative process given by a DSN is simple to describe: starting from a simplex B of K vertices embedded in a high-dimensional ambient space S, one draws random points from the B s relative interior according to a Dirichlet distribution. Given each such point, a data point is generated according to a suitable probability kernel F. For the general simplex nest, S can be any vector space of dimensions D K 1, while the probability kernel F can be taken to be Gaussian, Multinomial, Poisson, etc, depending on the nature of the observed data (continuous, categorical or counts, resp.). If S is standard probability simplex, and F a Multinomial distribution over categories, then the model is reduced to the familiar LDA model of Blei et al. (2003). Although several geometric aspects of the DSN can be found in a vast array of well-known models in the literature, they were rarely treated together. First, viewing data as noisy observations from the low-dimensional affine hull that contains B, our model shares an assumption that can be found in both classical factor analysis and non-negative matrix factorization (NMF) models (Lee & Seung, 2001), as well as the work of Anandkumar et al. (2012); Arora et al. (2012b) arising in topic models. Second, the convex constraints (i.e., linear weights of a convex combination are non-negative and sum to one) are present in all latent variable probabilistic modeling, even though most dominant computational approaches to inference such as MCMC sampling (Griffiths & Steyvers, 2004) and variational inference (Blei et al., 2003; Hoffman et al., 2013; Kucukelbir et al., 2017) do not appear to take advantage of the underlying convex geometry. As is the case with topic models, scalable parameter estimation is a key challenge for the Dirichlet Simplex Nest. Thus, our main contribution is a novel inference algorithm that Dirichlet Simplex Nest and Geometric Inference accounts for the convex geometry and low dimensionality of the latent simplex structure endowed with a Dirichlet distribution. Starting with an original geometric technique of Yurochkin & Nguyen (2016), we present several new ideas allowing for more effective learning of asymmetric simplicial structures and the Dirichlet s concentration parameter for the general DSN model, thereby expanding its applicability to a broad range of data distributions. We also establish statistical consistency and estimation error bounds for the proposed algorithm. The paper proceeds as follows. Section 2 describes Dirichlet Simplex Nest models and reviews existing geometric inference techniques. Section 3 elucidates the convex geometry of the DSN via its connection to the Voronoi Tessellation of simplices and the structure of Dirichlet distribution on lowdimensional simplices. This helps motivate the proposed Voronoi Latent Admixture (VLAD) algorithm. Theoretical analysis of VLAD is given in Section 4. Section 5 presents an exhaustive comparative study on simulated and real data. We conclude with a discussion in Section 6. 2. Dirichlet Simplex Nest We proceed to formally describe Dirichlet Simplex Nest as a generative model. Let β1, . . . , βK S be K elements in a D-dimensional vector space S, and define B = Conv(β1, . . . , βK) as their convex hull. When K D + 1, B is a simplex in general positions. Next, for each i = 1, . . . , n, generate a random vector µi B by taking µi := PK k=1 θikβk, where the corresponding coefficient vector θi = (θi1, . . . , θi K) K 1 is generated by letting θi Dir K(α) for some concentration parameter α RK + . Now, given µi the data point xi is generated by xi|µi F( | µi), where F is a given probability kernel such that E[xi | θi] = µi for any i = 1, . . . , n. Relation to existing models The DSN encompasses several existing models in the literature. If we set S := D 1 and likelihood kernel F( ) to Multinomial, then we recover the LDA model (Blei et al., 2003). Other specific instances include Gaussian-Exponential (Schmidt et al., 2009) and Poisson-Gamma models (Cemgil, 2009). Estimating B is a challenging task for the general Dirichlet Simplex Nest model. Taking the perspective of Bayesian inference, a standard MCMC implementation for the DSN is likely computationally inefficient. In the case of LDA, as noted in Yurochkin & Nguyen (2016), the inefficiency of posterior inference can be traced to the need for approximating the posterior distributions of the large number of latent variables representing the topic labels. With the DSN model, we bypass the representation of such latent variables by integrating them out, but doing so at the cost of losing conjugacy. An alternative technique is variational inference (cf. Blei et al. (2017); Paisley et al. (2014)). While very fast, this powerful method may be inaccurate in practice and does not carry a strong theoretical guarantee. Relation to NMF and archetypal analysis The DSN provides a probabilistic justification for these methods, which often impose an additional geometric condition on the model known as separability that identifies the model parameters in a way that permits efficient estimation (Donoho & Stodden, 2003; Arora et al., 2012a; Gillis & Vavasis, 2012). Separability is somewhat related to a control on the Dirichlet s concentration parameter α, by setting α be sufficiently small. The DSN allows for a probabilistic description of the nature of the separation. Moreover, by addressing also the case where α is large, the DSN modeling provides an arguably more effective approach to archetypal analysis and non-negative matrix factorization for non-separable data. We remark that an approach proposed by (Huang et al., 2016) also permits a more general geometric identification condition called sufficiently scattered, but this generality comes at the expense of efficient estimation. Geometric inference Geometric Dirichlet Means (GDM) algorithm of Yurochkin & Nguyen (2016) is a geometric technique for estimating the (topic) simplex B that arises in the Latent Dirichlet Allocation model. The basic idea of GDM is simple: performing the K-means clustering algorithm on the n points µi (or their estimates) to obtain K centroids. These centroids cannot be a good estimate for B s vertices, but they provide reasonable directions toward the vertices. Starting from the simplex s estimated centroid, the GDM constructs K line segments connecting to the K centroids and suitably extends the rays to provide an estimate for the K vertices. The GDM method is shown to be accurate when either B is equilateral, or the Dirichlet concentration parameter α is very small, i.e., most of the points µis are concentrated near the vertices. The quality of the estimates deteriorates in the absence of such conditions. The deficiency of the GDM algorithm can be attributed to several factors: first, for a general simplex, the K-means centroids and the simplex s vertices do not line up. Fortunately, we will see that they may be lined up in a straight line by a suitable affine transformation of the simplex structure. Second, the nature of the Dirichlet distribution on the simplex is not pro-actively exploited, including that of parameter α. Third, typically K D, the affine hull of B is a very low-dimensional structure, a fact not utilized by the GDM algorithm. It turns out that these shortcomings may be overcome by a careful consideration of the geometric structure of the simplex and the Dirichlet distribution. For illustrations, we consider a toy problem of learning extreme points of simplex B, given Gaussian data likelihood xi|µi N(µi, σ2ID) and D = K = 3. The triangle is Dirichlet Simplex Nest and Geometric Inference (a) GDM; time 1s (b) Xray; time < 1s (c) HMC; time 10m (d) VLAD; time < 1s Figure 1: Toy simplex learning: n = 5000, D = 3, K = 3, α = 2.5, σ = 0.1. chosen to be non-equilateral and Dirichlet concentration parameter is set to α = 2.5. Figure 1a illustrates the deteriorating performance of the GDM. In Figure 1b, we also observe Xray (Kumar et al., 2013), another recent NMF algorithm, failing to solve the problem, as the aforementioned separability assumption is violated for large α. On the other hand, Figure 1c demonstrates the high accuracy of the posterior mean obtained by Hamiltonian Monte Carlo (HMC) (Neal et al., 2011; Hoffman & Gelman, 2014) implemented using Stan (Carpenter et al., 2017), albeit at the cost of 10 minutes training time. Lastly our new algorithm (VLAD) in Fig. 1d, exhibits an accuracy comparable to that of the HMC and the run-time of the GDM algorithm. 3. Inference of the Dirichlet Simplex Nest 3.1. Simplicial Geometry In order to motivate our algorithm, we elucidate the geometry of the DSN through the concept of Centroidal Voronoi Tessellation (CVT) (Du et al., 1999) of a simplex B, a convex subset of D-dimensional metric space S. Definition 1 (Centroidal Voronoi Tessellation). Let Ω S be an open set equipped with a distance function d and a probability density ρ. For a set of K points c1, . . . , c K, the Voronoi cell corresponding to ck is the set Vk = {x Ω: d(x, ck) < d(x, cl) for any l = k}. The collection of Voronoi cells V1, . . . , VK is a tessellation of Ω; i.e. the cells are disjoint and k Vk = Ω. If c1, . . . , c K are also the centroids of their respective Voronoi cells, i.e., the tessellation is a Centroidal Voronoi Tessellation. CVTs are special: any set of k points induces a Voronoi tessellation, but these points are generally not the centroids of their associated cells. One can check that a CVT minimizes J(c1, . . . , c K) = Z Vk d(x, ck)2ρ(x)dx. It is a fact that J has a unique global minimizer as long as ρ vanishes on a set of measure zero, the Voronoi cells are convex, and the distance function is convex in each argument (Du et al., 1999). Moreover, it can be seen that the centroids of the CVT of an equilateral simplex equipped with the Dir K(α) distribution fall on the line segments between the centroid of the simplex and the extreme points of the simplex, but this is not the case when the simplex shape is non-equilateral (cf. Fig. 1a). The following lemma formalizes the aforementioned insight to a simplex of arbitrary shape B by considering a suitably modified distance function d( , ) of the CVT. (In Fig. 1d, the blue, purple and yellow dots are the sample versions of the Voronoi cells of the CVT under the new distance function and the corresponding centroids are in red.) Lemma 1. Let B RD K denote the matrix form of simplex B. Suppose it has full (column) rank, equipped with distance function (BBT ) and the probability distribution PB defined as PB(S) = Prob({θ K 1 : Bθ S}), where θ is distributed by symmetric Dirichlet density ρα := Dir K(α), for any S int(B), and A denotes a pseudoinverse of A. The centroids of its CVT fall on the line segments connecting the centroid of B to β1, . . . , βK. Proof. Let c1, . . . , c K and V1, . . . , VK be the centroids and cells of the CVT of K 1 equipped with Euclidean distance and Dir K(α) density ρα. It suffices to verify that Bc1, . . . , Bc K and BV1, . . . , BVK are the centroids and cells of the CVT of B = B K 1. By a change of variables formula, BVk x Bv 2 (BBT ) ρα(B x)| det(B )|dx R Vk ρα(B x)| det(B )|dx : v Vk Vk Bθ Bv 2 (BBT ) ρα(θ)dθ R Vk ρα(θ)dθ : v Vk Vk θ v 2 2ρα(θ)dθ R Vk ρα(θ)dθ : v Vk which we recognize as the centroids of the CVT of K 1 under ℓ2 metric. Since K 1 is a standard simplex and therefore equilateral, the centroids of the CVT of equilateral Dirichlet Simplex Nest and Geometric Inference simplex fall on the line segments connecting the centroid of the simplex to its extreme points. Lemma 1 suggests an algorithm to estimate the extreme points of B. First, estimate the centroids of the CVT of B (equipped with scaled Euclidean norm (BBT ) ) and search along the rays extending from the centroid of B through the CVT centroids for the simplicial vertices. 3.2. The Voronoi Latent Admixture (VLAD) Algorithm We first consider the noiseless problem, F( | µ) = δµ. That is, xi = µis are observed. In this case, Lemma 1 suggests estimating the CVT centroids by scaled K-means optimization: argmin c1,...,c K xi Vk(xi ck)T (BBT ) (xi ck) , (1) Unfortunately, the scaled Euclidean norm (BBT ) is unknown. We propose an equivalent approach that does not depend on knowledge of BBT . In the noiseless case, observe that the population covariance matrix of the samples takes the form Σ = BSBT , where S is the covariance matrix of a Dir(α) random variable on K 1. By the standard properties of the Dir(α) distribution, it can be seen that S = 1 K(Kα+1)P, where P = IK 1 K 1K1T K is the centering matrix. Hence, knowledge of Σ will be sufficient because the centered data points x fall in span(Σ) = span(BPBT ): For each (θ, x) pair, x := Bθ |{z} x 1 K B1 | {z } E[x] K B1(1T θ |{z} =1 ) = BPθ := B θ. (2) This suggests that the centroids of the CVT may be recovered by clustering the centered data points in the Σ - norm. This insight is formalized by Lemma 2. The centroids of the CVT of simplex B under (BBT ) -norm are given by {c k + c0|k = 1, . . . , K}, where (c 1, . . . , c K) solves the minimization min c1,...,c K V1,...,VK x BVk ( x ck)T Σ ( x ck)ρ(x)dx (3) and c0 = R xρ(x)dx is the centroid of simplex B. Proof. We first show that (3) is equivalent to (unscaled) K-means clustering on K 1. Note that Σ = δBPBT for some δ > 0. Without loss of generality, we restrict to ck s in span{BPBT }. Write ck = BPvk for vk RK, for k = 1, . . . , K. Recalling (2) and the fact P is a projector, (1/δ) PK k=1 R x BVk( x ck)T Σ ( x ck)ρ(x)dx θ Vk( θ vk)T PBT Σ BP( θ vk)ρα(θ)dθ θ Vk( θ vk)T P( θ vk)ρα(θ)dθ θ Vk θ Pvk 2 2ρα(θ)dθ. (4) Since θ is distributed by the symmetric Dirichlet ρα = Dir(α) on K 1, the last equality entails that the optimal vk s are the points which represent the barycentric coordinate of the centroids of the CVT of K 1. Thus, the optimal solution for ck = BPvk represents the centroids of the CVT of simplex B under (BBT ) -norm (using the coordinating system that is centered at origin c0). We proceed to address the optimization (3) applied to empirical data to arrive at Voronoi Latent Admixture (VLAD) algorithm in Algorithm 1. We utilize the singular value decomposition (SVD) of the centered data points to simplify computation. Let X Rn D be the matrix whose rows are the centered data points and X = UΛW T be its SVD. Each term in the objective of (3) is equivalent to, with Σ being replaced by its empirical version, Σn = 1 ( xi ck)T Σ n( xi ck) = n(ui ηk)T ΛW T WΛ 2W T WΛ(ui ηk) = n ui ηk 2 2, where xi = WΛui, and set ck = WΛηk. Thus, instead of performing scaled K-means clustering in S, it suffices to perform standard K-means in the low (K 1) dimensional space. This yields a significant computational speed-up. After applying VLAD, the weights θi s can be obtained by projecting the data points onto B and compute the barycentric coordinates of the projected points. Algorithm 1 Voronoi Latent Admixture (VLAD) Input: data x1, . . . , xn; K; extension parameter γ. Output: simplex vertices β1, . . . , βK n P i xi {find data center} 2: xi xi bc0, i = 1, . . . , n {centering} 3: compute top K 1 singular factors of the centered data matrix X Rn D: X = UΛW T 4: η1, . . . , ηK K-means(u1, . . . , un), where the ui s are the rows of U Rn (K 1) 5: bck WΛηk + bc0 6: bβk bc0 + γ(bck bc0) It remains to estimate the extreme points βks given the CVT centroids cks. This task is simplified by two observations: First, the CVT centroids reside on the line segment between the centroid of simplex B and its extreme points, per Lemma 1. Thus we merely need to estimate the ratio of Dirichlet Simplex Nest and Geometric Inference the distance from the extreme point to the centroids of B and the distance from the CVT centroids to the centroid of B. Due to the symmetry of Dir K(α) distribution on K 1, this ratio is identical for all extreme points we refer to this ratio as the extension parameter γ. Secondly, γ does not depend on the geometry of B, only that of the Dirichlet distribution. Thus, γ can be easily estimated by appealing to a Monte Carlo technique on Dir K. This subroutine is summarized in Algorithm 2, provided that α is given. Algorithm 2 Evaluating extension parameters 1: generate θ1, . . . , θm Dir K(α), where m is the number of Monte Carlo samples 2: v1, . . . , v K K-means(θ1, . . . , θm) K2 K PK l=1 vl 1 3.3. Estimating the Dirichlet Concentration Parameter Next, we describe how to estimate concentration parameter α from the data, by employing a moment-based approach. Recall from the previous section that there is an one-to-one mapping between α and the extension parameter γ. For each α > 0, let γ(α) > 0 denote the corresponding extension parameter and B(γ) RD K the estimator of B output by VLAD with extension parameter γ. In the absence of noise, the covariance matrix of the DSN model has the form BS(α)BT , where S(α) RK K is the covariance matrix of a Dir(α) random variable on K 1. This suggests we estimate α by a generalized method of moments approach: ˆα = argmin α>0 ˆB(γ(α))S(α) ˆB(γ(α))T ˆΣ , (5) where ˆΣ is the sample covariance matrix ˆΣ = 1 n XT X. We remark that there is no need to run VLAD multiple times to evaluate the objective in (5) at multiple α-values. After VLAD is run once, we may evaluate γ(α) for any value of γ by affinely transforming the output of VLAD. Further, (5) is a scalar optimization problem, so the computational cost of solving (5) is negligible. In the presence of noise, the covariance matrix of the DSN model no longer has the form BS(α)BT . We need to add a correction term to ensure a consistent estimator of BS(α)BT . For example, if the noise is Gaussian, a consistent estimator of BS(α)BT is Σ = ˆΣ ˆσ2ID, where ˆσ2 is an estimate of the noise variance. In the Supplement, we give consistent estimators of BS(α)BT for multinomial and Poisson noise. With a good estimator Σ of BS(α)BT in place, we replace ˆΣ in (5) by Σ and then solve (5) to obtain an estimate of α. 4. Consistency and Estimation Error Bounds In this section we establish consistency properties and error bound guarantees of the VLAD procedure. For c = (c1, . . . , c K) RK D, define φA : RD RK D R as φA(x, c) = mink {1,...,K} x ck 2 A where A is a positive semidefinite matrix. Recall Σ as the covariance matrix of the data generating distribution and Σn its empirical counterpart. In the algorithm, we work with the best rank K 1 approximation of Σn, which we denote by (Σn)K. Let Q denote the distribution for µis. Recall that Xi|µi F( |µi). Let P be the induced distribution corresponding to Xi, which is the projection of Xi on the affine space of dimension K 1 spanned by the top K 1 eigenvectors of Σ. We also use Pn to denote the empirical distribution of the data represented by random variables Xi. Since K-means clustering is a subroutine of our algorithm, we expect at least some sort of condition requiring that the K-means clustering routine be well-behaved in some sense. To that end we need the following standard condition on the population K-means objective (cf. Pollard (1981)). (a.1) Pollard s regularity criterion (PRC): The Hessian matrix of the function c 7 QφBSBT ( , c) evaluated at c for all optimizers c of QφBSBT ( , c) is positive definite, with minimum eigenvalue λ0 > 0. It turns out that this will be all we need for the following theorem in the noiseless setting, where we have Σ = BSBT = (Σ)K has rank K 1 and so, P = Q and Xi L= Xi. Theorem 1. Consider the noiseless setting, i.e., F( | µ) = δµ. Suppose that B = Conv(β1, . . . , βK) is the true topic simplex, while (β1n, . . . , βKn) are the vertex estimates obtained by VLAD algorithm. Moreover, assume the error due to Monte Carlo estimates of the extension parameter is negligible. Provided that condition (a.1) holds, min π (βπ(1)n, . . . , βπ(K)n) (β1, . . . , βK) = OP(n 1/2), where the minimization is taken over all permutations π of {1, . . . , K}. Note that the constant corresponding to the rate OP(n 1/2) is dependent on the Hessian matrix of the function c 7 PφΣ( , c). The proof for Theorem 1 is in the Supplement. In general, F( | µ) is not degenerate. Due to the presence of "noise" in the K 1 SVD subspace, the estimates of the CVT centroids may be inconsistent, which entails inconsistency of the VLAD s estimate for B. The following theorem provides an error bound in the general setting. We need a Dirichlet Simplex Nest and Geometric Inference strengthening of Pollard s Regularity Criterion. Let (Σ)K denote the best K 1 rank approximation of Σ with respect to the Frobenius norm. Assume: (a.2) The Hessian matrix of the function c 7 Pφ(Σ)K( , c) evaluated at c for all optimizers c of Pφ(Σ)K( , c) is uniformly positive definite with minimum eigenvalue λ0 > 0, for all (Σ)K such that (Σ BSBT ) ϵID, for some ϵ > 0. The noise level is formalized by the following conditions: (b) There is ϵ0 > 0 such that ϵ0ID Cov(X|θ) is positive semi-definite uniformly over θ K 1. (c) There exists M0 such that for all M > M0, R M,c0)c x c0 2 2g(x)dx k1 M , for some univer- sal constant k1, where B( M, c0) is a ball of radius M around population centroid c0 and g( ) is the density of P with respect to the Lebesgue measure on the K 1 dimensional space which contains the top K 1 eigenvectors of BSBT + ϵ0ID. Theorem 2. Suppose that B = Conv(β1, . . . , βK) is simplex corresponding to extreme points of the DSN. Let (β1n, . . . , βKn) be the corresponding extreme point estimates obtained by the VLAD algorithm. Assume the error in the Monte Carlo estimates of the extension parameter is negligible. Provided that (a.2), (b) and (c) hold, then minπ (βπ(1)n, . . . , βπ(K)n) (β1, . . . , βK) 2 = + OP(n 1/2), (6) where π ranges over permutations of {1, . . . , K}. The constant corresponding to the rate OP(n 1/2) in the above theorem, depends on the Hessian matrix of the function c 7 PφΣ( , c). The constant corresponding to the is dependent on the minimum and maxi- mum eigenvalues of the matrix BSBT . The preceding results control the error incurred by the VLAD algorithm when the concentration parameter α is known. When α is unknown, our proposed solution in Section 3.3 performs well in both simulated and real-data experiments. We do not know in theory whether the concentration parameter α is identifiable, we shall present empirical results in the Supplement which suggest identifiability. Assuming a condition which guarantees model identifiability, we can establish that the estimate obtained by the VLAD algorithm via (5) is in fact consistent. Theorem 3. Assume that function ϕ( α) = γ( α)2 K(K α+1) is monotonically increasing in α, where γ( α) is the extension Table 1: Baselines and required conditions Method Conjugacy True α Separability VLAD (this work) VLAD-α (this work) Gibbs (2004) Stan-HMC (2017) SVI (2013) GDM (2016) Recover KL (2013) MVES (2009) Xray (2013) parameter corresponding to α. Let α0 C be the true concentration parameter for some compact set C . Let ˆαn = argminα C ˆB(γ(α))S(α) ˆB(γ(α))T Σn , where Σn is a consistent estimator of BS(α)BT . Then, ˆαn α0 P 0. (7) 5. Experiments The goal of our experimental studies is to demonstrate the applicability and efficiency of our algorithm for a number of choices of the DSN probability kernel: Gaussian, Poisson and Multinomial (i.e. LDA). We summarize all competing estimation procedures in our comparative study and their corresponding underlying assumptions in Table 1. We remark that Gibbs sampler (Griffiths & Steyvers, 2004), Stan implementation of No U-Turn HMC (Hoffman & Gelman, 2014; Carpenter et al., 2017) and Stochastic Variational Inference (SVI) (Hoffman et al., 2013) may be augmented with techniques such as empirical Bayes to estimate hyperparameter α, although it may slow down convergence. We instead allow these baselines to use true values of α in all simulated experiments to their advantage; when latent simplex is of general geometry (i.e. non-equilateral), GDM (Yurochkin & Nguyen, 2016) requires α 0 to perform well, which is alike separability. Not all baselines are suitable for all three probability kernels, i.e. Gibbs sampler and SVI rely on (local) conjugacy and are only applicable in the LDA scenario; Recover KL (Arora et al., 2013) is an algorithm that relies on a separability condition (i.e. anchor words) designed for topic models. In simulated experiments we will consider both VLAD with estimated concentration parameter α following our results in Section 3.3 and VLAD trained with the knowledge of true data generating α (VLAD-α). For real data analysis, we estimate the concentration parameter by (5) and apply VLAD to a text corpus and stock market data set. Dirichlet Simplex Nest and Geometric Inference 0 5000 10000 15000 20000 25000 30000 Gaussian likelihood: sample size n Minimum matching distance VLAD-α GDM VLAD Xray Stan-HMC SPA MVES (a) Gaussian data 0 5000 10000 15000 20000 25000 30000 Poisson likelihood: sample size n Minimum matching distance VLAD-α GDM VLAD Xray Stan-HMC SPA MVES (b) Poisson data 0 5000 10000 15000 20000 25000 30000 Multinomial likelihood: sample size n Minimum matching distance VLAD-α GDM VLAD Recover KL Gibbs SVI (c) Categorical data Figure 2: Minimum matching distance for increasing n 0.2 0.4 0.6 0.8 1.0 Gaussian likelihood: cmin Minimum matching distance VLAD-α GDM VLAD Xray Stan-HMC SPA MVES GDM-MC (a) Gaussian data 0.2 0.4 0.6 0.8 1.0 Poisson likelihood: cmin Minimum matching distance VLAD-α GDM VLAD Xray Stan-HMC SPA MVES GDM-MC (b) Poisson data 0.2 0.4 0.6 0.8 1.0 Multinomial likelihood: cmin Minimum matching distance VLAD-α GDM VLAD Recover KL Gibbs SVI GDM-MC (c) Categorical data Figure 3: Minimum matching distance for varying DSN geometry. 5.1. Comparative Simulation Studies Convergence behavior We investigate the convergence of the estimates of the DSN extreme points for the three likelihood kernels under the increasing sample size. The hyperparameter settings are D = 500, K = 10, α = 2 (for LDA vocabulary size D = 2000). To ensure non-trivial geometry of the DSN we rescale extreme points towards their mean by uniform random factors between 0.5 and 1. We use the Minimum Matching distance - a metric previously studied in the context of polytopes estimation (Nguyen, 2015) to compare the quality of the fitted DSN model returned by a variety of inference algorithms. We defer additional details to the supplement. In Fig. 2 we see that VLAD and VLAD-α significantly outperform all baselines. Further, the estimation error reduces with increased sample size verifying statements of Theorems 2 and 3. We note that Stan HMC may also achieve good performance, however it is very costly to fit (e.g., 40 HMC iterations for Poisson case and n = 30000 took 14 hours compared to 7 seconds for VLAD), therefore we had to restrict number of iterations, which explains its wider error bars across experiments. Geometry of the DSN To study the role of geometry of the DSN we rescale extreme points towards their mean by uniform random factors ck Unif(cmin, 1) for k = 1, . . . , K and vary cmin in Fig. 3 (smaller values imply more severe skewness of the latent simplex). To isolate the effect of the geometry of the DSN, we compare to GDM combined with knowledge of true α and extension parameter estimation using Algorithm 2 (GDM-MC). If the underlying simplex is equilateral, GDM-MC will be equivalent to VLAD-α. In Fig. 3 we see that VLAD and VLAD-α are robust to varying skewness of the DSN. On the contrary, GDM-MC is only accurate when the latent simplex becomes closer to equilateral. This experiment verifies geometric motivation of our work in practice we can not expect latent geometric structure to be necessarily equilateral and geometrically robust method such as VLAD is more reliable. Varying Dirichlet prior To complete our simulation studies we verify α estimation procedure proposed in Section 3.3 and analyzed in Theorem 3. It is also interesting to compare performance of other baselines for larger α scenario often overlooked in the literature. In Fig. 4 (and in previous experiments) we see that performance gap between VLAD and VLAD-α is very small, supporting effectiveness of our α estimation procedure across probability kernels. Additionally, we see that higher values of α lead to degrading performance of all considered methods, however VLAD degrades more gracefully. Dirichlet Simplex Nest and Geometric Inference 0 1 2 3 4 5 Gaussian likelihood: α Minimum matching distance VLAD-α GDM VLAD Xray Stan-HMC SPA MVES (a) Gaussian data 0 1 2 3 4 5 Poisson likelihood: α Minimum matching distance VLAD-α GDM VLAD Xray Stan-HMC SPA MVES (b) Poisson data 0 1 2 3 4 5 Multinomial likelihood: α Minimum matching distance VLAD-α GDM VLAD Recover KL Gibbs SVI (c) Categorical data Figure 4: Minimum matching distance for increasing α. Table 2: NYT topic modeling (categorical data) || Stock data factor analysis (continuous data) Perplexity Coherence Time Frobenius norm Volume Time VLAD 1767 0.86 6min 0.300 0.14 1s GDM 1777 0.88 30min 0.294 1499 1s Gibbs || HMC 1520 0.80 5.3hours 0.299 1.95 10min Recover KL || MVES 2365 0.70 17min 0.287 5.39 109 3min SVI || SPA 1669 0.81 40min 0.392 3.31 107 1s 5.2. Real Data Analysis Topic modeling We analyze a collection of news articles from the New York Times. After preprocessing, we have 5320 unique words and 100k training documents with 25k left out for perplexity evaluation. We also compare semantic coherence of the topics (Newman et al., 2010). In Table 2 (left) we present results for K = 80 topics. The Gibbs sampler has the best perplexity score, but it falls behind in topic coherence. VLAD estimated α = 0.05 and has approximately same perplexity and coherence as GDM, while being 5 times faster. VLAD identified contextually meaningful topics, as can be seen from good coherence score and by eye-balling the topics they cover a variety of concepts from fishing and cooking to the Enron scandal and cancer. The top 20 words for each of the VLAD topics are provided along with the code. Stock market analysis We collect variations (closure minus opening price) for 3400 days and 55 companies. We train several algorithms on data from the first 3000 days and report the average distance between the data points from the last 400 days and fitted simplices (i.e., Frobenius norm). This metric alone might be misleading since stretching any simplex will always reduce the score, therefore we also report the volumes of corresponding simplices. Results are summarized in Table 2 (right) our method (estimated α = 0.05) achieves comparable fit in terms of the Frobenius norm with a more compact simplex. Among the factors identified by VLAD, we notice a growth component related to banks (e.g., Bank of America, Wells Fargo). Another factor suggests that the performance of fuel companies like Valero Energy and Chevron are inversely related to the performance of defense contractors (Boeing, Raytheon). 6. Summary and Discussion The Dirichlet Simplex Nest model generalizes a number of popular models in machine learning applications, including LDA and several variants of non-negative matrix factorization (NMF). We also develop an algorithm that exploits the geometry of the DSN to perform fast and accurate inference. We demonstrate the superior statistical and computational properties of the algorithm on several real datasets and verify its accuracy through simulations. One of the key distinctions between the DSN model and NMF models is we replace the separability assumption by a Dirichlet prior on the weights. The main benefit of this approach is it enables us to model data that does not contain archetypal points (Cutler & Breiman, 1994). Among the limitations of our approach is the reliance on the Dirichlet distribution assumption in a crucial way, that the Dirichlet distribution is symmetric on the standard probability simplex K 1. In theory, the algorithm breaks down when the Dirichlet distribution is asymmetric. Surprisingly, in simulations at least, we found that VLAD seems quite robust in recovering the correct direction of extreme points, even as most existing methods break down in such situations. These findings are reported in the Supplement. Acknowledgement Support provided by NSF grants DMS1830247, CAREER DMS-1351362, CNS-1409303 and a Margaret and Herman Sokol Faculty Award are gratefully acknowledged. Dirichlet Simplex Nest and Geometric Inference Anandkumar, A., Foster, D. P., Hsu, D. J., Kakade, S. M., and Liu, Y.-K. A spectral algorithm for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems, pp. 917 925, 2012. Arora, S., Ge, R., Kannan, R., and Moitra, A. Computing a nonnegative matrix factorization provably. In Proceedings of the 44th annual ACM symposium on Theory of computing, pp. 145 162. ACM, 2012a. Arora, S., Ge, R., and Moitra, A. Learning topic models going beyond SVD. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pp. 1 10. IEEE, 2012b. Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu, M. A practical algorithm for topic modeling with provable guarantees. In Proceedings of the 30th International Conference on Machine Learning, pp. 280 288, 2013. Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3: 993 1022, March 2003. Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017. Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. Stan: A probabilistic programming language. Journal of statistical software, 76(1), 2017. Cemgil, A. T. Bayesian inference for nonnegative matrix factorisation models. Computational intelligence and neuroscience, 2009, 2009. Chan, T.-H., Chi, C.-Y., Huang, Y.-M., and Ma, W.-K. A convex analysis-based minimum-volume enclosing simplex algorithm for hyperspectral unmixing. IEEE Transactions on Signal Processing, 57(11):4418 4432, 2009. Cutler, A. and Breiman, L. Archetypal analysis. Technometrics, 36(4):338 347, 1994. Donoho, D. and Stodden, V. When does non-negative matrix factorization give a correct decomposition into parts? In Advances in neural information processing systems, 2003. Du, Q., Faber, V., and Gunzburger, M. Centroidal Voronoi Tessellations: Applications and algorithms. SIAM Review, 41(4):637 676, 1999. Gillis, N. and Vavasis, S. Fast and robust recursive algorithms for separable nonnegative matrix factorization. ar Xiv preprint ar Xiv:1208.1237, 2012. Griffiths, T. L. and Steyvers, M. Finding scientific topics. PNAS, 101(suppl. 1):5228 5235, 2004. Hoffman, M. D. and Gelman, A. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1): 1593 1623, 2014. Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303 1347, May 2013. Huang, K., Fu, X., and Sidiropoulos, N. D. Anchor-Free Correlated Topic Modeling: Identifiability and Algorithm. ar Xiv:1611.05010 [cs, stat], November 2016. Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(1): 430 474, 2017. Kumar, A., Sindhwani, V., and Kambadur, P. Fast conical hull algorithms for near-separable non-negative matrix factorization. In International Conference on Machine Learning, pp. 231 239, 2013. Lee, D. D. and Seung, H. S. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pp. 556 562, 2001. Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011. Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100 108. Association for Computational Linguistics, 2010. Nguyen, X. Posterior contraction of the population polytope in finite admixture models. Bernoulli, 21(1):618 646, 02 2015. Paisley, J. W., Blei, D. M., and Jordan, M. I. Bayesian nonnegative matrix factorization with stochastic variational inference., 2014. Pollard, D. Strong consistency of k-means clustering. The Annals of Statistics, 9(1):135 140, 01 1981. Pritchard, J. K., Stephens, M., and Donnelly, P. Inference of population structure using multilocus genotype data. Genetics, 155(2):945 959, 2000. Schmidt, M. N., Winther, O., and Hansen, L. K. Bayesian non-negative matrix factorization. In International Conference on Independent Component Analysis and Signal Separation, pp. 540 547. Springer, 2009. Dirichlet Simplex Nest and Geometric Inference Tang, J., Meng, Z., Nguyen, X., Mei, Q., and Zhang, M. Understanding the limiting factors of topic modeling via posterior contraction analysis. In Proceedings of the 31st International Conference on Machine Learning, pp. 190 198, 2014. Yurochkin, M. and Nguyen, X. Geometric Dirichlet Means Algorithm for topic inference. In Advances in Neural Information Processing Systems, pp. 2505 2513, 2016. Yurochkin, M., Guha, A., and Nguyen, X. Conic Scan-and Cover algorithms for nonparametric topic modeling. In Advances in Neural Information Processing Systems, pp. 3881 3890, 2017.