# finding_k_in_latent_k_polytope__14199796.pdf

Finding k in Latent k-Polytope

Chiranjib Bhattacharyya * 1 Ravi Kannan * 2 Amit Kumar * 3

The recently introduced Latent k Polytope(Lk P) encompasses several stochastic Mixed Membership models including Topic Models. The problem of ﬁnding k, the number of extreme points of Lk P, is a fundamental challenge and includes several important open problems such as determination of number of components in Ad-mixtures. This paper addresses this challenge by introducing Interpolative Convex Rank(ICR) of a matrix deﬁned as the minimum number of its columns whose convex hull is within Hausdorff distance ε of the convex hull of all columns. The ﬁrst important contribution of this paper is to show that under standard assumptions k equals the ICR of a subset smoothed data matrix deﬁned from Data generated from an Lk P. The second important contribution of the paper is a polynomial time algorithm for ﬁnding k under standard assumptions. An immediate corollary is the ﬁrst polynomial time algorithm for ﬁnding the inner dimension in Non-negative matrix factorisation(NMF) with assumptions which are qualitatively different than existing ones such as Separability.

1. Introduction

Latent variable models have found large number of applications in the real world. Such models specify a generative process between the observed variables and a set of underlying un-observed variables, often called Latent variables. Examples of latent variable models include Clustering like k-means, Finite Mixture models, Topic models (LDA), and Stochastic block models. In such models learning the parameters of the generative process is often intractable and remains an active area of research.

1Department of Computer Science and Automation, IISc Bangalore, India 2Microsoft Research India Lab., Bangalore, India 3Department of Computer Science and Engineering, IIT Delhi, India. Correspondence to: Chiranjib Bhattacharya <chiru@iisc.ac.in>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

(Bhattacharyya & Kannan, 2020) suggested Latent k polytope(Lk P), an unifying model which includes all of the above mentioned instances of latent variable modelis as special cases. Lk P deﬁnes a latent polytope K on k vertices and data is generated by ﬁrst picking latent points from this polytope, denoted by columns of a matrix P, and then perturbing them in an adversarial manner to produce datapoints. The perturbations can be quite large in magnitude, and the data points can lie quite far from the actual polytope K. The problem of recovering (vertices of) K from data generated in this manner is in general hard, unless we make assumptions about K and the data matrix A, each of whose columns represents one data point. They crystallized a set of assumptions, which will be referred in the sequel as standard assumptions, and showed that the vertices of the polytope K can be provably recovered to an ε approximation from a data Matrix A satisfying the assumptions.

The algorithm generally applies to all special cases and results in Model estimation with guaranteed approximation from a ﬁnite sample of data. However, as is usual practice in Latent Variable models the value of k is assumed known. For example, most algorithms for k Means or LDA with k topics would require the value of k.

In this paper we rely on the geometry of the problem to ﬁnd k from the data Matrix A. Straightforward attempts such as examining the rank etc of A are fruitless because most of the points in A can be very far from the actual polytope K. We use two key insights here: (i) Suppose ﬁrst we have access to the latent points matrix P instead of A. In this case, the convex hull of P, CH(P), would be a close approximation of K. Still it is not clear how to identify k from the CH(P). (ii) Such a strategy is still not practical as P is not available. However the eigen structure of A can provide important clues.

Contributions: This paper addresses some of these challenges and a summary of contributions are listed below.

The paper introduces the notion of Interpolative Convex Rank(ICR) of a matrix, and shows that k = ICR of a subset smoothed data matrix where k is the number of vertices in Lk P (see details in Theorem 1). The notion of ICR should be of independent interest.

The paper introduces new techniques based on the hy-

Finding k in the Latent k-Polytope

perplane separator theorem for proving lower bounds on the ICR of a matrix (see details in Section 3.4).

Under standard assumptions, the paper gives a polynomial time algorithm for ﬁnding the correct value of the number of vertices of the polytope K (see Theorem 18). We do this by using the familiar singular value thresholding method, but from a different perspective than commonly used in numerical analysis.

In this paper we show that Non-negative Matrix factorization(NMF) is a special case of Latent-k-polytope. In Theorem 20 we show that the algorithm recovers the correct inner dimension for NMF under standard assumptions.

2. Preliminaries and Problem Deﬁnition

Notation: For any natural number n we will denote [n] = {1, . . . , n}. The points will lie in the d-dimensional Euclidean space ℜd. For a matrix B and index ℓ, we shall use B ,ℓto denote the ℓth column of B. Given a subset S of columns of B, the notation B ,S refers to 1 |S| P

ℓ S B ,ℓ. The spectral norm of B will be denoted by ||B||. The 2norm (i.e., Euclidean norm) of a vector v is denoted by |v|. The rth singular value of a matrix B is denoted by sr(B). The convex hull of a set of points xi ℜd, i [m], is denoted by CH({x1, . . . , xm}) = {Pm i=1 αixi|0 αi, Pm i=1 αi = 1}. We abuse notation and use CH(B), where B is a matrix, to denote the convex hull of the columns of B. An extreme point of a convex set, C, is any point, x C such that it cannot be expressed as x = βx1 + (1 β)x2, 0 β 1, x1 = x2, x1, x2 C. Deﬁnition 1. Given two sets S, T ℜd, the Hausdorff distance dist(S, T) between them is deﬁned as

dist(S, T) := max max x S d(x, T), max y T d(y, S) ,

where d(z, A), for a point z and a set A ℜd, is deﬁned as mint A |z t|.

2.1. Review of Lk P and Problem statement

In this subsection we introduce relevant deﬁnitions and present the problem statement.

2.1.1. REVIEW OF LATENT-K-POLYTOPE

In this subsection we brieﬂy review the Latent-k-polytope introduced in (Bhattacharyya & Kannan, 2020). Deﬁnition 2. Let the columns of M ℜd k be the extreme points of the Latent k Polytope, K = CH(M). The data matrix A ℜd n is said to be generated from Lk P with parameters M, σ, σ > 0, if there exists a d n matrix P, whose columns lie in K, such that σ = A P n .

For each j [n], A.,j, a column of the data-matrix as the perturbation of the latent data point P.,j. The parameter σ, referred in the sequel as perturbation parameter, can be thought of as a measure of the maximum mean squared perturbation in any direction. In previous work (Bhattacharyya & Kannan, 2020) the problem of learning M from A was addressed. In general this is an intractable problem but under certain standard assumptions stated in the next subsection, a polynomial time algorithm was given for recovering M to an additive error. The dimension of M, i.e. the value of k, was assumed to be known.

2.1.2. STANDARD ASSUMPTIONS

(Bhattacharyya & Kannan, 2020) deﬁned a set of assumptions on Lk P which will be referred in the sequel as Standard Assumptions. A Lk P model with parameters M, σ, and the data natrix A is said to satisify the standard assumptions if the following are true.

Separation: Every vertex of K is far from the afﬁne or the convex hull of the remaining vertices of K.

Proximity: There is a parameter δ (0, 1) such that for every vertex of K, there are at least δn points (columns of A) whose corresponding latent points (columns in P) are close to the vertex.

Spectrally Bounded Perturbations: The parameter σ (= ||A P||/ n) is much smaller than the size of K.

The ﬁrst assumption is on the model,M, the second is on observed data,A, and last assumption is on the perturbation parameter, σ. In the sequel these assumptions will be stated more precisely. We prove that the set of standard assumptions (see Theorem 1) imply that there are k subsets of data each of cardinality δn whose averages nearly contain in their convex hull the averages of all δn sized subsets of data.This observation will be leveraged to give a data driven characterization of k. It is to be noted that the oft used notion of separability in Non-negative matrix Factorization(see (Gillis & Luce, 2018)) assumes that there are k actual data points whose hull nearly contains all data points. This is restrictive and does not hold even for simple cases of Lk P.

2.1.3. PROBLEM DEFINITON

The objective of the paper is to address the problem of ﬁnding k under the above mentioned assumptions. Deﬁnition 3. Given A generated from Lk P and δ, ﬁnd the number of extreme points of K.

Note that σ, k, P and M are not given as input parameters. Only data, A, is available. In (Bhattacharyya & Kannan,

Finding k in the Latent k-Polytope

2020) the emphasis was on computing M given k. In contrast this paper attempts to address the problem of ﬁnding k from A under standard assumptions.

2.2. Non-negative Matrix factorization(NMF)

Lk P is an interesting geometric alternative to several Unsupervised learning paradigms such as Topic Models, Clustering and Community Detection (Bhattacharyya & Kannan, 2020). An important aim of this paper is to explore NMF as a special case of Lk P.

Over the last two decades NMF have attracted signiﬁcant research interest. In this section we do not review the vast literature on NMF but point the reader to many readable surveys(e.g. (Gillis, 2014)). However, to aid further discussion the deﬁnition of NMF is brieﬂy reviewed.

Exact NMF seeks a factorization of the data matrix A, Ad n = Md k Wk n where n is the number of datapoints,d is the ambient data dimension and all entries of M, W, A are assumed to be positive. Each column of M represents a feature and each column of W denotes the corresponding weights on the individual features. The parameter k is often called inner-dimension (Vavasis, 2009). It is possible that Exact NMF may not exist and bulk of the research in NMF has been directed towards understanding the case of approximate NMF deﬁned as follows

Ad n = Md k Wk n + N. (Approx-NMF)

Here again W, M are assumed to have non-negative entries. Each column of N can be thought of as perturbation or noise.

The Inexact NMF is a special case of Latent-k-Polytope with the requirement that each column of W should sum to 1. Each datapoint, A ,j is a perturbation of a point P ,j = Pr l=1 wjl M ,l, lying on the polytope K = CH(M). The paramater σ depends on the norm of the matrix N. It is straightforward to see that if we are give data-matrix A, which satisﬁes the standard assumptions, and the inner dimension r, the algorithm mentioned in (Bhattacharyya & Kannan, 2020) directly applies and one can recover the features.

Our goal in this paper is to recover the inner-dimension for NMF.

2.3. Related Literature: ﬁnding the number of vertices of a Latent Polytope

To the best of our knowledge the problem of ﬁnding the number of vertices of the latent polytope has not been studied. The problem of enumerating all the vertices of a polytope, described by linear inequalites, is known to be hard (Khachiyan et al., 2008). The problem described

here is much more complicated as we have access only to some perturbed points. Since Lk P is a geometric construct it maybe difﬁcult to adopt probabilistic approaches such as Hierarchical Dirichlet Process (Teh et al., 2006) which have shown empirical success in many applications. However, to the best of our understanding it does not provide a guarantee for ﬁnding the correct k, the number of extreme points of K, from a ﬁnite number of samples.

3. Finding the number of vertices of Lk P from data: An interpolative approach

Finding the number of vertices of K, the Latent k Polytope, deﬁned in Deﬁnition 2 is in general a hard problem. Many open problems of active research such as ﬁnding the number of topics in a topic model are important special cases of this problem. The challenges in addressing the issue of determing the number of vertices can be distilled into two following issues.

Firstly, one needs to establish a suitable quantity which relates the observed data A to the columns of M. Existing attempts in special cases (e.g.use of Hierarchical Dirichlet processes for Topic models(Teh et al., 2006)) do not translate into a ﬁnite-time algorithm which guarantees the recovery of the correct number of vertices. A well-formulated quantity which can yield tractable algorithmic insights under realistic assumptions is important aim to have. Secondly, even if such a quantity is devised, the overall goal of ﬁnding a tractable algorithm is a harder challenge.

In this paper we address both these issues by appealing to convex geometry. It involves conceptualizing a new notion of Matrix rank (see Subsection 3.1), and following which we develop a polynomial time algorithm(see Subsection 3.2).

3.1. Interpolative Convex Rank (ICR) of a Matrix

In this section the concept of Interpolative Convex Rank (ICR) of a matrix is introduced.

Deﬁnition 4. Given a parameter ε > 0, the Interpolative Convex Rank (ICR) of a matrix C ℜd m denoted ICR(ε, C) = l, where l is the minimum number such that there exits columns C ,i1, . . . , C ,il of C such that

dist(CH(C), CH(C ,i1, . . . , C ,il)) ε

In words it is the minimum number of columns of C whose convex hull is ε close the convex hull of all columns of C. A geometric insight behind the deﬁnition can be obtained by observing that if ε = 0 then ICR of C is equal to the number of extreme points of CH(C). For any ε > 0, ICR(ε, C) is the minimum number of vertices drawn from the columns of C whose convex hull approximates the convex hull of

Finding k in the Latent k-Polytope

all the columns of C with error at most ε. Such approximations have been studied with different notions of distance in Convex Geometry (Barvinok, 2014; Gruber, 1993).

For any non zero ε, the quantity ICR rank is the smallest number of extreme points of a polytope whose convex hull approximates the polytope.

One of the goals of this paper is to relate NMF to Lk P. We argue that Non-Negative Rank (NNR), a often used measure for number of factors, may not be suited and ICR is a better alternative. To support this claim we ﬁrst recall the deﬁnition of NNR((Vavasis, 2009)); we deﬁne it through the inner dimension.

Deﬁnition 5. The non-negative rank (NNR) of a matrix is the minimum inner dimension of a non-negative factorization of the matrix.

There is no known provable method for identifying the correct inner dimension; it is mostly determined by heuristics (see Section 3 in (Gillis, 2014)). It is folklore that (under some conditions) NNR reﬂects or is equal to the correct inner dimension. However, in many cases NNR can be very different than inner-dimension. In the Supplememt this is illustrated with several examples where ICR recovers the true inner-dimension while NNR could be much lower. The intuitive reason for the contrast between NNR and ICR in these examples is that the interpolative property forces ICR to restrict attention to the columns of the given data matrix, but NNR does not have such restriction and hence can choose suitable but arbitrary points not from the data-matrix to yield a low value of NNR.

The use of interpolative assumption for NMF is not new and is mentioned in (Recht et al., 2012). However the algorithms assume that the value of inner-dimension is known and using ICR to characterize the inner-dimension is new.

3.2. Recovering the number of vertices of Lk P from ICR

In this section, we prove the result that the parameter k can be recovered from the data matrix A as ICR of a suitably smoothed data matrix e A obtained from A. As pointed out earlier, the raw data matrix A does not yield k since CH(A) can be much larger than K.

We now deﬁne the subset smoothed version of A.

Deﬁnition 6. (Subset smoothed data matrix) Let Rδ denote the collection of all subsets R [n], |R| = δn. Then the subset smoothed data matrix e A has |Rδ| columns indexed by the elements in Rδ. For each R Rδ, the column e A ,R of e A is same as A.,R, i.e., 1 |R| P

We now state the main result of this section.

Theorem 1. Assume that the input data matrix A is gener-

ated by Lk P (as in Deﬁnition 2). Let ρ > 0, δ (0, 1) and suppose the following three assumptions are satisﬁed

1. Every vertex M.,ℓof K satisﬁes the following condition:

d(M.,ℓ, CH({M ,ℓ : ℓ = ℓ}) 7ρ. (1)

2. For every ℓ [k], there exists a subset Sℓ [n] of size δn such that

|M.,ℓ P.,Sℓ| ρ

3. The perturbation parameter σ is deﬁned as follows

Then the ICR(ρ/3, e A) = k.

Proof. The proof consists of computing an upper-bound (see Lemma 3) and a lower bound (see Corollary 8).

Remarks: Before proceeding to give formal details we wish to make some remarks to convey the main intuitions. The assumptions are re-statements of the three standard assumptions deﬁned in Subsection 2.1.2. We would like to note that the ﬁrst condition (1) above is much weaker than requiring that M.,ℓis far from the afﬁne hull of the other columns of M.,ℓ, made in (Bhattacharyya & Kannan, 2020). We now give some intuitions behind the proof. It proceeds in two parts. In the upper bound proof, we show tha ICR(ρ/3, e A) is at most k (see Lemma 3). The idea of the proof is as follows. For R Rδ, A.,R and P.,R are close to each other (this is where the subset smoothening operation helps). Now, P.,R lies inside K, and so can be expressed as a convex combination of the columns M.,ℓof M, each of which is close to the corresponding point A.,Sℓ, where Sℓis the subset deﬁned in (2). It follows that A.,R is close to the convex hull of the points {A.,Sℓ, ℓ [k]}. This exhibits a convex polytope on k vertices which are close to CH( e A), proving the upper bound.

The lower bound part of the proof (see Corollary 8) is harder and involves new tools from Convex Geometry. Suppose there is a subset of points W whose convex hull is close to CH( e A). It is not hard to show that the convex hull of W is also close to CH(M) (again using subset smoothening). Now we show that we can partition (a slightly larger set than) CH(M) into k parts by cutting it with suitable hyperplanes obtained from the Separating Hyperplane Theorem, and W must have a non-empty intersection with each of these parts. This shows that |W| must be at least k.

Finding k in the Latent k-Polytope

3.3. Upper Bound Proof

Let H denote CH( e A). Let H denote the convex hull of {A.,Sℓ, ℓ [k]}, where Sℓ Rδ are as deﬁned in (2). We show that H is within Hausdorf distance ρ/3 of H . Since H H, it sufﬁces to show that for every point in H, there is a point in H within distance at most ρ/3. H is a convex set and d(x, H ) is a convex function of x, so the maximum of d(x, H ) over H is attained at a vertex of H. Hence it sufﬁces to show that for each of the points A.,R, R Rδ, we have d(A ,R, H ) ρ/3.

The proof proceeds by ﬁrst showing that for any R Rδ, the vector A.,R is close to P.,R, in fact within distance at most σ/

δ. Now, any vector of the form P.,R can be written as a convex combination of the vertices of K, i.e., the vectors M.,ℓ, ℓ [k]. The same argument also shows that each of the vectors M.,ℓis close to the corresponding vector A.,Sℓ. Thus A.,R is close to a convex combination of the vectors A.,Sℓ, ℓ [k]. We now prove these claims formally.

Claim 2. For all S [n], |S| δn, |A.,S P.,S| σ

Proof. We ﬁrst note that |A.,S P.,S| ||A P||/ p

|S|. Indeed, let u be the unit vector which is 1/ p

|S| on coordinates corresponding to S, 0 otherwise. Then

|A.,S P.,S| = 1 p

|S| |(A P)u| ||A P|| p

The claim now follows by using the deﬁnition of σ, i.e., ||A P|| = σ n.

We are now ready to show the upper bound result.

Lemma 3. Under the conditions of Theorem 1, ICR(ρ/3, e A) is at most k.

Proof. Fix an R Rδ. Since P.,R K, it can be written as a convex combination P

ℓ [k] λℓM.,ℓof the vertices of K. Then, we have

ℓ [k] λℓA.,Sℓ

|A.,R P.,R| + X

P.,Sℓ| + |A.,Sℓ P.,Sℓ| 2σ

since the ﬁrst and third term (in the RHS of the ﬁrst inequality) are each at most σ/

δ by the Claim 2 and the second term is at most ρ/30 by (2). This shows that that the ICR(ρ/3, e A) is at most k.

Figure 1. Illustration of the proof of Theorem 5: the matrix R has four points. The lightly shaded half-space around R.,ℓdenotes Vℓ, whereas the darker shaded region around the same vertex denotes Qℓ.

3.4. Lower Bound Proof

We now prove the more difﬁcult part of Theorem 1, i.e., ICR(ρ/3, e A) cannot be strictly less than k. We will prove a stronger assertion, namely, that there is no set of k 1 points in ℜd whose convex hull is within distance ρ/3 of H, where H = CH( e A).

Assume, for sake of contradiction, that there exist points w1, . . . , wk 1 ℜd such that

dist(H, CH({w1, . . . , wk 1})) ρ/3. (4)

Recall that K = CH(M). Let W0 denote CH({w1, . . . , wk 1}). We ﬁrst begin by showing that dist(K, W0) is at most ρ. The intuition is that the Hausdorff distance between H and K is small the argument follows along the same lines as in the upper bound proof. It follows from (4) that dist(K, W0) is also small. We give the complete proof of the following result in the supplementary material.

Claim 4. Let W0 denote the convex hill of the points w1, . . . , wk 1. Then, dist(K, W0) ρ.

The desired contradiction follows easily from the above claim and the following result:

Theorem 5. Let W be a set of points such that the Hausdorff distance between the the convex hull of W and that of the columns of M is at most ρ, i.e., dist(CH(W), CH(M)) ρ. Then |W| k.

The intuition behind the proof is as follows: each vertex M.,ℓof CH(M) is far from the convex hull formed by rest of the columns of M, and hence we can ﬁnd a separating hyperplane which separates M.,ℓfrom CH(M ℓ) by some large enough margin (see the region Vℓin Figure 1). We deﬁne a further reﬁnement of this region, shown as Qℓin Figure 1. The proof consists of two conceptual steps: (i) we

Finding k in the Latent k-Polytope

show that the regions Qℓare disjoint for distinct columns M.,ℓof M, and (ii) we argue that each of the regions Qℓ must contain at least one vertex of W. Since there are k columns in M, this would implies that |W| must be at least k.

Proof. For two subsets A and B of points in ℜd, deﬁne their Minkowski sum, A + B, as {a + b : a A, b B}. For an index ℓ [k], let M ℓdenote the matrix M with column ℓremoved. Since the columns of M satisfy condition (1), this implies that (here B denotes the unit ball)

[M.,ℓ+ 7ρB] CH(M ℓ) = .

Therefore, the Separating Hyperplane Theorem from Convex Geometry implies that there is a unit vector v(ℓ) such that for all ℓ = ℓ,

v(ℓ) M.,ℓ> v(ℓ) M.,ℓ + 7ρ (5)

For an index ℓ [k], let Vℓdenote the set

Vℓ:= {x : v(ℓ) x > v(ℓ) M.,ℓ 2ρ},

and deﬁne Qℓ:= (CH(M) + ρB) Vℓ.

Claim 6. For every ℓ [k], there is a point w W such that w Qℓ.

Proof. Assume for the sake of contradiction that there is an index ℓ [k] such that Qℓ W = . Since dist(CH(W), CH(M)) ρ, W CH(M) + ρB, and so it must be the case that Vℓ W = . The deﬁnition of Vℓ implies that for every w W,

w v(ℓ) M.,ℓ v(ℓ) 2ρ.

Consequently, for every point y CH(W),

v(ℓ) y v(ℓ) M.,ℓ 2ρ. (6)

Since dist(CH(M), CH(W)) ρ, there is a point y CH(W) with |y M.,ℓ| ρ. But this contradicts (6). This proves the claim.

So, we must have that for each ℓ [k], Qℓ W is nonempty. We now show that the sets Qℓare disjoint, which will yield the desired result.

Lemma 7. The sets Qℓ, ℓ [k] are mutually disjoint.

Proof. Suppose, for the sake of contradiction, that there is a point z Qℓ Qℓ for distinct indices ℓ, ℓ . Since z CH(M) + ρB, there is a point y CH(M), with |y z| ρ.

So y can be written as a convex combination P

ℓ [k] αℓ M.,ℓ . We now also write y as

y = αℓM ,ℓ+ (1 αℓ)x , x CH(M ,ℓ : ℓ = ℓ) (7)

Inequality (5) implies that v(ℓ) x < v(ℓ) M ,ℓ 7ρ. So,

v(ℓ) y < v(ℓ) M ,ℓ (1 αℓ)7ρ.

Since v(ℓ) is unit vector and |y z| ρ,

v(ℓ) z v(ℓ) M ,ℓ (1 αℓ)7ρ + ρ.

But since z Vℓ, we have by the deﬁnition of Vℓ

v(ℓ) z > v(ℓ) M ,ℓ 2ρ.

Thus by the last two inequalities, αℓ> 4/7 > 0.5. A similar argument shows that αℓ > 0.5. But this is not possible because the quantities αℓ are non-negative and add to 1. This proves the result.

Combining Lemma 7 and Claim 6, we see that |W| k. This proves the desired theorem.

As a corollary, we get the lower bound result:

Corollary 8. Under the assumptions of Theorem 1, ICR(ρ/3, e A) k.

Proof. Combining Theorem 5 and Claim 4, we see that if W is any set of points for dist(CH( e A), CH(W)) ρ/3, then |W| k. By deﬁnition, if ICR(ρ/3, e A) = k , then there is a set of k points whose convex hull is within Hausdorff distance ρ/3 of CH( e A). Therefore, k k.

It is worth noting that the matrix e A has exponential (in n) number of columns, and so cannot be used directly by an efﬁcient algorithm. We have given an algebraic characterization of the number of vertices of K in terms of ICR, but no polynomial time algorithm is known to ﬁnd ICR. In the next section, we give a polynomial time algorithm to ﬁnd the number of vertices of K.

4. Algorithm for ﬁnding the number of vertices of the latent polytope

In this section, we consider a slightly stronger set of standard assumptions than those in Theorem 1. We prove that if the input points satisfy these conditions, then we can recover the correct value of k in polynomial time. In fact, we will show that there is a polynomial time computable threshold τ such that k = Maxr : sr(A) τ.

Singular value thresholding is a well known procedure which yields provable guarantees in many applications (Cai

Finding k in the Latent k-Polytope

et al., 2010; Chatterjee, 2015; Donoho & Gavish, 2014). However the techniques considered in the literature do not apply to the problem at hand and hence new methods are warranted.

The main difﬁculty in choosing such a threshold τ is that it cannot be kept at some suitable multiple of the largest singular value of the data matrix A: the matrix A may be illconditioned, and so the largest singular value may not give us any information about the parameter k. Instead, we write down a convex program whose output is used for deciding this threshold τ. Once we have the correct value of k, we can use ideas based on ICR to recover the actual vertices of M (see for example Algortihm 2 and Theorem 20).

Algorithm 1 Algorithm for ﬁnding the number of extreme points of K.

Input: d n data matrix A and a parameter δ (0, 1). Let opt be the optimal value of the convex program (12).

Let k be the maximum value of k [d] such that the singular value sk (A) δ2opt/8. Output k .

We now state the main result of this section. We give some deﬁnitions. For a subspace V and a point x, let proj(x, V ) denote the orthogonal projection of x on V . For a subset S of points, let Null(S) denote the subspace orthogonal to the linear span of S. Theorem 9. Suppose that the input data satisﬁes the following assumptions. There exists a δ (0, 1) such that

Every vertex M.,ℓof K satisﬁes the following condition:

|proj (M.,ℓ, Null (M \ M.,ℓ))| δ|M.,ℓ|. (8)

The LHS above is same as the distance of M.,ℓfrom the afﬁne hull of the other columns of M.

For Sℓ:= {j : |M.,ℓ P.,j| 4σ

δ }, |Sℓ| δn (9)

σ δ3 min ℓ |M.,ℓ|/20. (10)

Further, assume that all the entries of M are non-negative. Then, given δ and the data matrix A, Algorithm 1 outputs the correct value of k.

Note that the LHS of condition (8) is smaller than the LHS of the corresponding constraint (1).

The following result follows as corollary of the above result. It shows that the ICR of the subset smoothed data matrix e A is equal to the number of vertices k of K. Details are given in the supplementary material.

Lemma 10. Suppose the conditions (8), (9) and (10) stated in Theorem 9 are satisﬁed by the input data. Then ICR(6σ/

δ, e A) = k.

4.1. Proof of Theorem 9

We now proceed to prove Theorem 9. There are two phases in Algorithm 1. In the ﬁrst phase, we estimate minℓ|M.,ℓ| by writing a suitable convex program. We use this estimate as a threshold for the singular values of A to determine the value of k.

The convex program is as follows (recall that A is the data matrix generated from Lk P deﬁned in Deﬁnition 2):

min opt := |A x| (11) X

j [n] xj = 1, 0 xj 1/(δn), j [n]. (12)

We now show that the optimal value of the convex program can be used to approximate minℓ|M.,ℓ| within a constant factor.

Lemma 11. Let ρ0 := minℓ|M.,ℓ|, and opt be deﬁned in (11). Then ρ0/(2k) opt 2ρ0.

Proof. The proof is done by two subclaims, each of which prove one of the above inequalities. Let x denote the vector which satisﬁes opt = |A x |.

Claim 12. |A x | ρ0/(2k).

Proof. The vector P x is a convex combination of the columns of P. Each column of P can be expressed as a convex combination of the columns of M. It follows that P x is a convex combination of the columns of M, i.e., it is of the form P

j [k] αj M.,j, where αj s are nonnegative and sum to 1. It follows that there is an index j with αj 1/k. Since all the columns of M are non-negative, P x is component-wise at least M.,j/k, and so, has length at least ρ0/k. Finally,

|A x| |P x| ||P A|| |x| ρ0/k σ/δ (10) ρ0/(2k),

where the second last inequality follows from the fact that xj 1/(δn) and (10).

Claim 13. |A x | 2ρ0

Proof. Let ρ0 be |M.,ℓ|. Deﬁne a feasible solution x to the convex program P as follows: xj = 1/(δn) if j Sℓ, 0 otherwise. Observe that |x| = 1

|A x| = |A.,Sℓ| |P.,Sℓ|+ σ

δ |M.,ℓ|+ 4σ

Finding k in the Latent k-Polytope

where the ﬁrst inequality is by Claim 4 and the second by (9).

The above two results show that ρ0/(2k) opt 2ρ0.

We now use opt as a threshold for the singular values of A, and show that k can be recovered from this threshold. This will also prove correctness of Algorithm 1, and hence yield a provable data determined threshold for ﬁnding the ICR of A.

Theorem 14. Deﬁne k = maxk [d] : sk (A)/ n δ2opt/8. Then k = k .

Proof. In order to prove this result, we need to estimate sk(A) and sk+1(A). We shall do this by estimating the singular values of some related matrices. We begin by bounding the singular values of M.

Claim 15. sk(M) δρ0

k , whereas sk+1(M) = 0.

Proof. The second statement follows trivially since M has only k columns, and so is a rank k matrix. For the ﬁrst statement, observe that

sk(M) = min x:|x|=1 |Mx|.

Let x be any unit vector in ℜk. There must be a coordinate ℓ [k] such that |xℓ| 1

|Mx| = P j [k] xj M.,j |xℓ| |proj(M.,ℓ, Null(M \

k , where the last inequality follows from condition (8).

Let S1, . . . , Sk be subsets of [n] guaranteed by condition (9). We assume that each of these subsets have size exactly δn (otherwise we remove some elements from them). Note that these sets are mutually disjoint, since (8) and (10) imply that for any two distinct ℓ, ℓ [k], |M.,ℓ M.,ℓ | > 8σ

Let B = (P.,S1 | P.,S2 | . . . |P.,Sk). We relate sk(B) and sk(M).

Claim 16. sk(B) sk(M) 4σ

Proof. Condition (9) implies that ||B M|| 4σ

δ . This implies the ﬁrst inequality in the claim, since we have (from Linear Algebra), sk(B) sk(M) ||M B||. The second inequality follows from Claim 15 and (10).

We now relate sk(B) and sk(P).

Claim 17. sk(P)

δn sk(B) δ1.5ρ0

Proof. We shall exhibit a subspace of dimension k such that for any unit vector x in this subspace, |P x|

δn sk(P). This will prove the desired result by the Max-Min theorem of Linear Algebra which states:

sk(B) = max V :dim(V )=k min x V,|x|=1 |Px|.

For each index ℓ [k] deﬁned a unit vector eℓ ℜn as follows:

δn if j Sℓ 0 otherwise

Since the sets S1, . . . , Sk are pair-wise disjoint, the vectors eℓ, ℓ [k], span a subspace U of dimension k (in fact these vectors are form an orthonormal family of k vectors).

Now consider any unit vector v = P

j [k] βjej. Since ej s are orthonormal, P

j β2 j = 1. Now,

where y is the vector whose jth coordinate is βj. Since y is a unit vector, |P y| sk(B). This implies the desired result (the last inequality in the claim follows from Claim 16).

Finally, we give bounds on the singular values of the data matrix A.

Lemma 18. sk(A) δ2opt/8 > sk+1(A).

Proof. Since ||P A|| σ n, |sr(P)/ n sr(A)/ n| is at most σ for any r [d]. Using this fact and Claim 17, we see that

sk(A)/ n δ1.5ρ0

where the second inequality follows from Claim 13, and the last inequality follows from the fact that δ 1/k.

Similarly, (since sk+1(P) = 0)

sk+1(A)/ n σ (10) δ3ρ0/20 kδ3opt/10

δ2opt/10 < δ2opt/8,

where the third inequality follows from Claim 12.

This proves that k = k.

Thus we have shown that k = k. This proves that we can recover the number of vertices in the latent polytope if the conditions (8), (9), (10) are satisﬁed, hence proving Theorem 9.

Finding k in the Latent k-Polytope

4.2. Finding a non-negative matrix factorization without the knowledge of inner-dimension

Given A ﬁnding an approximate non-negative factorization is intractable. In a long line of work starting from (Arora et al., 2012) several ingenious algorithms have emerged (Recht et al., 2012; Gillis & Vavasis, 2014; Rong & Zou, 2015) which using Separability, ﬁrst introduced by (Donoho & Stodden, 2003), provably recover an approximate factorization. These procedures assume that the innerdimension k is known. By imposing the assumption that Every data point is approximately a non-negative combination of k data points, several algorithms (e.g. (Ambikapathi et al., 2013; Fu et al., 2015; Gillis & Luce, 2018)) can ﬁnd minimum such k by solving Linear or Convex programs. However the assumption is too strong to hold for Lk P and can be shown to fail in many simple instances. To the best of our knowledge there are no known algorithms which can provably recover the inner-dimension and the factorization under the conditions of Lk P.

In this section we state polynomial time procedures which provably recovers the factorization(the matrix M) along with k, the inner-dimension (see equation (Approx-NMF)). Noting that Approx-NMF is a special case of Lk P, the application of Algorithm 1 recovers k.

It is important to note that the standard assumptions stated earlier are different than separability. In fact conditions (9) and (10) of Theorem 9, are analogous to assumptions which have been used for NMF, namely, nearly pure records and noise assumptions of of (Bhattacharyya et al., 2016). Instead of the Well-Separatedness condition (8) of Theorem 9, we make the following Dominant Features assumptions which is similar to, but, weaker than their assumption with the same name.

Dominant Features This stipulates that for ℓ [k], there is a set Tℓof dominant features of column ℓof M. More precisely,

T1, T2, . . . , Tk disjoint : ℓ, X

i Tℓ Mi,ℓ 4 X

ℓ =ℓ Mi,ℓ .

Lemma 19. The Dominant Features assumption above implies condition (8) of Theorem 9.

The proof is based on diagonal dominance and is given in the Supplementary Material. Finally we state the main theorem, which gives a procedure, described in Algorithm 2, for ﬁnding close approximations to the vertices of M.

The algorithm is a simpliﬁed version of (Bhattacharyya & Kannan, 2020), requiring new proofs. At the outset, the algorithm ﬁnds the left k singular subspace V of A and projects data points onto V . It runs in k iterations and in each iteration, ﬁnds an approximation to a new vertex

of K. This approximation is of the form A ,R for some R [n], |R| = δn. R is obtained as follows: We pick a random vector ur in V the null space of the vectors found so far and set: R arg max R |u A ,R|, noting that the optimization can be carried out in polynomial time. Indeed, the optimization to ﬁnd R is a 1-dimension problem. We project all the columns A ,j along ur (note that this is the matrix A which has only n columns). Now the optimal R would be either the δn columns of A with the highest dot product with ur or the smallest, and so, can be done in polynomial time.

Theorem 20. Assuming conditions (9) and (10) and also the Dominant Features assumption, Algorithm 1 ﬁnds k in polynomial time. Using this value of k, Algorithm 2 ﬁnds k points, each of which is within O(k4σ/δ1.5) distance of a unique vertex of M.

The proof is given in the supplementary material. The above theorem yields an NMF without knowing the value of innerdimension. A detailed discussion of the relative strengths and weakness of various assumptions is beyond the scope of this paper and will be discussed elsewhere. It sufﬁces to stay that the assumptions, techniques and the results presented in this section are novel contributions to NMF.

Algorithm 2 Algorithm for ﬁnding the approximations to the vertices of K.

Input: d n data matrix A and a parameter δ (0, 1), k.

Let V be the subspace of ℜd spanned by the top k left singular vectors of A. Let ˆA.,j be the projection of A.,j on V, for j = 1, . . . , n. for r = 0, . . . , k 1 do

ur unit vector chosen uniformly at random from the subspace Ur := V Null( ˆA.,R1, . . . , ˆA.,Rr). Rr+1 arg max R Rδ |u ˆA.,R|. end for Output ˆA.,R1, . . . , ˆA.,Rk.

5. Conclusions

This paper uses Convex geometry in ﬁnding the number of extreme points of the Latent-k-polytope. This settles several interesting problems related to ﬁnding the number of components in Mixed-membership models.

Acknowledgement The authors thank the reviewers for their insightful comments which greatly helped in preparing the camera ready version. The author CB gratefully acknowledges the support of Microsoft Research.

Finding k in the Latent k-Polytope

Ambikapathi, A., Chan, T., Chi, C., and Keizer, K. Hyperspectral data geometry-based estimation of number of endmembers using p -norm-based pure pixel identiﬁcation algorithm. IEEE Trans. Geosci. Remote. Sens., 51(5-1):2753 2769, 2013. URL https://doi.org/ 10.1109/TGRS.2012.2213261.

Arora, S., Ge, R., Kannan, R., and Moitra, A. Computing a nonnegative matrix factorization provably. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 145 162. ACM, 2012.

Barvinok, A. Thrifty approximations of convex bodies by polytopes. International Mathematics Research Notices, 2014(16):4341 4356, 2014. doi: 10.1093/imrn/rnt078.

Bhattacharyya, C. and Kannan, R. Near-optimal sample complexity bounds for learning latent k-polytopes and applications to ad-mixtures. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 854 863. PMLR, 2020.

Bhattacharyya, C., Goyal, N., Kannan, R., and Pani, J. Non-negative matrix factorization under heavy noise. In Balcan, M. and Weinberger, K. Q. (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 1426 1434. JMLR.org, 2016. URL http://proceedings.mlr.press/ v48/bhattacharya16.html.

Cai, J., Cand es, E. J., and Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim., 20(4):1956 1982, 2010. doi: 10.1137/080738970. URL https://doi.org/10.1137/080738970.

Chatterjee, S. Matrix estimation by universal singular value thresholding. Annals of Statistics, 43(1):177 214, 2015.

Donoho, D. and Gavish, M. Minimax risk of matrix denoising by singular value thresholding. Annals of Statistics, 42(6):2413 2440, 2014.

Donoho, D. and Stodden, V. When does non-negative matrix factorization give a correct decomposition into parts? In Advances in Neural Information Processing Systems, pp. 1141 1148, 2003.

Fu, X., Ma, W., Chan, T., and Bioucas-Dias, J. M. Selfdictionary sparse regression for hyperspectral unmixing: Greedy pursuit and pure pixel search are related. IEEE J. Sel. Top. Signal Process., 9(6):1128 1141, 2015.

Gillis, N. The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines, 12, 2014.

Gillis, N. and Luce, R. A fast gradient method for nonnegative sparse regression with self-dictionary. IEEE Trans. Image Process., 27(1):24 37, 2018. URL https: //doi.org/10.1109/TIP.2017.2753400.

Gillis, N. and Vavasis, S. Fast and robust recursive algorithmsfor separable nonnegative matrix factorization. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(4):698 714, April 2014. ISSN 0162-8828. doi: 10.1109/TPAMI.2013.226.

Gruber, P. M. Aspects of approximation of convex bodies, Handbook of Convex Geometry, volume A, pp. 319 345. North-Holland, Amsterdam, 1993.

Khachiyan, L., Boros, E., Borys, K., Elbassioni, K., and Gurvich, V. Generating all vertices of a polyhedron is hard. Discrete & Computational Geometry, 39(1):174 190, 2008. doi: 10.1007/s00454-008-9050-5. URL https: //doi.org/10.1007/s00454-008-9050-5.

Recht, B., Re, C., Tropp, J., and Bittorf, V. Factoring nonnegative matrices with linear programs. In Advances in Neural Information Processing Systems, pp. 1214 1222, 2012.

Rong, G. and Zou, J. Intersecting faces: Non-negative matrix factorization with new guarantees. In ICML, pp. 2295 2303, 2015.

Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476):1566 1581, 2006. ISSN 01621459. URL http://www.jstor.org/ stable/27639773.

Vavasis, S. A. On the complexity of nonnegative matrix factorization. SIAM Journal on Optimization, 20(3):1364 1377, 2009.