# testing_determinantal_point_processes__4b56040e.pdf Testing Determinantal Point Processes Khashayar Gatmiry MIT CSAIL gatmiry@mit.edu Maryam Aliakbarpour Univ. of Massachusetts Amherst maryam.aliakbarpour@gmail.com Stefanie Jegelka MIT CSAIL stefje@mit.edu Determinantal point processes (DPPs) are popular probabilistic models of diversity. In this paper, we investigate DPPs from a new perspective: property testing of distributions. Given sample access to an unknown distribution q over the subsets of a ground set, we aim to distinguish whether q is a DPP distribution, or -far from all DPP distributions in 1-distance. In this work, we propose the first algorithm for testing DPPs. Furthermore, we establish a matching lower bound on the sample complexity of DPP testing, up to logarithmic factors. This lower bound also implies a new hardness result for the problem of testing the more general class of log-submodular distributions. 1 Introduction Determinantal point processes (DPPs) are a rich class of discrete probability distributions that were first studied in the context of quantum physics [54] and random matrix theory [30]. Initiated by the seminal work of Kulesza and Taskar [46], DPPs have gained a lot of attention in machine learning, due to their ability to naturally capture notions of diversity and repulsion. Moreover, they are easy to define via a similarity (kernel) matrix, and, as opposed to many other probabilistic models, offer tractable exact algorithms for marginalization, conditioning and sampling [5, 42, 46, 51]. Therefore, DPPs have been explored in a wide range of applications, including video summarization [39, 38], image search [45, 2], document and timeline summarization [53], recommendation [69], feature selection in bioinformatics [9], modeling neurons [63], and matrix approximation [22, 23, 50]. A Determinantal Point Process is a distribution over the subsets of a ground set [n] = {1, 2, . . . n}, and parameterized by a marginal kernel matrix K 2 Rn n with eigenvalues in [0, 1], whose (i, j)th entry expresses the similarity of items i and j. Specifically, the marginal probability that a set A [n] is observed in a random J Pr K[.] is P(A J ) = det(KA), where KA is the principal submatrix of K indexed by A. This implies P({i, j} J ) = det(K{i,j}) = Ki,i Kj,j K2 i,j for items i and j, which means similar items are less likely to co-occur in J . Despite the wide theoretical and applied literature on DPPs, one question has not yet been addressed: Given a sample of subsets, can we test whether it was generated by a DPP? This question arises, for example, when trying to decide whether a DPP may be a suitable mathematical model for a dataset at hand. To answer this question, we study DPPs from the perspective of property testing. Property testing aims to decide whether a given distribution has a property of interest, by observing as few samples as possible. In the past two decades, property testing has received a lot of attention, and questions such as testing uniformity, independence, identity to a known or an unknown given distribution, and monotonicity have been studied in this framework [18, 60]. More precisely, we ask How many samples from an unknown distribution are required to distinguish, with high probability, whether it is a DPP or -far from the class of DPPs in 1-distance? Given the rich mathematical structure of DPPs, one may hope for a tester that is exceptionally efficient. Yet, The author is also a visitor at the Simons Institute for the Theory of Computing. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. we show that testing is still not easy, and establish a lower bound of ( N/ 2) for the sample size of any valid tester, where N = 2n is the size of the domain. In fact, this lower bound applies to the broader class of log-submodular measures, and may hence be of wider interest given the popularity of submodular set functions in machine learning. Even more generally, the lower bound holds for testing any subset of log-submodular distributions that include the uniform measure. We note that the N dependence on the domain size is not uncommon in distribution testing, since it is required even for testing simple structures such as uniform distributions [57]. However, achieving the optimal sample complexity is nontrivial. We provide the first algorithm for testing DPPs; it uses N/ 2) samples. This algorithm achieves the lower bound and hence settles the complexity of testing DPPs. Moreover, we show how prior knowledge on bounds of the spectrum of K or its entries Kij can improve logarithmic factors in the sample complexity. Our approach relies on testing via learning. As a byproduct, our algorithm is the first to provably learn a DPP in 1-distance, while previous learning approaches only considered parameter recovery in K [67, 17], which does not imply recovery in 1-distance. In short, we make the following contributions: We show a lower bound of ( N/ 2) for the sample complexity of testing any subset of the class of log-submodular measures which includes the uniform measure, implying the same lower bound for testing DPP distributions and strongly Rayleigh [15] measures. We provide the first tester for the family of DPP distributions using O( N/ 2) samples. The sample complexity is optimal with respect to and the domain size N, up to logarithmic factors, and does not depend on other parameters. Additional assumptions on K can improve the algorithm s complexity. As a byproduct of our algorithm, we give the first algorithm to learn DPP distributions in 1 2 Related work Distribution testing. Hypothesis testing is a classical tool in statistics for inference about the data and model [56, 49]. About two decades ago, the framework of distribution testing was introduced, to view such statistical problems from a computational perspective [37, 13]. This framework is a branch of property testing [61], and focuses mostly on discrete distributions. Property testing analyzes the non-asymptotic performance of algorithms, i.e., for finite sample sizes. By now, distribution testing has been studied extensively for properties such as uniformity [57], identity to a known [10, 1, 25] or unknown distribution [20, 24], independence [10], monotonicity [11, 3], k-modality [21], entropy estimation [12, 70], and support size estimation [59, 68, 71]. The surveys [18, 60] provide further details. Testing submodularity and real stability. Property testing also includes testing properties of functions. As opposed to distribution testing, where observed samples are given, testing functions allows an active query model: given query access to a function f : X ! Y, the algorithm picks points x 2 X and obtains values f(x). The goal is to determine, with as few queries as possible, whether f has a given property or is -far from it. Closest to our work in this different model is the question of testing submodularity, in Hamming distance and p-distance [19, 62, 31, 14], since any DPP-distribution is log-submodular. In particular, Blais and Bommireddi [14] show that testing submodularity with respect to any p norm is feasible with a constant number of queries, independent of the function s domain size. The vast difference between this result and our lower bound for log-submodular distributions lies in the query model given samples versus active queries and demonstrates the large impact of the query model. DPPs also belong to the family of strongly Rayleigh measures [15], whose generating functions are real stable polynomials. Raghavendra et al. [58] develop an algorithm for testing real stability of bivariate polynomials, which, if nonnegative, correspond to distributions over two items. Learning DPPs. The problem of learning DPPs has been of great interest in machine learning. Unlike testing, in learning one commonly assumes that the underlying distribution is indeed a DPP, and aims to estimate the marginal kernel K. It is well-known that maximum likelihood estimation for DPPs is a highly non-concave optimization problem, conjectured to be NP-hard [17, 47]. To circumvent this difficulty, previous work imposes additional assumptions, e.g., a parametric family for K [45, 44, 2, 46, 8, 48], or low-rank structure [33, 34, 29]. A variety of optimization and sampling techniques have been used, e.g., variational methods [26, 36, 8], MCMC [2], first order methods [46], and fixed point algorithms [55]. Brunel et al. [17] analyze the asymptotic convergence rate of the Maximum likelihood estimator. To avoid likelihood maximization, Urschel et al. [67] propose an algorithm based on the method of moments, with statistical guarantees. Its complexity is determined by the cycle sparsity property of the DPP. We further discuss the implications of their result in our context in Section 4. Using similar techniques, Brunel [16] considers learning the class of signed DPPs, i.e., DPPs that allow skew-symmetry, Ki,j = Kj,i. 3 Notation and definitions Throughout the paper, we consider discrete probability distributions over subsets of a ground set [n] = {1, 2, . . . , n}, i.e., over the power set 2[n] of size N := 2n. We refer to such distributions via their probability mass function p : 2[n] ! R 0 satisfying P S [n] p(S) = 1. For two distributions p and q, we use 1(q, p) = 1 S [n] |q(S) p(S)| to indicate their 1 (total variation) distance, and χ2(q, p) = P (q(S) p(S))2 p(S) to indicate their χ2-distance. Unlike the 1-distance, the χ2-distance is a pseudo-distance, and can be lower bounded as χ2(q, p) 4 1(q, p)2 by a simple application of the Cauchy-Schwarz inequality. Determinantal Point Processes (DPPs). A DPP is a discrete probability distribution parameterized by a positive semidefinite kernel matrix K 2 Rn n, with eigenvalues in [0, 1]. More precisely, the marginal probability for any set S [n] to occur in a sampled set J is given by the principal submatrix indexed by rows and columns in S: Pr J K[ S J ] = det(KS). We refer to the probability mass function of the DPP by Pr K[ J ] = Pr J K[ J = J ]. A simple application of the inclusion-exclusion principle reveals an expression in terms of the complement J of J: Pr K[ J ] = | det(K I J)|. (1) Distribution testing. We mathematically define a property P to be a set of distributions. A distribution q has the property P if q 2 P. We say two distributions p and q are -far from ( -close to) each other, if and only their 1-distance is at least (at most) . Also, q is -far from P if and only if it is -far from any distribution in P. We define the -far set of P to be the set of all distributions that are -far from P. We say an algorithm is an ( , δ)-tester for property P if, upon receiving samples from an unknown distribution q, the following is true with probability at least 1 δ: If q has the property P, then the algorithm outputs accept. If q is -far from P, then the algorithm outputs reject. We refer to and δ as proximity parameter and confidence parameter, respectively. Note that if we have an ( , δ)-tester for a property with a confidence parameter δ < 0.5, then we can achieve an ( , δ0)-tester for an arbitrarily small δ0 by multiplying the sample size by an extra factor of (log(δ/δ0)). This amplification technique [28] is a direct implication of the Chernoff bound when we run the initial tester (log(δ/δ0)) times and take the majority output as the answer. 4 Main results We begin by summarizing our main results, and explain more proof details in Sections 5 and 6. Upper bound. Our first result is the first upper bound on the sample complexity of testing DPPs. Theorem 1 (Upper Bound). Given samples from an unknown distribution q over 2[n], there exists a deterministic ( , 0.01)-tester for determining whether q is a DPP or it is -far from all DPP distributions. The tester uses samples with logarithmic factors CN, = log2(N)(log(N) + log(1/ )). Importantly, the sample complexity of our upper bound grows as O( N/ 2), which is optimal up to a logarithmic factor (Theorem 2). With additional assumptions on the spectrum and entries of K, expressed as ( , )-normal DPPs, we obtain a refined analysis. Definition 1. For 2 [0, 0.5] and 2 [0, 1], a DPP with marginal kernel K is ( , )-normal if: 1. the eigenvalues of K are in the range [ , 1 ]; and 2. for i, j 2 [n] : Ki,j 6= 0 ) |Ki,j| . The notion of -normal DPPs was also used in [67]. Since K has eigenvalues in [0, 1], its entries Ki,j are at most one. Hence, we always assume 0 0.5 and 0 1. Lemma 1. For ( , )-normal DPPs, with knowledge of and , the factor in Theorem 1 becomes C0 N, , , = log2(N)(1 + log(1/ ) + min{log(1/ ), log(1/ )}). Even more, if at least one of or is not too small, i.e., if = ( 2N 1/4) or = ( 1N 1/4) hold, then C0 N, , , reduces to log2(N). With a minor change in the algorithm, the bound in Lemma 1 also holds for the problem of testing whether q is an ( , )-normal DPP, or -far only from just the class of ( , )-normal DPPs, instead of all DPPs (Appendix F). Our approach tests DPP distributions via learning: At a high-level, we learn a DPP model from the data as if the data were generated from a DPP distribution. Then, we use a new batch of data and test whether the DPP we learnt seems to have generated the new batch of the data. More accurately, given samples from q, we pretend q is a DPP with kernel K , and use a proper learning algorithm to estimate a kernel matrix ˆK. But, Urschel et al. [67] derive a lower bound on the complexity of learning K which, in the worst case, may lead to a sub-optimal sample complexity for testing. To reduce the sample complexity of learning, we do not work with a single accurate estimate ˆK, but construct a set M of candidate DPPs as potential estimates for q. We show that, with only ( N/ 2) samples, we can obtain a set M such that, with high probability, we can determine if q is a DPP by testing if q is close to any DPP in M. We prove that (log(|M|) N/ 2) samples suffice for this algorithm to succeed with high probability. Small-scale experiments in Appendix J validate the algorithm empirically. Lower Bound. Our second main result is an information-theoretic lower bound, which shows that the sample complexity of our tester in Theorem 1 is optimal up to logarithmic factors. Theorem 2 (Lower Bound). Given 0.0005 and n 22, any ( , 0.01)-tester needs at least ( N/ 2) samples to distinguish if q is a DPP or it is -far from the class of DPPs. Given 2 [0, 0.5], the same bound holds for distinguishing if q is an ( , )-normal DPP or it is -far from the class of DPPs (or -far from the class of ( , )-normal DPPs). In fact, we prove a more general result (Theorem 4): testing whether q is in any subclass of the family of log-submodular distributions that includes the uniform distribution requires ( N/ 2) samples. DPPs are such a subclass [46]. A distribution f over 2[n] is log-submodular if for every S S0 [n] and i /2 S0, it holds that log(f(S0[{i})) log(f(S0)) log(f(S[{i})) log(f(S)). Given the interest in log-submodular distributions [26, 66, 27, 40, 41], this result may be of wider interest. Moreover, our lower bound applies to another important subclass , strongly Rayleigh measures [15], which underlie recent progress in algorithms and mathematics [35, 32, 64, 4], and sampling in machine learning [5, 52, 51]. Our lower bound stands in stark contrast to the constant sample complexity of testing whether a given function is submodular [14], implying a wide complexity gap between access to given samples and access to an evaluation oracle (see Section 2). We prove our lower bounds by a reduction from a randomized instance of uniformity testing. 5 An Algorithm for Testing DPPs We first construct an algorithm for testing the smaller class of ( , )-normal DPPs, and then show how to extend this result to all DPPs via a coupling argument. Our testing algorithm relies on learning: given samples from q, we estimate a kernel ˆK from the data, and then test whether the estimated DPP has generated the observed samples. The magnitude Algorithm 1 DPP-Tester 1: procedure DPP-TESTER( , δ, sample access to q) 2: M construct the set of DPP distributions as described in Theorem 3. 3: for each p in M do 4: Use robust χ2 1 testing to check if χ2(q, p) 2/500, or 1(q, p) . 5: if the tester outputs accept then 6: Return accept. 7: Return reject of any entry ˆKi,j can be estimated from the marginals for S = {i, j} and i, j, since Pr K[S] = Ki,i Kj,j K2 i,j = Pr K[i]Pr K[j] K2 i,j. But, determining the signs is more challenging. Urschel et al. [67] estimate signs via higher order moments that are harder to estimate, but it is not clear whether the resulting ˆK yields a sufficiently accurate estimate of the distribution to obtain an optimal sample complexity for testing. Hence, instead, we construct a set M of candidate DPPs such that, with high probability, there is a p 2 M that is close to q if and only if q is a DPP. Our tester, Algorithm 1, tests closeness to M by individually testing closeness of each candidate in M. Constructing M. The DPPs in M arise from variations of an estimate for K , obtained with ( N/ 2) samples. Via the above strategy, we first estimate the magnitude |K ij| of each matrix entry. Separating the case K ij = 0, one can compute confidence intervals for this estimation around +| b Kij| and | b Kij|. We then pick candidate entries from these confidence intervals, such that at least one is close to the true K i,j. The candidate matrices K are obtained by all possible combinations of candidate entries Since these are not necessarily valid marginal kernels, we project them onto the positive semidefinite matrices with eigenvalues in [0, 1]. Then, M is the set of all DPPs parameterized by these projected candidate matrices (K). Its cardinality is given in Theorem 3 and, as an explicit function of N and , in Appendix H. If q is a DPP with kernel K , then, by construction, our candidates contain a K close to K . The projection of K remains close to K in Frobenius distance. We show that this closeness of the matrices implies closeness of the corresponding distributions q and p = Pr ( K)[ .] in 1-distance: 1(q, p) = O( ). Conversely, if q is -far from being a DPP, then it is, by definition, -far from M, which is a subset of all DPPs. Testing M. To test whether q is close to M, a first idea is to do robust 1 identity testing, i.e., for every p 2 M, test whether 1(q, p) or 1(q, p) = O( ). But, M can contain the uniform distribution, and it is known that robust 1 uniformity testing needs (N/ log N) samples [68], as opposed to the optimal dependence Hence, instead, we use a combination of χ2 and 1 distances for testing, and test χ2(q, p) = O( 2) versus 1(q, p) . This is possible with fewer samples [1]. To apply this robust χ21 identity testing (described in Section 5.1), we must prove that, with high probability, there is a p in M with χ2(q, p) = O( 2) if and only if q is a DPP. Theorem 3, proved in Appendix A, asserts this result if q is an ( , )-normal DPP. This is stronger than its 1 correspondent, since 4 2 1(q, p) χ2(q, p). To prove Theorem 3, we need to control the distance between the atom probabilities of q and p. We analyze these atom probabilities, which are given by determinants, via a lower bound on the smallest singular values σn of the family of matrices {K I J : J [n]}. Lemma 2. If the kernel matrix K has all eigenvalues in [ , 1 ], then, for every J [n]: σn(K I J) (1 )/ Lemma 2 is proved in Appendix B. In Theorem 3, we observe m = d(ln(1/δ) + 1) N/ 2e samples from q, and use the parameter & := d200n2 1 min{2 / , / }e, with := N 1 log(n) + 1. Theorem 3. Let q be an ( , )-normal DPP distribution with marginal kernel K . Given the parameters defined above, suppose we have m samples from q. Then, one can generate a set M of DPP distributions of cardinality |M| = (2& + 1)n2, with & defined as above, such that, with probability at least 1 δ, there is a distribution p 2 M with χ2(q, p) 2/500. 5.1 Correctness of the Testing Algorithm for ( , )-normal DPPs Next, we show that with high probability, our resulting testing algorithm succeeds with high probability. This finishes the proof of Lemma 1. For simplicity, we set the confidence parameter in Algorithm 1 to δ = 0.01. In this case, DPP-Tester aims to output accept if q is a ( , )-normal DPP, and reject if q is -far from all DPPs, in both cases with probability at least 0.99. To finish the proof for the adaptive sample complexity, we need to argue that our DPP-Tester succeeds with high probability, i.e., that with high probability all of the identity tests, with each p 2 M, succeed. The algorithm uses robust χ21 identity testing [1], to test χ2(q, p) 2/500 versus 1(q, p) . In our framework, the χ21 identity tester works as follows. It uses a Poissonization trick that simplifies the analysis. Given the average sample size m, the χ21 tester first samples m0 Poisson(m), then obtains m0 samples from q. For each p 2 M, it then computes the statistic J [n]: p(J) /50N (N(J) mp(J))2 N(J) mp(J) , (3) where N(J) is the number of samples that are equal to set J, and compares Z(m) with the threshold C = m 2/10. Acharya et al. [1] show that for m = ( N/ 2), Z(m) concentrates around its mean, which is strictly below C if p satisfies χ2(q, p) 2/500, and strictly above C if 1(q, p) . Let E1 be the event that all these robust tests, for every p 2 M, simultaneously answer correctly. To make sure that E1 happens with high probability, we use amplification (Section 3): while we use the same set of samples to test against every p 2 M, we multiply the sample size by (log(|M|)) to be confident that each test answers correctly with probability at least 1 O(|M| 1). A union bound then implies that E1 happens with arbitrarily large constant probability. Theorem 3 states that, indeed, with ( N/ 2) samples, if q is an ( , )-normal DPP, then M contains a distribution p such that χ2(q, p) 2/500, with high probability. We call this event E2. DPP-Tester succeeds in the case E1 \ E2: If q is an ( , )-normal DPP, then at least one χ21 test accepts p and consequently the algorithm accepts q as a DPP. Conversely, if q is -far from all DPPs, then 1(q, p) for every p 2 M, so all the χ21 tests reject simultaneously and DPP-Tester rejects q as well. With a union bound on the events Ec 2, it follows that E1 \ E2 happens with arbitrarily large constant probability too, independent of whether q is a DPP or not. Adding the sample complexities for generating M and for the χ21 tests and observing log(|M|) = O(1 + log(1/ ) + min{log(1/ ), log(1/ )}) completes the proof of Lemma 1. 5.2 Extension to general DPPs Next, we generalize our testing result from ( , )-normal DPPs to general DPPs to prove the general sample complexity in Theorem 1. The key idea is that, if some eigenvalue of K is very close to zero or one, we couple the process of sampling from K with sampling from another kernel z(K ) whose eigenvalues are bounded away from zero and one, i.e., parameterizing a (0, z)-normal DPP. This coupling enables us to test (0, z)-normal DPPs instead, by tolerating an extra failure probability, and transfer the above analysis for ( , )-normal DPPs. We state our coupling argument in the following Lemma, proved in Appendix D. Lemma 3. For a value z 2 [0, 1], we denote the projection of a marginal kernel K onto the convex set {A 2 S+ n | z I A (1 z)I} by z(K), where S+ n is the set of positive semidefinite matrices. For z = δ/2mn, consider the following stochastic processes: 1. derive m i.i.d samples {J (t) t=1 from Pr K[ .]; 2. derive m i.i.d samples {J (t) t=1 from Pr z(K)[ .]. There exists a coupling between (1) and (2) such that t=1 = {J (t) We can use this coupling argument as follows. Suppose the constant c1 is such that using c1CN, , , N/ 2 samples suffice for DPP-Tester to output the correct answer for testing ( , )- normal DPPs, with probability at least 0.995. Such a constant exists as we just proved. Now, we show that with m = c2CN, N/ 2 samples for large enough constant c2, we obtain a tester for the set of all DPPs. To this end, we use the parameter setting of our algorithm for (0, z) normal DPPs, where z = 0.005/(2m n) is a function of c2, , and N. One can readily see that c2 can be picked large enough, such that m c1CN, ,0, z N/ 2, with c2 being just a function of c1. This way, by the definition of c1, the algorithm can test for (0, z)-normal DPPs with success probability 0.995. So, if q is a (0, z)-normal DPP, or if it is -far from all DPPs, then the algorithm outputs correctly with probability at least 0.995. It remains to check what happens when q is a DPP with kernel K , but not (0, z)-normal. Indeed, DPP-Tester successfully decides this case too: due to our coupling, the product distributions Pr(m ) K [.] and Pr(m ) z(K )[.] over the space of data sets have 1-distance at most 0.005, so we have K [Acceptance Region] Pr(m ) z(K ) [Acceptance Region] 0.005 0.995 0.005 = 0.99, where the last inequality follows from the fact that z(K ) is an (0, z)-normal DPP. Hence, for such c2, DPP-Tester succeeds with c2CN, N/ 2 samples to test all DPPs with probability 0.99, which completes the proof of Theorem 1. Learning DPPs. Our tester implicitly provides a method to learn a DPP q in 1-distance: the χ2 1 tester can only accept candidate DPPs p 2 M for which we either have χ2(q, p) 2/500 or 1(q, p) < . Since 1(q, p) 1/2 χ2(q, p) < , any such p is a DPP with distance 1(q, p) to the underlying distribution q. If q is a DPP, we will find such a p with high probability. 6 Lower bound Next, we establish the lower bound in Theorem 2 for testing DPPs, which implies that the sample complexity of DPP-Tester is tight up to logarithmic factors. In fact, our lower bound is more general: it applies to the problem of testing any subset of the larger class of log-submodular distributions, whenever includes the uniform measure: Theorem 4. Let be any subset of log-submodular distributions that contains the uniform measure. For 0.0005 and n 22, given sample access to a distribution q over subsets of [n], any ( , 0.01)-tester that checks whether q 2 or q is -far from all log-submodular distributions requires N/ 2) samples. One may also wish to test if q is -far only from the distributions in . A tester for this question, however, would correctly return reject for any q that is -far from the set of all log-submodular distributions, and can hence distinguish the cases in Theorem 4 too. Hence, the lower bound extends to this question. Theorem 2 is simply a consequence of Theorem 4: we may set to be the set of all DPPs, or all ( , )-normal DPPs. Both classes include the uniform distribution over 2[n], which is an ( , )- normal DPP with marginal kernel I/2, where I is the n n identity matrix. By the same argument, the lower bound applies to distinguishing ( , )-normal DPPs from the -far set of all DPPs for 2 [0, 0.5]. Proof of Theorem 4. To prove Theorem 4, we construct a hard uniformity testing problem that can be decided by our desired tester for . In particular, we construct a family F, such that it is hard to distinguish between the uniform measure and a randomly selected distribution h from F. While the uniform measure is in , we will show that h is far from the set of log-submodular distributions with high probability. Hence, a tester for can, with high probability, correctly decide between F and the uniform measure. We obtain F by randomly perturbing the atom probabilities of the uniform measure over 2[n] by 0/N, with 0 = c for a sufficiently large constant c (specified in the appendix). More concretely, for every vector r 2 { 1}N whose entries are indexed by the subsets S [n], we define the distribution hr 2 F as 8S [n] : hr(S) / hr(S) = 1 + r S 0 where hr is the corresponding unnormalized measure. We assume that hr is selected from F uniformly at random, i.e., each entry r S is a Rademacher random variable independent from the others. In particular, it is known that distinguishing such a random hr from the uniform distribution requires ( N/ 02) samples [24, 57]. To reduce this uniformity testing problem to our testing problem for and obtain the lower bound ( N/ 2) for the sample complexity of our problem, it remains to prove that hr is -far from the class of log-submodular distributions with high probability. Hence, Lemma 4 finishes the proof. Lemma 4. For 0.0005, n 22 and c = 1024, a distribution hr drawn uniformly from F is -far from all log-submodular distributions with probability at least 0.99. Proof sketch for Lemma 4. We fix an arbitrary log-submodular distribution f and first show that (1) the 1-distance 1(f, hr) between f and the unnormalized measure hr is large with high probability, independent of f (we define the 1-distance of general measures the same as for probability measures). Then, (2) we show that if 1(f, hr) is large, 1(f, hr) is also large. To address (1), we define a family Sr of subsets that, as we prove, satisfies: (P1) With high probability, Sr has cardinality at least N/64. (P2) For every S 2 Sr, there is a contribution of at least 0/8N to 1(f, hr) from the term VS 2| hr(S) f(S)| + 1 2| hr(S [ {1}) f(S [ {1})|+ 1 2| hr(S [ {2}) f(S [ {2})| + 1 2| hr(S [ {1, 2}) f(S [ {1, 2})|. Together, the above properties imply that 1( hr, f) (N/64) ( 0/8N) = 0/512. We define the important family Sr as Sr := {S [n] \ {1, 2} | r(S[{1,2}) = 1, r(S[{2}) = 1, r(S[{1}) = 1}. Property (P1) follows from a Chernoff bound for the random variables {S 2 Sr}, 8S [n]\{1, 2}, where {.} is the indicator function. For proving Property P2, we distinguish two cases, depending on the ratio f((S [ {1, 2})/f(S [ {2}). One of these cases relies on the definition of log-submodularity. Finally, to show that (2) a large 1(f, hr) implies a large 1(f, hr), we control the normalization constant P S [n] hr(S). The full proof may be found in Appendix C. 7 Discussion In this paper, we initiate the study of distribution testing for DPPs. Our lower bound of ( N/ 2), where N is the domain size 2n, shows that, despite the rich mathematical structure of DPPs, testing whether q is a DPP or -far from it has a sample complexity similar to uniformity testing. This bound extends to related distributions that have gained interest in machine learning, namely log-submodular distributions and strongly Rayleigh measures. Our algorithm DPP-Tester demonstrates that this lower bound is tight for DPPs, via an almost matching upper bound of O( One may wonder what changes when using the moment-based learning algorithm from [67] instead of the learner from Section 5, which yields optimal testing sample complexity. With [67], we obtain a single estimate ˆKnew for K , apply a single robust χ21 test against Pr ˆ Knew[ .], and return its output. The resulting algorithm DPP-Tester2 shows a statistical-computational tradeoff: since it performs only one test, it gains in running time, but its sample complexity could be larger: Theorem 5, proved in Appendix G, states upper bounds that are is no longer logarithmic in and , and larger than O( Theorem 5. To test against the class of ( , )-normal DPPs, DPPTester2 needs n4 log(n)/ 2 2 2 + (4/ )2 log(n)+ samples, and runs in time O(Nn3 +n6 +mn2), where m is the number of input samples and is the cycle sparsity2 of the graph corresponding to the non-zero entries of K . Assuming a constant cycle sparsity may improve the sample complexity, but our lower bound still applies even with assumptions on cycle sparsity. While the results in this paper focus on sample complexity for general DPPs, it is an interesting avenue of future work to study whether additional structural assumptions, or a widening to strongly log-concave measures [6, 7], can lead to further statistical and computational benefits or tradeoffs. Broader Impact Due to their ability to model negative dependencies and repulsion, DPPs have become a popular tool for modeling diversity in subset selection tasks. However, they are not the only models for negative dependence, and sometimes the decision for using DPPs may be solely based on their computational efficiency. If the true data distribution is far from being a DPP, the resulting approximation error may potentially induce biases. Our work poses the question of testing whether given data actually comes from a DPP. Being able to test for such a model fit can help avoid the biases from approximation error. Our work provides an initial theoretical understanding of the DPP testing problem. Our results settle the general sample complexity, and open avenues for further work to improve complexity over the general baseline by identifying additional mathematical structure that may exist in the data. Acknowledgments This work was supported by NSF Career award 1553284, NSF BIGDATA award IIS-1741341, NSF award CCF-1733808, CCF-1934846, NSF BIGDATA award IIS-1741137, and MIT-IBM Watson AI Lab and Research Collaboration Agreement No. W1771646, Fintech@CSAIL. We thank Behrooz Tahmasebi, Ankur Moitra and Marwa El Halabi for helpful discussions and comments. [1] Jayadev Acharya, Constantinos Daskalakis, and Gautam Kamath. Optimal testing for properties of distributions. In Advances in Neural Information Processing Systems (NIPS), pages 3591 3599, 2015. [2] Raja Hafiz Affandi, Emily Fox, Ryan Adams, and Ben Taskar. Learning the parameters of determinantal point process kernels. In Int. Conference on Machine Learning (ICML), pages 1224 1232, 2014. [3] Maryam Aliakbarpour, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yod- pinyanee. Towards testing monotonicity of distributions over general posets. In Conference on Learning Theory (COLT), pages 34 82, 2019. [4] N. Anari and S. Oveis Gharan. The Kadison-Singer problem for strongly Rayleigh measures and applications to asymmetric TSP. In IEEE Symposium on Foundations of Computer Science (FOCS), 2015. [5] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte Carlo Markov Chain algorithms for sampling strongly Rayleigh distributions and determinantal point processes. In Conference on Learning Theory (COLT), 2016. [6] Nima Anari, Shayan Oveis Gharan, and Cynthia Vinzant. Log-concave polynomials, entropy, and a deterministic approximation algorithm for counting bases of matroids. In IEEE Symposium on Foundations of Computer Science (FOCS), 2018. [7] Nima Anari, Kuikui Liu, Shayan Oveis Gharan, and Cynthia Vinzant. Log-concave polynomials II: High-dimensional walks and an FPRAS for counting bases of a matroid. In Symposium on Theory of Computing (STOC), 2019. 2The cycle sparsity of a graph is the smallest 0 such that the cycles with length at most 0 constitute a basis for the cycle space of the graph. [8] Rémi Bardenet and Michalis Titsias RC AUEB. Inference for determinantal point processes without spectral knowledge. In Advances in Neural Information Processing Systems (NIPS), pages 3393 3401, 2015. [9] Nematollah Kayhan Batmanghelich, Gerald Quon, Alex Kulesza, Manolis Kellis, Polina Gol- land, and Luke Bornn. Diversifying sparsity using variational determinantal point processes. ar Xiv preprint ar Xiv:1411.6307, 2014. [10] Tugkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 442 451, 2001. [11] Tugkan Batu, Ravi Kumar, and Ronitt Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In Symposium on Theory of Computing (STOC), pages 381 390, 2004. [12] Tugkan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. The complexity of approxi- mating the entropy. SIAM Journal on Computing, 35(1):132 150, 2005. [13] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing closeness of discrete distributions. Journal of the ACM, 60(1):4:1 4:25, 2013. [14] Eric Blais and Abhinav Bommireddi. Testing submodularity and other properties of valuation functions. ar Xiv preprint ar Xiv:1611.07879, 2016. [15] J. Borcea, P. Bränden, and T.M. Liggett. Negative dependence and the geometry of polynomials. Journal of American Mathematical Society, 22:521 567, 2009. [16] Victor-Emmanuel Brunel. Learning signed determinantal point processes through the principal minor assignment problem. In Advances in Neural Information Processing Systems (NIPS), pages 7365 7374, 2018. [17] Victor-Emmanuel Brunel, Ankur Moitra, Philippe Rigollet, and John Urschel. Rates of estima- tion for determinantal point processes. In Conference on Learning Theory (COLT), volume 65 of Proceedings of Machine Learning Research, pages 343 345. PMLR, 2017. [18] Clément L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquium on Computational Complexity (ECCC), 22:63, 2015. [19] Deeparnab Chakrabarty and Zhiyi Huang. Testing coverage functions. In International Collo- quium on Automata, Languages, and Programming, pages 170 181. Springer, 2012. [20] Siu-on Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal algorithms for testing closeness of discrete distributions. In SIAM-ACM Symposium on Discrete Algorithms (SODA), pages 1193 1203, 2014. [21] Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A. Servedio. Learning k-modal distributions via testing. Theory of Computing, 10(20):535 570, 2014. doi: 10.4086/toc.2014. v010a020. URL http://www.theoryofcomputing.org/articles/v010a020. [22] Michał Derezi nski and Michael W. Mahoney. Determinantal Point Processes in Randomized Numerical Linear Algebra. ar Xiv e-prints, art. ar Xiv:2005.03185, May 2020. [23] A. Deshpande, L. Rademacher, S. Vempala, and G. Wang. Matrix approximation and projective clustering via volume sampling. Symposium on Theory of Computing (STOC), 2:225 247, 2006. [24] Ilias Diakonikolas and Daniel M. Kane. A new approach for testing properties of discrete distributions. In FOCS, pages 685 694, 2016. [25] Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Sample-optimal identity testing with high probability. In International Colloquium on Automata, Languages, and Programming. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018. [26] Josip Djolonga and Andreas Krause. From MAP to marginals: Variational inference in Bayesian submodular models. In Advances in Neural Information Processing Systems (NIPS), pages 244 252, 2014. [27] Josip Djolonga, Stefanie Jegelka, and Andreas Krause. Provable variational inference for constrained log-submodular models. In Advances in Neural Information Processing Systems (NIPS), 2018. [28] D Dubhashi and Alessandro Panconesi. Concentration of measure for the analysis of randomised algorithms. Draft Manuscript, http://www.brics.dk/ale/papers. html, 1998. [29] Christophe Dupuy and Francis Bach. Learning determinantal point processes in sublinear time. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 244 257, 2018. [30] Freeman J Dyson. Statistical theory of the energy levels of complex systems. i. Journal of Mathematical Physics, 3(1):140 156, 1962. [31] Vitaly Feldman and Jan Vondrak. Optimal bounds on approximation of submodular and xos functions by juntas. SIAM Journal on Computing, 45(3):1129 1170, 2016. [32] A. Frieze, N. Goyal, L. Rademacher, and S. Vempala. Expanders via random spanning trees. SIAM Journal on Computing, 43(2):497 513, 2014. [33] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Bayesian low-rank determinantal point processes. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 349 356. ACM, 2016. [34] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Low-rank factorization of determinantal point processes. In Proc. AAAI Conference on Artificial Intelligence, 2017. [35] S. Oveis Gharan, A. Saberi, and M. Singh. A randomized rounding approach to the traveling salesman problem. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 550 559, 2011. [36] Jennifer A Gillenwater, Alex Kulesza, Emily Fox, and Ben Taskar. Expectation-maximization for learning determinantal point processes. In Advances in Neural Information Processing Systems (NIPS), pages 3149 3157, 2014. [37] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation, pages 68 75. Springer, 2011. [38] Boqing Gong, Wei-Lun Chao, Kristen Grauman, and Fei Sha. Diverse sequential subset selection for supervised video summarization. In Advances in Neural Information Processing Systems (NIPS), pages 2069 2077. Curran Associates, Inc., 2014. [39] Boqing Gong, Wei-Lun Chao, Kristen Grauman, and Fei Sha. Diverse sequential subset selection for supervised video summarization. In Advances in Neural Information Processing Systems (NIPS), pages 2069 2077, 2014. [40] Alkis Gotovos, S. Hamed Hassani, and Andreas Krause. Sampling from probabilistic sub- modular models. In Advances in Neural Information Processing Systems (NIPS), December 2015. [41] Alkis Gotovos, Hamed Hassani, Andreas Krause, and Stefanie Jegelka. Discrete sampling using semigradient-based product mixtures. In Uncertainty in Artificial Intelligence (UAI), August 2018. [42] J. Ben Hough, Manjunath Krishnapur, Yuval Peres, and Balint Virag. Determinantal processes and independence. Probability Surveys, 3:206 229, 2006. [43] Ilse CF Ipsen and Rizwana Rehman. Perturbation bounds for determinants and characteristic polynomials. SIAM Journal on Matrix Analysis and Applications, 30(2):762 776, 2008. [44] Alex Kulesza and Ben Taskar. Learning determinantal point processes. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 11, page 419 427, Arlington, Virginia, USA, 2011. AUAI Press. ISBN 9780974903972. [45] Alex Kulesza and Ben Taskar. k-dpps: Fixed-size determinantal point processes. In Int. Conference on Machine Learning (ICML), pages 1193 1200, 2011. [46] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2 3):123 286, 2012. [47] John A. Kulesza. Learning with Determinantal Point Processes. Ph D thesis, University of Pennsylvania, 2012. [48] Frédéric Lavancier, Jesper Møller, and Ege Rubak. Determinantal point process models and statistical inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4):853 877, 2015. [49] Erich L. Lehmann and Joseph P. Romano. Testing statistical hypotheses. Springer Texts in Statistics. Springer, 2005. [50] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast DPP sampling for Nyström with application to kernel methods. In Int. Conference on Machine Learning (ICML), 2016. [51] Chengtao Li, Suvrit Sra, and Stefanie Jegelka. Fast mixing markov chains for Strongly Rayleigh measures, DPPs, and constrained sampling. In Advances in Neural Information Processing Systems (NIPS), 2016. [52] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Polynomial time algorithms for dual volume sampling. In Advances in Neural Information Processing Systems (NIPS), 2017. [53] Hui Lin and Jeff Bilmes. Learning mixtures of submodular shells with application to document summarization. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI 12, page 479 490. AUAI Press, 2012. ISBN 9780974903989. [54] Odile Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7(1):83 122, 1975. [55] Zelda Mariet and Suvrit Sra. Fixed-point algorithms for learning determinantal point processes. In Int. Conference on Machine Learning (ICML), pages 2389 2397, 2015. [56] Jerzy Neyman and Egon Sharpe Pearson. Ix. on the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289 337, 1933. [57] Liam Paninski. A coincidence-based test for uniformity given very sparsely-sampled discrete data. IEEE TOIT, 54:4750 4755, 2008. [58] Prasad Raghavendra, Nick Ryder, and Nikhil Srivastava. Real stability testing. In Innovations in Theoretical Computer Science, 2017. [59] Sofya Raskhodnikova, Dana Ron, Amir Shpilka, and Adam Smith. Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM Journal on Computing, 39(3):813 842, 2009. [60] Ronitt Rubinfeld. Taming big probability distributions. XRDS, 19(1):24 28, 2012. [61] Ronitt Rubinfeld and Madhu Sudan. Robust characterizations of polynomials with applications to program testing. SIAM Journal on Computing, 25(2):252 271, 1996. [62] Comandur Seshadhri and Jan Vondrák. Is submodularity testable? Algorithmica, 69(1):1 25, [63] Jasper Snoek, Richard Zemel, and Ryan P Adams. A determinantal point process latent variable model for inhibition in neural spiking data. In Advances in Neural Information Processing Systems (NIPS), pages 1932 1940, 2013. [64] D.A. Spielman and N. Srivastava. Graph sparsification by effective resistances. SIAM J. Comput., 40(6):1913 1926, 2011. [65] Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012. [66] Sebastian Tschiatschek, Josip Djolonga, and Andreas Krause. Learning probabilistic submod- ular diversity models via noise contrastive estimation. In Proc. Int. Conference on Artificial Intelligence and Statistics (AISTATS), 2016. [67] John Urschel, Victor-Emmanuel Brunel, Ankur Moitra, and Philippe Rigollet. Learning deter- minantal point processes with moments and cycles. In Int. Conference on Machine Learning (ICML), pages 3511 3520. JMLR. org, 2017. [68] Gregory Valiant and Paul Valiant. Estimating the unseen: Improved estimators for entropy and other properties. JACM, 64(6):37:1 37:41, 2017. [69] Mark Wilhelm, Ajith Ramanathan, Alexander Bonomo, Sagar Jain, Ed H. Chi, and Jennifer Gillenwater. Practical diversified recommendations on You Tube with Determinantal Point Processes. In ACM International Conference on Information and Knowledge Management (CIKM), 2018. [70] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702 3720, 2016. [71] Yihong Wu, Pengkun Yang, et al. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, 47(2):857 883, 2019.