# online_map_inference_of_determinantal_point_processes__da1f5d0f.pdf Online MAP Inference of Determinantal Point Aditya Bhaskara School of Computing University of Utah bhaskaraaditya@gmail.com Amin Karbasi School of Engineering & Applied Science Yale University amin.karbasi@yale.edu Silvio Lattanzi Google Research Zürich silviol@google.com Morteza Zadimoghaddam Google Research Cambridge zadim@google.com In this paper, we provide an efficient approximation algorithm for finding the most likelihood configuration (MAP) of size k for Determinantal Point Processes (DPP) in the online setting where the data points arrive in an arbitrary order and the algorithm cannot discard the selected elements from its local memory. Given a tolerance additive error , our ONLINE-DPP algorithm achieves a k O(k) multiplicative approximation guarantee with an additive error , using a memory footprint independent of the size of the data stream. We note that the exponential dependence on k in the approximation factor is unavoidable even in the offline setting. Our result readily implies a streaming algorithm with an improved memory bound compared to existing results. 1 Introduction Probabilistic modeling of data, along with complex inference techniques, have become an important ingredient of the modern machine learning toolbox. To introduce structures in such models, such as diversity, sparsity, or non-iid samples, while ensuring computational tractability we typically need to provide fast sampling and computationally efficient inference. Determinantal point processes (DPP) are elegant probabilistic models of repulsion that admit such criteria, namely, efficient sampling, marginalization, and conditioning [Kulesza and Taskar, 2012a,b]. They have first been introduced by [Macchi, 1975] in quantum physics to model negative interactions among particles. In recent years, DPPs have found numerous applications in machine learning that rely on a diverse subset selection, such as different forms of data summarization [Mirzasoleiman et al., 2013, Kulesza and Taskar, 2012b, Feldman et al., 2018, Gong et al., 2014], multi-label classification [Xie et al., 2017], recommender systems [Lee et al., 2017, Qin and Zhu, 2013], to name a few. To define a DPP more precisely, let us assume that we have a set of n vectors V = {v1, v2 . . . , vn}, each of dimension d. Define V = [v1; v2; . . . ; vn] to be a matrix of size n d (row i consists of vector vi) and construct a positive semi-definite kernel L = V V T . The entry (i, j) of the kernel L encodes the similarity between two vectors vi and vj as the inner product between them, namely, Li,j = hvi, vji. The essential characteristic of a DPP is that the inclusion of an item makes the inclusion of similar items less likely. More specifically, a DPP assigns the probability of sampling a set of vectors indexed by S V as follows: Pr(S) / det(VSV T S ) = vol2(S), (1) 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. where VS is a submatrix of V whose rows consist of vectors in S and vol2(S) denotes the volume of the parallelepiped formed by vectors in S. Note that when vi and vj are similar (under the inner product similarity measure), their vectors are relatively non-orthogonal, and therefore, sets that include both of them have less volume. This way, DPPs incentivizes diversity and encodes negative correlations among elements. Note that the normalization factor in (1) can be computed explicitly and is equal to det(L + I). A fundamental problem associated to any probabilistic model, and in particular a DPP, is to find the most likely, or the maximum a posteriori (MAP), configuration [Gillenwater et al., 2012]: S2I Pr(S) = vol2(S), (2) where I indicates some feasibility constraints, most commonly a size constraint |S| k. We denote the optimum value of problem (2) by OPT. It is known that finding a solution to (2) is NP hard, even if we tolerate an approximation factor of ck for some c > 1 [Summa et al., 2015]. Using elegant optimization techniques, the works of [Nikolov, 2015, Nikolov and Singh, 2016, Ebrahimi et al., 2017] and others have developed algorithms that have an approximation ratio close to ek, nearly matching the lower bounds. However, all of these algorithms require the entire dataset to be in memory, and have impractical running times. Motivated by large-scale applications, recent work [Indyk et al., 2018, Mahabadi et al., 2019] has studied algorithms in sublinear models such as data streaming. Here, data points arrive one after another and the algorithm needs to be able to maintain a near optimal solution, while (a) using only a sublinear amount of memory and (b) having small per-point processing time. Very recently, using previous work on coresets, [Mahabadi et al., 2020] showed that for any > 0, there exists a streaming algorithm that uses memory O( 1 n kd), while achieving an approximation factor of O(k)(k/ ) for the objective (2). This yields a trade-off between the space complexity and approximation factor. The focus of our work is the online model, which is even more restrictive than streaming. As in streaming, our goal is to process points quickly as they arrive while maintaining small space. Additionally, we have the restriction that once a point is added to memory, it remains in memory for the remainder of the algorithm. The online model is preferred in applications where we need to maintain a solution at any point in time and where we want the solution to be consistent and stable, i.e., changes in the solution occurs rarely and only when really necessary [Cohen-Addad et al., 2019, Jaghargh et al., 2019, Lattanzi and Vassilvitskii, 2017, Bhaskara et al., 2019]. Our goal is to develop algorithms whose space and time complexity per vector are independent of n, while achieving an approximation ratio competitive with the best known streaming algorithms. In general, this is impossible: consider a simple example where we receive groups of k vectors, with each group having orthogonal vectors of a geometrically increasing length (factor β). Since we do not know the length of the stream, any algorithm with an approximation factor better than βk must select all the points.1 Any competitive algorithm must thus make some (possibly mild) assumptions on the full dataset. Our contributions are as follows: A simple online variant of local search (denoted ONLINE-LS). We prove that this algorithm achieves an approximation ratio k O(k), matching the best known streaming algorithms. However, the space complexity depends (logarithmically) on , an appropriately defined condition number parameter. (See Theorem 3.1.) (Main contribution) An algorithm ONLINE-DPP that avoids conditional number depen- dence, but incurs a small additive error. Given a parameter , the algorithm keeps only poly(k, log(1/ )) vectors, while incurring a multiplicative approximation factor of (k + log(1/ ))O(k) along with an additive error . (See Theorem 3.2.) Experiments demonstrating the efficiency of online methods for MAP inference. We demonstrate how even the simple algorithm ONLINE-LS finds solutions that compete favorably with offline algorithms (that store the entire dataset in memory). We also bound the amount of recourse in our algorithms, i.e., the number of times the solution changes through the course of the stream. This number will be independent of n in both algorithms. 1The argument needs slightly more care to be made formal; see [Mahabadi et al., 2020] for (n) space lower bounds for the streaming setting. 1.1 Related work The problem of finding the MAP configuration of a DPP has been studied extensively, and has also been referred to by other names such as sub-determinant maximization and D-optimal experiment design. Much of the work can be divided into a few strands, which we discuss briefly. Submodular Maximization. It is known that the set function f(S) = log det(VSV T S ) is a submodular function. Therefore, a series of previous work applied submodular maximization algorithms such as greedy addition [Chen et al., 2018, Badanidiyuru et al., 2014, Kulesza and Taskar, 2012b], soft-max relaxation [Gillenwater et al., 2012], multi-linear extension [Hassani et al., 2019], in order to find the MAP configuration. Interesting results are also known for maximizing determinantal functions with approximate submodularity Bian et al. [2017] and maximizing submodular functions whose value can be negative Harshaw et al. [2019]. However, there are two drawbacks with such methods. First, even though f(S) is submodular, it might be negative, i.e., when det(VSV T S ) < 1. Almost all submodular maximization algorithms assume non-negativity of the objective function in order to provide a constant factor approximation guarantee [Buchbinder and Feldman, 2017]. Second, any -approximation guarantee for a non-negative and non-monotone submodular function f may only provide a OPT1 approximation guarantee for problem (2) where OPT is the volume of the optimum solution. Such approximation guarantees may be much worse than multiplicative guarantees. Coresets. Given a large data set, a coreset is defined to be a subset with the property that solving an optimization problem on the subset yields a good solution to the problem on the full dataset Agarwal et al. [2005]. It is a powerful paradigm in the design of sublinear algorithms [Indyk et al., 2014, Mirrokni and Zadimoghaddam, 2015]; for problem (2), Indyk et al. [2018] showed the construction of coresets of size O(k log k) that yield an approximation guarantee of O(k)k. A more practical construction was proposed in Mahabadi et al. [2019], using simple greedy and local search coreset algorithms, with slightly weaker guarantees. Standard application of such coresets in the online setting implies k O(k) approximation guarantee at the cost of having the space complexity dependent on pn. Using similar ideas, [Mahabadi et al., 2020] very recently gave a one-pass streaming algorithm for (2) with a trade-off between the space complexity and approximation as discussed earlier. Our results yield bounds independent of n, even in the stronger online model. In parallel to the MAP inference, there has been a surge of interest in fast and efficient sampling from submodular point processes [Rebeschini and Karbasi, 2015, Gotovos et al., 2015]. In particular, spectral algorithms have been recently developed for generating samples from DPPs [Derezinski et al., 2019] and more generally from strongly Rayleigh measures [Anari et al., 2016]. Such techniques have also been generalized to other negative dependence probabilistic models and more complex constraints [Li et al., 2015, Celis et al., 2016, Mariet et al., 2018]. 2 Preliminaries We consider a stream of d-dimensional vectors V whose elements v1, v2, . . . , vn arrive in order. Unless otherwise specified, the algorithm is not aware of n, the size of the data stream. Throughout the paper, k will always denote the number of vectors we wish to output as the solution. Following Mahabadi et al. [2019], we sometimes refer to a set of vectors in IRd as a point set and we call the elements either points or vectors. In this paper, we are interested in solving problem(2) subject to a k cardinality constraint. Let S Rd be a set of vectors of size k. For any set of points P, we use OPTk(P) to denote OPTk(P) = max S P,|S|=k det(VSV T S ) = (vol(S))2, where as before VS is a k d matrix whose rows are vectors of S. When k is clear from the context, we sometimes drop the subscript and write OPT(P). For a set of vectors S, span(S) denotes the linear subspace spanned by S. We denote the set of all k-dimensional linear subspaces of Rd by Hk. Finally for a point p and a subspace H, we use d(p, H) to denote the (closest) Euclidean distance of p from H. For a set of points P, the directional height (as defined in [Mahabadi et al., 2019]) of P with respect to H is defined as maxp2P d(p, H). A core-set for the k-directional height is defined as follows. Definition 2.1. Mahabadi et al. [2019](Core-set for k-directional height) Given a point set P, a subset C P is called an -approximate core-set for k-directional height if for all H 2 Hk 1, p2C d(C, H) 1 p2P d(P, H). We denote an -approximate coreset of a set P by c(P) (clearly, P can have multiple core-sets). 3 Online volume maximization We now present our algorithmic results for online DPP. The first algorithm (Section 3.1) is very simple to implement, but suffers a dependence on an appropriately defined condition number. We then (Section 3.2) give an algorithm that avoids this dependence, but has additive error in its guarantee. 3.1 Online addition with a stash We use the following notation for stating the result. Let volfirst(V) be the volume of the first parallellepiped with non zero volume that can be formed in the stream (e.g., if the first k vectors in the stream are linearly independent, then volfirst = vol(v1, v2, . . . , vk)). We give a simple algorithm and prove the following. Theorem 3.1. Let V = {v1, v2, . . . } be a stream of vectors that arrive in an online model. Algorithm 1 (ONLINE-LS) provides the following guarantees: 1. At any point, it returns a solution with k O(k) approximation to the k-volume maximization problem so far. 2. At any point in time, the algorithm keeps in memory at most O(k + log ) vectors where is the ratio OPTk(V)/volfirst(V). Also, the solution is selected within those vectors. Furthermore, vectors are never deleted from the algorithm memory. 3. The running time of each step is O((k + k log2( ))Tvol(k)) where Tvol(k) is the time it takes to compute the volume of a size k set of columns.2 The amortized running time of each step is O((k+ k log2( ) n )Tvol(k)) where n is the total number of columns in the whole stream. For n log2 , the amortized computation in each step is only O(k) volume computations. 4. The total amount of recourse (number of changes to the solution during the entire stream) is The dependence on is undesirable because volfirst can be arbitrarily small (all it takes is one set of badly conditioned vectors). We show how to overcome this issue in Section 3.2. Outline. Algorithm 1 maintains a candidate solution S. Upon seeing a new column vt, if we can replace some vi 2 S with vt and increase vol(S) by a factor (1 + ), we perform the swap. But instead of throwing vi away, we store into a stash T. Every time we perform a swap and update solution S, we try to use the columns in the stash to improve solution S. Formally, we see if swapping any column v 2 S with vm 2 T improves the volume of S by a factor of 1 + . If such a pair exists, we perform the swap (and move v to the stash). We stop when no such swap is possible. Why a stash? It is natural to ask why we need to keep the removed vectors in a stash; can they later become useful to the optimal solution? It turns out that this is the case. In the supplement, Section E, we show a simple example with k = 2 where no bounded approximation factor is possible if vectors are removed permanently. Very roughly, the vectors in this example v1, v2, . . . , vr make slowly increasing angles with the x axis, and the lengths increase slightly in every step. This makes Local Search (without a stash) always maintain the two most recent vectors as the solution. In the end, the optimal solution turns out to be {v1, vr} as they have the largest angle. Making this formal requires some more work and is deferred to Section E. Our proof of Theorem 3.1 combines ideas from [Mahabadi et al., 2019] with properties of our stash based algorithm. The details are deferred to the supplement, Section A. The approximation bound 2By considering the vectors in order and using the base times height formula, the volume can be computed using O(k2) inner product computations, which is O(k2d). Algorithm 1 Local Search with Stash (ONLINE-LS) 1: Input: A sequence of arriving columns vt, and a constant (set to 1 for simplicity) 2: Output: A subset S of k columns at any time step 3: S, T ; 4: for newly arrived column vt do 5: if |S| < k and vt 62 span(S) then 6: Add vt to S 7: else 8: if 9vi 2 S : vol(S [ {vt} \ {vi}) > (1 + )vol(S) then 9: S S [ {vt} \ {vi} 10: Add vi to T 11: while 9v 2 S, vm 2 T : vol(S [ {vm} \ {v }) > (1 + )vol(S) do 12: S S [ {vm} \ {v } 13: Add v to T 14: Return S of k (k) turns out to be optimal for the algorithm. We thank the anonymous reviewer for providing the following example: suppose the stream consists of e1, e2, . . . , ek (the standard basis vectors), followed by h1, h2, . . . , hk, where hi are the vectors of the Hadamard basis (scaled so that the entries are 1), then the algorithm will not add any of the hi to the solution (because replacing one of the ei with some hj does not help). But choosing latter k vectors gives a volume of ( k)k, as opposed to 1 achieved by the algorithm. 3.2 Additive error algorithm Normalization. For simplicity of exposition, we assume below that all the columns in V have kvik 1. This assumption can be removed at the expense of replacing log(1/ ) in our space and approximation bounds (in Theorem 3.2) with log(maxi kvik / ). The algorithm is required to know an upper bound on log(maxi kvik / ). Note that since it is a logarithmic term, even a very coarse approximation to maxi kvik suffices. Details of this can be found in the supplement, Section C. The main result is the following: Theorem 3.2. Given a parameter > 0, algorithm 2 (ONLINE-DPP) has the following guarantees: 1. At any point, the algorithm returns a solution S that satisfies (k + log(1/ ))O(k)vol(S) + OPT(V). 2. The space complexity is O k3 log2 k + k log2 1 3. The running time of each step is O(L Tridge(k L) + k3 log k L2Tvol(k)). Parameter L = O(log(1/ ) + k log k) is defined as log(1/γ) (using (3)). Function Tridge(k L) is the time to compute the ridge leverage score when we currently have k L columns. Function Tvol(k) is the time it takes to compute the volume of a parallelepiped formed with k vectors. 4. The total amount of recourse (number of changes to the solution S during the entire stream) is O(log(1/ ) + k log k + k log log(1/ )). Remark. The space complexity is measured in terms of the number of vectors. By definition, this is upper bounded by d, but it can be much smaller if the columns are sparse. Further, as noted in Mahabadi et al. [2020], volume maximization can also be performed after applying a random projection in certain parameter regimes. Also, we note that Tridge has been extensively studied and can be computed in time nearly linear in the number of non-zeroes in the vectors in Sδ (see [Cohen et al., 2017, Alaoui and Mahoney, 2015], and references therein).3 The following definition extends Definition 2.1 to incorporate an additive term. 3To compute the volume of k columns, Tvol(k), you need to consider the columns in an arbitrary order and find the distance of each column to the span of previous ones. This can be done by maintaining an orthonormal basis of prefixes of this sequence of columns. So for the i-th column (2 i k), the inner product of it with the previous i 1 columns is needed. So the volume computation is reduced to O(k2) inner product computations. Definition 3.3. For a set V of vectors, we say that S is an ( , γ)-core-set for k-directional height, if for every subspace H of Rd with dimension (k 1), we have: u2S d(u, H) 1 u2V d(u, H), whenever max u2V d(u, H) γ. Thus the definition does not yield any guarantee when the maximum distance to a subspace is < γ. We also define a parameter γ and a set D that we use throughout. (64k)2k ; D = {2iγ : i = 1, 0, 1, . . . , and 2iγ 1 Algorithm 2 Online Coreset for Additive Error Approximation (ONLINE-DPP) 1: Input: An additive error parameter , and a sequence of vectors V = v1, v2, . . . 2: Output: Core-set S 3: Define γ using (3) 4: For all δ 2 D, set Sδ = ; and Mδ = δ2Id 5: for newly arrived vector vt do 6: for each δ 2 D do 7: if δ ) 1vt > 2 and |Sδ| < 4k log(4/δ) then 8: Add vt to Sδ 9: Update Mδ Mδ + vtv T t 10: Return S = [δ2DSδ Algorithm outline. The key objects in our algorithm are the sets Sδ, one for every δ 2 D, and the corresponding matrices Mδ. These sets are all initialized to be empty. Once a vt arrives, for every δ, it checks if the following condition holds: v T δ ) 1vt > 2. We refer to this as the Mδ condition (the LHS is known as the ridge leverage score [Cohen et al., 2017]). If so, it adds vt to Sδ. This is done as long as |Sδ| is small. Once |Sδ| 4k log(4/δ), no more elements are added to Sδ (this check is done independently for every δ). Algorithm 2 shows how to maintain a core-set S which is guaranteed to contain an approximately optimal solution. Since our overall goal is to maintain a near optimal solution at every time step, we will also keep a solution S S. Every time we add an element to S, we check to see if S is to be updated. For this, we can use any of the volume maximization algorithms in the literature. For concreteness, we use the local search algorithm in [Mahabadi et al., 2019]. Every time we add a vector to coreset S, we run the local search algorithm and let S0 be its solution. If vol(S0) 2vol(S), we set S to be S0. Based on the approximation guarantee of local search [Mahabadi et al., 2019], OPT(S) (3k)2k vol(S) = k O(k) vol(S) (4) Outputting S as the current solution provides all the guarantees of Theorem 3.2 except that it may change in too many time steps. To upper bound the recourse (number of updates to the solution), we output the first k vectors in the stream as the current solution until vol(S0) reaches = /((3k)2k k) where is from Theorem 3.4. After this point, we output S as the solution. Space complexity. By limiting the size of each Sδ to 4k log(4/δ), we obtain a simple bound on the overall space. The number of choices of δ is O(log(1/γ)), and each δ γ/2. Thus the space complexity is at most O(log(1/γ)) 4k log(8/γ) = O k log2(1/γ) The main technical result is the following: Theorem 3.4. The set S = [δ2DSδ returned by the algorithm is an ( , γ)-core-set for k-directional height, where = 8 k log(8/γ). Outline of the proof of Theorem 3.2. The space and time complexity analyses are relatively straightforward, using the bounds on Sδ along with (5). The tricky part is to show the approximation guarantee. Theorem 3.4 shows that S = [δSδ is an ( , γ)-core-set for k-directional height. So a natural approach is to start with the true optimum solution {u1, u2, . . . , uk} and show iteratively that there exist replacements in S. This is easy if we had coresets where the additive error γ (see Definition 3.3) is zero. When γ 6= 0, we need to show that the directional height in each step is bounded from below. This requires a more careful inductive argument, which crucially uses the choice of the parameter γ from (3). The full details of the proof are deferred to the supplement, Section B.1. The main step now is to prove Theorem 3.4. We do this by showing structural properties of the sets Sδ. The following packing lemma will play a key component in the argument. Lemma 3.5. Suppose v1, v2, . . . , vm 2 Rq are vectors of length 1. Let Vi denote the matrix with columns v1, v2, . . . , vi. Suppose that the vectors satisfy i+1(δ2Iq + Vi V T i ) 1vi+1 > 1. Then we have m 4q log(2/δ). The proof is an application of the matrix determinant lemma and Cauchy Schwarz; it is deferred to the supplement (Section B.1.1). The next lemma gives a clean interpretation of the Mδ condition. Lemma 3.6. Let U be a matrix with columns being u1, u2, . . . , um 2 Rd, and let δ > 0. For any v 2 Rd, we have the following: 1. For any c > 0, if v T (δ2Id + UU T ) 1v c, then we can write v = Ux + δz, where kxk2 + kzk2 c. 2. Suppose v T (δ2Id + UU T ) 1v > 2. Then we cannot express v = Ux + z0, for any choice of x, z0 that satisfy kxk 1 and kz0k δ. Once again, we defer the proof to the supplement (Section B.2). We next outline our main result. Outline of proof of Theorem 3.4. We need to show that for any vi 2 V and any (k 1)-dimensional subspace H of Rd, if d(vi, H) γ, then there exists a v 2 S such that d(v, H) (1/ ) d(vi, H). Let us fix any such vi and subspace H. We show that in fact, some element of Sδ satisfies this property, where δ is the unique value in D satisfying 2δ d(vi, H) < 4δ. Note that by our choice of parameters, there is always exactly one such δ 2 D. In what follows, let us fix this value of δ, and let = 8 k log(8/γ) be the core-set approximation factor we are aiming to prove in Theorem 3.4. Since all values in D including δ are at least γ/2, we also have log(4/δ) log(8/γ). So it suffices to prove that there exists a u 2 Sδ such that d(u, H)2 δ2 32k log(4/δ). Now, consider the set Sδ at the ith time step of the algorithm (when we see vi). If vi gets added to Sδ, there is nothing to prove. Otherwise, either the Mδ condition was not met, or the size of Sδ is already at the threshold of 4k log(4/δ). Let us consider the two cases separately and show the following. Claim 1. For δ as above, suppose |Sδ| < 4k log(4/δ) at the time that vi arrives, and suppose that v T δ ) 1vi 2. Then there exists a u 2 Sδ such that d(u, H)2 δ2 4k log(4/δ). Claim 2. For some δ 2 (0, 1/2), suppose |Sδ| = 4k log(4/δ). Then for any (k 1) dimensional subspace H, there exists u 2 Sδ such that d(u, H)2 > δ2 8|Sδ|. Claim 2 is interesting as it shows that some point in Sδ has a high enough distance to every (k 1)- dimensional subspace. It is the main step where we use the packing lemma stated above. The full proofs of the claims are in the supplement, Section B.3. These imply that if |Sδ| was 4k log(4/δ), it does not matter that ui was not added to Sδ. For any H, we already have maxu2Sδ d(u, H) being (1/ ) 4δ (1/ ) d(ui, H). This completes the proof of the theorem. 4 Experiments In this section we compare the experimental performances of our Algorithm 1 (ONLINE-LS), with: an online greedy algorithm, ONLINEGREEDY that upon processing a new row adds it to the solution if, by swapping it with any row in the solution, it is possible to increase the volume. the classic offline greedy algorithm, GREEDY that does k passes on the entire dataset and in every pass it adds to the current solution the row that increases the volume the most. All our experiments have been carried out on a standard desktop computer and all the experiments presented in this section are fully deterministic. In our experiments we consider three standard datasets: the Spambase dataset [Dua and Graff, 2017], the Statlog(or Shuttle) dataset [Dua and Graff, 2017] and the Pen-Based Recognition dataset [Dua and Graff, 2017]. All the datasets contain only integer and real values, the Spambase dataset contains 4601 instances of 57 dimensions, the Statlog dataset contains 58000 instances of 9 dimensions and the Pen-Based Recognition dataset contains 10992 of 16 dimensions. In all our experiments we normalized the norm of the instances to be smaller than 1. Quality, consistency and amount of computation. In Figure 1 we show a comparison of the three algorithms on the Spambase dataset and the Statlog dataset for k = 8 and = 0.1.4 For the offline GREEDY algorithm we report only the quality of the solution because the running times are incomparable. For ONLINELS and ONLINEGREEDY we report the number of volume computations as system independent proxy of the running time and the number of swap in the solution during the execution of the algorithm to capture the consistency of the algorithm. In all plot for ONLINELS and ONLINEGREEDY we represent the evolution of quality of solution, number of volume computations and number of swaps as a function of the number of columns analyzed so far. For GREEDY we show the final quality of the solution. Interestingly, both ONLINELS and ONLINEGREEDY recover a solution that has quality comparable with the solution of the offline algorithm. Furthermore, both algorithms execute a similar number of volume computations but ONLINELS execute significantly less swaps and so it is more consistent than ONLINEGREEDY. 0 10000 20000 30000 40000 0e+00 2e 14 4e 14 6e 14 Number of Analyzed Columns Solution Value Online LS Online Greedy Greedy 0 10000 20000 30000 40000 0e+00 1e+05 2e+05 3e+05 4e+0 Number of Analyzed Columns Number of Volume Computations Online LS Online Greedy 0 10000 20000 30000 40000 0 20 40 60 80 Number of Analyzed Columns Number of Swaps Online LS Online Greedy 0 1000 2000 3000 4000 0e+00 4e 16 8e 16 Number of Analyzed Columns Solution Value Online LS Online Greedy Greedy 0 1000 2000 3000 4000 0 10000 30000 5000 Number of Analyzed Columns Number of Volume Computations Online LS Online Greedy 0 1000 2000 3000 4000 0 10 20 30 40 50 60 70 Number of Analyzed Columns Number of Swaps Online LS Online Greedy Figure 1: Performances of the algorithms on the Shuttle ((a),(b),(c)) and Spambase datasets ((d),(e),(f)) for k = 8 and = 0.1. In the figures we report the quality of the solution ((a),(d)), the number of volume computations ((b),(e)) and the number of swaps ((c),(f)) as a function of the number of rows processed so far. 4Experiments for different k and and for the Pen-Based Recognition dataset are similar and postponed to supplementary material Dependency on . Now we turn our attention to the performance of our algorithm ONLINELS as a function of . In particular for Shuttle dataset we report how the number of swaps and quality of the solution change as changes (experiments on other datasets are available in supplementary material). Interestingly, we can see in Figure 2 that both the number of swaps and quality of the solution decrease smoothly as increases. The number of volume computations is very close for all the . Finally we note that in our experiment we also consider a variation of our algorithm that does not use a stash T. Interestingly we notice that this algorithm has performance very close to ONLINELS. 0 10000 20000 30000 40000 0e+00 1e 14 2e 14 3e 14 4e 14 Number of Analyzed Columns Solution Value ε= 0.05 ε= 0.1 ε= 0.3 ε= 0.5 0 10000 20000 30000 40000 0e+00 1e+05 2e+05 3e+05 4e+0 Number of Analyzed Columns Number of Volume Computations ε= 0.05 ε= 0.1 ε= 0.3 ε= 0.5 0 10000 20000 30000 40000 0 10 20 30 40 Number of Analyzed Columns Number of Swaps ε= 0.05 ε= 0.1 ε= 0.3 ε= 0.5 Figure 2: Performances of the algorithms on the Shuttle for different values of and k = 8. 5 Conclusion In this paper, we developed online algorithms for finding the most likelihood configuration of size k for DPPs. The algorithms process data points as they arrive, with small processing time per point and a small total memory footprint. Our main contribution ONLINE-DPP achieves a k O(k) multiplicative approximation guarantee with an additive error , using memory independent of the size of the data stream. Acknowledgments and Disclosure of Funding Aditya Bhaskara is partially supported by NSF (CCF-2008688) and by a Google Faculty Research Award. Amin Karbasi is partially supported by NSF (IIS-1845032), ONR (N00014-19-1-2406), AFOSR (FA9550-18-1-0160), and TATA Sons Private Limited. Broader Impact In this paper, we aim to address a fundamental problem in many aspects of data science: how to select a representative and diverse subset of data points. An elegant and intuitive way to score diversity of a subset is through a determinantal point process where diversity is measured via the geometric embedding of the data points. Our paper provides a rigorous and scalable method for maximizing diversity in such probabilistic models. Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Geometric approximation via coresets. In Combinatorial and Computational Geometry, MSRI, pages 1 30. University Press, 2005. Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 775 783. Curran Associates, Inc., 2015. Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes. In Conference on Learning Theory, pages 103 115, 2016. Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streaming submodular maximization: Massive data summarization on the fly. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 671 680, 2014. Aditya Bhaskara, Silvio Lattanzi, Sergei Vassilvitskii, and Morteza Zadimoghaddam. Residual based sampling for online low rank approximation. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1596 1614. IEEE, 2019. Andrew An Bian, Joachim M Buhmann, Andreas Krause, and Sebastian Tschiatschek. Guaran- tees for greedy maximization of non-submodular functions with applications. ar Xiv preprint ar Xiv:1703.02100, 2017. Niv Buchbinder and Moran Feldman. Submodular functions maximization problems. Handbook of Approximation Algorithms and Metaheuristics, 1:753 788, 2017. L Elisa Celis, Amit Deshpande, Tarun Kathuria, Damian Straszak, and Nisheeth K Vishnoi. On the complexity of constrained determinantal point processes. ar Xiv preprint ar Xiv:1608.00554, 2016. Laming Chen, Guoxin Zhang, and Eric Zhou. Fast greedy map inference for determinantal point process to improve recommendation diversity. In Advances in Neural Information Processing Systems, 2018. Michael B. Cohen, Cameron Musco, and Christopher Musco. Input sparsity time low-rank ap- proximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 17, page 1758 1777, USA, 2017. Society for Industrial and Applied Mathematics. Vincent Cohen-Addad, Niklas Oskar D Hjuler, Nikos Parotsidis, David Saulpic, and Chris Schwiegelshohn. Fully dynamic consistent facility location. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 3255 3265. Curran Associates, Inc., 2019. Michal Derezinski, Daniele Calandriello, and Michal Valko. Exact sampling of determinantal point processes with sublinear time preprocessing. In Advances in Neural Information Processing Systems, pages 11542 11554, 2019. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics. uci.edu/ml. J. B. Ebrahimi, D. Straszak, and N. K. Vishnoi. Subdeterminant maximization via nonconvex relaxations and anti-concentration. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 1020 1031, Oct 2017. doi: 10.1109/FOCS.2017.98. Moran Feldman, Amin Karbasi, and Ehsan Kazemi. Do less, get more: streaming submodular maximization with subsampling. In Advances in Neural Information Processing Systems, pages 732 742, 2018. Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Near-optimal map inference for determinantal point processes. In Advances in Neural Information Processing Systems, pages 2735 2743, 2012. Boqing Gong, Wei-Lun Chao, Kristen Grauman, and Fei Sha. Diverse sequential subset selection for supervised video summarization. In Advances in neural information processing systems, pages 2069 2077, 2014. Alkis Gotovos, Hamed Hassani, and Andreas Krause. Sampling from probabilistic submodular models. In Advances in Neural Information Processing Systems, pages 1945 1953, 2015. Christopher Harshaw, Moran Feldman, Justin Ward, and Amin Karbasi. Submodular maximiza- tion beyond non-negativity: Guarantees, fast algorithms, and applications. ar Xiv preprint ar Xiv:1904.09354, 2019. Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Zebang Shen. Stochastic conditional gradient++. ar Xiv preprint ar Xiv:1902.06992, 2019. Piotr Indyk, Sepideh Mahabadi, Mohammad Mahdian, and Vahab S Mirrokni. Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems, pages 100 108, 2014. Piotr Indyk, Sepideh Mahabadi, Shayan Oveis Gharan, and Alireza Rezaei. Composable core-sets for determinant maximization problems via spectral spanners. ar Xiv preprint ar Xiv:1807.11648, 2018. Mohammad Reza Karimi Jaghargh, Andreas Krause, Silvio Lattanzi, and Sergei Vassilvtiskii. Con- sistent online optimization: Convex and submodular. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2241 2250, 2019. Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. ar Xiv preprint ar Xiv:1207.6083, 2012a. Alex Kulesza and Ben Taskar. Learning determinantal point processes. ar Xiv preprint ar Xiv:1202.3738, 2012b. Silvio Lattanzi and Sergei Vassilvitskii. Consistent k-clustering. In Proceedings of the 34th Interna- tional Conference on Machine Learning-Volume 70, pages 1975 1984. JMLR. org, 2017. Sang-Chul Lee, Sang-Wook Kim, Sunju Park, and Dong-Kyu Chae. A single-step approach to recommendation diversification. In Proceedings of the 26th International Conference on World Wide Web Companion, pages 809 810, 2017. Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Efficient sampling for k-determinantal point processes. ar Xiv preprint ar Xiv:1509.01618, 2015. Odile Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, 1975. Sepideh Mahabadi, Piotr Indyk, Shayan Oveis Gharan, and Alireza Rezaei. Composable core-sets for determinant maximization: A simple near-optimal algorithm. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4254 4263, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr.press/ v97/mahabadi19a.html. Sepideh Mahabadi, Ilya Razenshteyn, David P Woodruff, and Samson Zhou. Non-adaptive adaptive sampling on turnstile streams. ar Xiv preprint ar Xiv:2004.10969, 2020. Zelda E Mariet, Suvrit Sra, and Stefanie Jegelka. Exponentiated strongly rayleigh distributions. In Advances in Neural Information Processing Systems, pages 4459 4469, 2018. Vahab Mirrokni and Morteza Zadimoghaddam. Randomized composable core-sets for distributed submodular maximization. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 153 162, 2015. Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, pages 2049 2057, 2013. Aleksandar Nikolov. Randomized rounding for the largest simplex problem. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC 15, page 861 870, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450335362. doi: 10.1145/2746539.2746628. URL https://doi.org/10.1145/2746539.2746628. Aleksandar Nikolov and Mohit Singh. Maximizing determinants under partition constraints. In Pro- ceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, STOC 16, page 192 201, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450341325. doi: 10.1145/2897518.2897649. URL https://doi.org/10.1145/2897518. 2897649. Lijing Qin and Xiaoyan Zhu. Promoting diversity in recommendation by entropy regularizer. In Twenty-Third International Joint Conference on Artificial Intelligence, 2013. Patrick Rebeschini and Amin Karbasi. Fast mixing for discrete point processes. In Conference on Learning Theory, pages 1480 1500, 2015. Marco Di Summa, Friedrich Eisenbrand, Yuri Faenza, and Carsten Moldenhauer. On largest volume simplices and sub-determinants. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 15, page 315 323, USA, 2015. Society for Industrial and Applied Mathematics. Pengtao Xie, Ruslan Salakhutdinov, Luntian Mou, and Eric P Xing. Deep determinantal point process for large-scale multi-label classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 473 482, 2017.