# dppy_dpp_sampling_with_python__4ea30eba.pdf Journal of Machine Learning Research 20 (2019) 1-7 Submitted 3/19; Revised 8/19; Published 10/19 DPPy: DPP Sampling with Python Guillaume Gautier g.gautier@inria.fr Guillermo Polito guillermo.polito@univ-lille.fr R emi Bardenet remi.bardenet@gmail.com Michal Valko valkom@deepmind.com Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRISt AL, 59651 Villeneuve d Ascq, France Inria Lille-Nord Europe, 40 avenue Halley 59650 Villeneuve d Ascq, France Deep Mind Paris, 14 Rue de Londres, 75009 Paris, France Editor: Andreas Mueller Determinantal point processes (DPPs) are specific probability distributions over clouds of points that are used as models and computational tools across physics, probability, statistics, and more recently machine learning. Sampling from DPPs is a challenge and therefore we present DPPy, a Python toolbox that gathers known exact and approximate sampling algorithms for both finite and continuous DPPs. The project is hosted on Git Hubm and equipped with an extensive documentation.î Keywords: determinantal point processes, sampling, MCMC, random matrices, Python 1. Introduction Determinantal point processes (DPPs) are distributions over configurations of points that encode diversity through a kernel function K. They were introduced by Macchi (1975) as models for beams of fermions, and they have since found applications in fields as diverse as probability (Soshnikov, 2000; K onig, 2004; Hough et al., 2006), statistical physics (Pathria & Beale, 2011), Monte Carlo methods (Bardenet & Hardy, 2019), spatial statistics (Lavancier et al., 2012), and machine learning (ML, Kulesza & Taskar, 2012). In ML, DPPs are mainly used as models for diverse sets of items, as in recommendation (Kathuria et al., 2016; Gartrell et al., 2016) or text summarization (Dupuy & Bach, 2018). Consequently, MLers mostly use finite DPPs, which are distributions over subsets of a finite ground set of cardinality M, parametrized by an M M kernel matrix K. Routine inference tasks such as normalization, marginalization, or sampling have complexity O(M3) (Gillenwater, 2014). Like other kernel methods, when M is large, O(M3) is a bottleneck. In terms of software, the R library spatstat of Baddeley & Turner (2005), a generalpurpose toolbox on spatial point processes, includes sampling and learning of continuous DPPs with stationary kernels, as described by Lavancier et al. (2012). Complementarily, we propose DPPy, a turnkey Python implementation of known general algorithms to sample finite DPPs. We also include algorithms for non-stationary continuous DPPs, e.g., related to random covariance matrices or Monte Carlo methods, which are also of interest for MLers. The DPPy project, hosted on Git Hub,m is already being used by the cross-disciplinary DPP community (Burt et al., 2019; Kammoun, 2018; Poulson, 2019; Derezi nski et al., 2019; Gautier et al., 2019). We use Travis for continuous integration and Coveralls for test coverage. Through Read The Docsî we provide an extensive documentation, which covers the essential mathematical background and showcases the key properties of DPPs through DPPy objects and associated methods. DPPy thus also serves as a tutorial on DPPs. m github.com/guilgautier/DPPy î dppy.readthedocs.io travis-ci.com/guilgautier/DPPy coveralls.io/github/guilgautier/DPPy c 2019 Guillaume Gautier, Guillermo Polito, R emi Bardenet, and Michal Valko. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/19-179.html. Gautier, Polito, Bardenet, and Valko 2. Definitions A point process is a random subset of points X = {X1, . . . , XN} X, where the number of points N is itself random. We further add to the definition that N should be almost surely finite and that all points in a sample are distinct. Given a reference measure µ on X, a point process is usually characterized by giving its k-correlation functions ρk for all k, where P one point of the process in each ball B(xi, dxi), i = 1, . . . , k = ρk (x1, . . . , xk) i=1 µ(dxi), see Møller & Waagepetersen (2004, Section 4). The functions ρk describe the interaction among points in X by quantifying co-occurrence of points at a set of locations. A point process X on (X, µ) parametrized by a kernel K : X X C is said to be determinantal, denoted as X DPP(K), if its k-correlation functions satisfy ρk(x1, . . . , xk) = det [K(xi, xj)]k i,j=1 , k 1. In ML, most DPPs are in the finite setting where X = {1, . . . , M} and µ = PM i=1 δi. In this context, the kernel function becomes an M M matrix K, and the correlation functions refer to inclusion probabilities. DPPs are thus often defined as X DPP(K) if P[S X] = det KS, S X, (1) where KS denotes the submatrix of K formed by the rows and columns indexed by S. The kernel matrix K is commonly assumed to be real-symmetric, in which case the existence and uniqueness of the DPP in Equation 1 is equivalent to the condition that the eigenvalues of K lie in [0, 1]. The result also holds for general Hermitian kernel functions K with additional assumptions (Soshnikov, 2000, Theorem 3). We note that there are also DPPs with nonsymmetric kernels (Borodin et al., 2010; Gartrell et al., 2019). Oftentimes, ML practitioners favor a more flexible definition of a DPP in terms of a likelihood kernel L, which only requires L 0, P[X = S] = det LS det [I + L]. (2) This avoids defining a correlation kernel 0 K I. Yet, the L parametrization makes Equation 1 less interpretable and does not cover important cases such as fixed-size DPPs, which correspond to projection kernels K. Kulesza & Taskar (2012, Section 5) countered that with k-DPPs, which can be understood as DPPs parametrized by a likelihood kernel, further conditioned to have exactly k elements. However, in general, k-DPPs are not DPPs. The main interest in DPPs in ML is that they model diversity while being tractable. Compared to independent sampling with the same marginals, Equation 1 entails P[{i, j} X] = Kii Kjj Kij Kji = P[{i} X] P[{j} X] K2 ij. (3) In particular, the larger Kij, the less likely items i and j co-occur. If Kij models the similarity between items i and j, DPPs are thus random diverse sets of elements. Most point processes that encode diversity are not tractable, in the sense that efficient algorithms to sample, marginalize, or compute normalization constants are not available. However, DPPs are amenable to these tasks with polynomial complexity (Gillenwater, 2014). Next, we focus on the sampling task, which is the core of DPPy. 3. Sampling determinantal point processes We henceforth assume that K is real-symmetric and satisfies suitable conditions (Soshnikov, 2000, Theorem 3), so that its spectral decomposition is available i=1 λiφi(x)φi(y), with Z X φi(x)φj(x)µ(dx) = δij. Note that, in the finite case, the spectral theorem is enough to eigendecompose K. Hough et al. (2006, Theorem 7) proved that sampling DPP(K) can be done in two steps: 1. draw Bi Ber(λi) independently and denote {i1, . . . , i N} = {i : Bi = 1}, 2. sample from the DPP with kernel e K(x, y) = PN n=1 φin(x)φin(y). In other words, all DPPs are mixtures of projection DPPs, i.e., DPPs parametrized by an orthogonal projection kernel. In a nutshell, Step 1 selects a component of the mixture and Step 2 generates a sample of the projection DPP( e K). Hough et al. (2006, Algorithm 18) provide a generic projection DPP sampler that we briefly describe. First, the projection DPP with kernel e K has exactly N = rank e K points, µ-almost surely. Then, the sequential aspect of the chain rule applied to sample (X1, . . . , XN) with probability distribution det e K(xp, xn) N n=1 µ(dxn) = Φ(x1) 2 distance2 Φ(xn), span {Φ(xp)}n 1 p=1 N (n 1) µ(dxn), (4) can be discarded to get a valid sample {X1, . . . , XN} DPP( e K). To each x X we associate a feature vector Φ(x) (φi1(x), . . . , φi N (x)), so that e K(x, y) = Φ(x)TΦ(y). A few remarks are in order. First, the LHS of Equation 4 defines an exchangeable probability distribution. Second, the successive ratios that appear in the RHS are the normalized conditional densities (w.r.t. µ) that drive the chain rule. The associated normalizing constants are independent of the previous points. The numerators can be written as the ratio of two determinants and further expanded with Woodbury s formula. They can be identified as the incremental posterior variances in Gaussian process regression with kernel e K (Rasmussen & Williams, 2006, Equation 2.26). Third, the chain rule expressed in Equation 4 has a strong Gram-Schmidt flavor since it actually comes from a recursive application of the base height formula. In the end, DPPs favor configuration of points whose feature vectors Φ(x1), . . . , Φ(x N) span a large volume, which is another way of understanding repulsiveness. The previous sampling scheme is exact and generic but, except for projection kernels, it requires the eigendecomposition of the underlying kernel. In the finite setting, this corresponds to an initial O(M3) eigendecomposition cost, followed by an average complexity of order O(M E[|X|]2) (see, e.g., Gillenwater, 2014; Tremblay et al., 2018). Besides, there exist some alternative exact samplers. Poulson (2019) and Launay et al. (2018) use a O(M3) Cholesky-based chain rule on sets; each item in turn is decided to be excluded or included in the sample. Derezi nski et al. (2019) first sample an intermediate distribution and correct the bias by thinning the intermediate sample (with size smaller than M) using a carefully designed DPP. In certain regimes, this procedure may be more practical with an overall O(M poly(E[|X|]) polylog(M)) cost. In the continuous case, sampling exactly each conditional that appears in the right-hand side of Equation 4 can be done using rejection sampling with a tailored proposal, see, e.g., Gautier et al. (2019). Gautier, Polito, Bardenet, and Valko In applications where the cost of exact sampling becomes a bottleneck, users rely on approximate sampling. Research has focused mainly on kernel approximation (Affandi et al., 2013) and MCMC samplers (Anari et al., 2016; Li et al., 2016; Gautier et al., 2017). However, specific DPPs admit efficient exact samplers that do not rely on Equation 4, e.g., uniform spanning trees (UST, Propp & Wilson, 1998, Figure 1(c)) or eigenvalues of random matrices. For instance, a β-ensemble is a set of N points of R with joint distribution 1 ZN,β p 0. For some choices of the weight function ω, the β-ensemble can be sampled by computing the eigenvalues of simple tridiagonal (Dumitriu & Edelman, 2002) or quindiagonal random matrices (Killip & Nenciu, 2004). In particular, (β = 2)-ensembles correspond to projection DPPs (K onig, 2004). They are therefore examples of continuous DPPs that can be sampled exactly in O(N2) time, without rejection. Some of these ensembles are of direct interest to MLers. The Laguerre ensemble, for instance, has ω be a Gamma pdf, and corresponds to the eigenvalues of the empirical covariance matrix of i.i.d. Gaussian vectors, see Figure 1(b). Finally, we mention that DPPy also features an exact sampler of the multivariate extension of the Jacobi ensemble, which has been central in recent results on faster-than-Monte Carlo numerical integration (Bardenet & Hardy, 2019; Gautier et al., 2019; Mazoyer et al., 2019). 4. The DPPy toolbox DPPy handles Python objects that fit the natural definition of the corresponding DPPs; see also the documentationî and the corresponding Jupyter notebooks, which showcase DPPy objects. For example, Finite DPP(kernel_type="correlation", **{"K":K}) instantiates a finite DPP(K). Its two main methods, .sample_exact() and .sample_mcmc() implement the different exact samplers and current state-of-the-art MCMC samplers. To sample k-DPPs, the additional .sample_exact_k_dpp() and .sample_mcmc_k_dpp() methods are available. A Laguerre β-ensemble is instantiated as Laguerre Ensemble(beta=2). When β {1, 2, 4}, it can be sampled either with .sample_full_model(), as the eigenvalues of a random covariance matrix, or, more efficiently and for any β > 0, using a tridiagonal matrix with .sample_banded_model(). Samples can be displayed via .plot() or .hist() to construct the empirical distribution that converges to the Marˇcenko-Pastur distribution, see Figure 1(b). DPPy can readily serve as research and teaching support. DPPy is also ready for other contributors to add content and enlarge its scope, e.g., with procedures for learning kernels. (a) 2D Jacobi ensemble (b) β = 2-Laguerre ensemble (c) K kernel of UST Figure 1: Some displays available in DPPy Acknowledgments We acknowledge funding by European CHIST-ERA project DELTA, the French Ministry of Higher Education and Research, the Nord-Pas-de-Calais Regional Council, Inria and Otto-von-Guericke-Universit at Magdeburg associated-team north-European project Allocate, and French National Research Agency project Bo B (n.ANR-16-CE23-0003). Affandi, R. H., Kulesza, A., Fox, E. B., and Taskar, B. Nystr om Approximation for Large Scale Determinantal Processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2013. Anari, N., Gharan, S. O., and Rezaei, A. Monte Carlo Markov Chain Algorithms for Sampling Strongly Rayleigh Distributions and Determinantal Point Processes. In Conference on Learning Theory (COLT), 2016. ar Xiv:1602.05242. Baddeley, A. and Turner, R. spatstat : An R Package for Analyzing Spatial Point Patterns. Journal of Statistical Software, 2005. Bardenet, R. and Hardy, A. Monte Carlo with Determinantal Point Processes. Annals of Applied Probability, in press, 2019. ar Xiv:1605.00361. Borodin, A., Diaconis, P., and Fulman, J. On adding a list of numbers (and other onedependent determinantal processes). Bulletin of the American Mathematical Society, 2010. ar Xiv:0904.3740. Burt, D., Rasmussen, C. E., and Wilk, M. V. D. Rates of Convergence for Sparse Variational Gaussian Process Regression. In International Conference on Machine Learning (ICML), 2019. ar Xiv:1903.03571. Derezi nski, M., Calandriello, D., and Valko, M. Exact sampling of determinantal point processes with sublinear time preprocessing. In Advances in Neural Information Processing Systems (Neur IPS), 2019. ar Xiv:1905.13476. Dumitriu, I. and Edelman, A. Matrix Models for Beta Ensembles. Journal of Mathematical Physics, 2002. ar Xiv:math-ph/0206043. Dupuy, C. and Bach, F. Learning Determinantal Point Processes in Sublinear Time. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2018. ar Xiv:1610.05925. Gartrell, M., Paquet, U., and Koenigstein, N. Low-Rank Factorization of Determinantal Point Processes for Recommendation. In AAAI Conference on Artificial Intelligence, 2016. ar Xiv:1602.05436. Gartrell, M., Brunel, V.-E., Dohmatob, E., and Krichene, S. Learning Nonsymmetric Determinantal Point Processes. Ar Xiv e-prints, 2019. ar Xiv:1905.12962. Gautier, Polito, Bardenet, and Valko Gautier, G., Bardenet, R., and Valko, M. Zonotope hit-and-run for efficient sampling from projection DPPs. International Conference on Machine Learning (ICML), 2017. ar Xiv:1705.10498. Gautier, G., Bardenet, R., and Valko, M. On two ways to use determinantal point processes for Monte Carlo integration. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Gillenwater, J. Approximate inference for determinantal point processes. Ph D thesis, University of Pennsylvania, 2014. Hough, J. B., Krishnapur, M., Peres, Y., and Vir ag, B. Determinantal Processes and Independence. In Probability Surveys, 2006. ar Xiv:math/0503110. Kammoun, M. S. Monotonous subsequences and the descent process of invariant random permutations. Electronic Journal of Probability, 2018. ar Xiv:1805.05253. Kathuria, T., Deshpande, A., and Kohli, P. Batched Gaussian Process Bandit Optimization via Determinantal Point Processes. In Advances in Neural Information Processing Systems (NIPS), 2016. ar Xiv:1611.04088. Killip, R. and Nenciu, I. Matrix models for circular ensembles. International Mathematics Research Notices, 2004. ar Xiv:math/0410034. K onig, W. Orthogonal polynomial ensembles in probability theory. Probability Surveys, 2004. ar Xiv:math/0403090. Kulesza, A. and Taskar, B. Determinantal Point Processes for Machine Learning. Foundations and Trends in Machine Learning, 2012. ar Xiv:1207.6083. Launay, C., Galerne, B., and Desolneux, A. Exact Sampling of Determinantal Point Processes without Eigendecomposition. Ar Xiv e-prints, 2018. ar Xiv:1802.08429. Lavancier, F., Møller, J., and Rubak, E. Determinantal point process models and statistical inference : Extended version. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 2012. ar Xiv:1205.4818. Li, C., Jegelka, S., and Sra, S. Fast Mixing Markov Chains for Strongly Rayleigh Measures, DPPs, and Constrained Sampling. In Advances in Neural Information Processing Systems (NIPS), 2016. ar Xiv:1608.01008. Macchi, O. The coincidence approach to stochastic point processes. Advances in Applied Probability, 1975. Mazoyer, A., Coeurjolly, J.-F., and Amblard, P.-O. Projections of determinantal point processes. Ar Xiv e-prints, 2019. ar Xiv:1901.02099. Møller, J. and Waagepetersen, R. P. Statistical inference and simulation for spatial point processes. Chapman & Hall/CRC. 2004. Pathria, R. K. and Beale, P. D. Statistical Mechanics. Academic Press. 2011. Poulson, J. High-performance sampling of generic Determinantal Point Processes. Ar Xiv e-prints, 2019. ar Xiv:1905.00165. Propp, J. G. and Wilson, D. B. How to Get a Perfectly Random Sample from a Generic Markov Chain and Generate a Random Spanning Tree of a Directed Graph. Journal of Algorithms, 1998. Rasmussen, C. E. and Williams, C. K. I. Gaussian processes for machine learning. MIT Press. 2006. Soshnikov, A. Determinantal random point fields. Russian Mathematical Surveys, 2000. ar Xiv:math/0002099. Tremblay, N., Barthelme, S., and Amblard, P.-O. Optimized Algorithms to Sample Determinantal Point Processes. Ar Xiv e-prints, 2018. ar Xiv:1802.08471.