# learning_nonsymmetric_determinantal_point_processes__ead5b5bb.pdf Learning Nonsymmetric Determinantal Point Mike Gartrell Criteo AI Lab m.gartrell@criteo.com Victor-Emmanuel Brunel ENSAE Paris Tech victor.emmanuel.brunel@ensae.fr Elvis Dohmatob Criteo AI Lab e.dohmatob@criteo.com Syrine Krichene Criteo AI Lab syrinekrichene@google.com Determinantal point processes (DPPs) have attracted substantial attention as an elegant probabilistic model that captures the balance between quality and diversity within sets. DPPs are conventionally parameterized by a positive semi-definite kernel matrix, and this symmetric kernel encodes only repulsive interactions between items. These so-called symmetric DPPs have significant expressive power, and have been successfully applied to a variety of machine learning tasks, including recommendation systems, information retrieval, and automatic summarization, among many others. Efficient algorithms for learning symmetric DPPs and sampling from these models have been reasonably well studied. However, relatively little attention has been given to nonsymmetric DPPs, which relax the symmetric constraint on the kernel. Nonsymmetric DPPs allow for both repulsive and attractive item interactions, which can significantly improve modeling power, resulting in a model that may better fit for some applications. We present a method that enables a tractable algorithm, based on maximum likelihood estimation, for learning nonsymmetric DPPs from data composed of observed subsets. Our method imposes a particular decomposition of the nonsymmetric kernel that enables such tractable learning algorithms, which we analyze both theoretically and experimentally. We evaluate our model on synthetic and real-world datasets, demonstrating improved predictive performance compared to symmetric DPPs, which have previously shown strong performance on modeling tasks associated with these datasets. 1 Introduction Determinantal point processes (DPPs) have attracted growing attention from the machine learning community as an elegant probablistic model for the relationship between items within observed subsets, drawn from a large collection of items. DPPs have been well studied for their theoretical properties [1, 4, 9, 13, 18, 20, 21], and have been applied to numerous machine learning applications, including document summarization [7, 24], recommender systems [11], object retrieval [1], sensor placement [17], information retrieval [19], and minibatch selection [29]. Efficient algorithms for DPP learning [10, 12, 14, 25, 26] and sampling [2, 22, 27] have been reasonably well studied. DPPs are conventionally parameterized by a positive semi-definite (PSD) kernel matrix, and due to this symmetric kernel, they are able to encode only repulsive interactions between items. Despite this limitation, symmetric DPPs have significant expressive power, and have proven effective in the aforementioned applications. However, the ability to encode only repulsive interactions, or negative Currently at Google. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. correlations between pairs of items, does have important limitations in some settings. For example, consider the case of a recommender system for a shopping website, where the task is to provide good recommendations for items to complete a user s shopping basket prior to checkout. For models that can only encode negative correlations, such as the symmetric DPP, it is impossible to directly encode positive interactions between items; e.g., a purchased basket containing a video game console would be more likely to also contain a game controller. One way to resolve this limitation is to consider nonsymmetric DPPs, which relax the symmetric constraint on the kernel. Nonsymmetric DPPs allow the model to encode both repulsive and attractive item interactions, which can significantly improve modeling power. With one notable exception [5], little attention has been given to nonsymmetric DPPs within the machine learning community. We present a method for learning fully nonsymmetric DPP kernels from data composed of observed subsets, where we leverage a low-rank decomposition of the nonsymmetric kernel that enables a tractable learning algorithm based on maximum likelihood estimation (MLE). Contributions Our work makes the following contributions: We present a decomposition of the nonsymmetric DPP kernel that enables a tractable MLE-based learning algorithm. To the best of our knowledge, this is the first MLE-based learning algorithm for nonsymmetric DPPs. We present a general framework for the theoretical analysis of the properties of the maximum likelihood estimator for a somewhat restricted class of nonsymmetric DPPs, which shows that this estimator has particular statistical guarantees regarding consistency. Through an extensive experimental evaluation on several synthetic and real-world datasets, we highlight the significant improvements in modeling power that nonsymmetric DPPs provide in comparison to symmetric DPPs. We see that nonsymmetric DPPs are more effective at recovering correlation structure within data, particularly for data that contains large disjoint collections of items. Unlike previous work on signed DPPs [5], our work does not make the very limiting assumption that the correlation kernel of the DPP is symmetric in the absolute values of the entries. This gives our model much more flexibility. Moreover, our learning algorithm, based on maximum likelihood estimation, allows us to leverage a low rank assumption on the kernel, while the method of moments tackled in [5] does not seem to. Finally, the learning algorithm in [5] has computational complexity of O(M 6), where M is the size of the ground set (e.g., item catalog), making it computationally infeasible for most practical scenarios. In contrast, our learning algorithm has substantially lower time complexity of O(M 3), which allows our approach to be used on many real-world datasets. 2 Background A DPP models a distribution over subsets of a finite ground set Y that is parametrized by a matrix L 2 R|Y| |Y|, such that for any J Y, Pr(J) / det(LJ), (1) where LJ = [Lij]i,j2J is the submatrix of L indexed by J. Since the normalization constant for Eq. 1 follows from the observation that P J Y det(LJ) = det(L + I), we have, for all J Y, PL(J) = det(LJ) det(L + I). (2) Without loss of generality, we will assume that Y = {1, 2, . . . , M}, which we also denote by [M], where M 1 is the cardinality of Y. It is common to assume that L is a positive semi-definite matrix in order to ensure that PL defines a probability distribution on the power set of [M] [20]. More generally, any matrix L whose principal minors det(LJ), J [M], are nonnegative, is admissible to define a probability distribution as in (2) [5]; such matrices are called P0-matrices. Recall that any matrix L can be decomposed uniquely as the sum of a symmetric matrix S and a skew-symmetric matrix A. Namely, S = L+L> 2 whereas A = L L> 2 . The following lemma gives a simple sufficient condition on S for L to be a P0-matrix. Lemma 1. Let L 2 RM M be an arbitrary matrix. If L + L> is PSD, then L is a P0-matrix. An important consequence is that a matrix of the form D + A, where D is diagonal with positive diagonal entries and A is skew-symmetric, is a P0-matrix. Such a matrix would only capture nonnegative correlations, as explained in the next section. 2.1 Capturing Positive and Negative Correlations When DPPs are used to model real data, they are often formulated in terms of the L matrix as described above, called an L-ensemble. However, DPPs can be alternatively represented in terms of the M M matrix K, where K = I (L + I) 1. Using the K representation, Pr(J Y ) = det(KJ), (3) where Y is a random subset drawn from P. K is called the marginal kernel; since here we are defining marginal probabilities that don t need to sum to 1, no normalization constant is needed. DPPs are conventionally parameterized by a PSD K or L matrix, which is symmetric. However, K and L need not be symmetric. As shown in [5], K is admissible if and only if L is a P0 matrix, that is, all of its principal minors are nonnegative. The class of P0 matrices is much larger, and allows us to accommodate nonsymmetric K and L matrices. To enforce the P0 constraint on L during learning, we impose the decomposition of L described in Section 4. Since we see as consequence of Lemma 1 that the sum of a PSD matrix and a skew-symmetric matrix is a P0 matrix, this allows us to support nonsymmetric kernels, while ensuring that L is a P0 matrix. As we will see in the following, there are significant advantages to accommodating nonsymmetric kernels in terms of modeling power. As shown in [20], the eigenvalues of K are bounded above by one, while L need only be a PSD or P0 matrix. Furthermore, K gives the marginal probabilities of subsets, while L directly models the atomic probabilities of observed each subset of Y. For these reasons, most work on learning DPPs from data uses the L representation of a DPP. If J = {i} is a singleton set, then Pr(i 2 Y ) = Kii. The diagonal entries of K directly correspond to the marginal inclusion probabilities for each element of Y. If J = {i, j} is a set containing two elements, then we have Pr(i, j 2 Y ) = Kii Kij Kji Kjj """" = Kii Kjj Kij Kji. (4) Therefore, the off-diagonal elements determine the correlations between pairs of items; that is, cov(1i2Y , 1j2Y ) = Kij Kji. For a symmetric K, the signs and magnitudes of Kij and Kji are the same, resulting in cov(1i2Y , 1j2Y ) = K2 ij 0. We see that in this case, the off-diagonal elements represent negative correlations between pairs of items, where a larger value of Kij leads to a lower probability of i and j co-occurring, while a smaller value of Kij indicates a higher co-occurrence probability. If Kij = 0, then there is no correlation between this pair of items. Since the sign of the K2 ij term is always nonpositive, the symmetric model is able to capture only nonpositive correlations between items. In fact, symmetric DPPs induce a strong negative dependence between items, called negative association [3]. For a nonsymmetric K, the signs and magnitudes of Kij and Kji may differ, resulting in cov(1i2Y , 1j2Y ) = Kij Kji 0. In this case, the off-diagonal elements represent positive correlations between pairs of items, where a larger value of Kij Kji leads to a higher probability of i and j co-occurring, while a smaller value of Kij Kji indicates a lower co-occurrence probability. Of course, the signs of the off-diagonal elements for some pairs (i, j) may be the same in a nonsymmetric K, which allows the model to also capture negative correlations. Therefore, a nonsymmetric K can capture both negative and positive correlations between pairs of items. 3 General guarantees in maximum likelihood estimation for DPPs In this section we define the log-likelihood function and we study the Fisher information of the model. The Fisher information controls whether the maximum likelihood, computed on n iid samples, will be a pn-consistent. When the matrix L is not invertible (i.e., if it is only a P0-matrix and not a P-matrix), the support of PL, defined as the collection of all subsets J [M] such that PL(J) 6= 0, depends on L, and the Fisher information will not be defined in general. Hence, we will assume, in this section, that L is invertible and that we only maximize the log-likelihood over classes of invertible matrices L. Consider a subset of the set of all P-matrices of size M. Given a collection of n observed subsets {Y1, ..., Yn} composed of items from Y = [M], our learning task is to fit a DPP kernel L based on this data. For all L 2 , the log-likelihood is defined as log PL(Yi) = ˆp J log PL(J) = ˆp J log det(LJ) log det(L + I) (5) where ˆp J is the proportion of observed samples that equal J. Now, assume that Y1, . . . , Yn are iid copies of a DPP with kernel L 2 . For all L 2 , the population log-likelihood is defined as the expectation of ˆfn(L), i.e., f(L) = E [log PL(Y1)] = J log det(LJ) log det(L + I) (6) J = E[ˆp J] = log PL (J). The maximum likelihood estimator (MLE) is defined as a minimizer ˆL of ˆfn(L) over the parameter space . Since ˆL can be viewed as a perturbed version of L , it can be convenient to introduce the space H defined as the linear subspace of RM M spanned by and define the successive derivatives of ˆfn and f as multilinear forms on H. As we will see later on, the complexity of the model can be captured in the size of the space H. The following lemma provides a few examples. We say that a matrix L is a signed matrix if for all i 6= j, Li,j = "i,j Lj,i for some "i,j 2 { 1, 1}. Lemma 2. 1. If is the set of all positive definite matrices, it is easy to see that H is the set of all symmetric matrices. 2. If is the set of all P-matrices, then H = RM M. 3. If is the collection of signed P-matrices, then H = RM M. 4. If is the set of P-matrices of the form S + A, where S is a symmetric matrix and A is a skew-symmetric matrix (i.e., A> = A), then H = RM M. 5. If is the set of signed P-matrices with known signed pattern (i.e., there exists ("i,j)i>j { 1, 1} such that for all L 2 and all i > j, Lj,i = "i,j Li,j), then H is the collection of all signed matrices with that same sign pattern. In particular, if is the set of all P-matrices of the form D + A where D is diagonal and A is skew-symmetric, then H is the collection of all matrices that are the sum of a diagonal and a skew-symmetric matrix. It is easy to see that the population log-likelihood f is infinitely many times differentiable on the relative interior of and that for all L in the relative interior of and H 2 H, and d2f(L)(H, H) = Hence, we have the following theorem. The case of symmetric kernels is studied in [6] and the following result is a straightforward extension to arbitrary parameter spaces. For completeness, we include the proof in the appendix. For a set RN N, we call the relative interior of its interior in the linear space spanned by . Theorem 1. Let be a set of P-matrices and let L be in the relative interior of . Then, for all H 2 H, df(L )(H) = 0. Moreover, the Fisher information is the negative Hessian of f at L and is given by d2f(L )(H, H) = Var tr , (9) where Y is a DPP with kernel L . It follows that the Fisher information is positive definite if and only if any H 2 H that verifies = 0, 8J [M] (10) must be H = 0. When is the space of symmetric and positive definite kernels, the Fisher information is definite if and only if L is irreducible, i.e., it is not block-diagonal up to a permutation of its rows and columns [6]. In that case, it is shown that the MLE learns L at the speed n 1/2. In general, this property fails and even irreducible kernels can induce a singular Fisher information. Lemma 3. Let be a subset of P -matrices. 1. Let L 2 and H 2 H satisfy (10). Then, for all i 2 [M], Hi,i = 0. 2. Let i, j 2 [M] with i 6= j. Let L 2 be such that L i,j 6= 0 and H satisfy the following property: 9" 6= 0 such that 8H 2 H, Hj,i = "Hi,j. Then, if H 2 H satisfies (10), Hi,j = Hj,i = 0. 3. Let L 2 be block diagonal. Then, any H 2 H supported outside of the diagonal blocks of L satisfies (10). In particular, this lemma implies that if is a class of signed P-matrices with prescribed sign pattern (i.e., Li,j = "i,j Lj,i for all i 6= j and all L 2 , where the "i,j s are 1 and do not depend on L), then if L lies in the relative interior of and has no zero entries, the Fisher information is definite. In the symmetric case, it is shown in [6] that the only matrices H satisfying (10) must be supported off the diagonal blocks of L , i.e., the third part of Lemma 3 is an equivalence. In the appendix, we provide a few very simple counterexamples that show that this equivalence is no longer valid in the nonsymmetric case. To add support for positive correlations to the DPP, we consider nonsymmetric L matrices. In particular, our approach involves incorporating a skew-symmetric perturbation to the PSD L. Recall that any matrix L can be uniquely decomposed as L = S + A, where S is symmetric and A is skew-symmetric. We impose a decomposition on A as A = BCT CBT , where B and C are low-rank M D0 matrices, and we use a low-rank factorization of S, S = V V T , where V is a low-rank M D matrix, as described in [12], which also allows us to enforce S to be PSD and hence, L to be a P0-matrix by Lemma 1. We define a regularization term, R(V , B, C), as R(V , B, C) = where λi counts the number of occurrences of item i in the training set, vi, bi, and ci are the corresponding row vectors of V , B, and C, respectively, and , β, γ > 0 are tunable hyperparameters. This regularization formulation is similar to that proposed in [12]. From the above, we have the full formulation of the regularized log-likelihood of our model: φ(V , B, C) = Yi + (BYi CT V V T + (BCT CBT ) + I + R(V , B, C) (12) The computational complexity of Eq. 12 will be dominated by computing the determinant in the second term (the normalization constant), which is O(M 3). Furthermore, since @ @Lij (log det(L)) = @Lij ), the computational complexity of computing the gradient of Eq. 12 during learning will be dominated by computing the matrix inverse in the gradient of the second term, (L + I) 1, which is O(M 3). Therefore, we see that the low-rank decomposition of the kernel in our nonsymmetric model does not afford any improvement over a full-rank model in terms of computational complexity. However, our low-rank decomposition does provide a savings in terms of the memory required to store model parameters, since our low-rank model has space complexity O(MD + 2MD0), while a full-rank version of this nonsymmetric model has space complexity O(M 2 + 2M 2). When D M and D0 M, which is typical in many settings, this will result in a significant space savings. 5 Experiments We run extensive experiments on several synthetic and real-world datasets. Since the focus of our work is on improving DPP modeling power and comparing nonsymmetric and symmetric DPPs, we use the standard symmetric low-rank DPP as the baseline model for our experiments. Preventing numerical instabilities The first term on the right side of Eq. (12) will be singular whenever |Yi| > D, where Yi is an observed subset. Therefore, to address this in practice we set D to the size of the largest subset observed in the data, as explained in [12]. Furthermore, the first term on the right side of Eq. (12) may be singular even when |Yi| D. In this case, we know that we are not at a maximum, since the value of the function becomes 1. Numerically, to prevent such singularities, in our implementation we add a small I correction to each LYi when optimizing Eq. 12 (we set = 10 5 in our experiments). 5.1 Datasets We perform next-item prediction and AUC-based classification experiments on two real-world datasets composed of purchased shopping baskets: 1. Amazon Baby Registries: This public dataset consists of 111,0006 registries or "baskets" of baby products, and has been used in prior work on DPP learning [11, 14, 25]. The registries are collected from 15 different categories, such as "apparel", "diapers", etc., and the items in each category are disjoint.We evaluate our models on the popular apparel category. We also perform an evaluation on a dataset composed of the three most popular categories: apparel, diaper, and feeding. We construct this dataset, composed of three large disjoint categories of items, with a catalog of 100 items in each category, to highlight the differences in how nonsymmetric and symmetric DPPs model data. In particular, we will see that the nonsymmetric DPP uses positive correlations to capture item co-occurences within baskets, while negative correlations are used to capture disjoint pairs of items. In contrast, since symmetric DPPs can only represent negative correlations, they must attempt to capture both co-occuring items, and items that are disjoint, using only negative correlations. 2. UK Retail: This is a public dataset [8] that contains 25,898 baskets drawn from a catalog of 4,070 items. This dataset contains transactions from a non-store online retail company that primarily sells unique all-occasion gifts, and many customers are wholesalers. We omit all baskets with more than 100 items, which allows us to use a low-rank factorization of the symmetric DPP (D = 100) that scales well in training and prediction time, while also keeping memory consumption for model parameters to a manageable level. 3. We also perform an evaluation on synthetically generated data. Our data generator allows us to explicitly control the item catalog size, the distribution of set sizes, and the item co-occurrence distribution. By controlling these parameters, we are able to empirically study how the nonsymmetric and symmetric models behave for data with a specified correlation structure. 5.2 Experimental setup and metrics Next-item prediction involves identifying the best item to add to a subset of selected items (e.g., basket completion), and is the primary prediction task we evaluate. We compute a next-item prediction for a basket J by conditioning the DPP on the event that all items in J are observed. As described in [13], we compute this conditional kernel, LJ, as LJ = L J L J,JL 1 J LJ, J, where J = Y J, L J is the restriction of L to the rows and columns indexed by J, and L J,J consists of the J rows and J columns of L. The computational complexity of this operation is dominated by the three matrix multiplications, which is O(M 2|J|). We compare the performance of all methods using a standard recommender system metric: mean percentile rank (MPR). A MPR of 50 is equivalent to random selection; a MPR of 100 indicates that the model perfectly predicts the held out item. MPR is a recall-based metric which we use to evaluate the model s predictive power by measuring how well it predicts the next item in a basket; it is a standard choice for recommender systems [15, 23]. See Appendix C for a formal description of how the MPR metric is computed. We evaluate the discriminative power of each model using the AUC metric. For this task, we generate a set of negative subsets uniformly at random. For each positive subset J+ in the test set, we generate a negative subset J of the same length by drawing |J+| samples uniformly at random, and ensure that the same item is not drawn more than once for a subset. We compute the AUC for the model on these positive and negative subsets, where the score for each subset is the log-likelihood that the model assigns to the subset. This task measures the ability of the model to discriminate between observed positive subsets (ground-truth subsets) and randomly generated subsets. For all experiments, a random selection of 80% of the baskets are used for training, and the remaining 20% are used for testing. We use a small held-out validation set for tracking convergence and tuning hyperparameters. Convergence is reached during training when the relative change in validation log-likelihood is below a pre-determined threshold, which is set identically for all models. We implement our models using Py Torch 2, and use the Adam [16] optimization algorithm to train our models. 5.3 Results on synthetic datasets We run a series of synthetic experiments to examine the differences between nonsymmetric and symmetric DPPs. In all of these experiments, we define an oracle that controls the generative process for the data. The oracle uses a deterministic policy to generate a dataset composed of positive baskets (items that co-occur) and negative baskets (items that don t co-occur). This generative policy defines the expected normalized determinant, det(KJ), for each pair of items, and a threshold that limits the maximum determinantal volume for a positive basket and the minimum volume for a negative basket. This threshold is used to compute AUC results for this set of positives and negatives. Note that the negative sets are used only during evaluation. For each experiment, in Figures 1, 2, and 3, we plot a transformed version of the learned K matrices for the nonsymmetric and symmetric models, where each element i of this matrix is re-weighted by det(K{ij}) for the corresponding pair. For each plotted transformation of K, a magenta element corresponds to a negative correlation, which will tend to result in the model predicting that the corresponding pair is negative pair. Black and cyan elements correspond to smaller and larger positive correlations, respectively, for the nonsymmetric model, and very small negative correlations for the symmetric model; the model will tend to predict that the corresponding pair is positive in these cases. We perform the AUC-based evaluation for each pair {i, j} by comparing det(K{ij}) predicted by the model with the ground truth determinantal volume provided by the oracle; this task is equivalent to performing basket completion for the pair. In Figures 1, 2, and 3, we show the prediction error for each pair, where cyan corresponds to low error, and magenta corresponds to high error. Recovering positive examples for low-sparsity data In this experiment we aim to show that the nonsymmetric model is just as capable as the symmetric model when it comes to learning negative correlations when trained on data containing few negative correlations and many positive correlations. We choose a setting where the symmetric model performs well. We construct a dataset that contains no large disjoint collections of items, with 100 baskets of size six, and a catalog of 100 items. To reduce the impact of negative correlations between items, we use a categorical distribution, with nonuniform event probabilities, for sampling the items that populate each basket, with a large coverage of possible item pairs. This logic ensures few negative correlations, since there is a low probability that two items will never co-occur. For the nonsymmetric DPP, the oracle expects the model to predict a low negative correlation, or a positive correlation, for a pair of products that have a high co-occurence probability in the data. The results of this experiment are shown in Figure 1. We see from the plots showing the transformed K matrices that both the nonsymmetric and symmetric models recover approximately the same structure, resulting in similar error plots, and similar predictive AUC of approximately 0.8 for both models. Recovering negative examples for high-sparsity data We construct a more challenging scenario for this experiment, which reveals an important limitation of the symmetric DPP. The symmetric DPP requires a relatively high density of observed item pairs (positive pairs) in order to learn the negative structure of the data that describes items that do not co-occur. During learning, the DPP will maximize determinantal volumes for positive pairs, while the det(L + I) normalization constant maintains a representation of the global volume of the parameter space for the entire item catalog. For a high density of observed positive pairs, increasing the volume allocated to positive pairs will result in a decrease in the volume assigned to many negative pairs, in order to maintain approximately the same global volume represented by the normalization constant. For a low density of positive pairs, the model will not allocate low volumes to many negative pairs. This phenomenon affects both the nonsymmetric and symmetric models. Therefore, the difference in each model s ability to capture negative structure within a low-density region of positive can be explained in terms of how each model maximizes determinatal volumes using positive and negative correlations. In the case of the symmetric DPP, the model can increase determinantal volumes by using smaller negative correlations, resulting in off-diagonal Kij = Kji values that approach zero. As these off-diagonal parameters approach zero, this behavior has the side effect of also increasing the determinantal volumes of subsets within disjoint groups, since these volumes are also affected by these small parameter values. In 2Our code is available at https://github.com/cgartrel/nonsymmetric-DPP-learning Figure 1: Results for synthetic experiment showing model recovery of structure of positive examples for low-sparsity data. Figure 2: Results for synthetic experiment showing model recovery of structure of negative examples for high-sparsity data. 14 disjoint groups are used for data generation. contrast, the nonsymmetric model behaves differently; determinantal volumes can be maximized by switching the signs of the off-diagonal entries of K and increasing the magnitude of these parameters, rather than reducing the values of these parameters to near zero. This behavior allows the model to assign higher volumes to positive pairs than to negative pairs within disjoint groups in many cases, thus allowing the nonsymmetric model to recover disjoint structure. In our experiment, the oracle controls the sparsity of the data by setting the number of disjoint groups of items; positive pairs within each disjoint group are generated uniformly at random, in order to focus on the effect of disjoint groups. For the AUC evaluation, negative baskets are constructed so that they contain items from at least two different disjoint groups. When constructing our dataset, we set the number of disjoint groups to 14, with 100 baskets of size six, and a catalog of 100 items. The results of our experiment are shown in Figure 2. We see from the error plot that the symmetric model cannot effectively learn the structure of the data, leading to high error in many areas, including within the disjoint blocks; the symmetric model provides an AUC of 0.5 as a result. In contrast, the nonsymmetric model is able to approximately recover the block structure, resulting in an AUC of 0.7. Recovering positive examples for data that mixes disjoint sparsity with popularity-based positive structure For our final synthetic experiment, we construct a scenario that combines aspects of our two previous experiments. In this experiment, we consider three disjoint groups. For each disjoint group we use a categorical distribution with nonuniform event probabilities for sampling items within baskets, which induces a positive correlation structure within each group. Therefore, the oracle will expect to see a high negative correlation for disjoint pairs, compared to all other non-disjoint pairs within a particular disjoint group. For items with a high co-occurrence probability, we expect the symmetric DPP to recover a near zero negative correlation, and the nonsymmetric DPP to recover a positive correlation. Furthermore, we expect both the nonsymmetric and symmetric models to recover higher marginal probabilities, or Kii values, for more popular items. The determinantal volumes for positive pairs containing popular items will thus tend to be larger than the volumes of negative pairs. Therefore, for baskets containing popular items, we expect that both the nonsymmetric and symmetric models will be able to easily discriminate between positive and negative baskets. When constructing positive baskets, popular items are sampled with high probability, proportional to their popularity. We therefore expect that both models will be able to recover some signal about the correlation structure of the data within each disjoint group, resulting in a predictive AUC higher than 0.5, since the popularity-based positive correlation structure within each group allows the model to recover some structure about correlations among item pairs within each group. However, we expect that the nonsymmetric model will provide better predictive performance than the symmetric model, since its properties enable recovery of disjoint structure (as discussed previously). We see the expected results in Figure 3, which are further confirmed by the predictive AUC results: 0.7 for the symmetric model, and 0.75 for the nonsymmetric model. 5.4 Results on real-world datasets To examine how the nonsymmetric model behaves when trained on a real-world dataset with clear disjoint structure, we first train and evaluate the model on the three-category Amazon baby registry Figure 3: Results for synthetic experiment showing model recovery of positive structure for data with popularity-based positive examples and disjoint groups. Three disjoint sets and popularity-based weighted random generation are used for the positive examples. Metric Amazon: Apparel Amazon: 3-category UK Retail Sym DPP Nonsym DPP Sym DPP Nonsym DPP Sym DPP Nonsym DPP MPR 77.42 1.12 80.32 0.75 60.61 0.94 75.09 0.85 76.79 0.60 79.45 0.57 AUC 0.66 0.01 0.73 0.01 0.70 0.01 0.79 0.01 0.57 0.001 0.65 0.01 Table 1: MPR and AUC results for the Amazon Diaper, Amazon three-category (Apparel + Diaper + Feeding), and UK retail datasets. Results show mean and 95% confidence estimates obtained using bootstrapping. Bold values indicate improvement over the symmetric low-rank DPP outside of the confidence interval. We use D = 30, = 0 for both Amazon datasets; D0 = 100 for the Amazon 3-category dataset; D0 = 30 for the Amazon apparel dataset; D = 100, D0 = 20, = 1 for the UK dataset; and β = γ = 0 for all datasets. dataset. This dataset is composed of three disjoint categories of items, where each disjoint category is composed of 100 items and approximately 10,000 baskets. Given the structure of this dataset, with a small item catalog for each category and a large number of baskets relative to the size of the catalog, we would expect a relatively high density of positive pairwise item correlations within each category. Furthermore, since each category is disjoint, we would expect the model to recover a low density of positive correlations between pairs of items that are disjoint, since these items do not co-occur within observed baskets. We see the experimental results for this three-category dataset in Figure 4. As expected, positive correlations dominate within each category, e.g., within category 1, the model encodes 80.5% of the pairwise interactions as positive correlations. For pairwise interactions between items within two disjoint categories, we see that negative correlations dominate, e.g., between C1 and C2, the model encodes 97.2% of the pairwise interactions as negative correlations (or equivalently, 2.8% as positive interactions). C1 C2 C3 " # C1 80.5% 2.8% 3.7% C2 2.8% 71.6% 4.5% C3 3.7% 4.5% 97.8% Figure 4: Percentage of positive pairwise correlations encoded by nonsymmetric DPP when trained on the three-category Amazon baby registry dataset, as a fraction of all possible pairwise correlations. Category n is denoted by Cn. Table 1 shows the results of our performance evaluation on the Amazon and UK datasets. Compared to the symmetric DPP, we see that the nonsymmetric DPP provides moderate to large improvements on both the MPR and AUC metrics for all datasets. In particular, we see a substantial improvement on the three-category Amazon dataset, providing further evidence that the nonsymmetric DPP is far more effective than the symmetric DPP at recovering the structure of data that contains large disjoint components. 6 Conclusion By leveraging a low-rank decomposition of the nonsymmetric DPP kernel, we have introduced a tractable MLEbased algorithm for learning nonsymmetric DPPs from data. To the best of our knowledge, this is the first MLE-based learning algorithm for nonsymmetric DPPs. A general framework for the theoretical analysis of the properties of the maximum likelihood estimator for a somewhat restricted class of nonsymmetric DPPs reveals that this estimator has certain statistical guarantees regarding its consistency. While symmetric DPPs are limited to capturing only repulsive item interactions, nonsymmetric DPPs allow for both repulsive and attractive item interactions, which lead to fundamental changes in model behavior. Through an extensive experimental evaluation on several synthetic and real-world datasets, we have demonstrated that nonsymmetric DPPs can provide significant improvements in modeling power, and predictive performance, compared to symmetric DPPs. We believe that our contributions open to the door to an array of future work on nonsymmetric DPPs, including an investigation of sampling algorithms, reductions in computational complexity for learning, and further theorectical understanding of the properties of the model. [1] R. Affandi, E. Fox, R. Adams, and B. Taskar. Learning the parameters of determinantal point process kernels. In ICML, 2014. [2] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes. In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 103 115, Columbia University, New York, New York, USA, 23 26 Jun 2016. PMLR. [3] Julius Borcea, Petter Brändén, and Thomas Liggett. Negative dependence and the geometry of polynomials. Journal of the American Mathematical Society, 22(2):521 567, 2009. [4] Alexei Borodin. Determinantal Point Processes. ar Xiv:0911.1153, 2009. [5] Victor-Emmanuel Brunel. Learning signed determinantal point processes through the principal minor assignment problem. In Neur IPS, pages 7365 7374, 2018. [6] Victor-Emmanuel Brunel, Ankur Moitra, Philippe Rigollet, and John Urschel. Rates of esti- mation for determinantal point processes. In Conference on Learning Theory, pages 343 345, 2017. [7] Wei-Lun Chao, Boqing Gong, Kristen Grauman, and Fei Sha. Large-margin determinantal point processes. In Uncertainty in Artificial Intelligence (UAI), 2015. [8] D Chen. Data mining for the online retail industry: A case study of rfm model-based customer segmentation using data mining. Journal of Database Marketing and Customer Strategy Management, 19(3), August 2012. [9] Laurent Decreusefond, Ian Flint, Nicolas Privault, and Giovanni Luca Torrisi. Determinantal Point Processes, 2015. [10] Christophe Dupuy and Francis Bach. Learning Determinantal Point Processes in sublinear time, [11] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Bayesian low-rank determinantal point processes. In Rec Sys. ACM, 2016. [12] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Low-rank factorization of Determinantal Point Processes. In AAAI, 2017. [13] J. Gillenwater. Approximate Inference for Determinantal Point Processes. Ph D thesis, University of Pennsylvania, 2014. [14] J. Gillenwater, A. Kulesza, E. Fox, and B. Taskar. Expectation-maximization for learning Determinantal Point Processes. In NIPS, 2014. [15] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008. [16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, [17] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in Gaussian processes: theory, efficient algorithms and empirical studies. JMLR, 9:235 284, 2008. [18] A. Kulesza. Learning with Determinantal Point Processes. Ph D thesis, University of Pennsyl- vania, 2013. [19] A. Kulesza and B. Taskar. k-dpps: Fixed-size determinantal point processes. In ICML, 2011. [20] A. Kulesza and B. Taskar. Determinantal Point Processes for machine learning, volume 5. Foundations and Trends in Machine Learning, 2012. [21] Frédéric Lavancier, Jesper Møller, and Ege Rubak. Determinantal Point Process models and statistical inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4):853 877, 2015. [22] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast dpp sampling for nystrom with application to kernel methods. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2061 2070, New York, New York, USA, 20 22 Jun 2016. PMLR. [23] Yanen Li, Jia Hu, Cheng Xiang Zhai, and Ye Chen. Improving one-class collaborative filtering by incorporating rich user information. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 10, 2010. [24] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document summarization. In Uncertainty in Artificial Intelligence (UAI), 2012. [25] Zelda Mariet and Suvrit Sra. Fixed-point algorithms for learning Determinantal Point Processes. In ICML, 2015. [26] Zelda Mariet and Suvrit Sra. Kronecker Determinantal Point Processes. In NIPS, 2016. [27] Patrick Rebeschini and Amin Karbasi. Fast mixing for discrete point processes. In Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 1480 1500, Paris, France, 03 06 Jul 2015. PMLR. [28] Michael J. Tsatsomeros. Focus on computational neurobiology. chapter Generating and Detecting Matrices with Positive Principal Minors, pages 115 132. Nova Science Publishers, Inc., 2004. [29] Cheng Zhang, Hedvig Kjellström, and Stephan Mandt. Stochastic learning on imbalanced data: Determinantal Point Processes for mini-batch diversification. Co RR, abs/1705.00607, 2017.