# dynamic_determinantal_point_processes__278b5b39.pdf

Dynamic Determinantal Point Processes

Takayuki Osogami, Rudy Raymond IBM Research AI

Tomoyuki Shirai Institute of Mathematics for Industry Kyushu University

Akshay Goel Graduate School of Mathematics Kyushu University

Takanori Maehara RIKEN Center for Advanced Intelligence Project

The determinantal point process (DPP) has been receiving increasing attention in machine learning as a generative model of subsets consisting of relevant and diverse items. Recently, there has been a signiﬁcant progress in developing efﬁcient algorithms for learning the kernel matrix that characterizes a DPP. Here, we propose a dynamic DPP, which is a DPP whose kernel can change over time, and develop efﬁcient learning algorithms for the dynamic DPP. In the dynamic DPP, the kernel depends on the subsets selected in the past, but we assume a particular structure in the dependency to allow efﬁcient learning. We also assume that the kernel has a low rank and exploit a recently proposed learning algorithm for the DPP with low-rank factorization, but also show that its bottleneck computation can be reduced from O(M 2 K) time to O(M K2) time, where M is the number of items under consideration, and K is the rank of the kernel, which can be set smaller than M by orders of magnitude.

Introduction

A determinantal point process (DPP) deﬁnes a probability distribution over the subsets of items in a given ground set in a way that the subsets consisting of relevant and diverse items have high probabilities (Macchi 1975; Kulesza and Taskar 2012; Soshnikov 2000; Shirai and Takahashi 2003a; 2003b). The DPP is receiving increasing attention in machine learning because of its broad applicability to recommendation of products (Gillenwater et al. 2014; Gartrell, Paquet, and Koenigstein 2017), summarization of documents or videos (Gong et al. 2014), modeling neural spiking (Snoek, Zemel, and Adams 2013), and other areas where we want to select a subset of relevant and diverse items. In some of these applications, it is important to take into account the dependency across time (Affandi, Kulesza, and Fox 2012; Gong et al. 2014; Qiao et al. 2016; Snoek, Zemel, and Adams 2013). For example, a customer who purchased a particular subset of products last week might want to purchase another particular subset of products that are otherwise unneeded this week. In this paper, we study a DPP whose distribution can change over time depending on the subsets that have been selected.

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

A key challenge in the applications of a DPP is in the high computational complexity that is needed for learning the probability distribution from data. Because the loglikelihood of a DPP is generally non-concave with respect to its parameters, and maximum likelihood estimation of a DPP is conjectured to be NP-hard (Kulesza 2012), the standard approach in the literature is to ﬁnd a local maxima via gradient-based methods. However, even one iteration of gradient ascent can be computationally prohibitive. Speciﬁcally, a DPP can be speciﬁed by an M M kernel matrix, where M is the size of the ground set. An iteration of gradient ascent involves basic matrix-operations such as inversion, multiplication, and determinant, and a naive approach would require O(M 3) time. Recently, Gartrell et al. (2017) have proposed an efﬁcient learning algorithm for DPPs, where they assume that the kernel has a low rank and leverage its low-rank factorization. The per iteration complexity of their gradientbased approach is O(T κ3 + K M 2), where K is the rank of the kernel, T is the number of the subsets in training data, and κ is the maximum size of those subset. This significantly improves upon learning algorithms for full-rank kernels, which require O(T κ3 +M 3) (Mariet and Sra 2015) or more (Gillenwater et al. 2014). We extend the algorithm by Gartrell et al. to the DPP whose kernel can change over time, which we refer to as a Dynamic DPP. We design the Dynamic DPP in a way that the dynamic kernel not only allows low-rank factorization but also has a particular parametrization that enables efﬁcient learning through (stochastic) gradient-based approaches. Speciﬁcally, our dynamic kernel can be represented by a matrix that is a linear function of the statistics of previously selected subsets. The deﬁnition of the Dynamic DPP is our ﬁrst contribution. In developing a learning algorithm for the Dynamic DPP, we make three improvements to the algorithm by Gartrell et al. First, we use a dual representation of the kernel to reduce the computational complexity of a bottleneck from O(K M 2) to O(K2 M). Second, we replace an expression involving the inverse of a matrix that might not be convertible with a slightly simpler expression with pseudoinverse. Third, we represent our algorithm in vector-matrix operations. When our algorithm for the Dynamic DPP is specialized for the (standard) DPP, its per step complex-

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

ity is O(T κ2 K + K2 M). This can be further reduced to O(T κ3+K2 M) by using inverse instead of pseudo-inverse or setting K = O(κ). Our algorithm thus has smaller per step complexity than the best known algorithms (Gartrell, Paquet, and Koenigstein 2017; Mariet and Sra 2016), as we discuss further in the following. The per iteration complexity of our algorithm for the Dynamic DPP is O(T (M D K+ K2 M)), where D a hyperparameter of the Dynamic DPP and represents the number of statistics of previously selected subsets that the dynamic kernel takes into account. However, we also propose an algorithm whose per iteration complexity is O(T (M D K + D2 K2) + K2 M), where we iteratively use rank-one updates to recursively compute matrix inversions. These efﬁcient learning algorithms for the Dynamic DPP and the DPP constitute our second contribution.

Related Work

Our proposed learning algorithm extends and improves upon the learning algorithm for the low-rank DPP by Gartrell et al. (Gartrell, Paquet, and Koenigstein 2017). The lowrank DPP has been enhanced into a Bayesian low-rank DPP with a Hamiltonian Monte Carlo (HMC) approach in (Gartrell, Paquet, and Koenigstein 2016). However, the per iteration complexity of the Bayesian low-rank DPP remains O(T κ3 + K M 2) time, because it relies on the form of the gradient derived for the low-rank DPP. Our techniques can also be used with an HMC approach. Another approach complementary to the low-rank DPP is a Kronecker DPP, which assumes that the kernel is a Kronecker product (Mariet and Sra 2016). The learning algorithm for the Kronecker DPP in Mariet and Sra (2016) has per iteration complexity of O(T κ3 + M 2) time. Although low-rank DPP and Kronecker DPP assume different structures, and their learning algorithms (for ﬁnding local maxima) are not directly comparable, this paper shows that the per iteration complexity of the low-rank DPP can be made smaller than the Kronecker DPP. The prior work has studied various forms of DPPs that change over time (Affandi, Kulesza, and Fox 2012; Gong et al. 2014; Qiao et al. 2016; Snoek, Zemel, and Adams 2013). However, the prior work does not fully learn the dynamic kernel as we discuss in the following. Affandi et al. (2012) study a Markov DPP, where the pair of the subsets selected at two consecutive steps forms a DPP. They use a heuristic method (that generally does not maximize an objective function) to learn only the quality measures of the DPP kernel, assuming that the similarity metric is ﬁxed and known. Notice that the kernel of a DPP can be decomposed into a quality vector of size M and a similarity matrix of size M M (Kulesza and Taskar 2012). Gong et al. (2014) study a sequential DPP, which is similar to the Markov DPP but assumes that the ground set changes over time. The proposed learning algorithm is similar to the one for the low-rank DPP but considers a mapping from features of items to the low-rank matrix. They learn this static mapping, while the ground set is dynamic. On the other hand, we assume that the ground set is static and learn a dynamic kernel.

Qiao et al. (2016) study a time-varying DPP, where the kernel changes over time but only slightly (i.e., updated by a low-rank matrix) at a time. They propose an efﬁcient sampling method for the time-varying DPP. However, they assume that their time-varying kernel is given and do not study any learning algorithm. Our Dynamic DPP has a low-rank update and can beneﬁt from their sampling algorithm. Snoek et al. (2013) apply a DPP with time-varying kernel to neural spiking data. They assume that the quality measure is time-varying depending on external stimuli. They learn the time-varying kernel with a relatively na ıve approach. Due to the computational complexity, their application is limited to M = 31 neurons. The Dynamic DPP is auto-regressive, and its particular parametrization is motivated by the dynamic Boltzmann machine (Dy BM) (Osogami and Otsuka 2015b; 2015a; Osogami 2017), which makes the parameters of the Boltzmann machine depend on historical patterns. Similar to the Dy BM and vector auto regression (L utkepohl 2005), our dynamic kernel is linear in its parameters and allows effective optimization via gradient-based approaches.

Dynamic Determinantal Point Processes We study a determinantal point process (DPP) whose kernel can vary over time. Let L(t) be the kernel of the DPP at time t. We assume that the ground set stays unchanged over time and let M be the size of the ground set. Then, for each t, L(t) is an M M positive semi-deﬁnite matrix indexed by the items in the ground set. The probability that a subset X is selected from a ground set at time t is given by

Pt(X) = det(L(t)X ) det(L(t) + I), (1)

where L(t)X denotes the principal submatrix of L(t) indexed by the items in X. Similar to Gartrell et al. (2017), we use a low-rank factorization of L(t):

L(t) = V(t) V(t) , (2) where V(t) is an M K matrix for K M. Then, by Sylvester s determinant identity, we can work with the dual representation:

Pt(X) = det(L(t)X ) det(C(t) + I), (3)

where C(t) is a K K matrix and can be computed from V(t) in O(K2 M) time:

C(t) V(t) V(t). (4) We deﬁne the Dynamic DPP in a way that V(t) can depend on the subset selected by time t 1 in the following speciﬁc manner. Let x(s) denote an indicator column-vector of length M that represents the subset selected (sampled) at time s. Speciﬁcally, the i-th element of x(s) is 1 iff the i-th item is selected (is in the selected subset) at time s. For each t, we then let

d=1 x(t d) w(d) , (5)

where B is an M K matrix, and w(d) is a column-vector of length K. The Dynamic DPP thus has parameters Θ (B, w(1), . . . , w(D)). If no items are selected during the period from t D to t 1, we have L(t) = B B , which may be understood as the baseline kernel. When D = 0, we have V(t) = B for any t, and the Dynamic DPP is reduced to the (static) DPP. With our parametrization, the kernel can be represented as follows:

L(t) = B B +

d=1 B w(d) x(t d)

d=1 x(t d) w(d) B

d =1 x(t d) w(d) w(d ) x(t d ) . (6)

To shed light on this parametrization, let Bm,: be the mth row of B (i.e., a row-vector of length K) for each m. Observe that the (i, j)-th element of L(t) is given by

Li,j(t) = Bi,: (Bj,:) +

d=1 Bi,: w(d) xj(t d)

d=1 xi(t d) w(d) (Bj,:)

d =1 xi(t d) w(d) w(d ) xj(t d ), (7)

where xi(t) denotes the i-th element of x(t). Notice that Li,j(t) represents the product of the relevance of item i, the relevance of j, and the similarity between the two items. Recall that x(t) is binary, and 1 represents that the corresponding item is selected at time t. If neither i nor j are selected during the period from t D to t 1, we have Li,j(t) = Bi,: (Bj,:) . If j is selected at t d, then Li,j(t) is increased by Bi,: w(d). If i is selected at t d, then Li,j(t) is increased by w(d) (Bj,:) . If i is selected at t d and j is selected at t d , then Li,j(t) is increased by w(d) w(d ). Hence, i and j become more or less likely to be selected individually or together, when one or both of these items have been selected in the last D steps. The precise impact is given by the parameters w(1), . . . , w(D), and B. We will learn those parameters in a way that it best explains a given sequence of subsets.

Learning Dynamic DPPs

We seek to learn Θ in a way that it maximizes the loglikelihood of given series of subsets. The training dataset can consist of a set of sequences of subsets. For notational simplicity, we assume a single sequence of subsets in this section, but extension to multiple sequences is trivial. Let (X1, . . . , XT ) be the training dataset, where Xt is the subset selected at time t.

The log-likelihood of the training dataset is

f(Θ) log P((X1, . . . , XT )) =

t=1 log Pt(Xt | X t 1),

where Pt(Xt | X t 1) denotes the conditional probability of selecting Xt at time t given that X t 1 (X1, . . . , Xt 1) are selected by time t 1. We deﬁne

ft(Θ) log Pt(Xt | X t 1), (9)

so that f(Θ) =

t ft(Θ) and f(Θ) =

t ft(Θ), where the gradient is with respect to Θ, and the summation is from t = 1 to t = T. Similar to Equation (5) of Gartrell et al. (2017), we can represent ft(Θ) as follows:

ft(Θ) = log det(L(t)Xt) log det(C(t) + I) (10)

= tr (L(t)Xt) 1 L(t)Xt tr (C(t) + I) 1 C(t) ,

where, for matrices Y and Z, one should understand tr(Y Z) as the matrix, indexed by Θ, whose entry is tr(Y Z

θ ) for θ Θ. We will also use tr(Y SZ) to denote the matrix whose entry is tr(Y Z

θ ) for θ S for a matrix S consisting of a subset of the parameters in Θ. Unfortunately, the inverse in the ﬁrst term of (11) might not exist, which was not addressed in Gartrell et al. (2017). Here, we derive an alternative expression using a pseudo-inverse. For a matrix Y, the partial derivative of log det(Y Y ) with respect to θ can be represented as

log det(Y Y )

θ log det(Y Y )

θ 2 (Y+)i,j (12)

θ Y+ , (13)

where Y+ denotes the pseudo-inverse of Y. Because L(t)Xt = V(t)Xt,: (V(t)Xt,:) , we can write (10) as

ft(Θ) = 2 tr V(t)Xt,: V(t)Xt,: +

tr C(t) + I 1 C(t) . (14)

The following theorem shows speciﬁc forms of the gradients of ft(Θ) with respect to each parameter of Θ = (B, w(1), . . . , w(D)). The proof of the theorem is postponed to the appendix.

Theorem 1. Consider a determinantal point process with a dynamic kernel L(t):

Pt(Xt | X t 1) = det(L(t)Xt) det(C(t) + I), (15)

Algorithm 1 A learning algorithm for Dynamic DPPs.

1: input: X1, . . . , XT 2: Initialize B, W 3: repeat 4: for t = 1 to T do 5: J [T t + 1, T t + D] 6: V B + X[ : , J ] W

7: R V (V V + I) 1

8: A V[Xt, : ]+

9: ΔB(t) R 10: ΔB(t)[Xt, : ] ΔB[Xt, : ] + A

11: ΔW(t) A X[Xt, J ] R X[ : , J ] 12: end for 13: B B + η T t=1 ΔB(t)

14: W W + η T t=1 ΔW(t) 15: until a stopping condition is met 16: return: B, W

where L(t) = V(t) V(t) , C(t) = V(t) V(t),

d=1 x(t d) w(d) (16)

and L(t)Xt denotes the principal submatrix of L(t) indexed by the items in Xt. Then the gradient of

ft(Θ) log Pt(Xt | X t 1) (17)

is given by

BXt,:ft(Θ) = 2 A 2 V(t) A

B Xt,:ft(Θ) = 2 V(t) A Xt (19)

w(δ)ft(Θ) = 2 A x(t δ)Xt 2 A V(t) x(t δ) (20)

for δ [1, D], where BXt,: denotes the submatrix consisting of rows of B indexed by the items in Xt, V(t)Xt,: is deﬁned analogously, Xt denote the items not in Xt,

A V(t)Xt,: + , and A C(t) + I 1 . (21)

Theorem 1 leads to a gradient-based learning algorithm shown in Algorithm 1. Here, we deﬁne W to be the K D matrix consisting of w( ) and X to be the M (T + D 1) matrix consisting of x( ), except x(T), in the reverse order:

W (w(1), . . . , w(D)) (22) X (x(T 1), . . . , x(1), 0, . . . , 0). (23)

The whole X may be prepared in the beginning, or its submatrices may be constructed when they are needed. For clarity, we use the notation such as B[Xt, : ] to denote the submatrix consisting of rows of B indexed by the items in Xt. Algorithm 1 computes the gradients given in Theorem 1 in the for-loop and applies them in Steps 13-14 to update B and W. For simplicity, we omit the factor of 2, which is common in (18)-(20). Hence, η in Steps 13-14 may be understood as the learning rate multiplied by 2.

In each repetition of the for-loop, J deﬁnes the set of columns of X to be used in that repetition. Notice that

X[ : , J ] = (x(t 1), . . . , x(t D)). (24)

Then V in Step 6 corresponds to V(t), and R in Step 7 corresponds to V(t) A. The Dynamic DPP has hyperparameters K and D that needs to be set appropriately with validation. When the size of a subset X is greater than K, the probability Pt(X) must be zero. We therefore need to set K at least as large as the maximum size κ of the subsets in the training data. We thus assume K κ. Notice that one could also work with the probability of the complement of the selected subset. Then we can set K M κ, where κ is the minimum size of the selected subset (i.e.. M κ is the maximum size of the complement of the selected subset). Algorithm 1 has the following computational complexity per iteration of the repeat-loop: Theorem 2. Each iteration of Algorithm 1 takes O(T (M D K + K2 M)) time.

Proof. Step 6 involves a multiplication of an M D matrix and a D K matrix, taking O(M D K) time. Step 7 involves a multiplication of a K M matrix and an M K matrix, an inversion of a K K matrix, and a multiplication of an M K matrix and a K K matrix, taking O(K2 M) time. Step 8 involves a pseudo-inverse of a κ K matrix, where κ is the size of Xt. Step 8 thus takes O(κ2 K) time. Step 11 involves a multiplication of a K κ matrix and a κ D matrix as well as a multiplication of a K M matrix and an M D matrix, taking O(K M D) time. The theorem now follows because we need to set K [κ, M].

One can also consider a learning algorithm based on stochastic gradients for Dynamic DPPs. Such an algorithm has the following computational complexity per step: Corollary 1. A stochastic update for the Dynamic DPP takes O(M D K + K2 M) time. When D = 0 in Algorithm 1, we have a learning algorithm for the (static) DPP (see Algorithm 2). Because the kernel is static, we can place Step 4 of Algorithm 2 out of the for-loop, unlike the R in Step 7 of Algorithm 1 that needs to be computed for each t. As a result, Algorithm 2 has the following computational complexity per iteration. Theorem 3. Each iteration of Algorithm 2 takes O(T κ2 K + K2 M) time, where κ is the maximum size of Xt for t [1, T].

Proof. Step 4 involves a multiplication of a K M matrix and an M K matrix, inversion of a K K matrix, and a multiplication of an M K matrix and a K K matrix, taking O(K2 M) time. Step 6 involves a pseudo-inverse of an at most κ K matrix, taking O(κ2 K) time.

Algorithm 2 thus has a smaller per-iteration computational complexity than the O(T κ3 + K M 2) time algorithm for the low-rank DPP in Gartrell et al. (2017) as long as K = O(κ). However, notice that the O(κ2 K) cost stems from pseudo-inverse. The use of inverse as in Gartrell et

Algorithm 2 A learning algorithm for DPPs.

1: input: X1, . . . , XT 2: Initialize B 3: repeat 4: ΔB T B (B B + I) 1

5: for t = 1 to T do 6: ΔB[Xt, : ] ΔB[Xt, : ] + (B[Xt, : ]+)

7: end for 8: B B + η ΔB 9: until a stopping condition is met 10: return: B

al. (2017) would reduce the computational complexity to O(T κ3 + K2 M). Also, observe that Algorithm 2 is fully expressed in basic matrix operations and simpler than the one in Gartrell et al. (2017).

Rank-one Updates for Faster Computation When D = o(K), the bottleneck of Algorithm 1 is on the computation of A(t) (I + C(t)) 1, which needs to be computed for every t. This is in contrast to Algorithm 2 for DPPs, where the kernel is independent of t, and the corresponding quantity needs to be computed only once (Step 4). Recall that computation of A(t) requires O(M K2) time for computing C(t) and O(K3) time for inverting (I + C(t)). In this section, we will show that we can reduce the computational cost of this bottleneck when D = o(

M) by exploiting the fact that V(t + 1) differs from V(t) only by a small amount. Speciﬁcally, observe from (5) that V(t + 1) can be computed from V(t) recursively:

V(t + 1) = V(t) +

x(t + 1 d) x(t d) w(d) .

Letting δ(t, d) x(t+1 d) x(t d), we can then write I + C(t + 1) recursively:

I + C(t + 1) = I + C(t) +

d=1 V(t) δ(t, d) w(d)

d=1 w(d) δ(t, d) V(t)

d =1 w(d) δ(t, d) δ(t, d ) w(d) .

Namely, once we have the value of A(t), the value of A(t + 1) can be obtained in O(D2 K2) time, avoiding computation from scratch, by combining A(t) with rank-one updates suggested by the following Sherman-Morrison lemma: Lemma 1 (Sherman-Morrison). Let Y be an invertible K K matrix. Let u and v be column vectors of dimension K. Then, Y + u v 1 = Y 1 Y 1 u v Y 1

1 + v Y 1 u .

Algorithm 3 Step 7 of Algorithm 1 with rank-one updates.

1: if t = 1 then 2: A(t) (V V + I) 1

3: else 4: Compute A(t + 1) from A(t) by applying the rankone update for D2 + 2 D times 5: end if 6: R V A(t)

This lemma implies that Y + u v can be inverted in O(K2) time if Y 1 is known. Step 7 of Algorithm 1 can thus be modiﬁed into Algorithm 3, which leads to the following computational complexity: Theorem 4. Each iteration of Algorithm 1 takes O(T (M D K + K2 D2) + K2 M) time when Step 7 is replaced with Algorithm 3. A particularly attractive case from computational perspectives is when D = 1. Then we only need three rank-one updates to compute A(t + 1) from A(t). There are, however, other cases where one can enjoy this reduced computational cost. For example, when w w(1) = w(2) = . . . = w(D), we can write

V(t + 1) = V(t) + (x(t) x(t D)) w . (27)

In this case, V(t) depends on x(s) for s < t only through D d=1 x(t d), which represents the number of selection during the last D steps for each item. Another example is when w(d) = λ w(d 1) for each d for some 0 < λ < 1. With w = w(1), we then have

V(t + 1) = V(t) + y w , (28)

y x(t) (1 λ)

d=1 λd 1x(t d) λD 1x(t D).

In this case, the effect of the selection in previous steps diminishes by a factor of λ at every time step. After we learn the parameters of a Dynamic DPP, we would want to compute a sequence of the probabilities given by (3) for a given sequence of subsets. This would involve computing a sequence of the determinants. In particular, the denominator of (3) is the determinant of C(t), which needs to be computed from the M K matrix V(t):

det I + C(t) = det I + V(t) V(t) . (30)

Thus, the computational time for each step of t is O(K2 M). The following matrix determinant lemma, however, shows that this determinant can be updated in O(D2 K2) time, similar to the computation of A(t + 1) from A(t). Lemma 2 (Matrix determinant). Let Y be an invertible K K matrix. Let u and v be column vectors of dimension K. Then we have

det Y + uv = 1 + v Y 1u det (Y) . (31)

Numerical Experiments Here, we apply the proposed learning algorithms to four music datasets (JSB Chorales, Nottingham, Piano-midi, and Muse Data), which have been used in Boulanger Lewandowski et al. (2012). Each dataset consists of a set of sequences of chords. A chord is a subset of notes (items) that are selected from the ground set of M = 88 notes1, corresponding to the keys of the piano. A sequence of the chords forms a tune from a particular category of music associated with each dataset. The dataset is divided into training, validation, and test data. In the experiment, we ﬁrst apply Algorithm 2 to learn the B of a DPP, which will be used as the baseline. We then apply Algorithm 1 to learn the B and W of a Dynamic DPP, where we use the B trained for the DPP as the initial values of B. In Algorithm 1 and Algorithm 2, we choose the step size η via a relatively simple approach of backtracking line search, loosely following Armijo (1966)2. We stop the iteration when no step size signiﬁcantly increases the objective function or when parameters are updated 100 times. Because the training data consists of multiple sequences, the for-loop in each of the algorithms is now over all of the steps in all of the sequences in the training data. Figure 1 shows the log-likelihood given by the trained models on test data. The horizontal axis represents the lag D in Equation (5), and D = 0 corresponds to the DPP (baseline). Each curve in the ﬁgure shows the results with a particular value of rank K as indicated in the legend, where K0 is the maximum number of notes that constitute a chord (i.e., maximum size of selected subsets) in training or validation data for each dataset. For JSB Chorales, the choice of K = K0 = 4 results in low log-likelihood of below 10.0 for any D under consideration and is not shown in the ﬁgure (K = 4 K0 is shown instead). Note that JSB Chorales has a particularly small value of K0 = 4, compared to the other datasets. We can observe that the Dynamic DPP (D > 0) consistently outperforms the baseline for any K and D, which suggests the soundness of our learning algorithm for the Dynamic DPP. Particularly signiﬁcant improvement is observed when we increase D = 0 to D = 1. Increasing K from K = K0 does not signiﬁcantly increase log-likelihood. This means that the low-rank assumption can speed up learning without signiﬁcantly sacriﬁcing the quality of learning. In fact, we ﬁnd that estimated kernels tend to have K0 dominating singular values. We implemented our learning algorithms with Python and measured the computational time on a machine running

1We ignore the notes that appear neither in training data nor in validation data. The effective size of the ground set is thus reduced from M = 88 to M = 51 for JSB Chorales, M = 57 for Nottingham, and M = 78 for Muse Data (Piano-midi uses all M = 88 notes). This preprocess reduces the computational complexity, but we ﬁnd qualitatively similar results without the preprocess. 2Speciﬁcally, we search the largest step size η that increases the objective value from the candidate step sizes that can be represented as η = 10 n for a nonnegative integer n [nmin, 9], where nmin = 3. When the largest step size is used to update parameters, we decrease nmin by one to allow larger step sizes.

Ubuntu 16.04 with 4.0 GHz Intel Core i7-6700K CPU and 48 GB Memory. For example, for the JSB Chorales dataset, the average time per iteration for learning B and W of a Dynamic DPP with D = 3 (from Line 4 to Line 14 of Algorithm 1) varies from 2.3 seconds for K = K0 = 4 to 2.8 seconds for K = 3 K0 = 12. The corresponding average time per iteration for learning B of a DPP (from Line 4 to Line 8 of Algorithm 2) varies from 1.3 seconds for K = K0 to 1.6 seconds for K = 3 K0.

Conclusion We have proposed the Dynamic DPP, where the kernel of a DPP can change over time in an auto-regressive manner. The particular structure of the Dynamic DPP allows learning via (stochastic) gradient-based methods. A potential bottleneck of such a learning algorithm is in inverting the dynamic kernel at every step. We have shown, however, that the use of dual representation and rank-one updates can substantially reduce the computational complexity. In this paper, we have primarily provided theoretical support on the proposed approach. Although we have also demonstrated soundness and validity of the proposed algorithms through numerical experiments, we have not yet investigated their full capabilities. Important future work is in advancing practical aspects of the proposed approach and in customizing it for particular applications.

Appendix: Proof of Theorem 1 We prove the theorem by establishing the following two lemmas, which respectively derive computationally convenient form of the two terms of (14). These two lemmas immediately imply Theorem 1. Lemma 3. Let

V(t)Xt,: BXt,: +

d=1 x(t d)Xt w(d) , (32)

and let Xt denote the items not in Xt. Then we have

tr BXt,:V(t)Xt,: (V(t)Xt,:)+ = ((V(t)Xt,:)+) (33)

tr B Xt,:V(t)Xt,: (V(t)Xt,:)+ = 0 (34)

tr w(δ)V(t)Xt,: (V(t)Xt,:)+ = (V(t)Xt,:)+ x(t δ)Xt (35)

for δ [1, D]. Lemma 4. Let C(t) = V(t) V(t), where

d=1 x(t d) w(d) , (36)

and let A C(t) + I 1, where I is the K K identity matrix. Then we have

tr C(t) + I 1 BC(t) = 2 V(t) A (37)

tr C(t) + I 1 w(δ)C(t) = 2 A V(t) x(t δ) (38)

for δ [1, D].

Test log likelihood

K = 2 K0 K = 3 K0 K = 4 K0

(a) JSB Chorales (K0 = 4)

Test log likelihood

K = K0 K = 2 K0 K = 3 K0

(b) Nottingham (K0 = 9)

Test log likelihood

K = K0 K = 2 K0 K = 3 K0

(c) Piano-midi (K0 = 12)

Test log likelihood

K = K0 K = 2 K0 K = 3 K0

(d) Muse Data (K0 = 15)

Figure 1: Log-likelihood given by trained Dynamic DPPs on test data for the four music datasets. The rank K is varied across panels. The hyper-parameter D is varied along the horizontal axis, and D = 0 corresponds to the DPPs (baseline). The rank K is set as indicated in the legend, where K0 is the maximum number of notes that constitute a chord in training or validation data for each dataset.

Proof of Lemma 3 Let B denote the |Xt| K submatrix of B, where each row of B corresponds to the row of B indexed by an item in Xt. Let x(s) denote the subvector of x(s), where each element of x(s) corresponds to the element of x(s) indexed by an item in Xt. Namely, B BXt,: (39) x(s) x(s)Xt. (40)

Then we have

V(t)Xt,: = B +

d x(t d) w(d) . (41)

It is easy to see that

Bi,j V(t)Xt,: = ei,j (42)

where ei,j is matrix of order |Xt| K whose entries are zero except the (i, j)-th entry, which is one. Therefore, letting A (V(t)Xt,:)+, we have

tr Bi,j V(t)Xt,: A = tr ei,j A = Aj,i = ( A )i,j.

Hence we have

tr BV(t)Xt,: A = A . (44)

Likewise, we can obtain the gradient with respect to w(δ) by observing that

w(δ)j V(t)Xt,: = x(t δ) e j , (45)

where ej is the indicator column vector of length K with only j-th entry being one. Hence, we have

tr w(δ)j V(t)Xt,: A = tr x(t δ) e j A (46)

= tr e j A x(t δ) (47)

Therefore, we establish

tr w(δ)V(t)Xt,: A = A x(t δ). (49)

Proof of Lemma 4

To prove lemma 4, we will use the following lemma:

Lemma 5. Let Y be a K K symmetric matrix, and let Z be a M K matrix. Then

tr(Y (Z Z)) = 2tr((ZY) Z). (50)

tr(Y (Z Z)) = tr(Y ( Z) Z + Y Z Z) (51)

= tr(( Z) Z Y + (Z Y ) Z) (52)

= tr((Z Y) Z + (Z Y ) Z) (53)

= 2 tr((Z Y) Z). (54)

First observe that

Bi,j V(t) = ei,j, (55)

where ei,j is matrix of order M K whose entries are zero except the (i, j)-th entry, which is one. Because A is symmetric, we can use Lemma 5 to show

tr(A Bi,j(V(t) V(t))) = 2 tr((V(t) A) Bi,j V(t)) (56)

= 2 tr((V(t) A) ei,j) (57)

= 2 ((V(t) A) )j,i (58) = 2 (V(t) A)i,j, (59)

where the last equality holds because A is symmetric. Therefore, we have

tr(A B(V(t) V)) = 2 V(t) A (60)

Likewise, we have

w(δ)j V(t) = x(t δ) e j , (61)

where ej is the indicator column vector of length K whose only jth entry is 1. Again, because A is symmetric, we can use Lemma 5 to show

tr(A w(δ)j(V V)) = 2 tr((V(t) A) w(δ)j V(t)) (62)

= 2 tr((V(t) A) x(t δ)e j ) (63)

= 2 ((V(t) A) x(t δ))j (64)

= 2 (A V(t) x(t δ))j, (65)

where the last equality holds because A is symmetric. Therefore, we establish

tr(A w(δ)(V(t) V)) = 2 A V(t) x(t δ). (66)

Acknowledgments T. O. and R. R. are supported by JST CREST Grant Number JPMJCR1304, Japan. A. G. is fully supported by JICAFriendship Scholarship. T. S. is partially supported by JSPS Grant-in-Aid (26287019, 16H06338).

References Affandi, R. H.; Kulesza, A.; and Fox, E. 2012. Markov determinantal point processes. In Proceedings of the 28th Conference on Uncertainty in Artiﬁcial Intelligence, 26 35. Armijo, L. 1966. Minimization of functions having Lipschitz continuous ﬁrst partial derivatives. Paciﬁc Journal of Mathematics 16(1):1 3. Boulanger-Lewandowski, N.; Bengio, Y.; and Vincent, P. 2012. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), 1159 1166. Gartrell, M.; Paquet, U.; and Koenigstein, N. 2016. Bayesian low-rank determinantal point processes. In Proceedings of the 10th ACM Conference on Recommender Systems, 349 356. Gartrell, M.; Paquet, U.; and Koenigstein, N. 2017. Lowrank factorization of determinantal point processes. In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence (AAAI-17), 1912 1918. Gillenwater, J. A.; Kulesza, A.; Fox, E.; and Taskar, B. 2014. Expectation-maximization for learning determinantal point processes. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27. Curran Associates, Inc. 3149 3157. Gong, B.; Chao, W.-L.; Grauman, K.; and Sha, F. 2014. Diverse sequential subset selection for supervised video summarization. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27. Curran Associates, Inc. 2069 2077. Kulesza, A., and Taskar, B. 2012. Determinantal Point Processes for Machine Learning. Hanover, MA, USA: Now Publishers Inc. Kulesza, J. A. 2012. Learning with determinantal point processes. Ph.D. Dissertation, University of Pennsylvania.

L utkepohl, H. 2005. New Introduction to Multiple Time Series Analysis. Springer-Verlag Berlin Heidelberg. Macchi, O. 1975. The coincidence approach to stochastic point processes. Advances in Applied Probability 7(1):83 122. Mariet, Z., and Sra, S. 2015. Fixed-point algorithms for learning determinantal point process. In Proceedings of the 32nd International Conference on Machine Learning, 2389 2397. Mariet, Z., and Sra, S. 2016. Kronecker determinantal point processes. In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems 29. Curran Associates, Inc. 2694 2702. Osogami, T., and Otsuka, M. 2015a. Learning dynamic Boltzmann machines with spike-timing dependent plasticity. Technical Report RT0967, IBM Research. Osogami, T., and Otsuka, M. 2015b. Seven neurons memorizing sequences of alphabetical images via spike-timing dependent plasticity. Scientiﬁc Reports 5:14149. Osogami, T. 2017. Boltzmann machines for time-series. Technical Report RT0980, IBM Research - Tokyo. Qiao, M.; Xu, R. Y. D.; Bian, W.; and Tao, D. 2016. Fast sampling for time-varying determinantal point process. ACM Transactions on Knowledge Discovery from Data 11(Article No. 8). Shirai, T., and Takahashi, Y. 2003a. Random point ﬁelds associated with certain Fredholm determinants. I. Fermion, Poisson and boson point processes. Journal of Functional Analysis 205(2):414 463. Shirai, T., and Takahashi, Y. 2003b. Random point ﬁelds associated with certain Fredholm determinants. II. Fermion shifts and their ergodic and Gibbs properties. The Annals of Probability 31(3):1533 1564. Snoek, J.; Zemel, R.; and Adams, R. P. 2013. A determinantal point process latent variable model for inhibition in neural spiking data. In Burges, C. J. C.; Bottou, L.; Welling, M.; Ghahramani, Z.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 26. Curran Associates, Inc. 1932 1940. Soshnikov, A. 2000. Determinantal random point ﬁelds. Rossi ıskaya Akademiya Nauk. Moskovskoe Matematicheskoe Obshchestvo. Uspekhi Matematicheskikh Nauk 55(5(335)):107 160.