# dynamic_tensor_recommender_systems__a1eb481f.pdf

Journal of Machine Learning Research 22 (2021) 1-35 Submitted 9/19;Revised 12/20; Published 2/21

Dynamic Tensor Recommender Systems

Yanqing Zhang zhangyanqing@ynu.edu.cn Department of Statistics Yunnan University Kunming, 650504, China Xuan Bi xbi@umn.edu Carlson School of Management University of Minnesota Minneapolis, MN, 55455-0438, USA Niansheng Tang nstang@ynu.edu.cn Department of Statistics Yunnan University Kunming, 650504, China Annie Qu aqu2@uci.edu Department of Statistics University of California Irvine, CA, 92697-3425, USA

Editor: Animashree Anandkumar

Recommender systems have been extensively used by the entertainment industry, business marketing and the biomedical industry. In addition to its capacity of providing preference-based recommendations as an unsupervised learning methodology, it has been also proven useful in sales forecasting, product introduction and other production related businesses. Since some consumers and companies need a recommendation or prediction for future budget, labor and supply chain coordination, dynamic recommender systems for precise forecasting have become extremely necessary. In this article, we propose a new recommendation method, namely the dynamic tensor recommender system (DTRS), which aims particularly at forecasting future recommendation. The proposed method utilizes a tensor-valued function of time to integrate time and contextual information, and creates a time-varying coeﬃcient model for temporal tensor factorization through a polynomial spline approximation. Major advantages of the proposed method include competitive future recommendation predictions and eﬀective prediction interval estimations. In theory, we establish the convergence rate of the proposed tensor factorization and asymptotic normality of the spline coeﬃcient estimator. The proposed method is applied to simulations, IRI marketing data and Last.fm data. Numerical studies demonstrate that the proposed method outperforms existing methods in terms of future time forecasting.

Keywords: Contextual information, Dynamic recommender systems, Polynomial spline approximation, Prediction interval, Product sales forecasting

c 2021 Yanqing Zhang, Xuan Bi, Niansheng Tang and Annie Qu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/19-792.html.

Zhang, Bi, Tang and Qu

1. Introduction

Recommender systems (RS) are widely used in our daily lives, such as for selecting movies, restaurants, news articles, or online shopping. As one of the information ﬁltering techniques, RS can help users to ﬁnd interesting items through combining several information sources, e.g., users ratings and purchasing histories, item proﬁles and sales volumes, time, location, and companion or promotion strategies. Particularly, incorporating time is useful in RS since users purchase behaviors are dynamic and often highly dependent on seasonal and time factors, and business sectors also rely on dynamic recommendations to track users changing purchase interests over time. Thus, it is essential to capture information related to time and develop time-dependent RS, and we refer this as dynamic RS (DRS). However, developing competitive DRS brings new challenges. First, since data are streaming in over time and are time-dependent, general RS methods which are not capable of capturing time-dependency features may have reduced recommendation accuracy. Second, forecasting future recommendations accurately is also a great challenge for DRS due to the complexity of changing users interests. For example, users might like to watch news on weekdays, but watch movies on weekends. A shoe store sells more sandals in summer and more snow boots in winter. It is important to borrow information from historical data in developing trends. Many RS methods are not designed to capture trends and predict future recommendations. In addition, as data are streaming in over time, future recommendations could involve new users or new items, whose information is not available from historical data. This is also a common problem encountered in RS, referred as the cold start problem. General RS approaches include content-based ﬁltering and collaborative ﬁltering (CF). Traditionally, content-based ﬁltering methods recommend similar types of items by matching a user s preferred item proﬁle with current item s proﬁle (e.g., Salter and Antonopoulos, 2006; Son and Kim, 2017). In contrast, CF methods recommend items by predicting item ratings for the active user based on ratings from other similar users (e.g., Herlocker et al., 2004; Luo et al., 2012). On the basis of CF methods, research work related to DRS have been developed in recent years (e.g., Koren, 2009; Gultekin and Paisley, 2014; Yu et al., 2016; Wu et al., 2017; Guo et al., 2018; Xiong et al., 2010; Rafailidis and Nanopoulos, 2014; Bi et al., 2018; Wu et al., 2019). However, most of these methods can only make recommendations for observed discrete time points, and are not designed for future recommendation prediction on unobserved time points. Liao et al. (2018) constructed dynamic tensors by means of combining tensors in tensor stream. Song et al. (2019) used temporal matrix factorization to construct temporal recommender model assuming that users current interests are transformed from the previous time step with a Markov property. Liu and Ye (2020) proposed a dynamic three-way granularity recommendation based on matrix factorization. However, neither of these methods can handle higher-orders tensors. Moreover, these methods cannot make recommendations for future time points. To make future recommendations, Yu et al. (2016) developed a CF method incorporating a time series model, and Wu et al. (2017, 2019) proposed CF methods incorporating long short-term memory modeling, but they cannot deal with new users, items or contextual variables. Xiong et al. (2010) used a Bayesian estimation procedure with a time-dependent constraint to estimate DRS for new users and items but cannot deal with new time points. Bi et al. (2018) created an additional layer of nested latent factors for new time points,

Dynamic Tensor Recommender Systems

users and items. However, Xiong et al. (2010) and Bi et al. (2018) can only estimate the components of a tensor at ﬁxed time points instead of at any time point in a continuous time interval. In addition, for forecasting at future time points, their methods may involve an increasing number of parameters if time is treated as an additional tensor mode, which could be computationally costly.

Currently, there are several dynamic recommender systems based on neural network approaches. For example, Ko et al. (2016) used Gated Recurrent Units (GRUs) to build collaborative sequence model. Devooght and Bersini (2017) utilized a long short-term memory (LSTM) method to address changes in the interests of a user. Wei et al. (2017) utilized the stacked denoising autoencoder (SDAEs) to extract features of items. Livne et al. (2019) applied a LSTM encoder-decoder network on sequences of contextual information. However, none of these methods are able to accommodate contextual information, and solve the cold-start problem simultaneously. Some of these methods may have obvious hysteresis in forecasting, which could inﬂuence the accuracy of recommendation.

In this article, we propose a tensor-valued function of time for estimating the DRS and build a new time-varying coeﬃcient model based on tensor canonical polyadic decomposition (CPD) framework; namely, the dynamic tensor recommender system (DTRS). Speciﬁcally, we introduce a tensor-valued function of time with each mode corresponding to user, item or a contextual variable, where each component of the tensor is a function of time and has intra-cluster correlation. In the CPD framework, we build a time-varying coeﬃcient model incorporating group information of time points, users, items and contexts. We approximate each coeﬃcient function by a polynomial spline and employ group factors to explore homogeneous group eﬀects. We adopt the weighted least square approach to incorporate intra-cluster correlation for more eﬃcient estimation. In addition, we construct the prediction intervals of estimators of tensor components to forecast the conﬁdence range of predicted values. In theory, we establish the convergence rate of the proposed tensor factorization and the asymptotic property of the spline parametric estimator.

The proposed method has two signiﬁcant contributions. First, it can eﬀectively provide recommendations for an entire future interval as opposed to a series of limited time points. This is because the proposed method integrates time dependency feature to the dynamic recommender systems using the time-varying coeﬃcient model in tensor factorization to capture dynamic trends of recommender systems. In addition, the proposed method can achieve accurate forecasts for long time period through the spline extrapolation technique. Furthermore, the proposed subgroup factors extract homogeneous information from the same group to provide recommendation forecasting for future time points, and consequently solves the cold start problem.

Second, we establish the asymptotic distribution of the proposed estimators in that statistical inferences such as prediction interval can be formulated. In practice, it is desirable to know the upper and lower bounds for predictions, e.g., the highest possible cost, or the future sales volumes or revenues in the worst case scenario. However, existing methods on prediction intervals are mostly univariate or multivariate time series, and the prediction intervals for user-item-context interactions under a tensor framework have not been developed. In contrast, our approach allows prediction intervals for each element of a tensor-valued function, which provides a more complete picture of the dynamic recommender system over

Zhang, Bi, Tang and Qu

time. Our numerical studies also demonstrate that the proposed approach provides eﬀective prediction interval estimators.

The remainder of the paper is organized as follows. Section 2 introduces the notation and background on tensor and tensor factorization. Section 3 presents the proposed method and its implementation. Theoretical properties are derived in Section 4. Section 5 presents simulation studies to assess the performance of the proposed approach. In Section 6, we apply the proposed method to the IRI marketing data and Last.fm data. Concluding remarks and discussion are provided in Section 7.

2. Notation and Background

In this section, we introduce some notation and the background of the tensor and classical DRSs. Throughout this article, we use blackboard capital letters for sets, e.g., T, I, small letters for scalars, e.g., x, y R, bold small letters for vectors, e.g., x, y Rn, bold capital letters for matrices, e.g., X, Y Rn1 n2, and Euler script fonts for tensors, e.g., X, Y Rn1 n2 nd (d > 2).

A dth-order tensor is an array with d dimensions (d > 2), which is an extension of a matrix to higher order. Here d represents the tensor s order. We denote the component (i1, i2, , id) of a dth-order tensor Y by yi1i2 id, where ik = 1, 2, . . . , nk, and k is called a mode of the tensor (k = 1, 2, . . . , d ). In particular, a tensor Y is called a rank-one tensor if it can be written as Y = p1 p2 pd, where the symbol represents the vector outer product, and pk = (pk 1, pk 2, , pk nk) is a nk-dimensional latent factor corresponding to the kth mode. That is, each component of the tensor is the product of the corresponding vector components: yi1i2 id = p1 i1p2 i2 pd id.

Figure 1: Illustration of factorizations of a matrix and a third-order tensor. (a) factorization of a matrix into r rank-1 matrices, (b) CPD of a third-order tensor into r rank-1 tensors.

Dynamic Tensor Recommender Systems

The canonical polyadic decomposition (CPD) is commonly adopted in tensor decomposition, which decomposes a tensor as a sum of r rank-one tensors. That is:

j=1 p1 j p2 j pd j,

where pk j = (pk 1j, , pk nkj) is a nk-dimensional latent factor corresponding to the kth mode for k = 1, . . . , d; j = 1, . . . , r. Equivalently, each component of Y is

j=1 p1 i1jp2 i2j pd idj.

The CPD can be considered to be a higher-order generalization of matrix factorisation. Figure 1 illustrates a matrix factorization of a matrix and a CPD of a third-order tensor. An extensive review of tensors and other forms of tensor decomposition are discussed in Kolda and Bader (2009). Let Pk = (pk 1, pk 2, . . . , pk r)nk r and θ = {P1, P2, , Pd}. We can estimate θ via minimizing a loss function (e.g., L2 loss). However, the non-convexity of the loss function could impose computational complexity due to numerical instability or even non-convergence (de Silva and Lim, 2008; Frolov and Oseledets, 2017). A common approach to alleviate the non-convexity problem is to introduce regularization. That is, an objective function with regularization as the following:

L(θ|Y) = Q(Y, θ) + J(θ),

where Q is a loss function and J is a penalty function, such as L2, L1 or L0 penalties, or a fused Lasso. Specially, the optimization problem solves θ = arg min L(θ|Y), where θ deﬁnes an optimal set of model parameters. In the case of squared loss function with an L2-penalty, the objective function is

(i1,i2,...,id) Ω (yi1i2 id

j=1 p1 i1jp2 i2j pd idj)2 + λ

k=1 Pk 2 F ,

where F represents the Frobenius norm, and Ω= {(i1, i2, . . . , id) : yi1i2...id is observed} is a set of indices corresponding to the observed components. Notice that, in the context of RS, the set Ωmay not contain all indices of the tensor components and could be a small fraction of the entire tensor size, since the majority of the tensor components could be missing. Major algorithms for implementing the optimization problem include the cyclic coordinate descent algorithm, the stochastic gradient descent method and the maximum block improvement algorithm (Chen et al., 2012). Following the tensor techniques, the classical DRSs can incorporate time as an additional mode of a tensor, that is, Y Rn1 n2 nd T , where the last mode is a time mode at ﬁxed time points {t1, t2, , t T }. The classical DRSs use CPD to obtain component estimators, that is,

j=1 p1 i1jp2 i2j pd idjqtj for t = t1, t2, , t T ,

Zhang, Bi, Tang and Qu

Figure 2: Third-order tensor-valued process.

where q j = (qt1j, , qt T j) is a T-dimensional latent factor corresponding to the time mode. However, the classical DRSs can only estimate the values yi1i2 idt at ﬁxed time points t. If one needs to estimate the values yi1i2 idt for t (ti, ti+1), where i {1, 2, , T}, the classical DRSs are not applicable. Moreover, if one needs to forecast the values yi1i2 idt for t > t T , the classical DRSs need to extend the time mode to future time points. However, this involves an increasing number of parameters over time which could be computationally infeasible. In addition, the classical methods only focus on the estimations of the tensor components but do not provide statistical inference, e.g., the estimation of prediction intervals. In practice, providing the upper and lower bounds of predictions are also crucial in decision making. In the following, we pursue an alternative approach to solve this problem.

3. The Proposed Method

3.1 General Methodology

In this subsection, we develop the methodology for the proposed DTRS method. Speciﬁcally, we adopt the ideas of time-varying coeﬃcient model framework to generalize the CPD to capture the trends of the DRS, and classify time points into subgroups to infer new time point trends through existing time points of the same group. We consider a dth-order tensor-valued function Y(t) Rn1 n1 ... nd, where the value at time t is a d-dimensional array. The tensor set Y = {Y(t) : t T} is the corresponding stochastic process deﬁned on a compact interval T. Without loss of generality, let T be a closed interval [0, 1]. Figure 2 illustrates an example of a tensor-valued process with d = 3. In the DRS, the tensor-valued process could be the rating or sale volume of items or products from users or stores given contexts. We assume that time points can be categorized into diﬀerent subgroups, where time points of the same group have common information. For example, in our numerical studies, time points in the same month from the twelve months of each year are categorized in the same group. In addition to time, we also categorize subjects from other modes into subgroups if they share similar characteristics, for example, stores of the same market and products of the same product category. Given the subgroup labels, we assume that each component of Y(t) can be estimated:

yi1i2 id(t)

j=1 hj(t)p1 i1jp2 i2j pd idj + g(t)q1 i1q2 i2 qd id, (1)

where pk ikj and qk ik are the jth latent factor and the subgroup factor for the ikth subject from the kth mode, respectively. Here, hj(t) is a trend function of time for j, and g(t) = Pmd+1 e=1 ge(t)I(t se), where I(t se) is an indicator function and assigns the interval se on

Dynamic Tensor Recommender Systems

the eth subgroup, ge(t) is a trend function corresponding to the eth subgroup, and md+1 is the number of subgroups for time. We have qk ik = qk i k = qk (ek) if the ikth and i kth subjects

are from the ekth subgroup (ek = 1, 2, . . . , mk), where qk (ek) is the subgroup factor associated with the ekth subgroup, and mk is the number of subgroups for the kth mode. We denote the set of observed time points for the component yi1i2 id(t) by Ti1i2...id, and the number of components of this set by |Ti1i2...id|. Let yi1i2...id = {yi1i2 id(t)}t Ti1i2...id. We assume that the covariance matrix is cov(yi1i2...id) = Σ0 i1i2...id, typically not an identity matrix due to the intra-cluster correlation arising from repeated observed data. Equation (1) adopts the idea of varying-coeﬃcient models to create a CPD for tensor data. Varying-coeﬃcient models are a useful tool to explore dynamic patterns, and have been applied to modeling and predicting longitudinal, functional and time series data (Huang and Shen, 2004; Fan and Zhang, 2008). Based on the varying-coeﬃcient models, through the equations (1), we can obtain estimators of the component of tensor-value function at any time points in a continuous time interval (e.g., t (a, b)) instead of at ﬁxed time points as in the DRS approaches (e.g., Xiong et al., 2010; Bi et al., 2018). The ﬁrst part of equation (1) is an individual-level factor model which takes into account the heterogeneity of subjects and trend of time, and the time-varying coeﬃcients hj(t) (j = 1, . . . , r) reﬂect the dynamic features. The second part of equation (1) is a subgroup-level factor model to capture common features from the same subgroups, where the subgroup factors can accommodate new subjects from any mode at future time points, and the g(t) allows time variables to follow a subgroup function of time such that we can predict future time points via borrowing information from existing time points of the same group. To capture these trend functions, we adopt the polynomial splines to approximate hj(t) and ge(t). Let {νji}a N i=1 be interior knots within T, and Υj be a partition of T with a N knots, that is Υj = {0 = νj0 < νj1 < < νja N < νja N+1 = 1} for j = 1, 2, . . . , d. The polynomial splines of an order κ+1 are functions with κ-degree of polynomials on intervals [νji 1, νji) for i = 1, 2, . . . , a N and [νja N , νja N+1], and have κ 1 continuous derivatives globally. Denote a spline bases vector of the space of such spline functions as Bj(t) = (Bj1(t), . . . , Bj M(t)) , where M = a N + κ + 1 as the number of spline bases. The function hj(t) (j = 1, 2, . . . , d) can be approximated by

ˆhj(t) = PM i=1 αji Bji(t) = α j Bj(t),

where αj = (αj1, αj2, . . . , αj M) is a coeﬃcient vector. Spline functions can be B-spline or truncated polynomial functions. For example, for the truncated polynomial function, Bj(t) = (1, t, . . . , tκ, (t νj1)κ +, . . . , (t νja N )κ +) , and the (t ν)+ is t ν if t > ν and 0 otherwise. Similarly, let {ωei}a N i=1 be interior knots within T, Γe = {0 = ωe0 < ωe1 < < ωea N < ωea N+1 = 1}, and Ae(t) = (Ae1(t), . . . , Ae M(t)) be a vector of spline bases for e = 1, 2, . . . , md+1. The ge(t) can be approximated by

i=1 βei Aei(t) = β e Ae(t),

Zhang, Bi, Tang and Qu

where βe = (βe1, βe2, . . . , βe M) . Based on equation (1), the prediction can be obtain as follows

ˆyi1i2 id(t) =

j=1 ˆhj(t)p1 i1jp2 i2j pd idj + ˆg(t)q1 i1q2 i2 qd id, (2)

where ˆg(t) = Pmd+1 e=1 ˆge(t)I(t se). The equation (2) can capture trends of the DRS suﬃciently through the polynomial spline approximations of time-varying coeﬃcient functions. In addition, since the spline approximation is computationally fast (Xue and Yang, 2006), the equation (2) can achieve the spline estimates of the coeﬃcients eﬃciently, and this is especially advantageous in estimating high-dimensional parameters in RS. In contrast to these approaches like Xiong et al. (2010) and Bi et al. (2018), equation (2) can achieve forecasting at any future time points without requiring an increasing number of parameters over time. Note that the proposed method does not require the same number of knots and the same degree polynomial for either trend functions. In order to reduce the computational cost, we ﬁxed the same numbers of knots and the same degree polynomial. We can also adopt diﬀerent number of knots or diﬀerent degree polynomial for diﬀerent trend functions g(t) and hj(t) respectively, or apply existing methods (Van Loock et al., 2011; Yuan et al., 2013; Dung and Tjahjowidodo, 2017) to identify the number of knots. Due to the intra-cluster correlation, it is important to incorporate intra-cluster correlation into RS. However, in practice, the covariance matrix Σ0 i1i2...id is often unknown. We adopt an invertible working covariance matrix, denoted as Σi1i2...id, to take into account the intra-cluster correlation. Let P = (P1 , , Pd ), q = (q(1) , , q(d) ) , α = (α 1, . . . , α r) , β = (β 1, . . . , β md+1) , and γ = (α , β ) , where Pk = (pk 1, . . . , pk r), pk j = (pk 1j, , pk nkj) , q(k) = (qk (1), . . . , qk (mk)) , and k = 1, . . . , d. Deﬁne θ = {P, q, γ} as parameters of interest. Considering the intra-cluster correlation and non-convexity problem, we deﬁne the following weighted penalized objective function:

(i1,i2, ,id) Ω (yi1i2...id byi1i2...id) Σ 1 i1i2...id(yi1i2...id byi1i2...id)+λ( P 2 F + q 2 2+ γ 2 2),

(3) where λ is the penalized parameter, Ω= {(i1, i2, . . . , id) : yi1i2 id(t) is observed at some t}, 2 is the Euclidean norm, and byi1i2...id = {ˆyi1i2 id(t)}t Ti1i2...id is a |Ti1i2...id| 1 vector. The matrix Σi1i2...id is an approximation of the true covariance Σ0 i1i2...id, and can be

modeled as Σi1i2...id = V1/2 i1i2...id Ri1i2...id V1/2 i1i2...id, where Vi1i2...id is a diagonal matrix of the marginal variance of yi1i2...id, and Ri1i2...id is a working correlation matrix for yi1i2...id. Some commonly used working correlation structures include independence, exchangeable, and ﬁrst-order autoregressive process (AR-1), among others. Given a working correlation structure, the working correlation matrix depends on fewer nuisance parameters which can be estimated by the residual-based moment method (Liang and Zeger, 1986). The proposed method is robust to the misspeciﬁcation of correlation structure as indicated by our numerical examples.

3.2 Parameter Estimation

In this subsection, we discuss parameter estimation by minimizing (3). Let pk ik = (pk ik1, , pk ikr) and Ωk ik = {(i1, . . . , ik, . . . , id) : yi1 ik id(t) is observed at some t given ik} be the

Dynamic Tensor Recommender Systems

set of indices with the ﬁxed kth mode index ik, where the corresponding components are observed at some time points. We assume that the number of observations for each time subgroup se is larger or equal than 2 for e = 1, . . . , md+1, and the number of observations for each subgroup ek from the kth mode is larger or equal than 2 for ek = 1, . . . , mk; k = 1, . . . , d. The partial derivatives of the objective function (3) have explicit forms with respect to the individual factors, the subgroup factors and the spline coeﬃcients, which makes it feasible to apply the blockwise coordinate descent approach (BCD). That is, for ik = 1, . . . , nk and k = 1, . . . , d,

bpk ik = arg min pk ik

(yi1 ik id byi1 ik id) Σ 1 i1 ik id(yi1 ik id byi1 ik id) + λ pk ik 2 2, (4)

bq(k) = arg min q(k)

Ω (yi1i2 id byi1i2 id) Σ 1 i1i2 id(yi1i2 id byi1i2 id) + λ q(k) 2 2, (5)

bα = arg min α

Ω (yi1i2 id byi1i2 id) Σ 1 i1i2 id(yi1i2 id byi1i2 id) + λ α 2 2, (6)

bβ = arg min β

Ω (yi1i2 id byi1i2 id) Σ 1 i1i2 id(yi1i2 id byi1i2 id) + λ β 2 2. (7)

In fact, the estimation procedure of bpk ik in (4) is a ridge regression, and does not require knowing pk i k for i k = ik. Thus, parallel computation is applicable to calculate bpk 1, . . . , bpk nk 1 and bpk nk eﬃciently. The minimization of L(θ|Y) can be done cyclically through estimating P, q, α and β. Notice that Ω= nk ik=1Ωk ik, and it is possible that Ωk ik is empty for certain ik s, that is, there is no observation on the subject ik. Under this circumstance, the individual factor of the ik subject is assigned as pk ik = 0, and the predicted values may degenerate to the subgroup-level factor model by utilizing information from members of the same subgroup.

3.3 Implementation

In the following, we discuss several implementation issues. To solve the objective function (3), we incorporate the maximum block improvement (MBI) strategy (Chen et al., 2012) into the BCD algorithm cyclically as in Bi et al. (2018). The MBI has two advantages over traditional cyclic BCD algorithms. First, it has a good algorithmic property which guarantees convergence to a stationary point, whereas traditional BCDs may end up with certain points where the criterion function ceases to decrease (Chen et al., 2012). Second, the MBI has the capability of choosing descending directions and hence has the possibility to discover shortcuts , which may reduce the computational time signiﬁcantly. Let bθl be an estimator of θ at the lth iteration, θa be a subset of θ, θc be the complementary set of θa, and bθ a be the attempted update of θa. The improvement of the bθ a is deﬁned as

Jˆθ a = 1 L(bθ a, bθ c l 1|Y)

L(bθl 1|Y) . (8)

We summarize the implementation of the speciﬁcal algorithm as follows.

Zhang, Bi, Tang and Qu

Algorithm Implementation Algorithm

1: (Initialization) Input all observed yi1i2 id(t) s, the number of factors r, tuning parameter λ, initial value θ0 and a stopping criterion ε = 10 4.

2: (Individual factors update) At the lth iteration, estimate {P1, P2, , Pd, α}.

(i) For each Pk, solve (4) through parallel computing and obtain b Pk . Then calculate Jb Pk through (8).

(ii) For α, solve (6) and obtain bα . Then calculate Jˆα through (8).

(iii) Assign b Pk l b Pk , if Jb Pk = max{Jb P1 , Jb P2 , , Jb Pd , Jˆα }. bα(l) bα , if Jˆα = max{Jb P1 , Jb P2 , , Jb Pd , Jˆα }.

3: (Subgroup factors update) At the lth iteration, estimate {q(1), q(2), , q(d), β}.

(i) For every q(k), solve (5) and obtain bq(k) . Then calculate Jbq(k) through (8).

(ii) For β, solve (7) and obtain bβ . Then calculate Jˆβ through (8).

(iii) Assign bq(k) l bq(k) , if Jbq(k) = max{Jbq(1) , Jbq(2) , , Jbq(d) , Jˆβ }. bβ(l) bβ , if Jˆβ = max{Jbq(1) , Jbq(2) , , Jbq(d) , Jˆβ }.

4: (Stopping Criterion) Stop if max{Jb P1 , Jb P2 , , Jb Pd , Jˆα , Jbq(1) , , Jbq(d) , Jˆβ } < ε.

Set the ﬁnal estimator bθ = bθl. Otherwise set l l + 1 and go to step 2.

To select tuning parameter λ, we search the one from grid points minimizing the root mean square error on the validation set, deﬁned as [P (i1,...,id,t) Γ{yi1...id(t) ˆyi1...id(t)}2/|Γ|]1/2, where Γ is the set of indices and times of observed data. We choose the number of individual latent factors r such that it is suﬃciently large and leads to stable estimation. In general, the r is no smaller than the theoretical rank of the tensor in order to represent subjects latent features suﬃciently well, but not so large as to over-burden the computational cost. An appropriate selection of the knot sequence is important to eﬃciently implement the proposed method. In practice, knot locations are usually chosen to be equally-spaced over the range of data or placed at evenly-spaced quantiles of data. Since there are highdimensional factor parameters, for simplicity we set the number of knots to be the integer part of N1/(2κ+3), where N = |Ω| and κ is the degree of polynomials. One can also choose other methods to select the number of knots such as the AIC or BIC procedures (Xue and Yang, 2006). The degree of polynomials κ is commonly chosen as 1, 2, or 3. In our numerical study, we set κ = 2 and adopt truncated polynomial bases. One can also use diﬀerent degrees and spline bases for diﬀerent time-varying coeﬃcients. Another important issue is in selection of contextual variables as tensor modes. In practice, the chosen number of contexts is often pre-speciﬁed based on domain knowledge. A contextual variable can be considered an additional tensor mode of a higher-order tensor if users and items behaviors are distinctive under diﬀerent values of the contextual variable.

Dynamic Tensor Recommender Systems

On the one hand, a higher-order tensor with more contextual variables allows higher-order interactions and hence provides more accurate estimation. On the other hand, a higherorder tensor entails more complex and intensive computation, and may lead to overﬁtting. It is not suggested to assign too many contextual variables as additional tensor modes, which remains open to discussion regarding the number of contextual variables. In our numerical studies, promotion strategies are incorporated as a contextual variable, since users and items behaviors are distinctive under diﬀerent promotion strategies. In general practice, however, we assume that the order of a tensor can be determined based on prior knowledge.

4. Theoretical Properties

In this section, we provide asymptotic properties for the proposed method and the estimation of prediction intervals. Speciﬁcally, we establish the convergence rate of the proposed tensor factorization and the asymptotic normality of the spline coeﬃcient estimator. Following asymptotic normality, we can also construct the estimation of the prediction interval of the component. Note that identiﬁability is critical for tensor representation. We ﬁrst present the suﬃcient conditions to ensure identiﬁability of the proposed tensor modeling as follows.

Proposition 1 If Pd k=1 Kk 2r + d + 1 holds, minimizers of L(P, q, α, β|Y) in P, q, α and β given ﬁxed spline bases are unique up to permutation almost surely, where Kk is the Kruskal rank of (Pk, qk), and qk = (qk 1, qk 2, , qk nk) .

Proposition 1 shows that the proposed tensor modeling is identiﬁable up to permutation almost surely. To address permutation indeterminacy, we could align the factors according to a descending order of the ﬁrst row of mode-1 factor matrix P1, that is, p1 11 p1 12 p1 1r, following the method in Zhang et al. (2014). The rearrangement can be implemented during or after the proposed algorithm, since it does not aﬀect the estimation procedure. In the rest of Section 4, we assume that the parameters are identiﬁable. Let ui1i2...id = {(p1 i11p2 i21 pd id1), (p1 i12p2 i22, pd id2), , (p1 i1rp2 i2r pd idr), (q1 i1q2 i2 qd id)} , U Rn1 ... nd (r+1) consist of ui1i2...id, f(t) = {h1(t), h2(t), . . . , hr(t), g(t)} , Fi1i2...id R|Ti1i2...id| (r+1) be the matrix consisting of f(t) for all t Ti1i2 id. Considering random errors based on the equation (1), we denote yi1i2 id(t) as yi1i2 id(t) = f(t) ui1i2...id+εi1i2...id(t) for t Ti1i2 id, where εi1i2...id(t) is a random error with mean zero and ﬁnite variance. Let εi1i2...id = {εi1i2...id(t)}t Ti1i2...id be a |Ti1i2...id| 1 vector. We have cov(εi1i2...id) = cov(yi1i2...id) = Σ0 i1i2...id. Thus, the corresponding vector form is

yi1i2...id = Fi1i2...idui1i2...id + εi1i2...id,

Let J(U) be a non-negative penalty function of U. The overall criterion given hj( ) and g( ) is redeﬁned as

(i1,i2, ,id) Ω (yi1i2...id Fi1i2...idui1i2...id) Σ 1 i1i2...id(yi1i2...id Fi1i2...idui1i2...id)+λJ(U)

(9) for U S, where S is the parameter space for U.

Zhang, Bi, Tang and Qu

Based on the proposed method, byi1i2...id can be rewritten as byi1i2...id = Wi1i2...idγ, where Wi1i2...id = (Xi1i2...id1, , Xi1i2...idr, Zi1i2...id1, , Zi1i2...idmd+1), Xi1i2...idj = ui1i2...idj Bi1i2...idj, Zi1i2...ide = ui1i2...id(r+1)Ai1i2...ide, in which Bi1i2...idj = {Bj(t) }t Ti1i2...id R|Ti1i2...id| M,

and Ai1i2...ide = {I(t se)Ae(t) }t Ti1i2...id R|Ti1i2...id| M for j = 1, 2, . . . , r, e = 1, 2, . . . , md+1. By the approximation theory (de Boor, 2001), there exists a constant C > 0, the spline functions ehj(t) = α 0j Bj(t) and ege(t) = β 0e Ae(t) such that supt T |hj(t) ehj(t)|

Ca ξ N and supt T |ge(t) ege(t)| Ca ξ N for any j = 1, . . . , r, e = 1, . . . , md+1. Denote γ0 = (α 0, β 0) , and let N = |Ω| be the number of components of the set Ω, λmin{ } and λmax{ } be the smallest and largest eigenvalues of any symmetric matrix, respectively. We require the following regularity conditions to establish the asymptotic properties.

(C1) The functions hj( ) and ge( ) are ξth-order continuously diﬀerential for some ξ 2, all j = 1, . . . , d, and e = 1, . . . , md+1. The density function of design points t is absolutely continuous and bounded away from zero and inﬁnity on a compact support T.

(C2) The knots sequences Υj and Γe are quasi-uniform for j = 1, . . . , d and e = 1, . . . , md+1; that is, there exists a constant c > 0, such that

max j=1,...,d maxi=0,...,a N (νji+1 νji)

mini=0,...,a N (νji+1 νji) c, and max e=1,...,md+1 maxi=0,...,a N (ωei+1 ωei)

mini=0,...,a N (ωei+1 ωei) c.

(C3) There exist positive constants σ2 1 and σ2 2 such that the covariance matrix Σ0 i1i2...id of random error εi1...id satisﬁes that σ2 1 λmin{Σ0 i1i2...id} λmax{Σ0 i1i2...id} σ2 2.

(C4) There exist some positive constants c1 and c2 such that c1 λmin{Σ 1 i1i2...idΣ0 i1i2...id} λmax{Σ 1 i1i2...idΣ0 i1i2...id} c2.

(C5) Tmax = max(i1, ,id) Ω{|Ti1 id|} = op(Nτ), Tmin = min(i1, ,id) Ω{|Ti1 id|} = op(Nυ) for 0 τ/2 < υ τ < 1, and λ = op(1).

Conditions (C1)-(C3) are standard in the polynomial spline framework. Similar conditions are also presented in Huang (2003) and Claeskens et al. (2009). In particular, condition (C1) imposes a smoothness condition of trend functions and a mild condition on time density, and guarantees that the observation time points are randomly scattered. Condition (C2) indicates that the adjacent distances among the knot sequence are comparable. Condition (C3) implies that the eigenvalues of random errors are bounded. Condition (C4) implies that the diﬀerence between the working covariance and true covariance matrices is bounded. Condition (C5) implies that the number of the observed time points grows as the number of the observed components of the tensor increases, to ensure the convergence of the proposed tensor factorization. The following theorem establishes the convergence rate for the proposed tensor factorization.

Theorem 2 Under conditions (C1)-(C5), if the penalty function J(U) has bounded ﬁrst and second derivatives at true parameter U0, as N , on a δ-ball centered at U0 for some δ > 0, there exists a minimizer b U of (9) such that X

(i1,i2, ,id) Ω Fi1i2 id(bui1i2 id u0i1i2 id) 2 2/N = Op(N 1+2(τ υ)).

Dynamic Tensor Recommender Systems

Theorem 2 provides the convergence rate of the proposed method given trend functions. When τ = υ, that is, Tmax and Tmin have the same order, the convergence rate of the estimator b U reaches the optimal rate N 1/2. Meanwhile, if the order of Tmax is

N faster than that of Tmin, that is, τ υ = 0.5, then b U will not converge to the true U0. This implies that to guarantee consistency of the tensor factorization, one should collect suﬃcient observations even for the least popular user-item-context combinations. In the following theorem, we establish the asymptotic property of the spline coeﬃcient estimator.

Theorem 3 Under conditions (C1)-(C5), if lim N a N log a N/N = 0 and lim N a ξ N Nτ

= 0, then for any vector c whose components are not all zero, the parametric estimator bγ by (6) and (7) satisﬁes

c (bγ γ0)var{c (bγ γ0)} 1/2 L N(0, 1),

where var{c (bγ γ0)} = c Ψ 1ΦΨ 1c = Op(a NN 1+τ 2υ), Ψ = P (i1,...,id) ΩW i1...idΣ 1 i1...id

Wi1...id, and Φ = P (i1,...,id) ΩW i1...idΣ 1 i1...idΣ0 i1...idΣ 1 i1...id Wi1...id.

Theorem 3 establishes the asymptotic normality of the spline coeﬃcient estimator. The convergence rate of the spline coeﬃcient estimator is Op(a NN 1+τ 2υ). If Tmax and Tmin have the same order, var{c (bγ γ0)} = Op(a N/N1+υ), and similar results can be found in Huang et al. (2004). The asymptotic variance in Theorem 3 depends on the working covariance matrix and the true covariance matrix. When the working covariance matrices are equal to the true covariance matrices, the asymptotic variance of the proposed estimator reaches the minimum in the sense of Lower order and the proposed estimator is asymptotic eﬃcient. More importantly, the result of Theorem 3 is the key foundation for constructing prediction intervals. First, we derive the standard error for the spline parametric estimates given a ﬁxed λ using the sandwich covariance formula d Cov(bγ) = ( bΨ+λI) 1 bΦ( bΨ+λI) 1, where bΨ = P (i1,i2,...,id) Ωc W i1i2...idΣ 1 i1i2...id c Wi1i2...id, bΦ = P (i1,i2,...,id) Ω{c W i1i2...idΣ 1 i1i2...id(yi1i2...id c Wi1i2...idbγ)} 2, operation is the vector operation a 2 = aa , and I is an identity matrix. Since ˆyi1i2...id(t) = bw i1i2...idtbγ, and bwi1i2...idt is the tth column of estimator c W i1i2...id, a 100(1 σ)% prediction interval (Chatﬁeld, 1993) of ˆyi1i2...id(t) is

ˆyi1i2...id(t) φσ/2 q

var{ei1i2...id(t)}, (10)

where φσ/2 is the 100(1 σ)th percentile of the standard normal distribution, and the var{ei1i2...id(t)} is the variance of the prediction error and can be estimated as:

c var{ei1i2...id(t)} = bw i1i2...idt d Cov(bγ)bwi1i2...idt + c var{εi1i2...id(t)}. (11)

The ﬁrst term in equation (11) is due to estimation error, and the second term can be estimated by the mean squared error on training data.

Zhang, Bi, Tang and Qu

5. Simulation Studies

In this section, we perform simulation studies to compare the proposed method (DTRS) with competing methods, including Bayesian probabilistic tensor factorization (BPTF, Xiong et al., 2010) and the recommendation engine of multilayers (REM, Bi et al., 2018). We assess forecasting performance via examining the root mean square error (RMSE) and the mean absolute error (MAE), where the RMSE is deﬁned as [P (i1,...,id,t) Γ{yi1...id(t) ˆyi1...id(t)}2/|Γ|]1/2, the MAE is deﬁned as P (i1,...,id,t) Γ |yi1...id(t) ˆyi1...id(t)|/|Γ|, and Γ is the set of indices and time of observed data. Moreover, we evaluate the coverage probability of the prediction interval estimated by the proposed method with 95% nominal coverage probability (PICP) .

5.1 Simple Tensor Function

In the simulation, we consider a third-order tensor function of time with user, context and item modes. We set the numbers of users, contexts and items to n1 = 100, n2 = 9, and n3 = 100, respectively. We assume that users, contexts, items and time points are from m1 = 10, m2 = 3, m3 = 10 and m4 = 4 subgroups, respectively. Users, contexts, items and time points are evenly assigned to each subgroup. The number of latent factors is set as r = 3. We generate tensor functions at time points t U(0, 1) by generating its components as yi1i2i3(t) = Pr j=1 hj(t)p1 i1jp2 i2jp3 i3j +g(t)q1 i1q2 i2q3 i3 +εi1i2i3(t) for ik = 1, . . . , nk, k = 1, 2, 3, where the latent factors pk ik N(0, Ir), trend functions h1(t) = sin(0.3πt), h2(t) = 8t(1 t) 1 and h3(t) = cos(0.2πt) + 1. To distinguish diﬀerent subgroups, we set the subgroup factors as a simple sequence, where q1 (e1) = 1+0.4e1, q2 (e2) = 1.2+0.6e2 and q3 (e3) = 0.4+0.2e3 for ek = 1, . . . , mk and k = 1, 2, 3. The function g(t) = Pm4 e=1 ge(t)I(t se), where g1(t) = 2t 1, g2(t) = 8(t 0.5)3, g3(t) = sin(0.1πt) + cos(πt), and g4(t) = 5 exp(t)+10. The error εi1i2i3 = (εi1i2i3(t1), . . . , εi1i2i3(t T )) follows a multivariate normal distribution with mean 0 and a common marginal variance 1, and the correlation structure is either independence or AR-1 with correlation ρ = 0.85. In each simulation, we consider the number of time points as T = T1 + T2, where the tensor data in the ﬁrst T1 = 12 time points are set as the training data, and the tensor data in the last T2 time points are used as the testing data. For evaluating the forecasting performance at future time points, we consider T2 = 8 or 12. Considering the missing case, we generate n1n2n3T(1 πm) components out of the tensor functions, where πm is the missing percentage and set as 80%. Furthermore, we use πcs = 30% to represent the proportion of new items in the testing data unavailable from the training set. To illustrate the eﬀect of incorporating intra-cluster correlation on estimation eﬃciency, we compare the estimation eﬃciency of the proposed methods using diﬀerent working correlation structures: independent or AR-1, denoted as DTRSin and DTRSar, respectively. According to Xiong et al. (2010) and Bi et al. (2018), BPTF and REM methods model fourth-order tensor with user, context, item and time modes. For all methods, we assume that the subgroup structure and the number of latent factors are known. For REM and the proposed methods, the tuning parameter λ is pre-selected from grid points ranging from 0 to 20. The validation set is the data from the last four time points of the training set. For BPTF, we keep the remaining parameters by their default choices and obtain a forecast

Dynamic Tensor Recommender Systems

Table 1: Average RMSE and MAE of all approaches. The PICP is the average coverage probability of the 95% prediction interval. The RMSE, MAE and PICP are provided with standard error based on 100 simulations in each parenthesis.

True structure: Independent AR Method T2 = 8 T2 = 12 T2 = 8 T2 = 12 DTRSin RMSE 1.570(0.196) 1.660(0.389) 1.597(0.192) 1.707(0.524) MAE 1.092(0.091) 1.132(0.160) 1.115(0.091) 1.160(0.208) PICP 0.949(0.015) 0.953(0.017) 0.946(0.017) 0.952(0.018) DTRSar RMSE 1.625(0.244) 1.696(0.286) 1.576(0.190) 1.632(0.200) MAE 1.133(0.118) 1.170(0.159) 1.099(0.085) 1.130(0.102) PICP 0.943(0.019) 0.947(0.021) 0.947(0.015) 0.949(0.018) REM RMSE 2.502(0.322) 2.494(0.307) 2.498(0.304) 2.494(0.305) MAE 1.654(0.178) 1.640(0.170) 1.650(0.166) 1.643(0.172) PICP BPTFbayes RMSE 2.675(0.742) 2.930(0.965) 2.724(0.863) 3.181(1.148) MAE 1.810(0.427) 1.958(0.547) 1.826(0.495) 2.104(0.654) PICP BPTFbasic RMSE 2.142(0.221) 2.145(0.211) 2.136(0.222) 2.144(0.206) MAE 1.446(0.116) 1.454(0.115) 1.441(0.117) 1.453(0.111) PICP BPTFdouble RMSE 2.388(0.319) 2.665(0.356) 2.405(0.319) 2.642(0.380) MAE 1.598(0.190) 1.774(0.217) 1.611(0.191) 1.755(0.232) PICP

via sampling the factor matrix of time from the time posterior distribution, denoted as BPTFbayes. Following the forecasting technique of Araujo et al. (2019), we also consider BPTF incorporating basic exponential smoothing (Holt s method, Holt, 2004) and double exponential smoothing (Holt-Winters method, Holt, 2004; Winters, 1960). That is, we ﬁrst use BPTF to estimate the factor matrices of user, item, context and time in the training data, and then forecast the factor matrix of time at given time points of the testing data via basic exponential smoothing or double exponential smoothing, denoted as BPTFbasic and BPTFdouble, respectively. All methods are replicated by 100 simulation runs. Table 1 provides the estimation results of all methods. We observe that the proposed method has better performance when the working correlation structure is the same as the true correlation structure. When the true correlation structure is independence, the DTRSin has smaller RMSE and MAE than the DTRSar, with more than 2.17% improvement. Similarly, when the true correlation structure is AR-1, the DTRSar outperforms the DTRSin. Moreover, the PICPs of the DTRS method are close to 0.95, which implies that the proposed method provides accurate prediction intervals, whereas the competing methods are not able to provide such prediction intervals. For the performance of forecasting, we observe that the DTRSin and DTRSar outperform other methods across all settings. Speciﬁcally, both DTRS methods improve the RMSE and MAE of the REM by more than 40%, and

Zhang, Bi, Tang and Qu

Figure 3: Box plots of the MAE for forecasting values with 8 and 12 time points and the true independent correlation.

those of the BPTFbayes, BPTFbasic and BPTFdouble by more than 60%, 24%, and 41%, respectively. In this setting, the BPTFbasic performs better than the BPTFdouble. This is probably because the basic exponential smoothing method is more applicable in forecasting time series with no clear trend or seasonal pattern, whereas the double exponential smoothing method performs better when a trend is present (Holt, 2004). Although the BPTFbasic and BPTFdouble perform better than the BPTFbayes, the proposed method is still able to beat the best of the BPTF variations. This indicates that the proposed method provides more accurate forecasting compared to other methods. To illustrate the speciﬁc performance for forecasting at each time point, we calculate the MAE at each time point and provide box plots for the MAE in Figures 3-4. We observe that the performance of the proposed method is relatively robust against time in all settings. The MAEs of both DTRS methods at any time point are the lowest. The proposed method outperforms other methods, especially for long-term forecasting, indicating that it can handle both short-term and long-term forecasting accurately.

5.2 Long Forecasting Time Period

In this simulation study, we evaluate the performance of the proposed method with a vast set of users and items and time period. Moreover, we also report the average computational time (Com Time) in seconds for each method based on 50 repetitions. All experiments are implemented using Window 10 with 1.99 GHz Intel Core i7 Processor and 16 GB memory. We consider a third-order tensor function of time with user, context and item modes. We set the numbers of users, contexts and items to n1 = 1000, n2 = 30, n3 = 10000, respectively. The number of time points is T = T1+T2 with the number of the training time points T1 = 80 and the number of the testing time points T2 = 8, 12, 24, 32. Other setting is similar to the setting in Section 5.1. We consider that the error εi1i2i3 = (εi1i2i3(t1), . . . , εi1i2i3(t T ))

Dynamic Tensor Recommender Systems

Figure 4: Box plots of the MAE for forecasting values with 8 and 12 time points and the true AR-1 correlation.

follows a multivariate normal distribution with mean 0 and the independent structure. We perform the proposed method and the compering methods as in Section 5.1. The simulation results are summarized in Table 2.

Table 2 shows that the proposed method outperforms other methods across all settings. Speciﬁcally, both DTRS methods improve the RMSE and MAE of the REM by more than 17%, and those of the BPTFbayes, BPTFbasic and BPTFdouble by more than 39%, 31% and 39%, respectively. Moreover, the proposed method is robust against longer forecasting time periods comparing with other methods. Speciﬁcally, we observe that the REM is more eﬃcient computationally than the proposed method but the accuracy of the REM is lower than the proposed method. This is probably due to that the REM treats time as a mode of tensor, with more parameters involved as the time increases, and leads to more errors of estimation. Due to the demand of Gibbs sampling, the BPTFbayes, BPTFbasic and BPTFdouble require longer computational time and larger menory storage. In additional, as the time increases, the BPTFbayes needs to involve more parameters in time factors which leads to low computational eﬃciency and low accuracy in forecasting. Although the BPTFbasic and BPTFdouble do not have increasing number of parameters as the time increases due to the exponential smoothing method, the Gibbs sampling is still time-consuming. Reducing the number of Gibbs samples may lead to less computational time but with less accurate predictions. For our method, the DTRSar requires to estimate the covariance matrix, and the DTRSar requires more computational time than the DTRSin. Nevertheless, the DTRSar has higher accuracy than other competing methods in terms of RMSE and MAE. Overall, the proposed method performs well in term of computational eﬃciency and forecasting accuracy.

Zhang, Bi, Tang and Qu

Table 2: Average RMSE and MAE of all approaches for diﬀerent forecasting time horizons. The PICP and Com Time are the average coverage probability of the 95% prediction interval and average computational time in seconds, respectively. The RMSE, MAE and PICP are provided with standard error based on 50 simulations in each parenthesis.

Method T2 = 8 T2 = 12 T2 = 24 T2 = 32 DTRSin RMSE 1.549(0.244) 1.574(0.325) 1.633(0.407) 1.682(0.458) MAE 1.088(0.148) 1.107(0.173) 1.147(0.218) 1.178(0.247) PICP 0.930(0.024) 0.932(0.026) 0.935(0.029) 0.941(0.026) Com Time 634.5s 634.7s 635.4s 635.9s DTRSar RMSE 1.574(0.343) 1.597(0.377) 1.663(0.510) 1.727(0.675) MAE 1.106(0.185) 1.121(0.208) 1.164(0.274) 1.202(0.343) PICP 0.929(0.020) 0.931(0.023) 0.933(0.031) 0.938(0.028) Com Time 766.5s 766.8s 767.6s 768.1s REM RMSE 1.855(0.410) 2.127(0.560) 2.268(0.469) 2.041(0.640) MAE 1.321(0.245) 1.459(0.303) 1.528(0.280) 1.432(0.355) PICP Com Time 159.1s 168.5s 184.5s 163.5s BPTFbayes RMSE 2.331(0.772) 2.332(0.753) 2.323(0.680) 2.618(0.617) MAE 1.634(0.453) 1.629(0.439) 1.618(0.387) 1.793(0.369) PICP Com Time 772.0s 777.2s 792.5s 802.4s BPTFbasic RMSE 2.104(0.658) 2.106(0.671) 2.198(0.717) 2.641(0.791) MAE 1.503(0.448) 1.503(0.454) 1.558(0.471) 1.789(0.495) PICP Com Time 756.1s 761.0s 775.5s 784.1s BPTFdouble RMSE 2.203(0.704) 2.224(0.734) 2.367(0.834) 2.803(0.977) MAE 1.562(0.470) 1.572(0.485) 1.651(0.530) 1.879(0.581) PICP Com Time 738.8s 743.8s 759.5s 769.3s

Dynamic Tensor Recommender Systems

Table 3: Average RMSE and MAE of the proposed method DTRSin under 0%, 10% and 30% cluster misspeciﬁcation rate (Mis. rate). The PICP is the average coverage probability of the 95% prediction interval. The RMSE, MAE and PICP are provided with standard error based on 50 simulations in each parenthesis.

Method Mis. rate T2 = 8 T2 = 12 T2 = 24 T2 = 32 DTRSin 0% RMSE 1.549(0.244) 1.574(0.325) 1.633(0.407) 1.682(0.458) MAE 1.088(0.148) 1.107(0.173) 1.147(0.218) 1.178(0.247) PICP 0.930(0.024) 0.932(0.026) 0.935(0.029) 0.941(0.026) 10% RMSE 1.594(0.350) 1.612(0.357) 1.658(0.395) 1.703(0.433) MAE 1.109(0.184) 1.124(0.194) 1.159(0.222) 1.190(0.247) PICP 0.935(0.019) 0.938(0.020) 0.942(0.024) 0.947(0.022) 30% RMSE 1.604(0.386) 1.617(0.387) 1.697(0.502) 1.771(0.609) MAE 1.120(0.221) 1.129(0.221) 1.179(0.278) 1.226(0.330) PICP 0.935(0.021) 0.939(0.021) 0.943(0.024) 0.947(0.023) REM 0% RMSE 1.855(0.410) 2.127(0.560) 2.268(0.469) 2.041(0.640) MAE 1.321(0.245) 1.459(0.303) 1.528(0.280) 1.432(0.355) 10% RMSE 2.443(0.405) 2.390(0.384) 2.401(0.574) 2.383(0.432) MAE 1.661(0.243) 1.619(0.211) 1.612(0.344) 1.617(0.259) 30% RMSE 2.443(0.437) 2.337(0.314) 2.360(0.392) 2.268(0.382) MAE 1.676(0.294) 1.597(0.193) 1.595(0.245) 1.536(0.219)

5.3 Robustness under Cluster Misspeciﬁcation

In this simulation study, we study the robustness of the proposed method when the clusters are misspeciﬁed.

We follow the same data-generating process as in Section 5.2, but allow the cluster assignment to be misspeciﬁed. Speciﬁcally, we misassign users, contexts and items to adjacent clusters with 0%, 10% and 30% chance. The Adjacent clusters are deﬁned as the clusters with the closest group eﬀects. This deﬁnition of adjacent clusters reﬂects the real-data situation (e.g., a facial tissue might be misassigned as paper towels, but less likely to be misassigned as yogurt). We also compare the proposed method with the REM method which also consider the subgroup information.

The simulation results based on 50 replications are summarized in Table 3. Table 3 shows that in general the proposed method is robust against the misspeciﬁcation of cluster. In comparison with the results when 0% of the cluster members are misclassiﬁed, the proposed method is more robust than the REM method in all settings under 10% and 30% cluster misspeciﬁcation rate. For example, the proposed method under the 10% misspeciﬁcation rate is 2.9% worse than the proposed method without misspeciﬁcation in terms of the RMSE under T2 = 8; and is 3.6% worse than one without misspeciﬁcation under T2 = 8. However, the REM method under the 10% and 30% misspeciﬁcation rates is 31.7% worse than the REM method without misspeciﬁcation in terms of the RMSE under T2 = 8.

Zhang, Bi, Tang and Qu

6. Empirical Examples

6.1 IRI Marketing Data

In this section, we focus on sales data at drug stores from the IRI Marketing Data (Bronnenberg et al., 2008) to illustrate the performance of the proposed method. The original IRI data is an immense collection of consumer panel data and store sales at grocery stores, drug stores and mass-market stores over the years 2001-2011. The store sales data contain weekly product sales volumes, pricing, and promotion data for all items from 31 product categories sold in 50 U.S. markets. These markets are geographic units deﬁned typically as an agglomeration of counties, usually covering a major metropolitan areas (e.g., Chicago, IL) but sometimes covering just part of a region (e.g., New England). A detailed description of an early version of the data is available in Bronnenberg et al. (2008). To illustrate the proposed method, we choose sales data at drug stores collected from 2001 to 2011, where there are sales volume records, recorded times, promotion strategies, 43,631 product IDs, and 471 drug store IDs. These drug stores are from 50 markets across the United States. The products include items sold from these stores during the 11-year period, and are from 31 product categories, including hot dogs, household cleaners, margarine/butter blends, mayonnaise, milk, coﬀee, cigarettes, photography supplies, paper towels, frozen pizza, toilet tissue, yogurt, beer/ale/alcoholic cider, blades, cold cereal, carbonated beverages, diapers, deodorant, facial tissue, frozen dinners/entrees, laundry detergent, peanut butter, razors, mustard and ketchup, sugar substitutes, spaghetti/Italian sauce, soup, shampoo, salty snacks, toothpaste, and toothbrush. Moreover, various advertising and promotions strategies are imposed on these products to attract consumers. The promotions strategies have 30 types which are combinations of 5 advertisement features, 3 types of merchandise display, and an indicator on whether the product has a price reduction of more than 5%. The goal of our study is to predict the future sales volumes of each product from each store given each promotion strategy based on historical sales data. Through this prediction procedure, we are able to estimate future purchases, evaluate the inﬂuence of promotion strategy for product sales, and potentially recommend the most proﬁtable products to store managers, so the company can make wiser decisions on marketing strategies and inventory planning. For the IRI marketing data, a personalized suggestion refers to the recommendation of potentially proﬁtable products to store managers. Statistically this can be viewed as predicting future sales volumes of each product from each store based on historical sales data. There are abundant literature on product recommender systems including, but not limited to, Giering (2008), Xiong et al. (2010) and Yu et al. (2016). In these works, similar to the proposed IRI data analysis, recommendations of products to stores are also considered. For considering the trend of product sales, we aggregate the weekly data into monthly data according to the record time information so that the data contain more than 79.2 million sales records for 132 months from the beginning of 2001 to the end of 2011. For the proposed method, we classify stores, products, observed time points and promotion strategies into subgroups based on their markets, product categories, month of the year and whether a price reduction is applied, respectively. Table 4 shows the summary statistics of the data. According to the proposed method, the data can be reframed into monthly third-order tensors by store, product and promotion.

Dynamic Tensor Recommender Systems

Table 4: Summary statistics of the monthly IRI marketing data.

The number of types The number of subgroups Store 471 50 Promotion 30 2 Product 43,631 31 Month 132 12 Sales record 79,243,289

Figure 5: An illustration for monthly sale tensors at the tth month.

Figure 5 provides an illustration of the reframed sale records to clarify the proposed method. According to the given structure of monthly third-order tensors, the total number of sale records could be up to 471 30 43631 132 8.1 billion. Although the observed records are more than 79.2 million, there are still a large number of store-promotion-product-month combinations which are associated with unknown sales volumes. The sales data have a 99.9% missing rate and are highly sparse, which renders a particular challenge for recommender systems and forecasting. For comparison, we implement and report the performances of the proposed methods with diﬀerent working correlation matrices and the competing methods as in Section 5. For all methods, we select the number of latent factors r ranging from 3 to 30. For the REM method and the proposed methods, we select a tuning parameter λ from 1 to 29. For the BPTF methods, we use the default values of the remaining parameters. For selecting the above parameters, we set the data from the beginning of 2001 to the end of 2009 (i.e., the ﬁrst 108 months) as the training set and the data from the beginning of 2010 to the end of 2010 as the validation set, and then tune these parameters through minimizing the root mean square error on the validation set. We randomly sample 80% of the data from 2001 to 2010 as a training set and 20% of the data from the entire year of 2011 as the testing set. The random sampling is replicated 50 times. Table 5 shows the forecasting results produced by each method. The results of the DTRSar are similar to ones of the DTRSin and are omitted. From Table 5, we observe that the DTRSin method achieves coverage probabilities of the prediction interval close to 95%, implying that the proposed method can estimate the prediction interval accurately, whereas the competing methods cannot provide such prediction intervals. In addition, the

Zhang, Bi, Tang and Qu

Table 5: The RMSE and MAE of the forecasting sale volumes in 2011 from ﬁve methods. The PICP is the average coverage probability of the 95% prediction interval. The RMSE, MAE and PICP are provided with standard error based on 50 experiments in each parenthesis. The RRMSE and RMAE show the relative improvement ratios of the DTRSin method over others in terms of the RMSE and MAE.

Method RMSE RRMSE MAE RMAE PICP DTRSin 11.284(0.536) 3.790(0.058) 0.967(0.001) REM 13.425(1.458) 18.97% 4.072(0.261) 7.44% BPTFbayes 15.792(1.746) 39.95% 4.276(0.171) 12.82% BPTFbasic 12.736(0.385) 12.87% 3.838(0.058) 1.27% BPTFdouble 12.732(0.388) 12.83% 3.835(0.057) 1.19%

Figure 6: Box plots of the RMSE and MAE for forecasting values at 12 time points.

DTRSin method has the lowest RMSE and MAE. For example, DTRSin improves on the RMSE and MAE of the BPTFbayes by 39.95% and 12.82%, and of the REM by 18.97% and 7.44%, respectively. The BPTFbasic and BPTFdouble perform better than the BPTFbayes, while the BPTFdouble performs better than the BPTFbasic. However, the proposed method still outperforms the BPTFbasic and BPTFdouble.

To illustrate the speciﬁc performance for forecasting at each time point, we calculate the RMSE and MAE at each time point and provide box plots for the RMSE and MAE in Figure 6. We observe that the performances of the DTRSin, BPTFbasic and BPTFdouble are more robust than those of the REM and BPTFbayes. However, the RMSEs of the DTRSin method at each time point are still lower than those of the BPTFbasic and BPTFdouble methods. The proposed method outperforms other methods for forecasting at each time point, and can deal with long-term forecasting accurately.

Dynamic Tensor Recommender Systems

6.2 Last.fm Data

In this section, we analyze the Lastfm-1K dataset collected by Last.fm API (Celma, 2010) to evaluate the performance of the proposed method. The dataset is available at http: //ocelma.net/Music Recommendation Dataset/lastfm-1K.html and has been widely used in music recommendation experiments. The Lastfm data includes the listening history of 992 users and songs played daily, recorded by quadruples with user, timestamp, artist and song information, where the users proﬁles contain gender, age, country and signup, and artists contain 107,528 artists with ID and 69,420 without ID. A detailed description of the Lastfm-1K dataset is available at http://ocelma.net/Music Recommendation Dataset/ lastfm-1K.html. We extract the user-artist-song playcount tensor-valued function with each monthly time point based on the quadruples records. The goal of our study is to predict the future playcount of each song given each artist for each user. Through this prediction procedure, we are able to estimate future listening habit for each user, so that to recommend interesting songs to each user. To evaluate the performance of the proposed method, we consider a sub-dataset with 100 users randomly, where there are 7,490 artists, 32,287 songs and 53 months from February of 2005 to June of 2009. We classify users, song, time stamp and artists into subgroups based on users gender, a song s artist, month of the year, and whether the artist has ID, respectively. Although the observed records have been 356,786, there are still a large number of user-artist-song-month combinations which are associated with unknown playcount. The data have a high missing rate and are highly sparse. Similar to Section 6.1, we implement and report the performances of the proposed method and the competing methods. We randomly sample 80% of the data from February of 2005 to May of 2008 (i.e., the ﬁrst 40 months) as a training set and 20% of the data from June of 2008 to June of 2009 (i.e., the last 13 months) as the testing set. The random sampling is replicated 50 times. Table 6 shows the forecasting results produced by each method and the computational time. Table 6 indicates that the DTRSin method achieves coverage probabilities of the prediction interval close to 95%, which supports that the proposed method can obtain accurate prediction interval estimator, whereas the competing methods cannot provide such prediction interval. Table 6 also shows that the proposed method achieves the best performance. Speciﬁcally, the proposed method improves the compering methods by at least 8.02% with respect to the RMSE, and by at least 10.05% with respect to the MAE, and still achieves the smallest standard error with a reasonable computational time.

7. Discussion

In this article, we propose a new dynamic tensor recommender system which incorporates time information through a tensor-valued function. A unique contribution of our method is that it can estimate recommendation accurately given any time point in continuous time intervals. Technically, the proposed method builds a time-value tensor decomposition model and borrows group information from existing time points of the same group for higher forecasting accuracy. Moreover, the proposed method utilizes the polynomial spline method and the weighted least squared method to incorporate time-dependency and intra-cluster correlation into the DRS. The spline extrapolation enables our method to achieve both

Zhang, Bi, Tang and Qu

Table 6: The RMSE and MAE of the forecasting the playcount of songs from ﬁve methods. The PICP and Com Time are the average coverage probability of the 95% prediction interval and the average computational time in seconds. The RMSE, MAE and PICP are provided with standard error based on 50 experiments in each parenthesis. The RRMSE and RMAE show the relative improvement ratios of the DTRSin method over others in terms of the RMSE and MAE.

Method RMSE RRMSE MAE RMAE PICP Com Time DTRSin 12.113(1.836) 1.940(0.058) 0.931(0.012) 92.8s REM 16.085(2.375) 32.79% 3.228(0.717) 66.39% 197.0s BPTFbayes 13.084(1.966) 8.02% 2.139(0.080) 10.26% 84.3s BPTFbasic 13.117(2.014) 8.29% 2.137(0.082) 10.15% 86.2s BPTFdouble 13.116(2.015) 8.28% 2.135(0.083) 10.05% 82.5s

short-term and long-term forecasts accurately, as conﬁrmed in the numerical studies. In addition, the proposed method is able to provide pointwise prediction intervals based on the established asymptotic property, while existing recommender systems are not equipped with prediction intervals. In theory, we demonstrate that the proposed decomposition achieves asymptotic consistency on prediction and the spline coeﬃcient estimators have asymptotic normality. In addition, we show the numerical advantages of our method compared to existing methods, including the analysis of IRI marketing data.

Acknowledgments

We would like to thank the action editors and referees for insightful comments and suggestions which improve the article signiﬁcantly. We would like to acknowledge support for this project from the National Science Foundation Grants (DMS1821198, DMS1613190 and DMS1952402), National Natural Science Foundation of P.R. China (11731011, 11671349 and 12001479), and Natural Science Foundation of Yunnan Province of China (2019FD068). We would like to thank IRI for making the data available. All estimates and analysis in this paper based on data provided by IRI are by the authors and not by the IRI.

Dynamic Tensor Recommender Systems

Appendix A.

In this appendix we prove the theorems from Section 4.

Lemma 4 Predicted values given by (2) with ﬁxed spline bases are invariant with respect to scaling and permutation indeterminacies.

Proof According to (2), the tensor function Y(t) is represented as:

j=1 ˆhj(t)p1 j p2 j pd j + ˆg(t)q1 q2 qd.

Given ﬁxed spline bases {Bji( )}M i=1, hj(t) is uniquely conﬁrmed by coeﬃcients αj for j = 1, . . . , r, and g(t) is similar. Thus, the determinacy of coeﬃcients is equivalent to the determinacy of functional factors, so we discuss the indeterminacy of functional factors. Scaling indeterminacy refers to non-uniqueness with respect to a scale change of pk j and qk, and of each function hj(t) and g(t) for k = 1, . . . , d; j = 1, . . . , r. That is, for πkl and δl, k = 1, . . . , d, l = 1, . . . , r + 1, we have epk j = πkjpk j, ehj(t) = δjhj(t), j = 1, . . . , r, eqk = πkr+1qk, and eg(t) = δr+1g(t) such that δl Qd k=1 πkl = 1 for l = 1, . . . , r + 1. Thus, we know that Pr j=1 hj(t)p1 j p2 j pd j + g(t)q1 q2 qd = Pr j=1 ehj(t)ep1 j ep2 j epd j + eg(t)eq1 eq2 eqd. Let h(t) = {h1(t), h2(t), . . . , hr(t)} . Permutation indeterminacy refers to an arbitrary r r permutation matrix Π such that Pr j=1 hj(t)p1 j p2 j pd j + g(t)q1 q2 qd .= [[h(t); P1, P2, , Pd]] + g(t)q1 q2 qd = [[Πh(t); P1Π, P2Π, , PdΠ]] + g(t)q1 q2 qd. These imply the invariance of equation (2) with respect to scaling and permutation indeterminacies. The proof of the Lemma is completed.

Proof of Proposition 1. Based on the deﬁnition of Kruskal rank, the Kruskal rank of the (r + 1) 1 matrix (h1(t), h2(t), . . . , hr(t), g(t)) is one. Based on Theorem 3 in Sidiropoulos and Bro (2000), the suﬃcient condition of identiﬁability of (Pk, qk) up to permutation and scaling of columns is Pd k=1 Kk + 1 2(r + 1) + (d + 1) 1, that is, Pd k=1 Kk 2r + d + 1. Under the suﬃcient condition, if there exist two minimizers (e P, eq, eα, eβ) and (b P, bq, bα, bβ) of L( |Y), then (e P, eq, eα, eβ) and (b P, bq, bα, bβ) are identical with the exception of scaling and permutation. By Lemma 4, the byi1i2 id s provided by (e P, eq, eα, eβ) and (b P, bq, bα, bβ) are identical. Thus, L(e P, eq, eα, eβ|Y) = L(b P, bq, bα, bβ|Y) implies that Pd k=1(Pr j=1 epk j 2 2 + eqk 2 2) + Pr j=1 eαj 2 + Pmd+1 e=1 eβe 2

= Pd k=1(Pr j=1 bpk j 2 2 + bqk 2 2) + Pr j=1 bαj 2 + Pmd+1 e=1 bβe 2. (12)

Suppose there exist some k1, k2 = 1, . . . , d, k1 = k2, such that bpk1 j = νjepk1 j , bpk2 j = epk2 j /νj, bqk1 = τeqk1, bqk2 = eqk2/τ for positive constants τ, νj, j = 1, . . . , r. We have

j=1 ( bpk1 j 2 2+ bpk2 j 2 2)+ bqk1 2 2+ bqk2 2 2 =

j=1 (ν2 j epk1 j 2 2+ epk2 j 2 2/ν2 j )+τ 2 eqk1 2 2+ eqk2 2 2/τ 2.

Zhang, Bi, Tang and Qu

Then (12) implies that τ = 1 and νj almost surely, j = 1, . . . , r. Similarly, suppose there exist some k1 = 1, . . . , d, such that bpk1 j = νjepk1 j , bαj = eαj/νj, bqk1 = τeqk1, bβe = eβe/τ for positive constants τ, νj, j = 1, . . . , r; e = 1, . . . , md+1. We have

Pr j=1( bpk1 j 2 2 + bαj 2) + bqk1 2 2 + Pmd+1 e=1 bβe 2

= Pr j=1(ν2 j epk1 j 2 2 + eαj 2/ν2 j ) + τ 2 eqk1 2 2 + Pmd+1 e=1 eβe 2/τ 2.

Then (12) implies that τ = 1 and νj almost surely, j = 1, . . . , r. Thus, (e P, eq, eα, eβ) and (b P, bq, bα, bβ) are identical almost surely with the exception of permutation.

Proof of Theorem 2. The estimator bui1 id is obtained by solving (i1 id)L(U|Y) = 0, where (i1 id) represents the ﬁrst derivatives with respect to the vector ui1 id. By Taylor expansion, we have

bui1 id u0i1 id = { 2 (i1 id)L(U|Y)|u i1 id} 1 (i1 id)L(U|Y)|u0i1 id = {F i1i2...idΣ 1 i1i2...id Fi1i2...id + λ 2 (i1 id)J(U)|u i1 id} 1

{F i1i2...idΣ 1 i1i2...idεi1i2...id λ (i1 id)J(U)|u0i1 id}

where u i1 id is between bui1 id and u0i1 id, and 2 (i1 id) represents the second derivatives with respect to the vector ui1 id. Since λ = op(1) and J(U) have bounded ﬁrst and second derivatives at true parameter U0, λ 2 (i1 id)J(U)|u i1 id = op(1) and λ (i1 id)J(U)|u0i1 id = op(1). Under conditions (C1), (C3) and (C4), we have

I1 F = F i1i2...id(Σ 1 i1i2...idΣ0 i1i2...id)(Σ0 i1i2...id) 1Fi1i2...id + λ 2 (i1 id)J(U)|u i1 id F

c1σ 2 2 F i1i2...id Fi1i2...id + λ 2 (i1 id)J(U)|u i1 id F

c1σ 2 2 mint Ti1...id{|Ti1...id|}{ 1 |Ti1...id| P t Ti1...id f(t)f(t) }

+λ 2 (i1 id)J(U)|u i1 id F

I2 F c2σ 2 1 P t Ti1...id f(t)εi1...idt λ (i1 id)J(U)|u0i1 id F

Cc2σ 2 1 |Ti1...id|{ 1 |Ti1...id| P t Ti1...id εi1...idt} λ (i1 id)J(U)|u0i1 id F

|Ti1...id|1/2 T 1/2 max,

where a b and b a mean a/b is bounded, Tmin = mint Ti1...id{|Ti1...id|} and Tmax =

maxt Ti1...id{|Ti1...id|}. Thus, under condition (C5), we have bui1 id u0i1 id 2 T 1/2 max/Tmin

Dynamic Tensor Recommender Systems

Nτ/2 υ. Under condition (C1) and (C5), we have

1 N P (i1, ,id) Ω Fi1 id(bui1 id u0i1 id) 2 2 = 1 N P (i1, ,id) Ω(bui1 id u0i1 id) P t Ti1...id f(t)f(t) (bui1 id u0i1 id)

C maxt Ti1...id{|Ti1...id|} 1

N P (i1, ,id) Ω(bui1 id u0i1 id) (bui1 id u0i1 id)

T 2 max NT 2 min N 1+2(τ υ).

The proof of Theorem 2 is completed.

We can currently use a convenient basis system in our technical arguments and the results also hold true for other basis choices of the same function space. The B-spline and truncated polynomial basis functions span the same set of spline functions (de Boor, 2001), thus we use B-splines as the convenient basis system in our proofs. The B-splines have the following properties (de Boor, 2001): Bk(t) 0, PM k=1 Bk(t) = 1, t T, C1

M PM k=1 φ2 kdt R

T(PM k=1 φk Bk(t))2 C2

M PM k=1 φ2 k, C1 and C2 are constant and φk R.

Lemma 5 Under Conditions (C1)-(C5), we have

1 N P (i1,...,id) Ω(c Wi1...id Wi1...id) Σ 1 i1...id(c Wi1...id Wi1...id) = Op(N 1+2(τ υ)),

1 N P (i1,...,id) Ω(c Wi1...id Wi1...id) Σ 1 i1...id Wi1...id = Op(N 1+3τ/2 υ),

1 N P (i1,...,id) Ω(c Wi1...id Wi1...id) Σ 1 i1...idεi1...id = Op(N 1+τ υ).

Proof By Theorem 2 and the non-negative bounded properties of the B-spline basis functions (de Boor, 2001), we have for j, l = 1, . . . , r; e, k = 1, . . . , md+1,

1 N P (i1,...,id) Ω( b Xi1...idj Xi1...idj) Σ 1 i1...id( b Xi1...idl Xi1...idl)

= 1 N P (i1,...,id) Ω(bui1...idj ui1...idj)(bui1...idl ui1...idl)B i1...idjΣ 1 i1...id Bi1...idl

c2σ 2 1 Tmax 1

N P (i1,...,id) Ω(bui1...idj ui1...idj)(bui1...idl ui1...idl)

{ 1 |Ti1...id| P t Ti1...id Bj(t) Bl(t)}

= Op(N 1+2(τ υ)),

1 N P (i1,...,id) Ω( b Xi1...idj Xi1...idj) Σ 1 i1...id(b Zi1...ide Zi1...ide)

c2σ 2 1 1 N P (i1,...,id) Ω(bui1...idj ui1...idj)(bui1...id(r+1) ui1...id(r+1))B i1...idj Ai1...ide

= Op(N 1+2(τ υ)),

1 N P (i1,...,id) Ω(b Zi1...idk Zi1...idk) Σ 1 i1...id(b Zi1...ide Zi1...ide)

c2σ 2 1 1 N P (i1,...,id) Ω(bui1...id(r+1) ui1...id(r+1))2A i1...idk Ai1...ide = Op(N 1+2(τ υ)).

Zhang, Bi, Tang and Qu

By deﬁnition, we have

(i1,...,id) Ω (c Wi1...id Wi1...id) Σ 1 i1...id(c Wi1...id Wi1...id) = Op(N 1+2(τ υ)).

Under conditions (C3)-(C5), we have j, l = 1, . . . , r; e, k = 1, . . . , md+1,

1 N P (i1,...,id) Ω( b Xi1...idj Xi1...idj) Σ 1 i1...id Xi1...idl

c2σ 2 1 1 N P (i1,...,id) Ω(bui1...idj ui1...idj)ui1...idl B i1...idj Bi1...idl

c2σ 2 1 Tmax 1

N P (i1,...,id) Ω(bui1...idj ui1...idj)ui1...idl{ 1 |Ti1...id| P t Ti1...id Bj(t) Bl(t)}

= Op(N 1+3τ/2 υ),

1 N P (i1,...,id) Ω( b Xi1...idj Xi1...idj) Σ 1 i1...idεi1...id

c2σ 2 1 1 N P (i1,...,id) Ω(bui1...idj ui1...idj)B i1...idjεi1...id

CT 1/2 max 1

N P (i1,...,id) Ω(bui1...idj ui1...idj){ 1 |Ti1...id|1/2 P t Ti1...id εi1...idt}

= Op(N 1+τ υ),

1 N P (i1,...,id) Ω( b Xi1...idj Xi1...idj) Σ 1 i1...id Zi1...ide = Op(N 1+3τ/2 υ),

1 N P (i1,...,id) Ω(b Zi1...ide Zi1...ide) Σ 1 i1...id Xi1...idj = Op(N 1+3τ/2 υ),

1 N P (i1,...,id) Ω(b Zi1...ids Zi1...ids) Σ 1 i1...id Zi1...ide = Op(N 1+3τ/2 υ),

1 N P (i1,...,id) Ω(b Zi1...ide Zi1...ide) Σ 1 i1...idεi1...id = Op(N 1+τ υ).

By deﬁnition, we can obtain

(i1,...,id) Ω (c Wi1...id Wi1...id) Σ 1 i1...id Wi1...id = Op(N 1+3τ/2 υ),

(i1,...,id) Ω (c Wi1...id Wi1...id) Σ 1 i1...idεi1...id = Op(N 1+τ υ).

Proof of Theorem 3 Based on the criterion function (3), we can obtain the estimator of the coeﬃcient as follows:

(i1,i2,...,id) Ω

c W i1i2...idΣ 1 i1i2...id c Wi1i2...id + λI) 1 X

(i1,i2,...,id) Ω

c W i1i2...idΣ 1 i1i2...idyi1i2...id.

Thus, we have

bγ γ0 = P (i1,...,id) Ωc W i1...idΣ 1 i1i2...id c Wi1...id/N + λI/N 1

n P (i1,...,id) Ωc W i1...idΣ 1 i1i2...id(yi1...id c Wi1...idγ0)/N λγ0/N o .

Dynamic Tensor Recommender Systems

By Lemma 5, we have

1 N P (i1,...,id) Ωc W i1...idΣ 1 i1i2...id c Wi1...id

= 1 N P (i1,...,id) ΩW i1...idΣ 1 i1i2...id Wi1...id

N P (i1,...,id) Ω(c Wi1...id Wi1...id) Σ 1 i1i2...id(c Wi1...id Wi1...id)

N P (i1,...,id) Ω(c Wi1...id Wi1...id) Σ 1 i1i2...id Wi1...id

N P (i1,...,id) ΩW i1...idΣ 1 i1i2...id(c Wi1...id Wi1...id)

= 1 N P (i1,...,id) ΩW i1...idΣ 1 i1i2...id Wi1...id + Op(N 1+3τ/2 υ)

1 N P (i1,...,id) Ωc W i1...idΣ 1 i1i2...id(yi1...id c Wi1...idγ0)

= 1 N P (i1,...,id) ΩW i1...idΣ 1 i1i2...id(yi1...id Wi1...idγ0)

N P (i1,...,id) Ω(c Wi1...id Wi1...id) Σ 1 i1i2...id(c Wi1...id Wi1...id)γ0

N P (i1,...,id) Ω(c Wi1...id Wi1...id) Σ 1 i1i2...id(yi1...id Wi1...idγ0)

N P (i1,...,id) ΩW i1...idΣ 1 i1i2...id(c Wi1...id Wi1...id)γ0

= 1 N P (i1,...,id) ΩW i1...idΣ 1 i1i2...id(yi1...id Wi1...idγ0)

N P (i1,...,id) Ω(c Wi1...id Wi1...id) Σ 1 i1i2...id(yi1...id Wi1...idγ0) + Op(N 1+3τ/2 υ).

According to de Boor (2001), supt T |hj(t) ehj(t)| Ca ξ N and supt T |ge(t) ege(t)| Ca ξ N for each j = 1, . . . , r and each e = 1, . . . , md+1, where ehj(t) = α 0j Bj(t) and ege(t) = β 0e Ae(t). Let eyi1...id = Wi1...idγ0 = Pr j=1 ui1...idjehi1...idj + ui1...id(r+1)egi1...id,

yi1...id = Pr j=1 ui1...idjhi1...idj + ui1...id(r+1)gi1...id, where ehi1...idj = Bi1i2...idjα0j, egi1...id = Pmd+1 e=1 Ai1i2...ideβ0e, and hi1...idj and gi1...id consist of hj(t) and g(t) for all t Ti1...id, respectively. That is, yi1...id = yi1...id +εi1...id. Similar to the proof of lemma 5, we can obtain that

(i1,...,id) Ω W i1...idΣ 1 i1i2...id{

j=1 ui1...idj(hi1...idj ehi1...idj)} = Op(a ξ N Nτ 1),

(i1,...,id) Ω W i1...idΣ 1 i1i2...id(gi1...id egi1...id)ui1...id(r+1) = Op(a ξ N Nτ 1).

Zhang, Bi, Tang and Qu

Therefore, we have

1 N P (i1,...,id) ΩW i1...idΣ 1 i1i2...id(yi1...id Wi1...idγ0)

= 1 N P (i1,...,id) ΩW i1...idΣ 1 i1i2...id(yi1...id yi1...id)

N P (i1,...,id) ΩW i1...idΣ 1 i1i2...id(yi1...id eyi1...id)

= 1 N P (i1,...,id) ΩW i1...idΣ 1 i1i2...idεi1...id

N P (i1,...,id) ΩW i1...idΣ 1 i1i2...id{Pr j=1 ui1...idj(hi1...idj ehi1...idj)}

N P (i1,...,id) ΩW i1...idΣ 1 i1i2...id(gi1...id egi1...id)ui1...id(r+1)

= 1 N P (i1,...,id) ΩW i1...idΣ 1 i1i2...idεi1...id + op(a ξ N Nτ 1).

Similarly, we can obtain

1 N P (i1,...,id) Ω(c Wi1...id Wi1...id) Σ 1 i1i2...id(yi1...id Wi1...idγ0)

= 1 N P (i1,...,id) Ω(c Wi1...id Wi1...id) Σ 1 i1i2...idεi1...id

N P (i1,...,id) Ω(c Wi1...id Wi1...id) Σ 1 i1i2...id{Pr j=1 ui1...idj(hi1...idj ehi1...idj)}

N P (i1,...,id) Ω(c Wi1...id Wi1...id) Σ 1 i1i2...id(gi1...id egi1...id)ui1...id(r+1)

= Op(N 1+τ υ) + Op(a ξ N N 1+3τ/2 υ).

Thus, we have

1 N P (i1,...,id) Ωc W i1...idΣ 1 i1i2...id(yi1...id c Wi1...idγ0)

= 1 N P (i1,...,id) ΩW i1...idΣ 1 i1i2...idεi1...id + op(N 1+τ υ) + op(a ξ N Nτ 1).

Since λ = op(1), we have N 1λI = op(N 1) and N 1λγ0 = op(N 1). Then, we have

1 N P (i1,...,id) Ωc W i1...idΣ 1 i1i2...id c Wi1...id + 1

= 1 N P (i1,...,id) ΩW i1...idΣ 1 i1i2...id Wi1...id + op(N 1+3τ/2 υ),

1 N P (i1,...,id) Ωc W i1...idΣ 1 i1i2...id(yi1...id c Wi1...idγ0) 1

= 1 N P (i1,...,id) ΩW i1...idΣ 1 i1i2...idεi1...id + op(N 1+τ υ) + op(a ξ N Nτ 1).

For any vector c whose components are not all zero, we have

= P (i1,...,id) Ωc (P (i1,...,id) ΩW i1...idΣ 1 i1i2...id Wi1...id) 1W i1...idΣ 1 i1i2...idεi1...id + op(1)

= P (i1,...,id) Ωai1...idζi1...id + op(1),

where ζi1...id are independent with mean zero and variance one given {Wi1...id, (i1, . . . , id) Ω}, and

a2 i1...id = c (P (i1,...,id) ΩW i1...idΣ 1 i1i2...id Wi1...id) 1W i1...idΣ 1 i1i2...idΣ0 i1...idΣ 1 i1i2...id Wi1...id

(P (i1,...,id) ΩW i1...idΣ 1 i1i2...id Wi1...id) 1c.

Dynamic Tensor Recommender Systems

It follows easily by checking the Lindeberg condition that if

max(i1,...,id) Ωa2 i1...id P (i1,...,id) Ωa2 i1...id p 0, (13)

then P (i1,...,id) Ωai1...idζi1...id/ q P (i1,...,id) Ωa2 i1...id is asymptotically N(0, 1). We only need

to show that (13) holds. Based on the properties of the B-splines and conditions (C3)-(C4), we have, for any φ = (φ 1, . . . , φ r+md+1) with φj RM,

φ W i1...idΣ 1 i1i2...idΣ0 i1...idΣ 1 i1i2...id Wi1...idφ

c2 2σ 2 1 P t Ti1...id(φ wi1...idt)2

= c2 2σ 2 1 P t Ti1...id{Pr j=1 ui1i2...idjφ j Bj(t) + ui1i2...id(r+1) Pmd+1 e=1 φ r+e Ae(t)I(t se)}2

cc2 2σ 2 1 P t Ti1...id[(Pr j=1 u2 i1i2...idj) Pr j=1{φ j Bj(t)}2 + u2 i1i2...id(r+1) Pmd+1 e=1 {φ r+e Ae(t)}2]

(Pr j=1 u2 i1i2...idj) Pr j=1 P t Ti1...id{φ j Bj(t)}2 + u2 i1i2...id(r+1) Pmd+1 e=1 P t Ti1...id{φ r+e Ae(t)}2

(Pr j=1 u2 i1i2...idj) Pr j=1 C2

M |φj|2 + u2 i1i2...id(r+1) Pmd+1 e=1 C2

(Pr+1 j=1 u2 i1i2...idj)C2

By Lemmas A.1 and A.2 in Huang et al. (2004), we have

φ (P (i1,...,id) ΩW i1...idΣ 1 i1i2...idΣ0 i1...idΣ 1 i1i2...id Wi1...id)φ

c2 1σ 2 2 P (i1,...,id) Ωφ W i1...id Wi1...idφ

c2 1σ 2 2 NTmin[ 1

N P (i1,...,id) Ω 1 |Ti1...id| P t Ti1...id{Pr j=1 ui1i2...idjφ j Bj(t)

+ui1i2...id(r+1) Pmd+1 e=1 φ r+e Ae(t)I(t se)}2]

c2 1σ 2 2 NTmin[ 1

N P (i1,...,id) Ω 1 |Ti1...id| P t Ti1...id{Pr j=1 ui1i2...idjφ j Bj(t)

+ mine{ui1i2...id(r+1)φ r+e Ae(t)}}2]

Thus, we obtain that

max(i1,...,id) Ωφ W i1...idΣ 1 i1i2...idΣ0 i1...idΣ 1 i1i2...id Wi1...idφ

φ (P (i1,...,id) ΩW i1...idΣ 1 i1i2...idΣ0 i1...idΣ 1 i1i2...id Wi1...id)φ 1 NTmin .

Hence, (13) holds. Then we have c (bγ γ0)var{c (bγ γ0)} 1/2 L N(0, 1), where var{c (bγ γ0)} = c Ψ 1ΦΨ 1c, in which Φ = P (i1,...,id) ΩW i1...idΣ 1 i1i2...idΣ0 i1...idΣ 1 i1i2...id Wi1...id, and Ψ = P (i1,...,id) ΩW i1...idΣ 1 i1i2...id Wi1...id.

Zhang, Bi, Tang and Qu

For any φ = (φ 1, . . . , φ r+md+1) with φj RM, under conditions (C3)-(C4), we have

φ (P (i1,...,id) ΩW i1...idΣ 1 i1i2...idΣ0 i1...idΣ 1 i1i2...id Wi1...id)φ

c2 2σ 2 1 P (i1,...,id) Ωφ W i1...id Wi1...idφ

c2 2σ 2 1 NTmax[ 1

N P (i1,...,id) Ω 1 |Ti1...id| P t Ti1...id{Pr j=1 ui1i2...idjφ j Bj(t)

+ui1i2...id(r+1) Pmd+1 e=1 φ r+e Ae(t)I(t se)}2]

and φ (P (i1,...,id) ΩW i1...idΣ 1 i1i2...id Wi1...id)φ

c1σ 2 2 NTmin[ 1

N P (i1,...,id) Ω 1 |Ti1...id| P t Ti1...id(φ wi1...idt)2]

Since M = a N + κ + 1, we have

var{c (bγ γ0)} = c (P (i1,...,id) ΩW i1...idΣ 1 i1i2...id Wi1...id) 1

(P (i1,...,id) ΩW i1...idΣ 1 i1i2...idΣ0 i1...idΣ 1 i1i2...id Wi1...id)

(P (i1,...,id) ΩW i1...idΣ 1 i1i2...id Wi1...id) 1c

= c Ψ 1ΦΨ 1c

NT 2 min a NTmax

NT 2 min a NN 1+τ 2υ,

where Ψ = X

(i1,i2,...,id) Ω W i1i2...idΣ 1 i1i2...id Wi1i2...id,

(i1,i2,...,id) Ω W i1i2...idΣ 1 i1i2...idΣ0 i1i2...idΣ 1 i1i2...id Wi1i2...id.

The proof of the theorem is complete.

Miguel Araujo, Pedro Ribeiro, Hyun Ah Song, and Christos Faloutsos. Tensorcast: Forecasting and mining with coupled tensors. Knowledge and Information Systems, 59(3): 497 522, 2019.

Xuan Bi, Annie Qu, and Xiaotong Shen. Multilayer tensor factorization with applications to recommender systems. The Annals of Statistics, 46(6B):3308 3333, 2018.

Bart J. Bronnenberg, Michael W. Kruger, and Carl F. Mela. Database paper the iri marketing data set. Marketing Science, 27(4):745 748, 2008.

O. Celma. Music Recommendation and Discovery in the Long Tail. Springer, 2010.

Dynamic Tensor Recommender Systems

Chris Chatﬁeld. Calculating interval forecasts. Journal of Business & Economic Statistics, 11(2):121 135, 1993.

Bilian Chen, Simai He, Zhening Li, and Shuzhong Zhang. Maximum block improvement and polynomial optimization. SIAM Journal on Optimization, 22(1):87 107, 2012.

Gerda Claeskens, Tatyana Krivobokova, and Jean D. Opsomer. Asymptotic properties of penalized spline estimators. Biometrika, 96(3):529 544, 2009.

Carl de Boor. A Practical Guide to Splines; rev. ed. Springer, Berlin, 2001.

Vin de Silva and Lek-Heng Lim. Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM Journal on Matrix Analysis and Applications, 30(3):1084 1127, 2008.

Robin Devooght and Hugues Bersini. Long and short-term recommendations with recurrent neural networks. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, UMAP 17, page 13 21, New York, NY, USA, 2017.

Van Than Dung and Tegoeh Tjahjowidodo. A direct method to solve optimal knots of b-spline curves: An application for non-uniform b-spline curves ﬁtting. PLOS ONE, 12 (3):1 24, 2017.

Jianqing Fan and Wenyang Zhang. Statistical methods with varying coeﬃcient models. Statistics and Its Interface, 1(1):179 195, 2008.

Evgeny Frolov and Ivan Oseledets. Tensor methods and recommender systems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(3):e1201, 2017.

Michael Giering. Retail sales prediction and item recommendations using customer demographics at store level. SIGKDD Explorations Newsletter, 10(2):84 89, 2008.

San Gultekin and John Paisley. A collaborative kalman ﬁlter for time-evolving dyadic processes. In Proceedings of the 2014 IEEE International Conference on Data Mining, pages 140 149, Washington, DC, 2014.

Guibing Guo, Feida Zhu, Shilin Qu, and Xingwei Wang. Pccf: Periodic and continual temporal co-factorization for recommender systems. Information Sciences, 436-437:56 73, 2018.

Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. Evaluating collaborative ﬁltering recommender systems. ACM Transactions on Information Systems, 22(1):5 53, 2004.

Charles C. Holt. Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting, 20(1):5 10, 2004.

Jianhua Z. Huang. Local asymptotics for polynomial spline regression. The Annals of Statistics, 31(5):1600 1635, 2003.

Zhang, Bi, Tang and Qu

Jianhua Z. Huang and Haipeng Shen. Functional coeﬃcient regression models for nonlinear time series: A polynomial spline approach. Scandinavian Journal of Statistics, 31 (4):515 534, 2004.

Jianhua Z. Huang, Colin O. Wu, and Lan Zhou. Polynomial spline estimation and inference for varying coeﬃcient models with longitudinal data. Statistica Sinica, 14(3):763 788, 2004.

Young-Jun Ko, Lucas Maystre, and Matthias Grossglauser. Collaborative recurrent neural networks for dynamic recommender systems. volume 63 of Proceedings of Machine Learning Research, pages 366 381, The University of Waikato, Hamilton, New Zealand, 16 18 Nov 2016.

Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455 500, 2009.

Yehuda Koren. Collaborative ﬁltering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 447 456, New York, NY, 2009.

Kung Yee Liang and Scott Zeger. Longitudinal data analysis using generalized linear models. Biometrika, 73(1):13 22, 1986.

Jinzhi Liao, Jiuyang Tang, Xiang Zhao, and Haichuan Shang. Improving poi recommendation via dynamic tensor completion. Scientiﬁc Programming, pages 1 11, 2018.

Dun Liu and Xiaoqing Ye. A matrix factorization based dynamic granularity recommendation with three-way decisions. Knowledge-Based Systems, 191:105243, 2020.

Amit Livne, Moshe Unger, Bracha Shapira, and Lior Rokach. Deep context-aware recommender system utilizing sequential latent context. Co RR, abs/1909.03999, 2019. URL http://arxiv.org/abs/1909.03999.

Xin Luo, Yunni Xia, and Qingsheng Zhu. Incremental collaborative ﬁltering recommender based on regularized matrix factorization. Knowledge-Based Systems, 27:271 280, 2012.

Dimitrios Rafailidis and Alexandros Nanopoulos. Modeling the dynamics of user preferences in coupled tensor factorization. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 321 324, New York, NY, 2014.

James Salter and Nick Antonopoulos. Cinemascreen recommender agent: combining collaborative and content-based ﬁltering. IEEE Intelligent Systems, 21(1):35 41, 2006.

Nicholas D. Sidiropoulos and Rasmus Bro. On the uniqueness of multilinear decomposition of n-way arrays. Journal of Chemometrics, 14(3):229 239, 2000.

Jieun Son and Seoung Bum Kim. Content-based ﬁltering for recommendation systems using multiattribute networks. Expert Systems with Applications, 89:404 412, 2017.

Dandan Song, Zhifan Li, Mingming Jiang, Lifei Qin, and Lejian Liao. A novel temporal and topic-aware recommender model. World Wide Web, 22(5):2105 2127, 2019.

Dynamic Tensor Recommender Systems

W. Van Loock, G. Pipeleers, J. De Schutter, and J. Swevers. A convex optimization approach to curve ﬁtting with b-splines. IFAC Proceedings Volumes, 44(1):2290 2295, 2011.

Jian Wei, Jianhua He, Kai Chen, Yi Zhou, and Zuoyin Tang. Collaborative ﬁltering and deep learning based recommendation system for cold start items. Expert Systems with Applications, 69:29 39, 2017.

Peter R. Winters. Forecasting sales by exponentially weighted moving averages. Management Science, 6(3):324 342, 1960.

Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J. Smola, and How Jing. Recurrent recommender networks. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 495 503, New York, NY, 2017.

Xian Wu, Baoxu Shi, Yuxiao Dong, Chao Huang, and Nitesh V. Chawla. Neural tensor factorization for temporal interaction learning. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 537 545, New York, NY, 2019.

Liang Xiong, Xi Chen, Tzu-Kuo Huang, JeﬀSchneider, and Jaime G Carbonell. Temporal collaborative ﬁltering with bayesian probabilistic tensor factorization. In Proceedings of the 2010 SIAM International Conference on Data Mining, pages 211 222, 2010.

Lan Xue and Lijian Yang. Additive coeﬃcient modeling via polynomial spline. Statistica Sinica, 16(4):1423 1446, 2006.

Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for high-dimensional time series prediction. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29, pages 847 855. Curran Associates, Inc., 2016.

Yuan Yuan, Nan Chen, and Shiyu Zhou. Adaptive b-spline knot selection using multiresolution basis set. IIE Transactions, 45(12):1263 1277, 2013.

Chenyi Zhang, Ke Wang, Hongkun Yu, Jianling Sun, and Ee-Peng Lim. Latent factor transition for dynamic collaborative ﬁltering. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 452 460, 2014.