# latent_sparse_modeling_of_longitudinal_multidimensional_data__f793445e.pdf

Latent Sparse Modeling of Longitudinal Multi-Dimensional Data

Ko-Shin Chen,1 Tingyang Xu,2 Jinbo Bi1

1Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA ko-shin.chen@uconn.edu, jinbo.bi@uconn.edu 2 Tencent AI Lab, Shenzhen, China, tingyangxu@tencent.com

We propose a tensor-based approach to analyze multidimensional data describing sample subjects. It simultaneously discovers patterns in features and reveals past temporal points that have impact on current outcomes. The model coefﬁcient, a k-mode tensor, is decomposed into a summation of k tensors of the same dimension. To accomplish feature selection, we introduce the tensor latent LF,1 norm as a grouped penalty in our formulation. Furthermore, the proposed model takes into account within-subject correlations by developing a tensor-based quadratic inference function. We provide an asymptotic analysis of our model when the sample size approaches to inﬁnity. To solve the corresponding optimization problem, we develop a linearized block coordinate descent algorithm and prove its convergence for a ﬁxed sample size. Computational results on synthetic datasets and realﬁle f MRI and EEG problems demonstrate the superior performance of the proposed approach over existing techniques.

Introduction In this paper we introduce a tensor-based quadratic inference function (Tensor QIF) machine learning model that can be used to analyze longitudinal data and select features efﬁciently. Longitudinal data consists of repeated sample observations during a given time period. They appear in a variety of areas, from ﬁnance (Arnold, Liu, and Abe 2007; Sela and Simonoff 2012) to scientiﬁc research (Arnold, Liu, and Abe 2007; Lozano et al. 2009; Wang, Zhou, and Qu 2012), health-care and medicine (Bi et al. 2013; Fowler and Christakis 2008; Stappenbeck and Fromme 2010). One notable feature of longitudinal data is repeatedmeasurement within each subject. Thus observed responses are generally dependent and longitudinal correlation among different outcomes must be considered to obtain correct predictions. There are several extended generalized linear models that can be applied to time-dependent data under different assumptions. Diggle et al. have provided a comprehensive overview of various models. For ﬁtting marginal model, generalized estimating equation - GEE (Liang and Zeger 1986) and quadratic inference function - QIF (Qu and Li

This work was done while T. Xu was with the Department of Computer Science and Engineering at University of Connecticut. Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

2006) are common statistical approaches. They are generally more accurate than those of classic regression analysis that assumes independently and identically distributed (i.i.d.). In GEE model, the correlation structure of outcomes is presumed and the so-called working correlation matrix, R, is speciﬁed. However, in practice, the true correlation is often unknown. The GEE model with misspeciﬁed working correlation matrix will no longer result optimal estimation of coefﬁcients (Crowder 1995). In addition, the inverse of the matrix R is essential that may cause poor estimation when R has high dimension (Qu and Lindsay 2003). To overcome these disadvantages, Qu, Lindsay, and Li suggest the QIF model for which R 1 is approximated by a linear combination of several basis matrices. This method ensures that the estimator always exists and does not require any estimation for nuisance parameters associated with correlations. On the feature selection criteria, penalized GEE (Fu 2003) and penalized QIF (Bai, Fung, and Zhu 2009) are proposed. In this work, we study the lagged effect of covariates on outcomes. In many studies, it is necessary and insightful to model simultaneously the correlation among outcomes and the lagged effects of covariates, which is the so-called Granger causality (Granger 1980). For example, Shen et al. point out evidences of brain diseases may appear in the functional magnetic resonance imaging (f MRI) of an early diagnosis before clear symptoms are identiﬁed. Recent graphical Granger models such as (Arnold, Liu, and Abe 2007; Lozano et al. 2009) ignore the temporal correlations. Xu, Sun, and Bi have modeled such correlation through the GEE method. But their model only applies to datasets with one spatial dimension. Our goal is to develop a new penalized QIF method in tensor setting to model the temporal prediction. Nowadays, tensor regressions have shown to be powerful in learning complex feature structures from multidimensional data. Many tensor techniques have been developed and applied to a broad range of applications (Hoff 2015; Zhou, Li, and Zhu 2013). However when focusing on feature selections (e.g., sparse tensor decomposition), most of existing methods either assume i.i.d. samples, or assume correlated samples but do not model temporal additive effects. We propose a new learning formulation that constructs tensor-based predictive model as a function of covariates, not only from the current observation but also from multiple previous consecutive observations. Simultaneously the

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

Figure 1: Case for K = 3: a 3-way tensor is decomposed into a summation of three 3-way tensors so that each component tensor is sparse along a particular direction.

model determines the temporal contingency and the most inﬂuential features along each dimension of the tensor data. Given a data sample is characterized by a tensor, the coefﬁcients in our additive model also form a K-way tensor. To select features, we decompose the K-way coefﬁcient tensor into a summation of K sparse K-way tensors as shown in Figure 1. These tensors each present sparsity along one direction and impose different block-wise least absolute shrinkage and selection operators (LASSO) to the components. We use linearized block coordinate descent algorithm via a proximal map (Beck and Teboulle 2009; Xu and Yin 2017) to efﬁciently solve the optimization problem. This approach then leads to K sub-problems that share the same structure. We validate the effectiveness of the proposed method in simulations and in the analysis of real-life f MRI and EEG datasets. The rest of this paper is organized as follows. We ﬁrst brieﬂy review the GEE and QIFs methods, and then introduce our proposed formulation: Tensor QIF in the Method section, followed by an Asymptotic Analysis section. An optimization algorithm for solving the formulation is depicted in the Algorithm section where we also prove convergence and the recovery of feature support. Experimental results are included and discussed in the Empirical Evaluation section, followed by a Conclusion section.

Method Notations We represent a K-way tensor as A Rd1 d2 ...d K which contains N = K k=1 dk elements. The inner product of two tensors A and B is given by A, B = vect(A) vect(B). Here vect( ) denotes the column-major vectorization of a tensor. The Frobenius norm of a tensor A is deﬁned by ~A~F =

A, A . The j-th sub-tensor of a tensor A along the mode-k can be obtained by ﬁxing the k-th index as j, i.e. A(j) (k) = A(i1, i2, ..., ik j, ik+1, ...i K). Note that A(j) (k) is a (K 1)-way tensor. The mode-k ﬁber of A is a dk dimensional vector which is obtained by ﬁxing all index

of A except the k-th one. The mode-k unfolding of A is a matrix A(k) Rdk N/dk formed by concatenating all the N/dk mode-k ﬁbers along its columns. The operator [A1, A2, ..., Am] creates a (K + 1)-way tensor by concatenating m numbers of K-way tensors A1, A2, ..., Am of the same dimension.

Generalized Linear Models of a Tensor Because our model is concerned with tensor regression and classiﬁcation, we ﬁrst introduce a basic tensor formulation in which the objective function is written down into two parts: a loss function l and a regularizer. Let (Xi, yi)1 i m be a data set, where Xi Rd1 d2 ... d K is a covariate tensor and yi R (resp. { 1}) for regression (resp. classiﬁcation) is the corresponding outcome. We consider a linear model below:

i=1 l(yi, Xi, W ) + λ~W~( ), (1)

where λ 0 is the regularization parameter, and ~ ~( ) is a certain tensor norm. Elements in the tensor W are the model coefﬁcients to be ﬁtted. In the study of low-rank tensor decompositions, overlapped/latent tensor trace norm (Wimalawarne, Tomioka, and Sugiyama 2016) or Schatten norm (Tomioka and Suzuki 2013) are widely applied in (1). Although these latent tensor norms facilitate the search for a low-rank tensor solution, they cannot enforce sparsity and thus unable to select the most relevant ones among features. In this paper, we focus on sparsity and feature selection by imposing a regularization condition that forces to zero out an entire slice of the coefﬁcient tensor. In other words, our model selects nonzero slices in each direction of the tensor W. We hence introduce the latent LF,1 norm deﬁned by

~W~l-LF,1 := inf K k=1 Wk=W K k=1 λk dk j=1 ~(Wk)(j) (k)~F (2)

where λks are nonnegative constants. One can easily verify that Eq.(2) satisﬁes all required norm properties. There are various of settings for the loss function l depending on the speciﬁc learning tasks. When the dataset is assumed to be i.i.d, the squared loss

l(yi, Xi, W ) = (yi Xi, W )2;

for regression or the logistic loss

l(yi, Xi, W ) = log(1 + exp( yi Xi, W )).

for classiﬁcation are two simple models usually applied. A more general family - generalized linear model (GLM) - has been used according to an exponential distribution assumption on the dependent variable. This family includes both the squared loss and logistic loss. To deal with correlated samples, GLM has been further extended from point estimation to variance estimation, which leads to more complicated formula, such as GEE or QIF. Between these two, QIF is more effective as discussed early on. In this paper, we will use the QIF setting to analyze additive effects in longitudinal datasets. The complete formula of l in our model will be given in the next section.

The Proposed QIF Formulation

Let X (i) t Rd1 d2 ... d K 1 be a (K 1)-way tensor which represents the covariate tensor measured for the subject i at time t. We denote y(i) t the outcome of the subject i at time t. We assume that y(i) t depends not only on the current record X (i) t but also on the previous τ records: X (i) t 1, X (i) t 2, ..., X (i) t τ. Hence we may view a sample at a particular time t as a pair (X(i;t), y(i) t ), where X(i;t) is a K-way tensor concatenating all considered records:

X(i;t) := [X (i) t , X (i) t 1, X (i) t 2, ..., X (i) t τ].

Suppose there are T total times of measurement for each subject i. In order to have enough previouse observations, the index t of X(i;t) should start from τ+1 and there are n := T τ training examples for each subject. In the graphical Granger model, the relation between X(i;t) and y(i) t is given by y(i) t = X(i;t), W (3)

for some tensor coefﬁcient W Rd1 d2 ... d K 1 d K, where d K = τ. We denote N := K k=1 dk the number of elements in W. However, training examples in (3) are assumed to be i.i.d., which does not ﬁt the intrinsic property of our dataset. In our case, the consecutive examples share overlapping records (e.g. X(i;t) and X(i;t+1) share τ 1 records: X (i) t , X (i) t 1, ..., X (i) t τ+1) and outcomes y(i) t , y(i 1) t are correlated. Hence in this paper, we adapt QIF model which together with GEE are members of GLM. There are two essential ingredients in GLM: a link function and a variance function. The link function describes the relation between a linear predictor η and the mean (expectation) of an outcome y. The variance function tells how the variance of an outcome y depends on its mean. In our formulation, these can be expressed by

μ(i) t := E[y(i) t ] = h 1(η(i) t ), var(y(i) t ) = V (μ(i) t ), (4)

where h is a link function determined according to a presumed distribution on yt from the exponential family, V is a variance function, and

η(i) t = X(i;t), W (5)

is the linear predictor. Let y(i) := (y(i) τ+1, ..., y(i) τ+n)T be an n-dimensional column vector. In GEE models, the covariance matrix Σ(i) for y(i) is modeled by

Σ(i) := A(i) 1/2 R(α) A(i) 1/2 . (6)

Here R(α) is the working correlation matrix, and A(i) is an n n diagonal matrix with V (μ(i) τ+j) as the j-th diagonal element. The matrix Σ(i) will be equal to cov(y(i)) if R(α) is the true correlation structure for y(i) (Liang and Zeger 1986). The model coefﬁcients are then obtained by solving the score equation from the quasi-likelihood analysis. In our setting, it turns out to be m i=1 D(i) T A(i) 1/2 R 1(α) A(i) 1/2 s(i) = 0. (7)

Here s(i) = y(i) μ(i)), and μ(i) = (μ(i) τ+1, ..., μ(i) τ+n) which depends on W (see (4) and (5)). The n N matrix D(i) is given by D(i) = μ(i)/ w where w = vect(W) and D(i)

ab = (μ(i))a/ (w)b. In a more advanced QIF method, the working correlation no longer needs to be pre-speciﬁed as in GEE, which can be very inaccurate. Rather, it directly models R 1(α) as

j=1 aj Mj (8)

where Mj s are known n n matrices characterizing various basic correlation structures and aj s are unknown parameters. For example, an AR-1 correlation can be expressed as R 1(α) = a1M1 + a2M2, where M1 is an identity, and M2 satisﬁes (M2)i,j = 1 if |i j| = 1, (M2)i,j = 0 if |i j| = 1. Instead of solving aj s associated with (7), we formulate our optimization problem via the so-called extended score by substituting (8) for R 1(α) in (7):

i=1 g(i)(W) (9)

D(i) A(i) 1/2 M1 A(i) 1/2 s(i)

... D(i) A(i) 1/2 Md A(i) 1/2 s(i)

We may view each g(i)(W) as a random vector g(X, s, W) evaluated at the data {s(i), X(i) = (X(i;τ+1), ..., X(i;τ+n))}. The vector gm(W) in (9) is an (N d)-dimensional column vector. In fact, substituting (8) into (7) yields a linear combination of the row blocks of gm(W). Since gm(W) has a larger dimension than W, we cannot estimate W by simply solving gm(W) = 0. Adapting the idea of (Qu and Li 2006) and (Qu, Lindsay, and Li 2000), we obtain W by minimizing the weighted length of gm(W):

min W Qm(W) := mgm(W) C 1 m (W)gm(W), (10)

i=1 g(i)(W)g(i)(W) (11)

which estimates the covariance matrix of gm. The use of Cm leads to an efﬁcient model (Hansen 1982) because the calculation of Cm, a direct estimate of the covariance, allows us to omit the step of estimating aj s. In our tensor QIF model, the loss function l(W) = Qm(W) and the regularization term is given by (2). More precisely, we solve the following optimization problem:

min W1,W2,...,WK Qm(W)+

j=1 ~(Wk)(j) (k)~F

where each Wk Rd1 d2 ... d K and the ﬁnal coefﬁcient tensor

k=1 Wk. (13)

Asymptotic Analysis

In this section we establish the asymptotic normality for our Tensor QIF model as m approaches to inﬁnity. Below the convergence notations A p B represents A converges to B in probability, i.e. P(|A B| > ε) 0 for any ε >

0; F d G denotes F converges to G in distribution, i.e. the distribution function of F converges to the distribution function of G. We ﬁrst rescale the objective function in (12):

j=1 ~(Wk)(j) (k)~F

where Qm = g m C 1 m gm. We require the following regularity conditions on the random vector g:

1. There exists a unique W that satisﬁes the mean zero model assumption, i.e. E[g(W )] = 0.

2. The data {X(i), s(i)} s are i.i.d. and the parameter space Ω := Ω1 Ω2 ... ΩK is compact.

3. W has a unique decomposition W = K k=1 W k such that for each k, W k is an interior point of Ωk.

4. Let w = vect(W). For all W Ω, g(W)g(W) F d1(X, s), ~ wg(W)~F d2(X, s) for some d1, d2 such that E[d1(X, s)] and E[d2(X, s)] are ﬁnite.

Under these regularity conditions, we have

Theorem 1. Let λk s be ﬁxed constants and let K k=1 ˆ Wk;m := ˆ Wm be the estimator obtained by minimizing (14) subject to (13). Then as m , we have

ˆ Wm p W , (15)

m vect ˆ Wm W d N(0, (J 0 C 1 0 J0) 1). (16)

where C0 = C (W ) and J0 = J (W ).

The proof of the theorem is based on a uniform convergence result for stochastic functions (Newey and Mc Fadden 1994). A complete proof is given in the supplemental material.

In this section, we provide an algorithm to solve the optimization problem (12) followed by a convergence result. Since the sample size m is ﬁxed throughout this section, we drop the subscript m in (12) and write Qm as Q. We ﬁrst give notations that will be used in our algorithm.

Φ = (W1, . . . , WK); W(Φ) = K k=1 Wk.

F(Φ) = Q(W(Φ)) + R(Φ).

Φ(r) = (W(r) 1 , . . . , W(r) K ); W(r) = W(Φ(r)).

Optimization Algorithm

We develop a linearized block coordinate descent algorithm in the following iterative procedure to ﬁnd optimal ˆΦ in (12).

Denote the iterates at the r-th iteration by Φ(r). At the point Φ = (W1, , WK), let

j=1 ~(Wk)(j) (k)~F

Assume WQ(W) is Lipschitz continuous with Lipschitz modulus LQ. The following PL(Φ, Φ) is a linearized proximal map for the non-smooth regularizer R:

PL(Φ, Φ) :=Q( W) + R(Φ) + KL

k=1 ~Wk Wk~2 F

Wk Wk , WQ( W) (18)

where L LQ is a ﬁxed constant. Note that

2 ~W W~2 F KL

k=1 ~Wk Wk~2 F . (19)

The inequality (19) and the Lipschitz continuity of Q(W) indicate that for all L LQ,

F(Φ) PL(Φ, Φ) for all Φ and Φ. (20)

At the r-th iteration, we update Φ(r+1) by solving the following optimization problem

minΦ K k=1 WQ(r), Wk W(r) k + KL

2 ~Wk W(r) k ~2 F

where WQ(r) = WQ(W(r)). Since R(Φ) given in (17) is separable among Wk s, we can decompose the problem (21) into the following K separate subproblems:

WQ(r), Wk W(r) k + KL

2 ~Wk W(r) k ~2 F

j=1 ~(Wk)(j) (k)~F

, k {1, . . . , K}. (22)

Since the subproblems share the same structure, we may ﬁx k and solve (22) to ﬁnd the best Wk, which is equivalent to

Wk W(r) k 1 KL WQ(r)

j=1 ~(Wk)(j) (k)~F . (23)

The problem (23) has a closed-form solution W(r+1) k where each of its sub-tensor is

(W(r+1) k )(j) (k) = max 0, 1 λk KL~(P(r))(j) (k)~F

(P(r))(j) (k), (24)

and P(r) := W(r) k 1 KL WQ(r). In fact, from optimality conditions, W(r+1) k satisﬁes

WQ(r) + KL W(r+1) k W(r) k + λk Ak(W(r) k ) = 0 (25)

Algorithm 1 Search for optimal ˆΦ

Input: X, y, L, λk Output: ˆΦ = ( ˆ W1, , ˆ WK) 1. r = 0: compute L and initialize W(0) k for 1 k K. 2. Obtain Φ(r+1) = (W(r+1) 1 , , W(r+1) K ) by solving (23) for each ﬁxed 1 k K. 3. r = r + 1. Repeat 2 and 3 until convergence.

for all r 1 and 1 k K. Here Ak(W) is a subgradient

j=1 ~(W)(j) (k)~F . The calculation of the Lipschitz modu-

lus LQ can be computationally expensive. We therefore follow a similar argument in (Xu, Sun, and Bi 2015) to ﬁnd a proper approximation L LQ and use L as L in all of our computations. Algorithm 1 summarizes the steps for ﬁnding the optimal ˆ Wk.

Convergence Analysis

In this section, we prove that the sequence {Φ(r)}r 0 generated by Algorithm 1 will converge to a global optimal solution ˆΦ with a rate of O(1/r) if the initial point Φ(0) is located in a convex neighborhood of ˆΦ. Loader and Pilla indicate that the function Q(W) is not globally convex in general. Hence the standard convergence arguments such as (Beck and Teboulle 2009) cannot be applied directly. Furthermore, with the latent approach (13), we have to carefully split or combine inequalities at certain points. All of these make the proof of the convergence nontrivial. Let ˆΦ = ( ˆ W1, . . . , ˆ WK) be a global minimizer of F(Φ) and Ω = Ω1 . . . ΩK is a neighborhood of ˆΦ such that Π(Ω) := {W(Φ) : Φ Ω} is convex and Q(W) is a convex function in Π(Φ). Assume Φ(0) satisﬁes

D(Φ(0)) := K k=1 ~W( )k ˆ Wk~2 F < 1

K dist( Π(Ω), ˆ W) 2 . (26)

Then we have the following convergence result

Theorem 2. Let Φ(n) be the tuple of tensors generated by Algorithm 1 at the n-th iteration. Then for any n 1,

F(Φ(n)) F(ˆΦ) KL K k=1 ~W(0) k ˆ Wk~2 F 2n . (27)

To prove the theorem, we ﬁrst show that if Φ(r) satisﬁes (26) at the r-th iteration, then Φ(r+1) also satisﬁes (26). This ensures that the entire sequence {W(Φ(n))}n 0 generated by Algorithm 1 lies in Π(Ω) where Q is convex. Thus the convex inequality is always valid and Theorem 2 is established. Details are provided in the supplemental material.

Group Support: Values of λk s and L In this section we focus on the linear model in which

η(i) t = X(i;t),

k=1 Wk , y(i) t = X (i) t ,

k=1 W k + s(i) t

for some true tensor coefﬁcient W , where τ t T. Let D := WQ(W ). Motivated by the algorithm, we consider the following optimization problem for a ﬁxed k:

min Wk 1 2 ~Wk W k + D~2 F + λk

j=1 ~(Wk)(j) (k)~F . (28)

Our goal is to estimate the group support for W k, i.e. obtain the subset S k {1, 2, ..., dk} such that (W k)(j) (k) = 0 if and only if j S k. The KKT conditions for solutions of (28) yield

Theorem 3. Assume λk

2 max1 j dk ~D(j) (k)~F . Then

(28) has a solution ˆ Wk such that

{j : ( ˆ Wk)(j) (k) = 0} := ˆ Sk Sk. (29)

Furthermore, ˆ Sk = S k if λk < KL

2 minj S ~(W k)(j) (k)~F .

Theorem 3 is proved in the supplemental material.

Empirical Evaluation

In this section we present the results of both synthetic and real-life f MRI/EEG examples. We test the efﬁciency and effectiveness of the proposed method Tensor QIF comparing to existing methods. The datasets containing continuous responses are examined by the following methods: Tensor QIF, Linear Regression (LR), Least Absolute Shrinkage and Selection Operator (LASSO), QIF (Qu, Lindsay, and Li 2000), PQIF (Bai, Fung, and Zhu 2009), GEE (Liang and Zeger 1986), PGEE, and Graphical Granger Modeling (Lozano et al. 2009). For GEE and PGEE, the presumed correlation is set as the 1st-order autoregressive structure (AR(1)). Namely, corr(y(i) t , y(i) t ) = α|t t | for some 0 < α < 1. The coefﬁcient of determination, R2, is employed to evaluate the performance of these predicting models. An R2 of 1 indicates perfect ﬁtness while an R2 of 0 indicates that the model does not ﬁt the data.

Synthetic Data

We constructed synthetic datasets containing 150 subjects with 20 time points per subject. The data X(i) t at each time t is a matrix with various sizes in {5 5, 10 10, 15 15}. We generate X(i) t s from the normal distribution N(0, 22) and τ = 4. In this setting, the coefﬁcient W is then a 3-way tensor, i.e. K = 3. Let W = W1 + W2 + W3 be the decomposition. We generate W1, W2 and W3 such that each Wk simulates the latent pattern along mode-k: W1 has patterns (i.e. non-zero values) in the 1st and the 3rd feature along mode-1; W2 has patterns in the 2nd and the 3rd feature along mode-2; W3 selects lagged patterns at the 1st, 3rd, and 5th lagged time points. All nonzero entries of Wk s are from the uniform distribution U(0, 32). We further add the residual s(i) and sin(t) to the mean model in order to generate the outcome variable y(i) for each subject i. The residual s(i) s are generated from multivariate normal distribution with an

Table 1: Comparison of R2 values achieved by our method and the other methods on the synthetic datasets with different tensor sizes and on the real-life f MRI dataset.

d1 d2 (τ + 1) LR LASSO QIF PQIF GEE PGEE Granger Tensor QIF 5 5 5 0.097 0.167 0.234 0.245 0.097 0.149 0.684 0.986 10 10 5 0.053 0.149 0.188 0.316 0.056 0.137 0.581 0.983 15 15 5 0.025 0.137 0.158 0.173 0.079 0.117 0.435 0.976 f MRI 0.007 0.027 0.032 0.033 - - 0.075 0.259

Figure 2: (Left) The model coefﬁcients obtained by Tensor QIF on synthetic data. (Right) Feature groups (slices) selected by Tensor QIF for predicting MMSE scores.

AR(1) correlation structure at α = 0.6. Then the outcome y(i) t is computed as

y(i) t = X(i;t), (W1 + W2 + W3) + s(i) t + sin(t).

In our experiments, λ s are tuned as λ1 = λ2 = λ3 = 0.3 based on cross validation within training. We randomly select 80% of the subjects for training and the rest for testing. Table 1 provides the R2 comparison results between Tensor QIF and the other seven methods for three different sizes synthetic datasets. The proposed method Tensor QIF outperforms the traditional regression methods in all comparison scenarios in terms of predicting accuracy. LR and LASSO have poor results that might due to the number of records are relatively small comparing to total features in data tensor. For the marginal models, QIF-based methods performs better than the baseline methods because they can efﬁciently estimate the coefﬁcients even though the correlation matrix is misspeciﬁed while GEE-based methods performs poorer because they misspecify the correlation matrix. The Graphical Granger Modeling shows relatively high performance because it models the effects from lagged time points. However, it s performance suffers from the high correlations within the subjects. In Figure 2 (Left) we plot the mode-1 unfolded matrices resolved from Tensor QIF on the dataset of the size 10 10 5. Red (blue) color indicates that the corresponding features are positive (negative) predictors of the response variable. Features with white color are not selected. We see all subtensors (matrices) of W1, W2 and W3 capture the designed structure of the synthetic data. This explains the reason of achieving around 0.98 R2 by the proposed method. Figure

Figure 3: The model coefﬁcients resolved from the Granger model (Lozano et al. 2009) on synthetic data.

3 illustrates the result from the Granger model. It s tensor coefﬁcients are reshaped to align the matrices in Figure 2. It clearly showed the wrong selections on the ﬁrst lagged time point which was due to the correlations within subjects.

Functional magnetic resonance imaging (f MRI) is a functional neuroimaging procedure using MRI technology that measures brain activity by detecting associated changes in blood ﬂow. The f MRI data used in the experiment were collected by the Alzheimer s Disease Neuroimaging Initiative (ADNI)1. We cleaned up the f MRI data by ﬁltering out the incomplete or low quality observations. After data cleaning, the data includes 147 subjects diagnosed with mild cognitive impairment (MCI) from the year of 2009 to 2016. We use the participants ﬁrst f MRI scan as baseline and the other f MRI scans in 6th, 12th, 18th, and 24th months of the study. Here are 67 brain areas and 4 properties (CV,SA,TA,TS) of the brain cortex2 in our model. These properties are CV: Cortical Volume; SA: Surface Area; TA: Thickness Average; TS: Thickness Standard Deviation. This record naturally form a 3-way tensor with one dimension for brain areas, one for property, and one along the temporal line. Our Tensor QIF keeps such tensor form without squashing dataset into a vector which may cause losing the proximity. The outcome used in this experiment is the mini-mental state examination (MMSE) score quantiﬁed by a 30-point questionnaire, which is used extensively in clinical and research settings to measure cognitive impairment. At each time point, the MMSE score would be evaluated from participants answers of the questionnaire. We use 20% of subjects for testing. The lag variable is set toτ = 2. The λ1, λ2, and λ3 were tuned in a two-fold cross validation. In other words, the training records were further split into half: one used to build a model with a chosen parameter value from a range of 1 to 20 with a step size of 0.1; and the other used to test the resultant model. We chose the

1http://adni.loni.usc.edu/ 2http://adni.bitbucket.io/reference/ucsffresfr.html

parameter values that gave the best two-fold cross validation performance. As shown in Table 1, our method performs the best predictions. Moreover, our approach is able to select patterns along three dimensions: among the features, among the brain areas, and among the different lagged months. The λ s were chosen as λ1 = 6, λ2 = 20, and λ3 = 24. In Figure 2 (Right), the structural damage of AD starting 6 months ago plays a major role in the development of the AD. Larger means and standard derivations of the thickness imply a higher risk of the AD. The proposed model selects 14 out of 68 brain areas that affect the MMSE score. According to the selections of the brain areas, the data at Cuneus area and Transverse Temporal area in both sides, and the data at right Inferior Parietal area, and so on might be important to predict the cognitive impairment.

EEG Data Human memory function can be assayed in real-time by electroencephalographic (EEG) recording. However, the clinical utility of this method depends on the reliable determination of functionally and diagnostically relevant features. The proposed method approaches capable of modeling non-stationary signal have been explored as a way to synthesize large arrays of EEG data because the EEG record could be more precisely characterized by a 3-way tensor representing processing stages, spatial locations, and frequency bands as individual dimensions. Schizophrenia (SZ, n = 40) patients and healthy control (HC, n = 20) participants completed an EEG Sternberg task. EEG was analyzed to extract 5 frequency components (delta, theta, alpha, beta, gamma) at 4 processing stages (baseline, encoding, retention, retrieval) and 12 scalp sites representing central midline, and bi-lateral frontal and temporal regions. The proposed and comparing methods were applied to the resulting 240 features (forming a 5 4 12 tensor) to classify correct (-1) vs. incorrect (+1) responses on a trial-by-trial basis. In this approach, the proposed method guided the respective selection of spectral frequency, temporal (processing stages), and spatial (electrode sites) dimensions most related to trial performance. The correlations among processing stages were also estimated by the pro-

Figure 4: Rows, columns and slices selected by Tensor QIF for SZ (the top panel) and HC (the bottom panel). The ﬁgure is also provided in the supplemental material.

posed method. Separate models were constructed for SZ and HC samples for comparison of common and disparate feature patterns across the dimensions. For each of the SZ and HC datasets, 1/5 of the records were randomly chosen from every subject to form the test data and the rest of the records were used in training. The hyperparameters λ1, λ2, and λ3 in our approach and GEE/PGEE (one parameter) were tuned in a two-fold cross validation within the training data. We chose the parameter values that gave the best two-fold cross validation performance, which were λ1 = 7.5, λ2 = 5.5, λ3 = 7.4 for SZ and λ1 = 3.3, λ2 = 2.1, λ3 = 3.1 for HN. As shown in Figure 4, in both groups, task performance is most dependent on encoding and retrieval stage activity, with higher encoding uniformly and lower retrieval activity generally associated with better task performance across electrode sites. This pattern appears most prominently in central alpha activity (Figure 4; blue border). This indicates the same ﬁndings as in (Xu, Sun, and Bi 2015). Groups differed in two main ways: (1) centroparietal theta, beta, and gamma during encoding and retention predicted higher accuracy in HC (Figure 4; red border), and (2) delta activity across stages and electrodes (Figure 4; green border) predicted lower accuracy in SZ. Here the experimental results give much clearer details of the working electrode sites and spectral frequencies comparing to the results in (Johannesen et al. 2016). The proposed method outperform GEE and SVM solutions according to AUC values (HC: 55.5%; SZ: 58.8% versus the best AUC 53% from the other methods). This is because the proposed method enabled interpretation and summary across all dimensions, which is not possible for classiﬁers based on single vectors.

We have proposed a new learning formulation called Tensor QIF to analyze longitudinal data. It takes data matrices or tensors as inputs and make predictions. The proposed method can simultaneously determine the temporal contingency and the inﬂuential features from the observations of different modes without breaking into multiple models. The tensor coefﬁcient is computed by the summation of K component tensors so that each reﬂects the selection among a particular mode. Asymptotic analysis shows the proposed formulation ﬁnds true coefﬁcient when the sample size approaches to inﬁnity. Moreover, the related optimization problem can be efﬁciently solved by a linearized block coordinate descent algorithm which has a sublinear convergence rate. Empirical studies on both synthetic and real-life problems demonstrate the superior performance of the proposed method.

Acknowledgements

This work was partially supported by NSF grants DBI1356655, CCF-1514357, IIS-1718738, as well as NIH grants R01DA037349 and K02DA043063 to Jinbo Bi.

References Arnold, A.; Liu, Y.; and Abe, N. 2007. Temporal causal modeling with graphical granger methods. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 07, 66 75. New York, NY, USA: ACM. Bai, Y.; Fung, W. K.; and Zhu, Z. Y. 2009. Penalized quadratic inference functions for single-index models with longitudinal data. Journal of Multivariate Analysis 100(1):152 161. Beck, T., and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1):83 202. Bi, J.; Sun, J.; Wu, Y.; Tennen, H.; and Armeli, S. 2013. A machine learning approach to college drinking prediction and risk factor identiﬁcation. ACM Trans. Intell. Syst. Technol. 4(4):72:1 72:24. Crowder, M. 1995. On the use of a working correlation matrix in using generalised linear models for repeated measures. Biometrika 82(2):407 41 . Diggle, P.; Heagerty, P.; Liang, K.-Y.; and Zeger, S. 2002. Analysis of Longitudinal Data. Oxford University Press. Fowler, J. H., and Christakis, N. A. 2008. Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the framingham heart study. BMJ 337. Fu, W. J. 2003. Penalized estimating equations. Biometrics 59(1):126 132. Granger, C. 1980. Testing for causality: A personal viewpoint. Journal of Economic Dynamics and Control 2(1):329 352. Hansen, L. P. 1982. Large sample properties of generalized method of moments estimators. Econometrica 50(4):1029 1054. Hoff, P. D. 2015. Multilinear tensor regression for longitudinal relational data. Ann. Appl. Stat. 9(3):1169 1193. Johannesen, J. K.; Bi, J.; Jiang, R.; Kenney, J. G.; and Chen, C.-M. A. 2016. Machine learning identiﬁcation of eeg features predicting working memory performance in schizophrenia and healthy adults. Neuropsychiatric Electrophysiology 2(1):3. Liang, K.-Y., and Zeger, S. L. 1986. Longitudinal data analysis using generalised estimating equations. Biometrika 73(1):13 22. Loader, C., and Pilla, R. S. 2007. Iteratively reweighted generalized least squares for estimation and testing with correlated data: An inference function framework. Journal of Computational and Graphical Statistics 16(4):925 945. Lozano, A. C.; Abe, N.; Liu, Y.; and Rosset, S. 2009. Grouped graphical granger modeling methods for temporal causal modeling. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 09, 577 586. New York, NY, USA: ACM. Newey, W. K., and Mc Fadden, D. 1994. Chapter 36 large sample estimation and hypothesis testing. Handbook of Econometrics 4:2111 2245.

Qu, A., and Li, R. 2006. Quadratic inference functions for varying-coefﬁcient models with longitudinal data. Biometrics 62(2):379 391. Qu, A., and Lindsay, B. G. 2003. Building adaptive estimating equations when inverse of covariance estimation is difﬁcult. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65(1):127 142. Qu, A.; Lindsay, B. G.; and Li, B. 2000. Improving generalised estimating equations using quadratic inference functions. Biometrika 87(4):823 836. Sela, R. J., and Simonoff, J. S. 2012. Re-em trees: a data mining approach for longitudinal and clustered data. Machine Learning 86(2):169 207. Shen, L.; Thompson, P. M.; Potkin, S. G.; Bertram, L.; Farrer, L. A.; Foroud, T. M.; Green, R. C.; Hu, X.; Huentelman, M. J.; Kim, S.; Kauwe, J. S. K.; Li, Q.; Liu, E.; Macciardi, F.; Moore, J. H.; Munsie, L.; Nho, K.; Ramanan, V. K.; Risacher, S. L.; Stone, D. J.; Swaminathan, S.; Toga, A. W.; Weiner, M. W.; and Saykin, A. J. 2014. Genetic analysis of quantitative phenotypes in ad and mci: imaging, cognition and biomarkers. Brain Imaging and Behavior 8(2):183 207. Stappenbeck, C. A., and Fromme, K. 2010. A longitudinal investigation of heavy drinking and physical dating violence in men and women. Addictive Behaviors 35(5):479 485. Tomioka, R., and Suzuki, T. 2013. Convex tensor decomposition via structured schatten norm regularization. In Burges, C.; Bottou, L.; Welling, M.; Ghahramani, Z.; and Weinberger, K., eds., Advances in Neural Information Processing Systems 26. 1331 1339. Wang, L.; Zhou, J.; and Qu, A. 2012. Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics 68(2):353 360. Wimalawarne, K.; Tomioka, R.; and Sugiyama, M. 2016. Theoretical and experimental analyses of tensor-based regression and classiﬁcation. Neural Comput. 28(4):686 715. Xu, Y., and Yin, W. 2017. A globally convergent algorithm for nonconvex optimization based on block coordinate update. Journal of Scientiﬁc Computing 72(2):700 734. Xu, T.; Sun, J.; and Bi, J. 2015. Longitudinal lasso: Jointly learning features and temporal contingency for outcome prediction. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 15, 1345 1354. New York, NY, USA: ACM. Zhou, H.; Li, L.; and Zhu, H. 2013. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association 108(502):540 552.