# bayesian_tensor_regression__3eb9789c.pdf

Journal of Machine Learning Research 18 (2017) 1-31 Submitted 7/16; Published 8/17

Bayesian Tensor Regression

Rajarshi Guhaniyogi rguhaniy@ucsc.edu Department of Applied Mathematics & Statistics University of California Santa Cruz, CA 95064, USA

Shaan Qamar siqamar@gmail.com Google Inc. Mountain View, CA 94043, USA

David B. Dunson dunson@duke.edu Department of Statistical Science Duke University Durham, NC 27708-0251, USA

Editor: Robert Mc Culloch

We propose a Bayesian approach to regression with a scalar response on vector and tensor covariates. Vectorization of the tensor prior to analysis fails to exploit the structure, often leading to poor estimation and predictive performance. We introduce a novel class of multiway shrinkage priors for tensor coeﬃcients in the regression setting and present posterior consistency results under mild conditions. A computationally eﬃcient Markov chain Monte Carlo algorithm is developed for posterior computation. Simulation studies illustrate substantial gains over existing tensor regression methods in terms of estimation and parameter inference. Our approach is further illustrated in a neuroimaging application.

Keywords: Multiway Shrinkage Prior, Magnetic Resonance Imaging (MRI), Parafac Decomposition, Posterior Consistency, Tensor Regression

1. Introduction

In many application areas, it is common to collect predictors that are structured as a multiway array or tensor. For example, the elements of this tensor may correspond to voxels in a brain image (Lindquist, 2008; Lazar, 2008; Hinrichs et al., 2009; Ryali et al., 2010). Existing approaches for quantifying associations between an outcome and such tensor predictors mostly fall within two groups. The ﬁrst approach assesses the association between each cell (for brain images referred to as voxel) and the response independently, providing a p-value map (Lazar, 2008). The p-values can be adjusted for multiple comparisons to identify signiﬁcant sub-regions of the tensor. Although this approach is widely used and appealing in its simplicity, clearly such independent screening approaches have key disadvantages relative to methods that take into account the joint impact of the overall

. These authors contributed equally

c 2017 Rajarshi Guhaniyogi and Shaan Qamar and David B. Dunson.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/16-362.html.

Guhaniyogi et al.

tensor simultaneously. Unfortunately, the literature on simultaneous analysis approaches is sparse. One naive approach is to simply vectorize the tensor and then use existing methods for high-dimensional regression. Such vectorization fails to preserve spatial structure, making it more diﬃcult to learn low-dimensional relationships with the response. Eﬃcient learning is of critical importance, as the sample size is typically massively smaller than the total number of cells. Alternative approaches within the regression framework include functional regression and two stage approaches. The former views the tensor as a discretization of a continuous functional predictor. Most of the literature on functional predictors focuses on 1D functions; Reiss and Ogden (2010) consider the 2D case, but substantial challenges arise in extensions to 3D due to dimensionality and collinearity among cells. Recently Wang et al. (2014) considered 3D regularized functional regression with Haar wavelet basis. The article is essentially frequentist in nature with simulation studies showing only the mean squared error and the percentage of correctly identiﬁed zero and nonzero elements. Additionally, the article reveals that functional regression is largely aﬀected by the choice of proper basis functions. The second set of approaches, i.e. Two stage approaches ﬁrst conduct a dimension reduction step, commonly using PCA, and then ﬁt a model using lower dimensional predictors (Caﬀo et al., 2010). A clear disadvantage of such approaches is that the main principal components driving variability in the random tensor may have relatively limited impact on the response variable. Potentially, supervised PCA could be used, but it is not clear how to implement such an approach in 3D or higher dimensions. Zhou et al. (2013) propose extending generalized linear regression to include a tensor structured parameter corresponding to the measured tensor predictor. To circumvent diﬃculties with extensions to higher order tensor predictors, they impose additional structure on the tensor parameter, supposing it decomposes as a rank-R parafac sum (see Section 2.1). This massively reduces the eﬀective number of parameters to be estimated. They develop a penalized likelihood approach where adaptive lasso penalties are be imposed on individual margins of the parafac decomposition, focusing on good point estimation for the tensor parameter. However, their method relies heavily on cross-validation for selecting tuning parameters which are sensitive to the tensor dimension, the signal-to-noise ratio (degree of sparsity) and the parafac rank. Given that there is no automatic selection procedure for the tuning parameters provided in Zhou et al. (2013), they have to be fed manually by the end user which is problematic for an unknown tensor regression problem. Of practical interest is a self calibrating procedure which adapts the complexity of the model to the data. We propose a principled method to eﬀectively shrink unimportant cell coeﬃcients to zero while maintaining accuracy in estimating important cell coeﬃcients. Our framework gives rise to the automatic selection of tuning parameters, with carefully constructed shrinkage priors that naturally induce sparsity within and across components in the tensor factorization of the tensor coeﬃcient for optimal region selection. In addition, the need for valid measures of uncertainty on parameter (predictive) estimates is crucial, especially in settings with low or moderate sample sizes, which naturally motivates our Bayesian approach. Recently Suzuki (2015) proposed a Bayesian tensor regression approach with naive Gaussian prior on the components of the tensor factorization of the tensor coeﬃcients. In contrast, our proposed prior on the tensor coeﬃcient is more sophisticated in the sense that it imparts shrinkage in three ways: at a global level, at a local level of

Bayesian Tensor Regression

individual parameters, and also provides shrinkage towards low rank decomposition of the tensor coeﬃcient. Similarly, Bayesian tensor regression framework proposed in Goldsmith et al. (2014) uses binary indicators to determine whether a cell in the tensor predictor is predictive of the response. For a tensor predictor with 30 30 30 cells, such an approach requires to update 27000 binary indicators in each MCMC iteration and is deemed unsatisfactory due to mixing issues and poor inference.

Our approach diﬀers from image reconstruction literature as we do not model the distribution of the tensor X (Qiu, 2007). There is a considerable recent literature on frequentist tensor modeling in which one typically encounters time series (generally to study social networks or images evolving over time) with response at every time point is an array/tensor (Gerard and Hoﬀ, 2015; Hoﬀet al., 2015). There is also a Bayesian literature that facilitates joint modeling of a large number of unordered categorical variables (Zhou et al., 2015). Our framework is fundamentally diﬀerent from these approaches in the sense that these are all unsupervised tensor modeling approach while we propose a framework for supervised tensor regression. To the best of our knowledge, we are the ﬁrst to propose a novel multiway shrinkage prior in Bayesian tensor regression framework (with scalar response on a tensor predictor) that accommodates shrinkage of the tensor coeﬃcient for the appropriate identiﬁcation of important cells in the tensor predictor. Besides, we oﬀer strong posterior consistency results on Bayesian tensor regression framework with multiway shrinkage prior.

Remainder of the manuscript evolves as follows. In Section 2, we propose the basic framework of the tensor regression model with a scalar response, vector predictors and a tensor predictor. Section 3 characterizes desirable criteria for a multiway shrinkage prior and proposes a novel multiway shrinkage prior on the tensor coeﬃcient. Sections 4 and 5 provide theoretical results on the convergence of posterior distribution under the mutiway shrinkage prior and details on posterior computation respectively. Various simulation studies with 2D and 3D tensor predictors are presented in Sections 6 and 7 respectively to study eﬀectiveness of the Bayesian tensor regression under various degrees of sparsity and signal strength. Section 8 is devoted to a real brain connectome data analysis using the proposed Bayesian tensor regression model along with its competitors. The manuscripts ends with a discussion.

2. Tensor Regression

This section provides details on the tensor regression model.

2.1 Basic Notation

Let β1 = (β11, . . . , β1p1) and β2 = (β21, . . . , β2p2) be vectors of length p1 and p2, respectively. The vector outer product β1 β2 is a p1 p2 matrix with (i, j)-th entry β1i β2j. A D-way outer product between vectors βj = (βj1, . . . , βjpj), 1 j D, is a p1 p D multi-dimensional array denoted B = β1 β2 βD with entries (B)i1,...,i D = QD j=1 βjij. Deﬁne a vec(B) operator as stacking elements of this D-way tensor into a column vector of length QD j=1 pj. From the deﬁnition of outer products, it is easy to see that vec(β1 β2 βD) = βD β1. As a higher order generalization of matrix singular value decomposition, Tucker decomposition of a D-way tensor B D j=1ℜpj is

Guhaniyogi et al.

often considered. The Tucker decomposition (Kolda and Bader, 2009) can be expressed as

r D=1 λr1,...,r Dβ(r1) 1 β(r2) 2 β(r D) D (1)

where β(rj) j is a pj dimensional vector, 1 j D, and Λ = (λr1,...,r D)R1,...,RD r1,...,r D=1 is referred

to as the core tensor. If one considers {β(rj) j ; 1 rj Rj, 1 j D} as factor loadings and λr1,...,r D to be the corresponding coeﬃcients, then the Tucker decomposition may be thought of as a multiway analogue to factor modeling. A rank-R parafac decomposition emerges as a special case of Tucker decomposition 1 when R1 = R2 = = RD = R and λr1,...,r D = I(r1 = r2 = = r D). In particular, B D j=1ℜpj assumes a rank-R parafac decomposition if

r=1 β(r) 1 β(r) D (2)

where β(r) j , 1 j D and 1 r R are the pj dimensional margins . The parafac decomposition is more widely used due to its relative simplicity.

2.2 Model Framework

Let y Y denotes a response variable, with z X ℜp and X D j=1ℜpj scalar and tensor predictors, respectively. We consider a tensor regression model having a general form

y f α + z γ + X, B , σ , X, B = vec(X) vec(B), (3)

where f(µ, σ) is a family of distributions having location µ and scale σ, γ is a p 1 coeﬃcient for scalar preditors and B D j=1ℜpj is the tensor parameter corresponding to measured tensor predictor X. We focus more speciﬁcally on the Gaussian linear model case with

y = α + z γ + X, B + ϵ, ϵ N(0, σ2). (4)

The coeﬃcient tensor B has QD j=1 pj elements, necessitating substantial dimensionality reduction. A rank-1 parafac decomposition assumes B = β1 βD and vec(B) = βD β1. This reduces to modeling y = α + z γ + β 1Xβ2 when D = 2, corresponding to the bilinear model considered in Hung and Wang (2013). Since only the single parameter vector βj captures signal along the jth dimension, a rank-1 assumption severely limits ﬂexibility, ruling out interactions among dimensions. Following Zhou et al. (2013), we use a more ﬂexible rank-R parafac decomposition for B = PR r=1 β(r) 1 β(r) D introduced in (2) with β(r) j ℜpj, 1 j D, and 1 r R. Expression (4) then becomes

y = α + z γ + D X,

r=1 β(r) 1 β(r) D E + ϵ

= α + z γ + X

(i1,...,i D) (X)i1,...,i D(B)i1,...,i D + ϵ (5)

Bayesian Tensor Regression

where voxel (X)i1,...,i D of the tensor predictor has corresponding parameter

(B)i1,...,i D =

j=1 β(r) j,ij, (i1, . . . , i D) VB = D j=1{1, . . . , pj}. (6)

The model is therefore nonlinear in the parameters deﬁning B. A hierarchical speciﬁcation is completed by placing priors over unknown model parameters. While placing priors over α and γ is straightforward, Section 3.2 focuses on speciﬁcation of the prior over tensor parameters which is nontrivial and one of the main contributions of this work. Under the assumed rank-R parafac decomposition for B, model (5) requires estimating p+2+R PD j=1 pj as opposed to p+2+QD j=1 pj parameters for the unstructured vectorized (saturated) model. As we are interested in identifying geometric sub-regions of the tensor across which coeﬃcients are not close to zero, with the remaining elements being very close to zero, one wonders whether such dramatic dimension reduction retains suﬃcient ﬂexibility. Finally, we would like to accurately estimate coeﬃcient values in these subregions. Consistent with our theoretical analysis in Section 4, extensive simulation studies in Section 6 conﬁrm our ability to accomplish these goals.

2.3 Model Identiﬁability

From model (5) it is clear that only voxel-level coeﬃcients are identiﬁed and not the individual tensor margins deﬁning their product-sum given in (6). In the tensor setting, identiﬁability restrictions are understood in light of the following indeterminacies:

1. Scale indeterminacy: for each r = 1, . . . , R, deﬁne λr = (λ1r, . . . , λDr) such that QD j=1 λjr = 1. Then replacing β(r) j by λjrβ(r) j leaves the tensor parameter B unaltered.

2. Permutation indeterminacy: PR r=1 D j=1β(r) j = PR r=1 D j=1β(P(r)) j for any permutation

P( ) of {1, 2, . . . , R}. In particular, this implies that D j=1β(r) j are not identiﬁable for r = 1, . . . , R.

3. Orthogonal transformation indeterminacy (D = 2 only): for any orthonormal matrix O, one has (β(r) 1 O) (β(r) 2 O) = β(r) 1 β(r) 2 .

For D > 2, imposing the following (D 1)R constraints ensures identiﬁability of the margin parameters comprising the rank-R parafac decomposition:

β(r) j,1 = 1, 1 j < D, 1 r R, and β(1) D,1 > > β(R) D,1. (7)

In the context of Bayesian tensor regression, our emphasis is on accurate estimation and inference on tensor parameter B, and on state-of-the-art predictive performance. Importantly, B is always identiﬁable even if the tensor margins β(r) j s are not, hence we avoid imposing identiﬁability restrictions on the latter. As is evident from our simulation studies, non-identiﬁability of β(r) j does not appear to cause convergence issues and in-fact, simpliﬁes the design of eﬃcient computational algorithms.

Guhaniyogi et al.

3. Multiway Shrinkage Priors

This Section outlines the novel multiway shrinkage prior on the tensor coeﬃcient.

3.1 Vector Shrinkage Priors

There has been recent interest in high-dimensional regression with vector predictors, choosing priors which shrink small coeﬃcients towards zero while minimizing shrinkage of large coeﬃcients. Many of these priors can be expressed as a global-local (GL) scale mixtures (Polson and Scott, 2012) with

θj N(0, ψjτ), ψj g, τ h, (8)

where (θ1, . . . , θp) is a coeﬃcient vector, τ is a global scale and ψj is a local-scale. When g is a mixture of two components, with one concentrated near zero and the other away from zero, a spike and slab prior is obtained. Many other choices of g and h have been considered. Although the GL family is widely used and versatile, Bhattacharya et al. (2015) note advantages in drawing the local scales jointly. In particular, they propose to let

θj DE( |φjτ), (φ1, ..., φp) Dirichlet(a, . . . , a), τ h.

where DE( ) denote the double-exponential distribution. For small a and large p, the Dirichlet(a, . . . , a) prior has the property of favoring many values close to zero with a few much larger values, but with P j φj = 1. Though we draw motivation from literature on vector shrinkage priors, our goal of proposing a shrinkage prior on tensor parameter B is fundamentally more challenging as discussed in forthcoming sections.

3.2 Multiway Priors

We propose a new class of multiway shrinkage priors in the generalized linear model setting with tensor valued predictors. Assuming tensor parameter B admits a rank-R parafac decomposition, model (5) results in cell-level coeﬃcients that are a nonlinear function of the corresponding tensor margin parameters (see (6)). Moreover, this implies simultaneous shrinkage on each of the QD j=1 pj cell coeﬃcients as imposed by the prior over R PD j=1 pj parameters. This necessitates careful prior speciﬁcation on the tensor margins {β(r) j ; 1 j D, 1 r R} such that the induced cell-level prior has adequate tails so as to prevent over shrinkage. There are a number of desirable characteristics for a multiway prior on the tensor margins. The proposed multiway shrinkage prior must have a structure that facilitates eﬃcient and reliable model ﬁtting. In addition, it is important to ensure that

1. For each r = 1, . . . , R, β(r) 1,i1, . . . , β(r) D,i D and β(r) 1,k1, . . . , β(r) D,k D are equal in distribution, for any (i1, . . . , i D), (k1, . . . , k D) VB VB and (i1, . . . , i D) = (k1, . . . , k D). This is to ensure that (B)i1,...,i D and (B)k1,...,k D have the same distribution apriori.

2. Shrinkage towards a low rank decomposition, with the model adapting to the complexity and signal in the data, eﬀectively deleting unnecessary dimensions.

3. The prior should favor recovery of contiguous geometric subregions of the tensor across which the cell observations are predictive of the response.

Bayesian Tensor Regression

3.3 The Multiway Dirichlet GDP Prior

There are many ways of specifying priors over tensor margins β(r) j to satisfy the criteria listed. We propose a particular choice called the multiway Dirichlet generalized double Pareto (M-DGDP) prior. This prior induces shrinkage across components in an exchangeable way, with global scale τ Ga(aτ, bτ) adjusted in each component as τr = φrτ for r = 1, . . . , R, where Φ = (φ1, . . . , φR) Dirichlet(α1, . . . , αR) encourages shrinkage towards lower ranks in the assumed parafac decomposition. In addition, W jr = diag(wjr,1, . . . , wjr,pj), j = 1, . . . , D and r = 1, . . . , R are margin-speciﬁc scale parameters for each component. The hierarchical margin-level prior is given by

β(r) j N 0, (φrτ)W jr , wjr,k Exp(λ2 jr/2), λjr Ga(aλ, bλ). (9)

Collapsing over element-speciﬁc scales, notice that β(r) j,k|λjr, φr, τ iid DE(λjr/ φrτ), 1 k pj. Prior (9) induces a GDP prior on the individual margin coeﬃcients which in turn has the form of an adaptive Lasso penalty (Armagan et al., 2013a). Flexibility in estimating Br = {β(r) j ; 1 j D} is accommodated by modeling within-margin heterogeneity via element-speciﬁc scaling wjr,k. Common rate parameter λjr shares information between margin elements, encouraging shrinkage at the local scale. The framework adpoted by Zhou et al. (2013) for Frequentist tensor regression starts by assuming the true parafac rank and relying on shrinkage through a global parameter for estimating the tensor coeﬃcient. In contrast, M-DGDP prior proposes joint shrinkage on the global and local component parameters to achieve improved inference and estimation. Though rank-selection is not our purview, our prior also accomodates dimension reduction by favoring low-rank factorizations as discussed below.

4. Posterior Consistency for Tensor Regression

This Section details out theoretical properties of the tensor regression framework.

4.1 Notation and Framework

We establish convergence results for tensor regression model (5) under the simplifying assumptions that the intercept is omitted by centering the response and the error variance is σ2 = 1. Since our main focus is on the tensor coeﬃcient, we assume coeﬃcients for ordinary scalar covariates to be known. Without loss of generality, we assume γ = (0, . . . , 0). We consider an asymptotic setting in which the dimensions of the tensor grow with n. This paradigm attempts to capture the fact that tensor dimension Q j pj,n is typically substantially larger than sample size. This creates theoretical challenges, related to (but distinct from) those faced in showing posterior consistency for high dimensional regression (Armagan et al., 2013b) and multiway contingency tables (Zhou et al., 2015). Suppose the data generating model comes from the same class of models where the ﬁtted model belongs to, i.e., having true tensor parameter B0 n D j=1ℜpj,n, error variance σ2 0 = 1. We also assume that the true tensor coeﬃcient B0 n generating the data admits a

Guhaniyogi et al.

rank-R PARAFAC decomposition as below

r=1 β0(r) 1,n β0(r) D,n, β0(r) j,n = (β0(r) j,n,1, . . . , β0(r) j,n,pj,n) ℜpj,n.

In addition, deﬁne F n, F 0 n ℜR PD j=1 pj,n as the vectorized parameters:

F n = vec β(1) 1,n, , β(R) 1,n , , β(1) D,n, , β(R) D,n

F 0 n = vec β0(1) 1,n , , β0(R) 1,n , , β0(1) D,n, , β0(R) D,n .

Deﬁne a Kulback-Leibler (KL) neighborhood around the true tensor B0 n as

i=1 KL(f(yi|B0 n), f(yi|Bn)) < ϵ

Denote KL(f(yi|B0 n), f(yi|Bn)) as KLi. The KL-distance between N(µ1, σ2 1) and N(µ2, σ2 2) is log(σ2/σ1)+ σ2 1+(µ1 µ2)2 /2σ2 2 1

2, so it follows KLi = KL N Xi, Bn , 1 , N Xi, B0 n , 1

2 Xi, B0 n Xi, Bn 2. Hence, a KL neighborhood of radius ϵ around B0 n can be reexpressed as Bn = Bn : 1 2n Pn i=1 Xi, B0 n Xi, Bn 2 < ϵ . Further, let πn and Πn denote prior and posterior densities with n observations, respectively, and

Bcn f(yn|Bn)πn(F n) R f(yn|Bn)πn(F n) ,

with yn = (y1, . . . , yn) and f(yn|Bn) is the density of yn under model (5). Posterior consistency is established by showing that

Πn (Bc n) 0 under B0 n a.s. as n . (10)

4.2 Main Result

Our main theorem is that (10) holds under a simple suﬃcient condition on the prior and the tensor predictors.

Theorem 1 Let ζn = n 1+ρ3

2 (ρ3 > 0), Mn = 1

i=1 ||Xi||2 2. Given Lemma 6 in Appendix

A, for any ϵ > 0, Πn(Bn : 1

n Pn i=1 KLi > ϵ) 0 a.s. under B0 n, for the prior πn(Bn) that satisﬁes

Bn : ||Bn B0 n||2 < 2η 3Mnζn

> exp( dn), for all large n (11)

for any d > 0 and η < ϵ 32 d. That is, the model is posterior consistent when (11) holds.

Lemma 6 in Appendix A veriﬁes the existence of exponentially-consistent tests. The proof of the Lemma and Theorem are provided in the Appendix A. The proposed multiway shrinkage prior satisﬁes (11) and hence leads to posterior consistency under the following Theorem.

Bayesian Tensor Regression

Theorem 2 For ﬁxed constants H1, H2, M1, ρ1 and ρ2 > 0, the M-DGDP prior (9) on Bn satisﬁes (11), i.e. yields posterior consistency under conditions:

(a) H1nρ1 < Mn < H2nρ2

(b) supl=1,...,pj,n |β0(r) j,n,l| < M1 < , for all j = 1, . . . , D, r = 1, . . . , R

(c) PD j=1 pj,n log(pj,n) = o(n).

Remark 3 Condition (a) in Theorem 2 gives an upper and lower bound on the sum of the Frobenius norms of tensor predictors. In tensor predictors with {0, 1} entries (often observed for white/grey matter f MRI data), this condition simply imposes a restriction on the minimum and maximum number of 1s in a tensor predictor as a function of the sample size n. Condition (b) is mild, assuming the supremum of all entries in the tensor margins are bounded. Finally, condition (c) in Theorem 2 requires that P j=D pj,n grows sub-linearly with sample size n. However, note that the number of cells QD j=1 pj,n in the tensor can grow at a rate much faster than the sample size n; hence, the modeling framework allows large tensor covariates even for moderate sample sizes. Of course, condition (c) trivially holds when the dimension of the tensor is kept ﬁxed as n grows.

Remark 4 Our Section on posterior consistency is based on the assumption that γ = (0, . . . , 0) and error variance 1, however the results are trivially extendable to cases with unknown γ and σ2. In such a generalization, it is important to note that one would need to assume the dim(γ) is ﬁxed and not growing with n.

4.3 Prior Hyper-parameter Elicitation

The marginal distribution of cell coeﬃcients (6) under the proposed M-DGDP prior (9) is not available in closed-form. To assess how a shrinkage prior on the margins induces prior on cell coeﬃcients of the tensor, we turn to an expression for the cell-level variance:

var(Bi1,...,i D) = E var R X

j=1 β(r) j,id | W , Λ, Φ, τ

r=1 φD r Eτ{τ D} EΛ ,r

= Γ(α0 + D)

Γ(α0) b D τ (2Cλ)D EΦ

The following Lemma provides lower and upper bounds on the variance that can be useful for elicitation of default hyperparameter values.

Lemma 5 Under M-DGDP shrinkage prior (9) and for D > 1, if α1 = = αR = c/R, c N+, with constants Cλ = b2 λ/ (aλ 1)(aλ 2) , aλ > 2, Aτ = exp((D2 3D)/2), then the cell-level variance is bounded below by RαD 1 (2Cλ/bτ)D and above by Aτ(2Cλ/bτ)D exp(α1 RD).

Guhaniyogi et al.

G G G G G G G

(a) α = (0.2, 0.2, 0.2)

(b) α = (0.3, 0.3, 0.3)

(c) α = (0.5, 0.5, 0.5)

Figure 1: Visualization of points in the S2 probability simplex for 500 independent realizations of Dirichlet(α). As α 0, points increasingly tend to concentrate around vertices of the SR 1 simplex. This notion of sparsity is made precise in Yang and Dunson (2014).

Hyperparameters of the Dirichlet component in multiway prior (9) play a key role in controlling dimensionality of the model, with smaller values favoring more componentspeciﬁc scales τr 0, thus eﬀectively collapsing on a low-rank tensor factorization. Figure 1 plots realizations from the Dirichlet distribution when R = 3 for diﬀerent concentration parameters2 α. A discrete uniform prior is placed on α over a grid, A. By default, grid values are chosen to be 10 equally spaced values in [R D, R 0.10], letting the data tune this parameter according to the degree of sparisty present. Armagan et al. (2013a) study various choices of (aλ, ζ = bλ/aλ) that lead to desirable shrinkage properties, such as Cauchy-like tails for β(r) j,k while retaining Laplace-like shrinkage near zero. Empirical results from simulation studies across a variety of settings in Section 6 reveal no strong sensitivity to choices for hyper-parameters aλ, bλ. From Lemma 5, setting aλ = 3 and bλ = 2D aλ avoids overly narrow variance of the induced prior on tensor elements, Bi1,...,i D. Table 1 provides various quantiles of the induced prior on these elements under these default hyperparameter settings as a function of the parafac rank-R and tensor dimension D.

5. Posterior Computation and Model Fitting

Letting y ℜdenote a response, and z ℜp, X D j=1ℜpj predictors, we let

y|γ, B, σ N z γ + X, B , σ2

r=1 Br, Br = β(r) 1 β(r) D

σ2 πσ, γ πγ, β(r) j πβ.

2. For simplicity we assume α1 = = αR = α.

Bayesian Tensor Regression

R 5% 25% 50% 75% 95%

1 0.001 0.011 0.057 0.254 1.729 5 0.004 0.040 0.164 0.595 3.332 10 0.005 0.058 0.237 0.852 4.635

1 0.000 0.001 0.010 0.072 0.917 5 0.000 0.009 0.061 0.341 3.382 10 0.001 0.017 0.111 0.608 5.996

Table 1: Percentiles for |Bi1,...,i D| under the M-DGDP prior with default aλ = 3, bλ =

2D aλ, bτ = αR1/D (v = 1) and α = 1/R. Statistics are displayed as the parafac rank-R vary and dimension D of the tensor vary.

The noise variance is modeled using a conjugate inverse-gamma prior, σ2 IG(v/2, vs2 0/2), with v = 2 and s2 0 chosen by default so Pr(σ2 1) = 0.95 assuming a centered and scaled response. Regression coeﬃcients are given a conjugate normal prior γ N(0, σ2Σ0γ) and the tensor predictor is normalized over all cells to have mean zero and variance 1, allowing one to assume default values for hyper-parameters in the proposed multiway prior.

5.1 Posterior Computation

The proposed multiway prior (9) leads to Gibbs sampling scheme for most parameters of the tensor regression model (12). We rely on marginalization and blocking to reduce autocorrelation for β(r) j , wjr; 1 j D, 1 r R , (Φ, τ), (γ, σ) , drawing in sequence from [α, Φ, τ|B, W], [B, W|Φ, τ, γ, σ, y] and [γ, σ|B, y] as follows:

(1) Sample [α, Φ, τ|B, W ] compositionally as [α|B, W ][Φ, τ|α, B, W ]:

(a) Sample from the conditional distribution of Dirichlet concentration parameter [α|B, W ] via griddy-Gibbs: form a reference set by drawing M samples from [Φ, τ|α, B, W ] for each α A. Set wj,l = π(B|α, Φl, τl, W )π(Φl, τl|α), 1 l M, p(α|B, W ) = π(α) PM l=1 wj,l/M, and Pr(α = αj| ) = p(αj|B, W )/ P α A p(α|B, W ).

(b) Sample component-speciﬁc scales as [Φ, τ|α , B, W ] = [Φ|B, W ][τ|Φ, B, W ]; deﬁne p0 = PD j=1 pj, and recall aτ = PR r=1 αr = Rα and bτ = α(R/v)1/D (see Section 3.3), then

draw ψr gi G(α p0/2, 2bτ, 2Cr), Cr = PD j=1 β(r) T j W 1 jr β(r) j , and set φr = ψr/ PR l=1 ψl in parallel for 1 r R (see Appendix A for deﬁnition of gi G ) draw τ gi G(aτ Rp0/2, 2bτ, 2 PR r=1 Dr), Dr = Cr/φr.

(2) Sample from (β(r) j , wjr, λjr); 1 j D, 1 r R |Φ, τ, γ, σ, y using a back-ﬁtting procedure to produce a sequence of draws from the margin-level conditional distributions across components. For r = 1, . . . , R and j = 1, . . . , D, sample from conditional distribution [(β(r) j , wjr, λjr)|β(r) j, B r, Φ, τ, γ, σ, y], where β(r) j = {β(r) l , l = j} and B r = B \ Br;

(a) draw [wjr, λjr|β(r) j , φr, τ] = [wjr|λjr, β(r) j , φr, τ][λjr|β(r) j , φr, τ]:

Guhaniyogi et al.

draw λjr Ga aλ + pj, bλ + ||β(r) j ||1/ φrτ ; and

draw wjr,k gi G 1

2, λ2 jr, β2 (r) j,k /(φrτ) independently for 1 k pj

(b) draw β(r) j N(µjr, Σjr): deﬁne h(r) i,j,k = Pp1,...,p D d1=1,...,d D=1 I(dj = k) xd1,...,d D Q l =j β(r) l,il ,

H(r) i,j = (h(r) i,j,1, . . . , h(r) i,j,pj) , yi = yi z iγ P l =r Xi, Bl for 1 i n; then

Σjr = H(r) T j H(r) j /σ2 + W 1 jr /(φrτ) 1, µjr = Σjr H(r) j y/σ2

(3) Sample [γ, σ|B, y] = [γ|σ, y][σ2| y]; deﬁne yi = yi Xi, B for 1 i n, then

(a) draw σ2 IG(aσ, bσ), aσ = (n + v)/2, bσ = vs2 0 + || y||2 2 y T Zµγ /2

(b) draw γ N µγ, σ2Σγ , Σγ = ZT Z + Σ 1 0γ 1, µγ = ΣγZT y.

6. Simulation Studies

To illustrate ﬁnite-sample performance of the proposed multiway priors, we show results from a simulation study with various dimensionality (p, R) and deﬁne b = max |B0 i1,...,i D| as the maximum signal size. Throughout, set pj = p, true error variance σ2 0 = 1 and b = 1 for convenience. In addition, we set the true vector coeﬃcient γ0 = (0, . . . , 0) and focus exclusively on inference for tensor parameter B. The following simulated setups are considered:

1. Generated tensor: We construct tensor parameters having rank R0 = {3, 5} with p = {64, 100} and D = 2.

2. Ready made tensor: We use three tensor (2D) images without generating them from a parafac decomposition with known rank.

Five replicated datasets with n = 1000 are generated according to (12) with xi1,...,i D N(0, 1). The tensor parameters considered are shown in Figure 2, where the magnitude of the non-zero cells is b = 1. Examples are chosen to demonstrate recovery of cell-level coeﬃcients across varying degrees of complexity (dimension, parafac rank) and sparsity (% of non-zero cells; see Figure 2). The performance of our method with M-DGDP prior (9) (BTR) is compared with (i) frequentist tensor regression with penalization (FTR)(Zhou et al., 2013); and (ii) Lasso (on the vectorized tensor predictor). Comparisons are based on (a) cell mean squared estimation error (true non-zero, true zero, and overall); and (b) frequentist coverage (and length) of 95% credible intervals. By default, BTR uses R = 10 as an upper bound on the tensor parafac rank, minimizing eﬀects of extra dimensions automatically and concentrating on a lower rank coeﬃcient tensor as MCMC proceeds. We ran FTR with various choices of R and found equivalent performance for R between 3 to 15, thus the default value is set to R = 10 to ensure comparability with the BTR ﬁtting. MCMC for BTR was run for 1300 iterations, with a 300 iteration burn-in and remaining samples thinned by 5. The latter was chosen to keep runtime between BTR and FTR similar for 3D simulation studies in Section 7. There, the total runtime using non-optimized R code on an x86 64 Intel(R) Core(TM) i7-3770 is between 6.2 - 7.5 hours. In simulation studies, the tuning parameter in FTR is selected

Bayesian Tensor Regression

R3-ex R5-ex Shapes Eagle Palmtree Horse

|cell0| > 0 BTR 0.0230.00 0.0210.00 0.2430.01 0.2260.02 0.3160.01 0.2780.01 FTR 0.0350.00 0.0300.00 0.4150.03 0.3540.03 0.4350.02 0.3910.03 Lasso 0.6280.02 0.8220.03 0.6190.07 0.6650.03 0.6980.03 0.8880.01

|cell0| = 0 BTR 0.0110.00 0.0140.00 0.0710.00 0.0850.00 0.1000.01 0.1370.00 FTR 0.0220.00 0.0200.00 0.1270.02 0.1630.03 0.1590.00 0.2150.02 Lasso 0.0900.00 0.0980.02 0.0810.01 0.0970.00 0.0940.01 0.1550.02

Overall BTR 0.013 0.015 0.093 0.102 0.131 0.172 FTR 0.023 0.021 0.164 0.184 0.196 0.257 Lasso 0.187 0.288 0.179 0.204 0.217 0.407

Table 2: Comparison of cell estimation as measured by root mean squared error (RMSE) for the six 2D tensor images portrayed in Figure 2. Results from FTR (Zhou et al., 2013) use R = 10. For BTR, R = 10 is used as an upper bound to the tensor parafac rank. Subscript shows the standard error over a few replicated simulations.

over a grid of values to minimize RMSE for the tensor predictor3. In real applications, cross validation instead would be used to select the tuning parameter that results in lowest hold-out predictive RMSE for FTR. Assuming 10-fold cross validation were used over a vector of 20 tuning parameters, FTR would have a runtime of approximately 8 hours. It also needs to be mentioned that the convergence of parameters in BTR is extremely rapid with an average eﬀective sample size (ESS) 600 over 1000 iterations.

Cell-level RMSE reported in Table 2 demonstrates that our method (BTR) consistently out performs FTR. When the tensor parameter has a low-rank parafac decomposition ( R3ex and R5-ex ), BTR and FTR perform best, with BTR having lower RMSE on both true zero and non-zero cells. This validates empirically prior (9) along with our suggested default hyper-parameter choices in Section 3. In particular, the tensor coeﬃcient in BTR has three diﬀerent types of shrinkage parameters: global, local and shrinkage across ranks. Such a careful construction of shrinkage prior on B adapts to varying degrees of sparsity, shrinking many tensor coeﬃcients close to zero while accurately estimating nonzero cells. FTR shrinkage being dependent on only local parameters suﬀers in terms of both inferential and predictive performances.

Table 3 demonstrate that BTR yields 95% credible intervals with good frequentist coverage across each of the simulated settings, both overall as well as on the true non-zero coeﬃcients. Our method is one of the ﬁrst to oﬀer uncertainty quantiﬁcation for tensor valued predictors.Finally, Table 4 provides evidence of the robustness of our method to increasing predictor dimension using two of the simulated examples. In both cases, RMSE for FTR worsens considerably on the true zero coeﬃcients. For the true nonzero cells, RMSE increases for both methods as the margin dimension increases; however on a relative basis, FTR worsens considerably more, while on an absolute scale, BTR remains the clear winner.

3. To choose initial values, a preliminary analysis was run with a coarsened 16 16 image.

Guhaniyogi et al.

10 20 30 40 50 60

10 20 30 40 50 60

(a) R3-ex (7.0%)

10 20 30 40 50 60

10 20 30 40 50 60

(b) R5-ex (11.0%)

10 20 30 40 50 60

10 20 30 40 50 60

(c) Shapes (6.8%)

10 20 30 40 50 60

10 20 30 40 50 60

(d) Eagle (7.4%)

10 20 30 40 50 60

10 20 30 40 50 60

(e) Palmtree (8.0%)

10 20 30 40 50 60

10 20 30 40 50 60

(f) Horse (18.6%)

Figure 2: Simulated data with 64 64 2D tensor images (p = 64, D = 2). Row 1: The ﬁrst two images (from left) have a rank-3 and rank-5 parafac decomposition; the third image is regular , although does not have a low-rank parafac decomposition. Row 2: All three images are irregular, and do not have a low-rank parafac decomposition. Sparsity (% non-zero cells) are displayed in sub-captions.

R3-ex R5-ex Shapes Eagle Palmtree Horse |cell0| > 0 coverage 0.9860.02 0.9460.02 0.7470.01 0.7310.04 0.6770.04 0.7950.02

Overall coverage 0.9950.01 0.9700.01 0.9650.00 0.9400.02 0.9480.02 0.9270.01 length 0.0660.01 0.0610.01 0.2900.00 0.3010.03 0.4100.03 0.5660.02

Table 3: Row 1: Average coverage of 95% posterior credible intervals for all the cells of B for which the true cell coeﬃcient is nonzero. Row 2: Average coverage of 95% posterior credible intervals for all the cells of B. Row 3: Average length of 95% posterior credible intervals for all the cells of B. Subscripts show standard errors over replicated simulations.

Bayesian Tensor Regression

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

10 20 30 40 50 60

0.5 0.3 0.1 0.1 0.3 0.5 0.7 0.9 1.1 1.3

(e) Palmtree

10 20 30 40 50 60

10 20 30 40 50 60

0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4

Figure 3: Recovered images for the 64 64 2D tensor images in Figure 2 using our proposed BTR method. Here, R = 10 is used as an upper bound to the tensor parafac rank.

|cell0| R5-ex Shapes 64 100 64 100

coverage > 0 0.9460.02 0.9910.01 0.7470.01 0.5900.06 length > 0 0.0610.01 0.0690.01 0.2900.00 0.2470.01 rmse > 0 0.0210.00 0.0320.01 0.2430.01 0.3200.03 rmse = 0 0.0140.00 0.0140.00 0.0710.00 0.0630.00

FTR rmse > 0 0.0300.00 0.3690.06 0.4150.03 0.5860.14 rmse = 0 0.0200.00 0.1110.02 0.1270.02 0.1350.02

Table 4: Sensitivity analysis of cell estimation error (RMSE) as the tensor dimension increases; here pj = p {64, 100} for the 2D tensor images R5-ex and Shapes .

Guhaniyogi et al.

7. Simulated response with a real 3D brain image

We analyze data containing 3D MRI images for 550 adolescents, with information such as age and sex available. Age and sex are treated as ordinary scalar covariates while 3D MRI images act as tensor covariates. Let X denote a 30 30 30 3D MRI image, Z1 be the age and Z2 be the sex of an individual. The response is simulated using y N Z γ + X, B0 , σ2 , where Z denotes (Z1, Z2) , γ R2 and B0 R30 30 30.

We assume the true B0 is a rank 2 tensor, with B0 = a1 a2 a3 + b1 b2 b3. Initialization and standardization of predictors follow exactly as prescribed in Section 5. By varying ai s and bi s, the following cases with varying degrees of sparsity in the tensor parameter B0 are considered: Case 1: b1 = b2 = (0, . . . , 0, sin((1 : 15) π/4)), b3 = (sin((1 : 10) π/4), 0, . . . , 0), a1 = (0, . . . , 0, sin((1 : 10) π/4)), a2 = (0, . . . , 0, cos((1 : 15) π/4)), a3 = (sin((1 : 15) π/4), 0, . . . , 0). Case 2: b1 = b2 = (0, . . . , 0, sin((1 : 15) π/6)), b3 = (sin((1 : 20) π/6), 0, . . . , 0), a1 = (0, . . . , 0, sin((1 : 15) π/4)), a2 = (0, . . . , 0, cos((1 : 10) π/6)), a3 = (sin((1 : 15) π/6), 0, . . . , 0). Case 3: b1 = b2 = (0, . . . , 0, sin((1 : 20) π/6)), b3 = (sin((1 : 20) π/6), 0, . . . , 0), a1 = (0, . . . , 0, sin((1 : 10) π/4)), a2 = (0, . . . , 0, cos((1 : 20) π/4)), a3 = (sin((1 : 20) π/6), 0, . . . , 0).

We implement BTR, FTR, and Lasso on the vectorized tensor. As before, we present results for FTR with R = 10 (See additional discussion in Section 6 on FTR default setup). Point estimates for coeﬃcients corresponding to age and sex covariates are provided in Table 6. Table 5 summarizes RMSEs for the estimated tensor coeﬃcients for each method. BTR shows at least a 15% improvement over FTR on simulated cases considered. Evidently BTR tends to outperform FTR and vectorized lasso by a greater margin in less sparse settings as well. Importantly, every parameter in BTR is auto-tuned, while the Tensor Reg toolbox used for FTR (Zhou et al., 2013) requires calibrating tuning parameter values speciﬁc to each setting. Note that rather than using cross validation, tuning parameters in these experiments were chosen to provide the lowest possible (most optimistic) RMSE for the tensor coeﬃcient. FTR ﬁxes R = 10 based on ﬁndings discussed in Section 6 while BTR sets R = 10 as an upper bound, concentrating on a lower dimension parafac rank via adaptive shrinkage. While vectorized lasso and FTR do not come equipped with parameter uncertainty estimates, Table 7 demonstrates how BTR consistently provides over 95% coverage across examples with varying degrees of sparsity.

Finally, Table 8 provides a measure of mixing eﬃciency for a single MCMC run in each of the simulated cases considered (post burn-in over the remaining 1000 MCMC samples). All reported RMSE and coverage statistics were computed over these draws as well (thinning by 5 as previously discussed in Section 6).

8. Brain Connectome Data Analysis

Brain connectome data are known by neuroscientists to have low signal-to-noise ratio, and eﬀective modeling is often hindered as the sample size is often very limited compared to the

Bayesian Tensor Regression

Case 1 Case 2 Case 3

|cell0| > 0 BTR 0.39 0.30 0.34 FTR 0.46 0.41 0.43 Lasso 0.46 0.42 0.44

|cell0| = 0 BTR 0.04 0.14 0.10 FTR 0.00 0.00 0.00 Lasso 0.01 0.03 0.02

Overall BTR 0.13 0.20 0.17 FTR 0.15 0.22 0.18 Lasso 0.15 0.23 0.18

Table 5: Comparison of cell estimation as measured by root mean squared error (RMSE) for the coeﬃcients in case 1,2, 3 corresponding to 3D tensor images. Results from both BTR and FTR (Zhou et al., 2013) use R = 10.

Case 1 Case 2 Case 3

γ1 (truth = 0.5) BTR 0.57 0.54 0.33 FTR 0.46 0.85 0.95

γ2 (truth = 2.0) BTR 2.00 2.04 1.86 FTR 1.87 0.22 3.30

Table 6: Point estimates for age and sex coeﬃcients under BTR and FTR Zhou et al. (2013). True parameter values are also provided.

Guhaniyogi et al.

Case 1 Case 2 Case 3 Coverage 0.98 0.96 0.99 Length 0.54 0.87 2.16

Table 7: Length and coverage of 95% credible intervals for BTR. Values are reported as averages over all voxels of the tensor coeﬃcient.

τ 2 Scalar predictor γi (Ave.) Tensor predictor βijk (Ave.)

Case 1 (88% sparsity) lag-2 0.04 0.34 0.45 lag-4 0.01 0.11 0.22

Case 2 (82% sparsity) lag-2 0.08 0.33 0.46 lag-4 0.03 0.12 0.23

Case 3 (70% sparsity) lag-2 0.09 0.43 0.53 lag-4 0.02 0.30 0.40

Table 8: MCMC autocorrelation of the proposed BTR method on data studies of Section 7 generated using 3D brain MRI scans.

number of cells in the tensor predictor. In this setting, developing well calibrated predictive models is thus of key importance. To investigate the performance of competing methods outside the class of f MRI brain image data, we present an analysis using a brain connectome dataset on structural connectivity. Data are extracted from diﬀusion tensor imaging (DTI) and consist of estimates of the number of ﬁbers connecting pairs of brain regions for 109 individuals. For each individual, brain connections among 70 brain regions (following desikan atlas) are encoded by a 70 70 weighted adjacency matrix. The (i, j)-th oﬀ-diagonal entry in the adjacency matrix is the estimated number of ﬁber tracts connecting the i-th and j-th brain region. The data also provides 10 clinical covariates for every individual, including sex, age, openness, agreeableness and conscientiousness. The focus of this study is on developing a predictive model with Creativity Composite Index (CCI) as a response ﬁtted against clinical covariates and a tensor covariate (i.e., the weighted adjacency matrix). Implementing FTR on this data using the Tensor Reg package was attempted, however, functions in the toolbox require n > R p. In this example, because n = 109 and p = 70, it is only possible therefore to ﬁt FTR with R = 1, which has previously been found to perform poorly by Zhou et al. (2013). We therefore compare our proposed method (BTR) to the vectorized Lasso on the basis of their predictive performance. To assess the predictive performance, the sample of n = 109 individuals are divided into 10 folds. Both vectorized lasso and BTR are ﬁtted on 9 folds as training data and the remaining fold as the hold out sample. This is carried out for each of the 10 folds and predictive inferences are obtained for both vectorized lasso and BTR. Table 9 reports the root mean squared error (RMSE) and correlation between observed and predicted responses, here average is over the 10 crossvalidated folds. For reference, average RMSE of the null model is 10.03 with a standard deviation of 2.40 across the

Bayesian Tensor Regression

Method avg(RMSE) sd(RMSE) avg(cov.) sd(cov.) avg(cor.) sd(cor.) Lasso 9.21 2.18 63% 20% 0.31 0.11 BTR 9.03 1.64 91% 10% 0.32 0.13

Table 9: mean and standard deviation of RMSEs and cor(yobs, ypred) of Lasso and BTR over 10 folds of the data. It also provides mean and standard deviation of coverages of Lasso and BTR over 10 folds of the data

folds. Given the very high degree of sparsity in the connectome adjacency matrix, it is not surprising that the Lasso is competitive to BTR. However, note that BTR detects this signal with far fewer eﬀective parameters as compared to vectorized lasso. Finally, we measure coverage of 95% predictive intervals for all competitors. The latter is of course a byproduct of our fully Bayesian approach (BTR), while for the Lasso we use a two-staged approach. First we estimate the regression coeﬃcients and subsequently construct approximate 95% predictive intervals based on the normal response-model centered on the predictive mean with variance equal to the estimated residual variance.

9. Discussion

This work develops a novel class of prior distributions on tensor valued predictors which substantially reduces dimensionality relative to vectorizing, providing a multiway analogue of vector shrinkage priors, and enabling high dimensional region selection. The prior on tensor coeﬃcient constructed here imparts shrinkage of the tensor components at global and local levels, while also encouraging shrinkage towards low rank tensor decomposition. In contrast, existing penalization framework on the tensor coeﬃcient shrinks only at the global level. Strong theoretical results are proved for the proposed class of multiway shrinkage priors and a computationally eﬃcient MCMC algorithm is developed in the regression setting. We plan to extend methods developed here to settings where the measured response for each subject is binary (e.g., indicator of a heath outcome) or multivariate, i.e., y = (y1, ..., yd). Also, the current framework of Bayesian tensor regression ﬁxes rank R of the PARAFAC decomposition at reasonably large value. It might be of interest to learn the PARAFAC rank R by adding a discrete prior distribution on R.

In various longitudinal studies, monitoring the evolution in the predictor response relationship (i.e., changes to the scalar and tensor parameters) is of fundamental interest. One application involves subjects receiving various treatments (e.g., chemotherapy), with tensor valued predictors corresponding to m RI (f MRI) scans obtained at regular intervals over a period of time. In such settings it is of crucial importance to monitor the progression of the disease in response to the treatment being administered. We plan to extend our method to such settings.

Guhaniyogi et al.

Acknowledgement

The authors would like to thank Joshua T. Vogelstein from Johns Hopkins university for gracefully allowing us to utilize their brain connectome dataset in the real data analysis of our proposed BTR method.

MCMC algorithm

The following derivations concern the M-DGDP prior (9) and the sampling algorithm outlined in Section 5.1.

For step (1b) Recall from Section 3.3 that τ Ga(aτ, bτ) and Φ Dirichlet(α1, . . . , αR) and denote p0 = PD j=1 pj. Then,

π(Φ|B, ω) π(Φ) Z

0 π(B|ω, Φ, τ)π(τ)dτ

r=1 φαr 1 r i Z

h (τφr) p0/2 exp 1 τφr

j=1 ||βjr||2/(2ωjr) i τ aτ 1 exp( bττ)dτ

0 τ aτ R p0

τφr bτ(τφr) dτ

with Cr = Pd j=1 ||βjr||2/(2ωjr). When aτ = PR r=1 αr, this contains the kernel of a generalized inverse Gaussian (g IG) distribution for (τφr). Recall: X f X(x) = gi G(p, a, b) xp 1 exp( (ax + b/x)/2). Following Lemma 9 in the Appendix B, for independent random variable Tr fr on (0, ), the joint density of {φr = Tr/ P r T r : r = 1, . . . , R} has support on SR 1. In particular,

f(φ1, . . . , φR 1) = Z

0 t R 1 R Y

r=1 fr(φrt) dt, φR = 1 X

Substituting fr(x) x δr exp( Cr/x) exp( bτx) in the above expression yields

f(φ1, . . . , φR 1) Z

0 τ R 1 R Y

r=1 (φrτ) δr exp Cr (φrτ) bτ(φrτ) dτ

r=1 φ δri Z

0 τ R P r δr 1 R Y

r=1 exp Cr (φrτ) bτ(φrτ) dτ.

Matching exponents between this expression and the preceding one implies (1) aτ R(p0/2) 1 = R P r δr 1, and (2) δr = 1 + p0/2 αr. Then,

aτ = R(1 + p0/2) X

r δr = R(1 + p0/2) (R + Rp0/2 X

as previously noted. Hence, draws from [Φ|α, B, W ] are obtained by sampling Tr fr = gi G(αr p0/2, 2bτ, 2Cr) independently for r = 1, . . . , R, and renormalizing.

Bayesian Tensor Regression

Proof of lemma 5

Proof Using priors deﬁned in (9), one has Cλ = Eλ(1/λ2) = b2 λ (aλ 1)(aλ 2) for any aλ > 2. In addition, the following inequalities are useful to bound the latter quantity:

If α1 = c/R, c N+, Γ(α0+D)/Γ(α0) = α0(α0+1) (α0+D 1). Using the fact that log(x+1) x, x 0, one has log(α0)+ +log(α0+D 1) α0D 1+PD 2 k=1 D 2 k. Then αD 0 Γ(α0 + D)/Γ(α0) Aτ exp(α0D) where Aτ = exp( 1 + PD 2 k=1 D 2 k) = exp (D2 3D)/2 , D 2.

Let ||x||r denote the Lrth norm. Trivially, ||Φ||D D 1; in addition, by H older s in-

equality, for any x ℜk and 0 < r < p, one has ||x||p k 1 r 1

p ||x||r. In our setting, D 2. Taking r = 1 in the latter yields ||Φ||D D R (D 1).

Recall α0 = PR r=1 αr = α1R. This leads to the lower and upper bounds for the prior voxel-level variance:

var(Bi1,...,i D) (2Cλ)D (α1R)DR (D 1)/b D τ = (2Cλ)D αD 1 R/b D τ var(Bi1,...,i D) Aτ(2Cλ)D exp(α1 RD)/b D τ .

Consistency proofs

The proof of Theorem 1 relies in part on the existence of exponentially consistent tests.

Deﬁnition An exponentially consistent sequence of test functions Φn = I(yn Cn) for testing H0 : Bn = B0 n vs. H1 : Bn = B0 n satisﬁes

EB0 n(Φn) c1 exp( b1n), sup Bn Bcn EBn(1 Φn) c2 exp( b2n)

for some c1, c2, b1, b2 > 0.

Lemma 6 There exist an exponentially consistent sequence of tests Φn for testing H0 : Bn = B0 n vs. H1 : Bn = B0 n.

Proof We begin by stating that Pn i=1 yi Xi, B0 n 2 χ2 n under B0 n. We choose the

critical region of the test Φn as Cn = n Bn : 1

n Pn i=1 yi Xi, B0 n 2 > ϵ/4 o . Note that

EB0 n(Φn) = PB0 n

yi Xi, B0 n 2 > nϵ/4 exp nϵ

, for large n,

where the last line follows by simplifying Lemma 1 in Laurent and Massart (2000).

Guhaniyogi et al.

Now we will use the fact that

i=1 (yi Xi, B0 n )2

i=1 (yi Xi, Bn )2 + 1

Xi, Bn B0 n 2 + 2

i=1 (yi Xi, Bn ) ( Xi, Bn B0 n)

Xi, Bn B0 n 2 + 1

i=1 KLi + 2

i=1 (yi Xi, Bn ) Xi, Bn B0 n .

Note that, under Bn,

i=1 (yi Xi, Bn ) Xi, Bn B0 n N(0, 4

so that, 2 n Pn i=1(yi Xi, Bn ) Xi, Bn B0 n = q

4 n Pn i=1 KLi Z n, where Z N(0, 1). Thus,

sup Bn Bcn EBn(1 Φn) = sup Bn Bcn PBn

i=1 (yi Xi, B0 n )2 ϵ/4

sup Bn Bcn PBn

i=1 KLi Z n + 1

i=1 (yi Xi, Bn )2

sup Bn Bcn PBn

i=1 KLi Z n + 1

i=1 (yi Xi, Bn )2

sup Bn Bcn PBn

4 Pn i=1 KLi

i=1 (yi Xi, Bn )2

4 Pn i=1 KLi

1 2n Pn i=1 KLi

. Using this fact we have

sup Bn Bcn EBn(1 Φn)

sup Bn Bc n PBn

4 Pn i=1 KLi

i=1 (yi Xi, Bn )2

+ sup Bn Bc n PBn (Tn)

sup Bn Bcn PBn

i=1 KLi ϵ/4

i=1 (yi Xi, Bn )2

+ sup Bn Bcn PBn

i=1 (yi Xi, Bn )2

+ PBn χ2 1 nϵ

Bayesian Tensor Regression

where the last line requires an application of Lemma 1 in Laurent and Massart (2000).

Theorem 1 Proof Under Lemma 6 one has

Bcn f(yn|Bn)πn(F n) R f(yn|Bn)πn(F n) =

Bcn f(yn|Bn) f(yn|B0 n)πn(F n) R f(yn|Bn)

f(yn|B0 n)πn(F n) = N

D Φn + (1 Φn)N

Note that we have

PB0 n (Φn > exp( b1n/2)) EB0 n (Φn) exp(b1n/2) c1 exp( b1n/2).

Therefore P n=1 PB0 n (Φn > exp( b1n/2)) < . Using Borel-Cantelli lemma PB0 n (Φn > exp( b1n/2)i.o.) = 0. It follows that

Φn 0 a.s. (13)

In addition, we have

EB0 n((1 Φn)N) = Z (1 Φn) Z

f(yn|Bn) f(yn|B0 n)πn(F n)f(yn|B0 n)

Z (1 Φn)f(yn|Bn)πn(F n)

sup Bn Bcn EBn(1 Φn) c2 exp( b2n).

Using a similar technique as above, PB0 n ((1 Φn)N exp(nb2/2) > exp( nb2/4)i.o.) = 0 so

exp(bn)(1 Φn)N 0 a.s.. (14)

By Lemma 6 and (13)-(14) it is enough to show that M = exp( bn) R f(yn|Bn)

f(yn|B0 n)πn(F n)

for some b b = ϵ 256. We choose b = b. Consider the set Hn = n Bn : 1

n log h f(yn|B0 n) f(yn|Bn) i < η o , for some η which is chosen later.

M exp( bn) Z

n log f(yn|B0 n) f(yn|Bn)

exp(( b η)n)πn(Hn).

1 n log f(yn|Bn)

i=1 (yi Xi, Bn )2 + 1

i=1 (yi Xi, B0 n )2 #

Guhaniyogi et al.

Let yn = (y1, ..., yn) , Hn = ( X1, Bn , . . . , Xn, Bn ) and H0 n = X1, B0 n , . . . , Xn, B0 n . Then

n ||yn H0 n||2 + ||yn Hn||2 < 2η

2||yn H0 n|| ||yn Hn|| ||yn H0 n|| + ||yn Hn|| ||yn H0 n|| 2 < 2η

2||yn H0 n||||H0 n Hn|| + ||Hn H0 n||2 < 2η

n||H0 n Hn|| < 2η

3ζn , ||yn H0 n||2 < ζ2 n

where A1n = n 1 n||Hn H0 n|| < 2η

o , A2n = ||yn H0 n||2 < ζ2 n .

We will show that PB0 n(A2n) = 1 for all large n. Assume ζn = n(1+ρ3)/2, ρ3 > 0 so that ζ2 n > 8n for all large n. Then,

PB0 n(A 2n) = PB0 n(χ2 n > ζ2 n) exp( ζ2 n/2).

Therefore, using Borel-Cantelli lemma PB0 n(A 2n i.o.) = 0. Hence PB0 n(A2n) = 1 for all

large n. It is enough to bound πn(A1n). Let Mn = 1

q Pn i=1 ||Xi||2 2. Now use the fact that

1 n||Hn H0 n|| = 1

q Pn i=1( Xi, Bn B0 n )2 1 n

q Pn i=1 ||Xi||2 2 ||Bn B0 n||2 to conclude ||Bn B0 n||2 < 2η 3Mnζn

By (11) one has πn(A1n) πn ||Bn B0 n||2 < 2η 3Mnζn

exp( dn) and hence M

exp ( b η d)n as n proving the result.

Theorem 2 Proof Deﬁne g : R R s.t.

g(κ) = RκD + κD 1 D X

r=1 ||β0(r) j,n ||2 + + κ

l =j ||β0(r) l,n ||2.

Let κn > 0 be s.t. g(κn) = 2η 3Mnζn . Note that by Decarte s rule of sign, the equation

g(κ) 2η 3Mnζn = 0 has a unique positive root. Further

1 κn < 1 + max i=1,...,D

3 P j1 = =ji PR r=1

l=1 ||β0(r) jl,n||2

κn < 1 + max

2η 3Mnζn R, max i=1,...,D

P j1 = =ji PR r=1

l=1 ||β0(r) jl,n||2

Bayesian Tensor Regression

by Lemma 8 in Appendix B.

Using Lemma 7 in Appendix B, it is easy to see that

n ||β(r) j,n β0(r) j,n ||2 κn, j = 1, ..., D; r = 1, ..., R o ||Bn B0 n||2 < 2η 3Mnζn

Using (15), πn ||Bn B0 n||2 < 2η 3Mnζn

πn n ||β(r) j,n β0(r) j,n ||2 κn, j = 1, ..., D; r = 1, ..., R o . Note that

πn n ||β(r) j,n β0(r) j,n ||2 κn, j = 1, ..., D; r = 1, ..., R o |{wjr,l}pj,n l=1 , {λjr}D,R 1 j,r=1 , {φr}R 1 r=1 , τ

r=1 πn ||β(r) j,n β0(r) j,n ||2 κn|{wjr,l}pj,n l=1 , {λjr}D,R 1 j,r=1 , {φr}R 1 r=1 , τ

Therefore, it is enough to bound πn(||β(r) j,n β0(r) j,n || κn, j = 1, ..., D; r = 1, ..., R). For j = 1, ..., D, r = 1, ..., R,

πn(||β(r) j,n β0(r) j,n || κn|{wjr,l}pj,n l=1 , λjr, {φr}R 1 r=1 , τ)

|β(r) j,n,l β0(r) j,n,l| κn pj,n |{wjr,l}pj,n l=1 , λjr, {φr}R 1 r=1 , τ

2κn p2pj,nπwjr,lφrτ

|β0(r) j,n,l|2 + κ2 n/pj,n wjr,lφrτ

where the last step follows from the fact that R b a e x2/2dx e (a2+b2)/2(b a). Thus,

πn(||β(r) j,n β0(r) j,n || κn|λjr, {φr}R 1 r=1 , τ)

= E h πn(||β(r) j,n β0(r) j,n || κn|{wjr,l}pj,n l=1 , λjr, {φr}R 1 r=1 , τ) i

!pj,n pj,n Y

1 wjr,l exp

|β0(r) j,n,l|2 + κ2 n/pj,n wjr,lφrτ

2κnλ2 jr 2 p

!pj,n pj,n Y

1 wjr,l exp

|β0(r) j,n,l|2 + κ2 n/pj,n wjr,lφrτ λ2 jrwjr,l

Guhaniyogi et al.

Use the change of variable 1 wjr,l = zjr,l and the normalizing constant from the inverse Gaussian density to deduce

1 wjr,l exp

|β0(r) j,n,l|2 + κ2 n/pj,n wjr,lφrτ λ2 jrwjr,l

z3 jr,l exp

(|β0(r) j,n,l|2 + κ2 n/pj,n)

φrτ zjr,l λ2 jr 2zjr,l

2 |β0(r) j,n,l|2 + κ2n/pj,n

(19) can be written as

πn(||β(r) j,n β0(r) j,n || κn|λjr, {φr}R 1 r=1 , τ)

2κnλ2 jr 2 p

!pj,n pj,n Y

2 |β0(r) j,n,l|2 + κ2n/pj,n

2 |β0(r) j,n,l|2 + κ2n/pj,n

πn(||β(r) j,n β0(r) j,n || κn|{φr}R 1 r=1 , τ)

!pj,n baλ,r λ,r Γ(aλ,r)

λjr λpj,n+aλ,r 1 jr exp

2 |β0(r) j,n,l|2 + κ2n/pj,n

!pj,n baλ,r λ,r Γ(aλ,r) Γ(pj,n + aλ,r)

2 |β0(r) j,n,l|2+κ2n/pj,n

2κn 2bλ,r p

!pj,n 1 Γ(aλ,r) Γ(pj,n + aλ,r)

2 |β0(r) j,n,l|2+κ2n/pj,n

bλ,r φrτ + 1

pj,n+aλ,r .

Bayesian Tensor Regression

The ﬁnal expression as in the above yields

πn(||β(r) j,n β0(r) j,n || κn, j = 1, ..., D, r = 1, ..., R|{φr}R 1 r=1 , τ)

2κn 2bλ,r p

!pj,n 1 Γ(aλ,r)λpj,n+aλ,r 1 j,r Γ(pj,n + aλ,r)

2 |β0(r) j,n,l|2+κ2n/pj,n

bλ,r φrτ + 1

We will now use the fact that for φr 1,

2 |β0(r) j,n,l|2+κ2n/pj,n

bλ,r φrτ + 1

pj,n+aλ,r 1

2 |β0(r) j,n,l|2+κ2n/pj,n

bλ,r φrτ + 1 τφr

pj,n+aλ,r Iτ [0,1].

This inequality is critical to provide a lower bound on πn(||β(r) j,n β0(r) j,n || κn, j = 1, ..., D, r = 1, ..., R) as following

πn(||β(r) j,n β0(r) j,n || κn, j = 1, ..., D, r = 1, ..., R)

λλ1 2 Γ(Ra) Γ(λ1)Γ(a)R

κn pj,nbλ,r

pj,n Γ(pj,n + aλ,r)

τ τ λ1 R PD j=1 pj,n

2 1 exp( λ2τ)

QR r=1 φa 1 r QR r=1 φ

PD j=1 pj,n

2 |β0(r) j,n,l|2+κ2n/pj,n

bλ,r φrτ + 1

pj,n+aλ,r dφdτ

λλ1 2 Γ(Ra) Γ(λ1)Γ(a)R

κn pj,nbλ,r

pj,n Γ(pj,n + aλ,r)

2 |β0(r) j,n,l|2+κ2n/pj,n

τ=0 τ λ1+PR r=1 aλ,r D

2 1 exp( τλ2)dτ Z

r=1 φ a+aλ,r D

= λλ1 2 Γ(Ra) Γ(λ1)Γ(a)R

κn pj,nbλ,r

pj,n Γ(pj,n + aλ,r)

2 |β0(r) j,n,l|2+κ2n/pj,n

(λ1 + PR r=1 aλ,r D

QR r=1 Γ(a + aλ,r D

2 PR r=1 aλ,r) .

Guhaniyogi et al.

Denote C6 = λλ1 2 Γ(Ra) Γ(λ1)[Γ(a)]R exp( λ2) (λ1+PR r=1 aλ,r D

QR r=1[Γ(a+aλ,r D

2 )] Γ(Ra+PR r=1 aλ,r D

2 ) . Then the above expression gives us

log ||Bn B0 n||2 < 2η 3Mnζn

log(κn) + 1

2 log(pj,n) + log(bλ,r) + log(Γ(aλ,r)

j=1 log(Γ(pn,j + aλ,r)) +

r=1 (pj,n + aλ,r) log

2 |β0(r) j,n,l|2 + κ2n/pj,n

Using (16) and assumption (b), it is easy to see that 1 κn < G5nρ2+ ρ3+1

2 QD j=1 pj,n for a con-

stant G5 > 0 for all large n. Therefore, PD j=1 PR r=1 pj,n h log 1 κn

2 log(pj,n) + log(bλ,r) + log(Γ(aλ,r) i =

o(n). Also, PR r=1 PD j=1 log(Γ(pj,n + aλ,r))] PD j=1(pj,n + aλ,r) log(pj,n + aλ,r) = o(n), by

assumption (c). Finally, PD j=1 PR r=1(pj,n + aλ,r) log h Ppj,n l=1

2 |β0(r) j,n,l|2+κ2n/pj,n

bλ,r + 1 i = o(n),

by assumptions (b) and (c). Thus, log πn(Bn : ||Bn B0 n||2 < 2η 3Mnζn ) < dn for all d > 0, for all large n. This proves the result.

This Section contains additional Lemmas relevant to the article.

Lemma 7 Suppose T = T1 TD and F = F1 FD are two rank one tensors of same dimension. Then

T F = (T1 F1) (TD FD) +

I1 I2=1:D,|I1|=l,|I2|=D l γ1 γD,

where γj = Fj if j I2; = Tj Fj if j I1.

Proof We will show it by induction. If D = 2 then,

T F = T1 T2 F1 F2 = (T1 F1) T2 + F1 T2 F1 F2 = (T1 F1) (T2 F2) + (T1 F1) F2 + F1 (T2 F2).

Bayesian Tensor Regression

Assume the result to hold for D 1. For D,

T1 TD F1 FD = (T1 F1) T2 TD + F1 [T2 TD F2 FD]

= (T1 F1) [(T2 F2) (TD FD) + F2 FD+

I1 I2,|I1|=l,|I2|=D 1 l γ2 γD]+

F1 [(T2 F2) (TD FD) +

I1 I2,|I1|=l,|I2|=D 1 l γ2 γD]

= (T1 F1) (TD FD) +

I1 I2,|I1|=l,|I2|=D l γ1 γD].

Hence proved.

Lemma 8 Let x be a real root of the polynomial P(x) = akxk +ak 1xk 1 + +a1x a0. Then 1/|x | < 1 + maxi=1,...,k ai

Proof Consider the polynomial P1(ζ) = ζk a1 a0

. By making a change

of variable with ζ = 1

x, we obtain

= akxk + + a1x a0

Note that P1 1

x = 0 is solved by x = x . Therefore, P1(ζ) = 0 is solved by ζ = 1 x . The result follows by using Cauchy bound on the roots of a polynomial.

Lemma 9 Suppose T1, . . . , Tm are independent random variables with Tj having density fj supported in (0, ). Let φj = Tj Pm l=1 Tm . Then the joint density of (φ1, . . . , φm 1) has a joint

density supported on the simplex Sm 1 and is given by

f(φ1, . . . , φm 1) = Z

t=0 tm 1 m Y

l=1 fj(φjt)dt,

where φm = 1 Pm 1 l=1 φl.

Proof This result is well known in the theory of normalized random measures (Kruijer et al., 2010).

Guhaniyogi et al.

Artin Armagan, David B Dunson, and Jaeyong Lee. Generalized double pareto shrinkage. Statistica Sinica, 23(1):119 143, 2013a.

Artin Armagan, David B Dunson, Jaeyong Lee, Waheed U Bajwa, and Nate Strawn. Posterior consistency in linear models under shrinkage priors. Biometrika, 100(4):1011 1018, 2013b.

Anirban Bhattacharya, Debdeep Pati, Natesh S Pillai, and David B Dunson. Dirichlet laplace priors for optimal shrinkage. Journal of the American Statistical Association, 110 (512):1479 1490, 2015.

Brian S Caﬀo, Ciprian M Crainiceanu, Guillermo Verduzco, Suresh Joel, Stewart H Mostofsky, Susan Spear Bassett, and James J Pekar. Two-stage decompositions for the analysis of functional connectivity for fmri with application to alzheimer s disease risk. Neuro Image, 51(3):1140 1149, 2010.

David Gerard and Peter Hoﬀ. Adaptive higher-order spectral estimators. ar Xiv preprint ar Xiv:1505.02114, 2015.

JeﬀGoldsmith, Lei Huang, and Ciprian M Crainiceanu. Smooth scalar-on-image regression via spatial Bayesian variable selection. Journal of Computational and Graphical Statistics, 23(1):46 64, 2014.

Chris Hinrichs, Vikas Singh, Lopamudra Mukherjee, Guofan Xu, Moo K Chung, Sterling C Johnson, and Alzheimer s Disease Neuroimaging Initiative. Spatially augmented lpboosting for ad classiﬁcation with evaluations on the adni dataset. Neuroimage, 48(1):138 149, 2009.

Peter D Hoﬀet al. Multilinear tensor regression for longitudinal relational data. The Annals of Applied Statistics, 9(3):1169 1193, 2015.

Hung Hung and Chen-Chien Wang. Matrix variate logistic regression model with application to eeg data. Biostatistics, 14(1):189 202, 2013.

Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455 500, 2009.

Willem Kruijer, Judith Rousseau, and Aad Van Der Vaart. Adaptive Bayesian density estimation with location-scale mixtures. Electronic Journal of Statistics, 4:1225 1257, 2010.

Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000.

Nicole Lazar. The Statistical Analysis of Functional MRI Data. Springer Science & Business Media, 2008.

Martin A Lindquist. The statistical analysis of fmri data. Statistical Science, 23(4):439 464, 2008.

Bayesian Tensor Regression

Nicholas G Polson and James G Scott. Local shrinkage rules, l evy processes and regularized regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2):287 311, 2012.

Peihua Qiu. Jump surface estimation, edge detection, and image restoration. Journal of the American Statistical Association, 102(478):745 756, 2007.

Philip T Reiss and R Todd Ogden. Functional generalized linear models with images as predictors. Biometrics, 66(1):61 69, 2010.

Srikanth Ryali, Kaustubh Supekar, Daniel A Abrams, and Vinod Menon. Sparse logistic regression for whole-brain classiﬁcation of fmri data. Neuro Image, 51(2):752 764, 2010.

Taiji Suzuki. Convergence rate of Bayesian tensor estimatior and its minimax optimality. In Proceedings of the 32nd International Conference on Machine Learning (Lille, 2015), pages 1273 1282, 2015.

Xuejing Wang, Bin Nan, Ji Zhu, and Robert Koeppe. Regularized 3d functional regression for brain image data via Haar wavelets. The annals of applied statistics, 8(2):1045, 2014.

Yun Yang and David B Dunson. Minimax optimal Bayesian aggregation. ar Xiv preprint ar Xiv:1403.1345, 2014.

Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540 552, 2013.

Jing Zhou, Anirban Bhattacharya, Amy H Herring, and David B Dunson. Bayesian factorizations of big sparse tensors. Journal of the American Statistical Association, 110(512): 1562 1576, 2015.