# bayesian_tensor_regression__3eb9789c.pdf Journal of Machine Learning Research 18 (2017) 1-31 Submitted 7/16; Published 8/17 Bayesian Tensor Regression Rajarshi Guhaniyogi rguhaniy@ucsc.edu Department of Applied Mathematics & Statistics University of California Santa Cruz, CA 95064, USA Shaan Qamar siqamar@gmail.com Google Inc. Mountain View, CA 94043, USA David B. Dunson dunson@duke.edu Department of Statistical Science Duke University Durham, NC 27708-0251, USA Editor: Robert Mc Culloch We propose a Bayesian approach to regression with a scalar response on vector and tensor covariates. Vectorization of the tensor prior to analysis fails to exploit the structure, often leading to poor estimation and predictive performance. We introduce a novel class of multiway shrinkage priors for tensor coefficients in the regression setting and present posterior consistency results under mild conditions. A computationally efficient Markov chain Monte Carlo algorithm is developed for posterior computation. Simulation studies illustrate substantial gains over existing tensor regression methods in terms of estimation and parameter inference. Our approach is further illustrated in a neuroimaging application. Keywords: Multiway Shrinkage Prior, Magnetic Resonance Imaging (MRI), Parafac Decomposition, Posterior Consistency, Tensor Regression 1. Introduction In many application areas, it is common to collect predictors that are structured as a multiway array or tensor. For example, the elements of this tensor may correspond to voxels in a brain image (Lindquist, 2008; Lazar, 2008; Hinrichs et al., 2009; Ryali et al., 2010). Existing approaches for quantifying associations between an outcome and such tensor predictors mostly fall within two groups. The first approach assesses the association between each cell (for brain images referred to as voxel) and the response independently, providing a p-value map (Lazar, 2008). The p-values can be adjusted for multiple comparisons to identify significant sub-regions of the tensor. Although this approach is widely used and appealing in its simplicity, clearly such independent screening approaches have key disadvantages relative to methods that take into account the joint impact of the overall . These authors contributed equally c 2017 Rajarshi Guhaniyogi and Shaan Qamar and David B. Dunson. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/16-362.html. Guhaniyogi et al. tensor simultaneously. Unfortunately, the literature on simultaneous analysis approaches is sparse. One naive approach is to simply vectorize the tensor and then use existing methods for high-dimensional regression. Such vectorization fails to preserve spatial structure, making it more difficult to learn low-dimensional relationships with the response. Efficient learning is of critical importance, as the sample size is typically massively smaller than the total number of cells. Alternative approaches within the regression framework include functional regression and two stage approaches. The former views the tensor as a discretization of a continuous functional predictor. Most of the literature on functional predictors focuses on 1D functions; Reiss and Ogden (2010) consider the 2D case, but substantial challenges arise in extensions to 3D due to dimensionality and collinearity among cells. Recently Wang et al. (2014) considered 3D regularized functional regression with Haar wavelet basis. The article is essentially frequentist in nature with simulation studies showing only the mean squared error and the percentage of correctly identified zero and nonzero elements. Additionally, the article reveals that functional regression is largely affected by the choice of proper basis functions. The second set of approaches, i.e. Two stage approaches first conduct a dimension reduction step, commonly using PCA, and then fit a model using lower dimensional predictors (Caffo et al., 2010). A clear disadvantage of such approaches is that the main principal components driving variability in the random tensor may have relatively limited impact on the response variable. Potentially, supervised PCA could be used, but it is not clear how to implement such an approach in 3D or higher dimensions. Zhou et al. (2013) propose extending generalized linear regression to include a tensor structured parameter corresponding to the measured tensor predictor. To circumvent difficulties with extensions to higher order tensor predictors, they impose additional structure on the tensor parameter, supposing it decomposes as a rank-R parafac sum (see Section 2.1). This massively reduces the effective number of parameters to be estimated. They develop a penalized likelihood approach where adaptive lasso penalties are be imposed on individual margins of the parafac decomposition, focusing on good point estimation for the tensor parameter. However, their method relies heavily on cross-validation for selecting tuning parameters which are sensitive to the tensor dimension, the signal-to-noise ratio (degree of sparsity) and the parafac rank. Given that there is no automatic selection procedure for the tuning parameters provided in Zhou et al. (2013), they have to be fed manually by the end user which is problematic for an unknown tensor regression problem. Of practical interest is a self calibrating procedure which adapts the complexity of the model to the data. We propose a principled method to effectively shrink unimportant cell coefficients to zero while maintaining accuracy in estimating important cell coefficients. Our framework gives rise to the automatic selection of tuning parameters, with carefully constructed shrinkage priors that naturally induce sparsity within and across components in the tensor factorization of the tensor coefficient for optimal region selection. In addition, the need for valid measures of uncertainty on parameter (predictive) estimates is crucial, especially in settings with low or moderate sample sizes, which naturally motivates our Bayesian approach. Recently Suzuki (2015) proposed a Bayesian tensor regression approach with naive Gaussian prior on the components of the tensor factorization of the tensor coefficients. In contrast, our proposed prior on the tensor coefficient is more sophisticated in the sense that it imparts shrinkage in three ways: at a global level, at a local level of Bayesian Tensor Regression individual parameters, and also provides shrinkage towards low rank decomposition of the tensor coefficient. Similarly, Bayesian tensor regression framework proposed in Goldsmith et al. (2014) uses binary indicators to determine whether a cell in the tensor predictor is predictive of the response. For a tensor predictor with 30 30 30 cells, such an approach requires to update 27000 binary indicators in each MCMC iteration and is deemed unsatisfactory due to mixing issues and poor inference. Our approach differs from image reconstruction literature as we do not model the distribution of the tensor X (Qiu, 2007). There is a considerable recent literature on frequentist tensor modeling in which one typically encounters time series (generally to study social networks or images evolving over time) with response at every time point is an array/tensor (Gerard and Hoff, 2015; Hoffet al., 2015). There is also a Bayesian literature that facilitates joint modeling of a large number of unordered categorical variables (Zhou et al., 2015). Our framework is fundamentally different from these approaches in the sense that these are all unsupervised tensor modeling approach while we propose a framework for supervised tensor regression. To the best of our knowledge, we are the first to propose a novel multiway shrinkage prior in Bayesian tensor regression framework (with scalar response on a tensor predictor) that accommodates shrinkage of the tensor coefficient for the appropriate identification of important cells in the tensor predictor. Besides, we offer strong posterior consistency results on Bayesian tensor regression framework with multiway shrinkage prior. Remainder of the manuscript evolves as follows. In Section 2, we propose the basic framework of the tensor regression model with a scalar response, vector predictors and a tensor predictor. Section 3 characterizes desirable criteria for a multiway shrinkage prior and proposes a novel multiway shrinkage prior on the tensor coefficient. Sections 4 and 5 provide theoretical results on the convergence of posterior distribution under the mutiway shrinkage prior and details on posterior computation respectively. Various simulation studies with 2D and 3D tensor predictors are presented in Sections 6 and 7 respectively to study effectiveness of the Bayesian tensor regression under various degrees of sparsity and signal strength. Section 8 is devoted to a real brain connectome data analysis using the proposed Bayesian tensor regression model along with its competitors. The manuscripts ends with a discussion. 2. Tensor Regression This section provides details on the tensor regression model. 2.1 Basic Notation Let β1 = (β11, . . . , β1p1) and β2 = (β21, . . . , β2p2) be vectors of length p1 and p2, respectively. The vector outer product β1 β2 is a p1 p2 matrix with (i, j)-th entry β1i β2j. A D-way outer product between vectors βj = (βj1, . . . , βjpj), 1 j D, is a p1 p D multi-dimensional array denoted B = β1 β2 βD with entries (B)i1,...,i D = QD j=1 βjij. Define a vec(B) operator as stacking elements of this D-way tensor into a column vector of length QD j=1 pj. From the definition of outer products, it is easy to see that vec(β1 β2 βD) = βD β1. As a higher order generalization of matrix singular value decomposition, Tucker decomposition of a D-way tensor B D j=1ℜpj is Guhaniyogi et al. often considered. The Tucker decomposition (Kolda and Bader, 2009) can be expressed as r D=1 λr1,...,r Dβ(r1) 1 β(r2) 2 β(r D) D (1) where β(rj) j is a pj dimensional vector, 1 j D, and Λ = (λr1,...,r D)R1,...,RD r1,...,r D=1 is referred to as the core tensor. If one considers {β(rj) j ; 1 rj Rj, 1 j D} as factor loadings and λr1,...,r D to be the corresponding coefficients, then the Tucker decomposition may be thought of as a multiway analogue to factor modeling. A rank-R parafac decomposition emerges as a special case of Tucker decomposition 1 when R1 = R2 = = RD = R and λr1,...,r D = I(r1 = r2 = = r D). In particular, B D j=1ℜpj assumes a rank-R parafac decomposition if r=1 β(r) 1 β(r) D (2) where β(r) j , 1 j D and 1 r R are the pj dimensional margins . The parafac decomposition is more widely used due to its relative simplicity. 2.2 Model Framework Let y Y denotes a response variable, with z X ℜp and X D j=1ℜpj scalar and tensor predictors, respectively. We consider a tensor regression model having a general form y f α + z γ + X, B , σ , X, B = vec(X) vec(B), (3) where f(µ, σ) is a family of distributions having location µ and scale σ, γ is a p 1 coefficient for scalar preditors and B D j=1ℜpj is the tensor parameter corresponding to measured tensor predictor X. We focus more specifically on the Gaussian linear model case with y = α + z γ + X, B + ϵ, ϵ N(0, σ2). (4) The coefficient tensor B has QD j=1 pj elements, necessitating substantial dimensionality reduction. A rank-1 parafac decomposition assumes B = β1 βD and vec(B) = βD β1. This reduces to modeling y = α + z γ + β 1Xβ2 when D = 2, corresponding to the bilinear model considered in Hung and Wang (2013). Since only the single parameter vector βj captures signal along the jth dimension, a rank-1 assumption severely limits flexibility, ruling out interactions among dimensions. Following Zhou et al. (2013), we use a more flexible rank-R parafac decomposition for B = PR r=1 β(r) 1 β(r) D introduced in (2) with β(r) j ℜpj, 1 j D, and 1 r R. Expression (4) then becomes y = α + z γ + D X, r=1 β(r) 1 β(r) D E + ϵ = α + z γ + X (i1,...,i D) (X)i1,...,i D(B)i1,...,i D + ϵ (5) Bayesian Tensor Regression where voxel (X)i1,...,i D of the tensor predictor has corresponding parameter (B)i1,...,i D = j=1 β(r) j,ij, (i1, . . . , i D) VB = D j=1{1, . . . , pj}. (6) The model is therefore nonlinear in the parameters defining B. A hierarchical specification is completed by placing priors over unknown model parameters. While placing priors over α and γ is straightforward, Section 3.2 focuses on specification of the prior over tensor parameters which is nontrivial and one of the main contributions of this work. Under the assumed rank-R parafac decomposition for B, model (5) requires estimating p+2+R PD j=1 pj as opposed to p+2+QD j=1 pj parameters for the unstructured vectorized (saturated) model. As we are interested in identifying geometric sub-regions of the tensor across which coefficients are not close to zero, with the remaining elements being very close to zero, one wonders whether such dramatic dimension reduction retains sufficient flexibility. Finally, we would like to accurately estimate coefficient values in these subregions. Consistent with our theoretical analysis in Section 4, extensive simulation studies in Section 6 confirm our ability to accomplish these goals. 2.3 Model Identifiability From model (5) it is clear that only voxel-level coefficients are identified and not the individual tensor margins defining their product-sum given in (6). In the tensor setting, identifiability restrictions are understood in light of the following indeterminacies: 1. Scale indeterminacy: for each r = 1, . . . , R, define λr = (λ1r, . . . , λDr) such that QD j=1 λjr = 1. Then replacing β(r) j by λjrβ(r) j leaves the tensor parameter B unaltered. 2. Permutation indeterminacy: PR r=1 D j=1β(r) j = PR r=1 D j=1β(P(r)) j for any permutation P( ) of {1, 2, . . . , R}. In particular, this implies that D j=1β(r) j are not identifiable for r = 1, . . . , R. 3. Orthogonal transformation indeterminacy (D = 2 only): for any orthonormal matrix O, one has (β(r) 1 O) (β(r) 2 O) = β(r) 1 β(r) 2 . For D > 2, imposing the following (D 1)R constraints ensures identifiability of the margin parameters comprising the rank-R parafac decomposition: β(r) j,1 = 1, 1 j < D, 1 r R, and β(1) D,1 > > β(R) D,1. (7) In the context of Bayesian tensor regression, our emphasis is on accurate estimation and inference on tensor parameter B, and on state-of-the-art predictive performance. Importantly, B is always identifiable even if the tensor margins β(r) j s are not, hence we avoid imposing identifiability restrictions on the latter. As is evident from our simulation studies, non-identifiability of β(r) j does not appear to cause convergence issues and in-fact, simplifies the design of efficient computational algorithms. Guhaniyogi et al. 3. Multiway Shrinkage Priors This Section outlines the novel multiway shrinkage prior on the tensor coefficient. 3.1 Vector Shrinkage Priors There has been recent interest in high-dimensional regression with vector predictors, choosing priors which shrink small coefficients towards zero while minimizing shrinkage of large coefficients. Many of these priors can be expressed as a global-local (GL) scale mixtures (Polson and Scott, 2012) with θj N(0, ψjτ), ψj g, τ h, (8) where (θ1, . . . , θp) is a coefficient vector, τ is a global scale and ψj is a local-scale. When g is a mixture of two components, with one concentrated near zero and the other away from zero, a spike and slab prior is obtained. Many other choices of g and h have been considered. Although the GL family is widely used and versatile, Bhattacharya et al. (2015) note advantages in drawing the local scales jointly. In particular, they propose to let θj DE( |φjτ), (φ1, ..., φp) Dirichlet(a, . . . , a), τ h. where DE( ) denote the double-exponential distribution. For small a and large p, the Dirichlet(a, . . . , a) prior has the property of favoring many values close to zero with a few much larger values, but with P j φj = 1. Though we draw motivation from literature on vector shrinkage priors, our goal of proposing a shrinkage prior on tensor parameter B is fundamentally more challenging as discussed in forthcoming sections. 3.2 Multiway Priors We propose a new class of multiway shrinkage priors in the generalized linear model setting with tensor valued predictors. Assuming tensor parameter B admits a rank-R parafac decomposition, model (5) results in cell-level coefficients that are a nonlinear function of the corresponding tensor margin parameters (see (6)). Moreover, this implies simultaneous shrinkage on each of the QD j=1 pj cell coefficients as imposed by the prior over R PD j=1 pj parameters. This necessitates careful prior specification on the tensor margins {β(r) j ; 1 j D, 1 r R} such that the induced cell-level prior has adequate tails so as to prevent over shrinkage. There are a number of desirable characteristics for a multiway prior on the tensor margins. The proposed multiway shrinkage prior must have a structure that facilitates efficient and reliable model fitting. In addition, it is important to ensure that 1. For each r = 1, . . . , R, β(r) 1,i1, . . . , β(r) D,i D and β(r) 1,k1, . . . , β(r) D,k D are equal in distribution, for any (i1, . . . , i D), (k1, . . . , k D) VB VB and (i1, . . . , i D) = (k1, . . . , k D). This is to ensure that (B)i1,...,i D and (B)k1,...,k D have the same distribution apriori. 2. Shrinkage towards a low rank decomposition, with the model adapting to the complexity and signal in the data, effectively deleting unnecessary dimensions. 3. The prior should favor recovery of contiguous geometric subregions of the tensor across which the cell observations are predictive of the response. Bayesian Tensor Regression 3.3 The Multiway Dirichlet GDP Prior There are many ways of specifying priors over tensor margins β(r) j to satisfy the criteria listed. We propose a particular choice called the multiway Dirichlet generalized double Pareto (M-DGDP) prior. This prior induces shrinkage across components in an exchangeable way, with global scale τ Ga(aτ, bτ) adjusted in each component as τr = φrτ for r = 1, . . . , R, where Φ = (φ1, . . . , φR) Dirichlet(α1, . . . , αR) encourages shrinkage towards lower ranks in the assumed parafac decomposition. In addition, W jr = diag(wjr,1, . . . , wjr,pj), j = 1, . . . , D and r = 1, . . . , R are margin-specific scale parameters for each component. The hierarchical margin-level prior is given by β(r) j N 0, (φrτ)W jr , wjr,k Exp(λ2 jr/2), λjr Ga(aλ, bλ). (9) Collapsing over element-specific scales, notice that β(r) j,k|λjr, φr, τ iid DE(λjr/ φrτ), 1 k pj. Prior (9) induces a GDP prior on the individual margin coefficients which in turn has the form of an adaptive Lasso penalty (Armagan et al., 2013a). Flexibility in estimating Br = {β(r) j ; 1 j D} is accommodated by modeling within-margin heterogeneity via element-specific scaling wjr,k. Common rate parameter λjr shares information between margin elements, encouraging shrinkage at the local scale. The framework adpoted by Zhou et al. (2013) for Frequentist tensor regression starts by assuming the true parafac rank and relying on shrinkage through a global parameter for estimating the tensor coefficient. In contrast, M-DGDP prior proposes joint shrinkage on the global and local component parameters to achieve improved inference and estimation. Though rank-selection is not our purview, our prior also accomodates dimension reduction by favoring low-rank factorizations as discussed below. 4. Posterior Consistency for Tensor Regression This Section details out theoretical properties of the tensor regression framework. 4.1 Notation and Framework We establish convergence results for tensor regression model (5) under the simplifying assumptions that the intercept is omitted by centering the response and the error variance is σ2 = 1. Since our main focus is on the tensor coefficient, we assume coefficients for ordinary scalar covariates to be known. Without loss of generality, we assume γ = (0, . . . , 0). We consider an asymptotic setting in which the dimensions of the tensor grow with n. This paradigm attempts to capture the fact that tensor dimension Q j pj,n is typically substantially larger than sample size. This creates theoretical challenges, related to (but distinct from) those faced in showing posterior consistency for high dimensional regression (Armagan et al., 2013b) and multiway contingency tables (Zhou et al., 2015). Suppose the data generating model comes from the same class of models where the fitted model belongs to, i.e., having true tensor parameter B0 n D j=1ℜpj,n, error variance σ2 0 = 1. We also assume that the true tensor coefficient B0 n generating the data admits a Guhaniyogi et al. rank-R PARAFAC decomposition as below r=1 β0(r) 1,n β0(r) D,n, β0(r) j,n = (β0(r) j,n,1, . . . , β0(r) j,n,pj,n) ℜpj,n. In addition, define F n, F 0 n ℜR PD j=1 pj,n as the vectorized parameters: F n = vec β(1) 1,n, , β(R) 1,n , , β(1) D,n, , β(R) D,n F 0 n = vec β0(1) 1,n , , β0(R) 1,n , , β0(1) D,n, , β0(R) D,n . Define a Kulback-Leibler (KL) neighborhood around the true tensor B0 n as i=1 KL(f(yi|B0 n), f(yi|Bn)) < ϵ Denote KL(f(yi|B0 n), f(yi|Bn)) as KLi. The KL-distance between N(µ1, σ2 1) and N(µ2, σ2 2) is log(σ2/σ1)+ σ2 1+(µ1 µ2)2 /2σ2 2 1 2, so it follows KLi = KL N Xi, Bn , 1 , N Xi, B0 n , 1 2 Xi, B0 n Xi, Bn 2. Hence, a KL neighborhood of radius ϵ around B0 n can be reexpressed as Bn = Bn : 1 2n Pn i=1 Xi, B0 n Xi, Bn 2 < ϵ . Further, let πn and Πn denote prior and posterior densities with n observations, respectively, and Bcn f(yn|Bn)πn(F n) R f(yn|Bn)πn(F n) , with yn = (y1, . . . , yn) and f(yn|Bn) is the density of yn under model (5). Posterior consistency is established by showing that Πn (Bc n) 0 under B0 n a.s. as n . (10) 4.2 Main Result Our main theorem is that (10) holds under a simple sufficient condition on the prior and the tensor predictors. Theorem 1 Let ζn = n 1+ρ3 2 (ρ3 > 0), Mn = 1 i=1 ||Xi||2 2. Given Lemma 6 in Appendix A, for any ϵ > 0, Πn(Bn : 1 n Pn i=1 KLi > ϵ) 0 a.s. under B0 n, for the prior πn(Bn) that satisfies Bn : ||Bn B0 n||2 < 2η 3Mnζn > exp( dn), for all large n (11) for any d > 0 and η < ϵ 32 d. That is, the model is posterior consistent when (11) holds. Lemma 6 in Appendix A verifies the existence of exponentially-consistent tests. The proof of the Lemma and Theorem are provided in the Appendix A. The proposed multiway shrinkage prior satisfies (11) and hence leads to posterior consistency under the following Theorem. Bayesian Tensor Regression Theorem 2 For fixed constants H1, H2, M1, ρ1 and ρ2 > 0, the M-DGDP prior (9) on Bn satisfies (11), i.e. yields posterior consistency under conditions: (a) H1nρ1 < Mn < H2nρ2 (b) supl=1,...,pj,n |β0(r) j,n,l| < M1 < , for all j = 1, . . . , D, r = 1, . . . , R (c) PD j=1 pj,n log(pj,n) = o(n). Remark 3 Condition (a) in Theorem 2 gives an upper and lower bound on the sum of the Frobenius norms of tensor predictors. In tensor predictors with {0, 1} entries (often observed for white/grey matter f MRI data), this condition simply imposes a restriction on the minimum and maximum number of 1s in a tensor predictor as a function of the sample size n. Condition (b) is mild, assuming the supremum of all entries in the tensor margins are bounded. Finally, condition (c) in Theorem 2 requires that P j=D pj,n grows sub-linearly with sample size n. However, note that the number of cells QD j=1 pj,n in the tensor can grow at a rate much faster than the sample size n; hence, the modeling framework allows large tensor covariates even for moderate sample sizes. Of course, condition (c) trivially holds when the dimension of the tensor is kept fixed as n grows. Remark 4 Our Section on posterior consistency is based on the assumption that γ = (0, . . . , 0) and error variance 1, however the results are trivially extendable to cases with unknown γ and σ2. In such a generalization, it is important to note that one would need to assume the dim(γ) is fixed and not growing with n. 4.3 Prior Hyper-parameter Elicitation The marginal distribution of cell coefficients (6) under the proposed M-DGDP prior (9) is not available in closed-form. To assess how a shrinkage prior on the margins induces prior on cell coefficients of the tensor, we turn to an expression for the cell-level variance: var(Bi1,...,i D) = E var R X j=1 β(r) j,id | W , Λ, Φ, τ r=1 φD r Eτ{τ D} EΛ ,r = Γ(α0 + D) Γ(α0) b D τ (2Cλ)D EΦ The following Lemma provides lower and upper bounds on the variance that can be useful for elicitation of default hyperparameter values. Lemma 5 Under M-DGDP shrinkage prior (9) and for D > 1, if α1 = = αR = c/R, c N+, with constants Cλ = b2 λ/ (aλ 1)(aλ 2) , aλ > 2, Aτ = exp((D2 3D)/2), then the cell-level variance is bounded below by RαD 1 (2Cλ/bτ)D and above by Aτ(2Cλ/bτ)D exp(α1 RD). Guhaniyogi et al. G G G G G G G (a) α = (0.2, 0.2, 0.2) (b) α = (0.3, 0.3, 0.3) (c) α = (0.5, 0.5, 0.5) Figure 1: Visualization of points in the S2 probability simplex for 500 independent realizations of Dirichlet(α). As α 0, points increasingly tend to concentrate around vertices of the SR 1 simplex. This notion of sparsity is made precise in Yang and Dunson (2014). Hyperparameters of the Dirichlet component in multiway prior (9) play a key role in controlling dimensionality of the model, with smaller values favoring more componentspecific scales τr 0, thus effectively collapsing on a low-rank tensor factorization. Figure 1 plots realizations from the Dirichlet distribution when R = 3 for different concentration parameters2 α. A discrete uniform prior is placed on α over a grid, A. By default, grid values are chosen to be 10 equally spaced values in [R D, R 0.10], letting the data tune this parameter according to the degree of sparisty present. Armagan et al. (2013a) study various choices of (aλ, ζ = bλ/aλ) that lead to desirable shrinkage properties, such as Cauchy-like tails for β(r) j,k while retaining Laplace-like shrinkage near zero. Empirical results from simulation studies across a variety of settings in Section 6 reveal no strong sensitivity to choices for hyper-parameters aλ, bλ. From Lemma 5, setting aλ = 3 and bλ = 2D aλ avoids overly narrow variance of the induced prior on tensor elements, Bi1,...,i D. Table 1 provides various quantiles of the induced prior on these elements under these default hyperparameter settings as a function of the parafac rank-R and tensor dimension D. 5. Posterior Computation and Model Fitting Letting y ℜdenote a response, and z ℜp, X D j=1ℜpj predictors, we let y|γ, B, σ N z γ + X, B , σ2 r=1 Br, Br = β(r) 1 β(r) D σ2 πσ, γ πγ, β(r) j πβ. 2. For simplicity we assume α1 = = αR = α. Bayesian Tensor Regression R 5% 25% 50% 75% 95% 1 0.001 0.011 0.057 0.254 1.729 5 0.004 0.040 0.164 0.595 3.332 10 0.005 0.058 0.237 0.852 4.635 1 0.000 0.001 0.010 0.072 0.917 5 0.000 0.009 0.061 0.341 3.382 10 0.001 0.017 0.111 0.608 5.996 Table 1: Percentiles for |Bi1,...,i D| under the M-DGDP prior with default aλ = 3, bλ = 2D aλ, bτ = αR1/D (v = 1) and α = 1/R. Statistics are displayed as the parafac rank-R vary and dimension D of the tensor vary. The noise variance is modeled using a conjugate inverse-gamma prior, σ2 IG(v/2, vs2 0/2), with v = 2 and s2 0 chosen by default so Pr(σ2 1) = 0.95 assuming a centered and scaled response. Regression coefficients are given a conjugate normal prior γ N(0, σ2Σ0γ) and the tensor predictor is normalized over all cells to have mean zero and variance 1, allowing one to assume default values for hyper-parameters in the proposed multiway prior. 5.1 Posterior Computation The proposed multiway prior (9) leads to Gibbs sampling scheme for most parameters of the tensor regression model (12). We rely on marginalization and blocking to reduce autocorrelation for β(r) j , wjr; 1 j D, 1 r R , (Φ, τ), (γ, σ) , drawing in sequence from [α, Φ, τ|B, W], [B, W|Φ, τ, γ, σ, y] and [γ, σ|B, y] as follows: (1) Sample [α, Φ, τ|B, W ] compositionally as [α|B, W ][Φ, τ|α, B, W ]: (a) Sample from the conditional distribution of Dirichlet concentration parameter [α|B, W ] via griddy-Gibbs: form a reference set by drawing M samples from [Φ, τ|α, B, W ] for each α A. Set wj,l = π(B|α, Φl, τl, W )π(Φl, τl|α), 1 l M, p(α|B, W ) = π(α) PM l=1 wj,l/M, and Pr(α = αj| ) = p(αj|B, W )/ P α A p(α|B, W ). (b) Sample component-specific scales as [Φ, τ|α , B, W ] = [Φ|B, W ][τ|Φ, B, W ]; define p0 = PD j=1 pj, and recall aτ = PR r=1 αr = Rα and bτ = α(R/v)1/D (see Section 3.3), then draw ψr gi G(α p0/2, 2bτ, 2Cr), Cr = PD j=1 β(r) T j W 1 jr β(r) j , and set φr = ψr/ PR l=1 ψl in parallel for 1 r R (see Appendix A for definition of gi G ) draw τ gi G(aτ Rp0/2, 2bτ, 2 PR r=1 Dr), Dr = Cr/φr. (2) Sample from (β(r) j , wjr, λjr); 1 j D, 1 r R |Φ, τ, γ, σ, y using a back-fitting procedure to produce a sequence of draws from the margin-level conditional distributions across components. For r = 1, . . . , R and j = 1, . . . , D, sample from conditional distribution [(β(r) j , wjr, λjr)|β(r) j, B r, Φ, τ, γ, σ, y], where β(r) j = {β(r) l , l = j} and B r = B \ Br; (a) draw [wjr, λjr|β(r) j , φr, τ] = [wjr|λjr, β(r) j , φr, τ][λjr|β(r) j , φr, τ]: Guhaniyogi et al. draw λjr Ga aλ + pj, bλ + ||β(r) j ||1/ φrτ ; and draw wjr,k gi G 1 2, λ2 jr, β2 (r) j,k /(φrτ) independently for 1 k pj (b) draw β(r) j N(µjr, Σjr): define h(r) i,j,k = Pp1,...,p D d1=1,...,d D=1 I(dj = k) xd1,...,d D Q l =j β(r) l,il , H(r) i,j = (h(r) i,j,1, . . . , h(r) i,j,pj) , yi = yi z iγ P l =r Xi, Bl for 1 i n; then Σjr = H(r) T j H(r) j /σ2 + W 1 jr /(φrτ) 1, µjr = Σjr H(r) j y/σ2 (3) Sample [γ, σ|B, y] = [γ|σ, y][σ2| y]; define yi = yi Xi, B for 1 i n, then (a) draw σ2 IG(aσ, bσ), aσ = (n + v)/2, bσ = vs2 0 + || y||2 2 y T Zµγ /2 (b) draw γ N µγ, σ2Σγ , Σγ = ZT Z + Σ 1 0γ 1, µγ = ΣγZT y. 6. Simulation Studies To illustrate finite-sample performance of the proposed multiway priors, we show results from a simulation study with various dimensionality (p, R) and define b = max |B0 i1,...,i D| as the maximum signal size. Throughout, set pj = p, true error variance σ2 0 = 1 and b = 1 for convenience. In addition, we set the true vector coefficient γ0 = (0, . . . , 0) and focus exclusively on inference for tensor parameter B. The following simulated setups are considered: 1. Generated tensor: We construct tensor parameters having rank R0 = {3, 5} with p = {64, 100} and D = 2. 2. Ready made tensor: We use three tensor (2D) images without generating them from a parafac decomposition with known rank. Five replicated datasets with n = 1000 are generated according to (12) with xi1,...,i D N(0, 1). The tensor parameters considered are shown in Figure 2, where the magnitude of the non-zero cells is b = 1. Examples are chosen to demonstrate recovery of cell-level coefficients across varying degrees of complexity (dimension, parafac rank) and sparsity (% of non-zero cells; see Figure 2). The performance of our method with M-DGDP prior (9) (BTR) is compared with (i) frequentist tensor regression with penalization (FTR)(Zhou et al., 2013); and (ii) Lasso (on the vectorized tensor predictor). Comparisons are based on (a) cell mean squared estimation error (true non-zero, true zero, and overall); and (b) frequentist coverage (and length) of 95% credible intervals. By default, BTR uses R = 10 as an upper bound on the tensor parafac rank, minimizing effects of extra dimensions automatically and concentrating on a lower rank coefficient tensor as MCMC proceeds. We ran FTR with various choices of R and found equivalent performance for R between 3 to 15, thus the default value is set to R = 10 to ensure comparability with the BTR fitting. MCMC for BTR was run for 1300 iterations, with a 300 iteration burn-in and remaining samples thinned by 5. The latter was chosen to keep runtime between BTR and FTR similar for 3D simulation studies in Section 7. There, the total runtime using non-optimized R code on an x86 64 Intel(R) Core(TM) i7-3770 is between 6.2 - 7.5 hours. In simulation studies, the tuning parameter in FTR is selected Bayesian Tensor Regression R3-ex R5-ex Shapes Eagle Palmtree Horse |cell0| > 0 BTR 0.0230.00 0.0210.00 0.2430.01 0.2260.02 0.3160.01 0.2780.01 FTR 0.0350.00 0.0300.00 0.4150.03 0.3540.03 0.4350.02 0.3910.03 Lasso 0.6280.02 0.8220.03 0.6190.07 0.6650.03 0.6980.03 0.8880.01 |cell0| = 0 BTR 0.0110.00 0.0140.00 0.0710.00 0.0850.00 0.1000.01 0.1370.00 FTR 0.0220.00 0.0200.00 0.1270.02 0.1630.03 0.1590.00 0.2150.02 Lasso 0.0900.00 0.0980.02 0.0810.01 0.0970.00 0.0940.01 0.1550.02 Overall BTR 0.013 0.015 0.093 0.102 0.131 0.172 FTR 0.023 0.021 0.164 0.184 0.196 0.257 Lasso 0.187 0.288 0.179 0.204 0.217 0.407 Table 2: Comparison of cell estimation as measured by root mean squared error (RMSE) for the six 2D tensor images portrayed in Figure 2. Results from FTR (Zhou et al., 2013) use R = 10. For BTR, R = 10 is used as an upper bound to the tensor parafac rank. Subscript shows the standard error over a few replicated simulations. over a grid of values to minimize RMSE for the tensor predictor3. In real applications, cross validation instead would be used to select the tuning parameter that results in lowest hold-out predictive RMSE for FTR. Assuming 10-fold cross validation were used over a vector of 20 tuning parameters, FTR would have a runtime of approximately 8 hours. It also needs to be mentioned that the convergence of parameters in BTR is extremely rapid with an average effective sample size (ESS) 600 over 1000 iterations. Cell-level RMSE reported in Table 2 demonstrates that our method (BTR) consistently out performs FTR. When the tensor parameter has a low-rank parafac decomposition ( R3ex and R5-ex ), BTR and FTR perform best, with BTR having lower RMSE on both true zero and non-zero cells. This validates empirically prior (9) along with our suggested default hyper-parameter choices in Section 3. In particular, the tensor coefficient in BTR has three different types of shrinkage parameters: global, local and shrinkage across ranks. Such a careful construction of shrinkage prior on B adapts to varying degrees of sparsity, shrinking many tensor coefficients close to zero while accurately estimating nonzero cells. FTR shrinkage being dependent on only local parameters suffers in terms of both inferential and predictive performances. Table 3 demonstrate that BTR yields 95% credible intervals with good frequentist coverage across each of the simulated settings, both overall as well as on the true non-zero coefficients. Our method is one of the first to offer uncertainty quantification for tensor valued predictors.Finally, Table 4 provides evidence of the robustness of our method to increasing predictor dimension using two of the simulated examples. In both cases, RMSE for FTR worsens considerably on the true zero coefficients. For the true nonzero cells, RMSE increases for both methods as the margin dimension increases; however on a relative basis, FTR worsens considerably more, while on an absolute scale, BTR remains the clear winner. 3. To choose initial values, a preliminary analysis was run with a coarsened 16 16 image. Guhaniyogi et al. 10 20 30 40 50 60 10 20 30 40 50 60 (a) R3-ex (7.0%) 10 20 30 40 50 60 10 20 30 40 50 60 (b) R5-ex (11.0%) 10 20 30 40 50 60 10 20 30 40 50 60 (c) Shapes (6.8%) 10 20 30 40 50 60 10 20 30 40 50 60 (d) Eagle (7.4%) 10 20 30 40 50 60 10 20 30 40 50 60 (e) Palmtree (8.0%) 10 20 30 40 50 60 10 20 30 40 50 60 (f) Horse (18.6%) Figure 2: Simulated data with 64 64 2D tensor images (p = 64, D = 2). Row 1: The first two images (from left) have a rank-3 and rank-5 parafac decomposition; the third image is regular , although does not have a low-rank parafac decomposition. Row 2: All three images are irregular, and do not have a low-rank parafac decomposition. Sparsity (% non-zero cells) are displayed in sub-captions. R3-ex R5-ex Shapes Eagle Palmtree Horse |cell0| > 0 coverage 0.9860.02 0.9460.02 0.7470.01 0.7310.04 0.6770.04 0.7950.02 Overall coverage 0.9950.01 0.9700.01 0.9650.00 0.9400.02 0.9480.02 0.9270.01 length 0.0660.01 0.0610.01 0.2900.00 0.3010.03 0.4100.03 0.5660.02 Table 3: Row 1: Average coverage of 95% posterior credible intervals for all the cells of B for which the true cell coefficient is nonzero. Row 2: Average coverage of 95% posterior credible intervals for all the cells of B. Row 3: Average length of 95% posterior credible intervals for all the cells of B. Subscripts show standard errors over replicated simulations. Bayesian Tensor Regression 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 0.5 0.3 0.1 0.1 0.3 0.5 0.7 0.9 1.1 1.3 (e) Palmtree 10 20 30 40 50 60 10 20 30 40 50 60 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Figure 3: Recovered images for the 64 64 2D tensor images in Figure 2 using our proposed BTR method. Here, R = 10 is used as an upper bound to the tensor parafac rank. |cell0| R5-ex Shapes 64 100 64 100 coverage > 0 0.9460.02 0.9910.01 0.7470.01 0.5900.06 length > 0 0.0610.01 0.0690.01 0.2900.00 0.2470.01 rmse > 0 0.0210.00 0.0320.01 0.2430.01 0.3200.03 rmse = 0 0.0140.00 0.0140.00 0.0710.00 0.0630.00 FTR rmse > 0 0.0300.00 0.3690.06 0.4150.03 0.5860.14 rmse = 0 0.0200.00 0.1110.02 0.1270.02 0.1350.02 Table 4: Sensitivity analysis of cell estimation error (RMSE) as the tensor dimension increases; here pj = p {64, 100} for the 2D tensor images R5-ex and Shapes . Guhaniyogi et al. 7. Simulated response with a real 3D brain image We analyze data containing 3D MRI images for 550 adolescents, with information such as age and sex available. Age and sex are treated as ordinary scalar covariates while 3D MRI images act as tensor covariates. Let X denote a 30 30 30 3D MRI image, Z1 be the age and Z2 be the sex of an individual. The response is simulated using y N Z γ + X, B0 , σ2 , where Z denotes (Z1, Z2) , γ R2 and B0 R30 30 30. We assume the true B0 is a rank 2 tensor, with B0 = a1 a2 a3 + b1 b2 b3. Initialization and standardization of predictors follow exactly as prescribed in Section 5. By varying ai s and bi s, the following cases with varying degrees of sparsity in the tensor parameter B0 are considered: Case 1: b1 = b2 = (0, . . . , 0, sin((1 : 15) π/4)), b3 = (sin((1 : 10) π/4), 0, . . . , 0), a1 = (0, . . . , 0, sin((1 : 10) π/4)), a2 = (0, . . . , 0, cos((1 : 15) π/4)), a3 = (sin((1 : 15) π/4), 0, . . . , 0). Case 2: b1 = b2 = (0, . . . , 0, sin((1 : 15) π/6)), b3 = (sin((1 : 20) π/6), 0, . . . , 0), a1 = (0, . . . , 0, sin((1 : 15) π/4)), a2 = (0, . . . , 0, cos((1 : 10) π/6)), a3 = (sin((1 : 15) π/6), 0, . . . , 0). Case 3: b1 = b2 = (0, . . . , 0, sin((1 : 20) π/6)), b3 = (sin((1 : 20) π/6), 0, . . . , 0), a1 = (0, . . . , 0, sin((1 : 10) π/4)), a2 = (0, . . . , 0, cos((1 : 20) π/4)), a3 = (sin((1 : 20) π/6), 0, . . . , 0). We implement BTR, FTR, and Lasso on the vectorized tensor. As before, we present results for FTR with R = 10 (See additional discussion in Section 6 on FTR default setup). Point estimates for coefficients corresponding to age and sex covariates are provided in Table 6. Table 5 summarizes RMSEs for the estimated tensor coefficients for each method. BTR shows at least a 15% improvement over FTR on simulated cases considered. Evidently BTR tends to outperform FTR and vectorized lasso by a greater margin in less sparse settings as well. Importantly, every parameter in BTR is auto-tuned, while the Tensor Reg toolbox used for FTR (Zhou et al., 2013) requires calibrating tuning parameter values specific to each setting. Note that rather than using cross validation, tuning parameters in these experiments were chosen to provide the lowest possible (most optimistic) RMSE for the tensor coefficient. FTR fixes R = 10 based on findings discussed in Section 6 while BTR sets R = 10 as an upper bound, concentrating on a lower dimension parafac rank via adaptive shrinkage. While vectorized lasso and FTR do not come equipped with parameter uncertainty estimates, Table 7 demonstrates how BTR consistently provides over 95% coverage across examples with varying degrees of sparsity. Finally, Table 8 provides a measure of mixing efficiency for a single MCMC run in each of the simulated cases considered (post burn-in over the remaining 1000 MCMC samples). All reported RMSE and coverage statistics were computed over these draws as well (thinning by 5 as previously discussed in Section 6). 8. Brain Connectome Data Analysis Brain connectome data are known by neuroscientists to have low signal-to-noise ratio, and effective modeling is often hindered as the sample size is often very limited compared to the Bayesian Tensor Regression Case 1 Case 2 Case 3 |cell0| > 0 BTR 0.39 0.30 0.34 FTR 0.46 0.41 0.43 Lasso 0.46 0.42 0.44 |cell0| = 0 BTR 0.04 0.14 0.10 FTR 0.00 0.00 0.00 Lasso 0.01 0.03 0.02 Overall BTR 0.13 0.20 0.17 FTR 0.15 0.22 0.18 Lasso 0.15 0.23 0.18 Table 5: Comparison of cell estimation as measured by root mean squared error (RMSE) for the coefficients in case 1,2, 3 corresponding to 3D tensor images. Results from both BTR and FTR (Zhou et al., 2013) use R = 10. Case 1 Case 2 Case 3 γ1 (truth = 0.5) BTR 0.57 0.54 0.33 FTR 0.46 0.85 0.95 γ2 (truth = 2.0) BTR 2.00 2.04 1.86 FTR 1.87 0.22 3.30 Table 6: Point estimates for age and sex coefficients under BTR and FTR Zhou et al. (2013). True parameter values are also provided. Guhaniyogi et al. Case 1 Case 2 Case 3 Coverage 0.98 0.96 0.99 Length 0.54 0.87 2.16 Table 7: Length and coverage of 95% credible intervals for BTR. Values are reported as averages over all voxels of the tensor coefficient. τ 2 Scalar predictor γi (Ave.) Tensor predictor βijk (Ave.) Case 1 (88% sparsity) lag-2 0.04 0.34 0.45 lag-4 0.01 0.11 0.22 Case 2 (82% sparsity) lag-2 0.08 0.33 0.46 lag-4 0.03 0.12 0.23 Case 3 (70% sparsity) lag-2 0.09 0.43 0.53 lag-4 0.02 0.30 0.40 Table 8: MCMC autocorrelation of the proposed BTR method on data studies of Section 7 generated using 3D brain MRI scans. number of cells in the tensor predictor. In this setting, developing well calibrated predictive models is thus of key importance. To investigate the performance of competing methods outside the class of f MRI brain image data, we present an analysis using a brain connectome dataset on structural connectivity. Data are extracted from diffusion tensor imaging (DTI) and consist of estimates of the number of fibers connecting pairs of brain regions for 109 individuals. For each individual, brain connections among 70 brain regions (following desikan atlas) are encoded by a 70 70 weighted adjacency matrix. The (i, j)-th off-diagonal entry in the adjacency matrix is the estimated number of fiber tracts connecting the i-th and j-th brain region. The data also provides 10 clinical covariates for every individual, including sex, age, openness, agreeableness and conscientiousness. The focus of this study is on developing a predictive model with Creativity Composite Index (CCI) as a response fitted against clinical covariates and a tensor covariate (i.e., the weighted adjacency matrix). Implementing FTR on this data using the Tensor Reg package was attempted, however, functions in the toolbox require n > R p. In this example, because n = 109 and p = 70, it is only possible therefore to fit FTR with R = 1, which has previously been found to perform poorly by Zhou et al. (2013). We therefore compare our proposed method (BTR) to the vectorized Lasso on the basis of their predictive performance. To assess the predictive performance, the sample of n = 109 individuals are divided into 10 folds. Both vectorized lasso and BTR are fitted on 9 folds as training data and the remaining fold as the hold out sample. This is carried out for each of the 10 folds and predictive inferences are obtained for both vectorized lasso and BTR. Table 9 reports the root mean squared error (RMSE) and correlation between observed and predicted responses, here average is over the 10 crossvalidated folds. For reference, average RMSE of the null model is 10.03 with a standard deviation of 2.40 across the Bayesian Tensor Regression Method avg(RMSE) sd(RMSE) avg(cov.) sd(cov.) avg(cor.) sd(cor.) Lasso 9.21 2.18 63% 20% 0.31 0.11 BTR 9.03 1.64 91% 10% 0.32 0.13 Table 9: mean and standard deviation of RMSEs and cor(yobs, ypred) of Lasso and BTR over 10 folds of the data. It also provides mean and standard deviation of coverages of Lasso and BTR over 10 folds of the data folds. Given the very high degree of sparsity in the connectome adjacency matrix, it is not surprising that the Lasso is competitive to BTR. However, note that BTR detects this signal with far fewer effective parameters as compared to vectorized lasso. Finally, we measure coverage of 95% predictive intervals for all competitors. The latter is of course a byproduct of our fully Bayesian approach (BTR), while for the Lasso we use a two-staged approach. First we estimate the regression coefficients and subsequently construct approximate 95% predictive intervals based on the normal response-model centered on the predictive mean with variance equal to the estimated residual variance. 9. Discussion This work develops a novel class of prior distributions on tensor valued predictors which substantially reduces dimensionality relative to vectorizing, providing a multiway analogue of vector shrinkage priors, and enabling high dimensional region selection. The prior on tensor coefficient constructed here imparts shrinkage of the tensor components at global and local levels, while also encouraging shrinkage towards low rank tensor decomposition. In contrast, existing penalization framework on the tensor coefficient shrinks only at the global level. Strong theoretical results are proved for the proposed class of multiway shrinkage priors and a computationally efficient MCMC algorithm is developed in the regression setting. We plan to extend methods developed here to settings where the measured response for each subject is binary (e.g., indicator of a heath outcome) or multivariate, i.e., y = (y1, ..., yd). Also, the current framework of Bayesian tensor regression fixes rank R of the PARAFAC decomposition at reasonably large value. It might be of interest to learn the PARAFAC rank R by adding a discrete prior distribution on R. In various longitudinal studies, monitoring the evolution in the predictor response relationship (i.e., changes to the scalar and tensor parameters) is of fundamental interest. One application involves subjects receiving various treatments (e.g., chemotherapy), with tensor valued predictors corresponding to m RI (f MRI) scans obtained at regular intervals over a period of time. In such settings it is of crucial importance to monitor the progression of the disease in response to the treatment being administered. We plan to extend our method to such settings. Guhaniyogi et al. Acknowledgement The authors would like to thank Joshua T. Vogelstein from Johns Hopkins university for gracefully allowing us to utilize their brain connectome dataset in the real data analysis of our proposed BTR method. MCMC algorithm The following derivations concern the M-DGDP prior (9) and the sampling algorithm outlined in Section 5.1. For step (1b) Recall from Section 3.3 that τ Ga(aτ, bτ) and Φ Dirichlet(α1, . . . , αR) and denote p0 = PD j=1 pj. Then, π(Φ|B, ω) π(Φ) Z 0 π(B|ω, Φ, τ)π(τ)dτ r=1 φαr 1 r i Z h (τφr) p0/2 exp 1 τφr j=1 ||βjr||2/(2ωjr) i τ aτ 1 exp( bττ)dτ 0 τ aτ R p0 τφr bτ(τφr) dτ with Cr = Pd j=1 ||βjr||2/(2ωjr). When aτ = PR r=1 αr, this contains the kernel of a generalized inverse Gaussian (g IG) distribution for (τφr). Recall: X f X(x) = gi G(p, a, b) xp 1 exp( (ax + b/x)/2). Following Lemma 9 in the Appendix B, for independent random variable Tr fr on (0, ), the joint density of {φr = Tr/ P r T r : r = 1, . . . , R} has support on SR 1. In particular, f(φ1, . . . , φR 1) = Z 0 t R 1 R Y r=1 fr(φrt) dt, φR = 1 X Substituting fr(x) x δr exp( Cr/x) exp( bτx) in the above expression yields f(φ1, . . . , φR 1) Z 0 τ R 1 R Y r=1 (φrτ) δr exp Cr (φrτ) bτ(φrτ) dτ r=1 φ δri Z 0 τ R P r δr 1 R Y r=1 exp Cr (φrτ) bτ(φrτ) dτ. Matching exponents between this expression and the preceding one implies (1) aτ R(p0/2) 1 = R P r δr 1, and (2) δr = 1 + p0/2 αr. Then, aτ = R(1 + p0/2) X r δr = R(1 + p0/2) (R + Rp0/2 X as previously noted. Hence, draws from [Φ|α, B, W ] are obtained by sampling Tr fr = gi G(αr p0/2, 2bτ, 2Cr) independently for r = 1, . . . , R, and renormalizing. Bayesian Tensor Regression Proof of lemma 5 Proof Using priors defined in (9), one has Cλ = Eλ(1/λ2) = b2 λ (aλ 1)(aλ 2) for any aλ > 2. In addition, the following inequalities are useful to bound the latter quantity: If α1 = c/R, c N+, Γ(α0+D)/Γ(α0) = α0(α0+1) (α0+D 1). Using the fact that log(x+1) x, x 0, one has log(α0)+ +log(α0+D 1) α0D 1+PD 2 k=1 D 2 k. Then αD 0 Γ(α0 + D)/Γ(α0) Aτ exp(α0D) where Aτ = exp( 1 + PD 2 k=1 D 2 k) = exp (D2 3D)/2 , D 2. Let ||x||r denote the Lrth norm. Trivially, ||Φ||D D 1; in addition, by H older s in- equality, for any x ℜk and 0 < r < p, one has ||x||p k 1 r 1 p ||x||r. In our setting, D 2. Taking r = 1 in the latter yields ||Φ||D D R (D 1). Recall α0 = PR r=1 αr = α1R. This leads to the lower and upper bounds for the prior voxel-level variance: var(Bi1,...,i D) (2Cλ)D (α1R)DR (D 1)/b D τ = (2Cλ)D αD 1 R/b D τ var(Bi1,...,i D) Aτ(2Cλ)D exp(α1 RD)/b D τ . Consistency proofs The proof of Theorem 1 relies in part on the existence of exponentially consistent tests. Definition An exponentially consistent sequence of test functions Φn = I(yn Cn) for testing H0 : Bn = B0 n vs. H1 : Bn = B0 n satisfies EB0 n(Φn) c1 exp( b1n), sup Bn Bcn EBn(1 Φn) c2 exp( b2n) for some c1, c2, b1, b2 > 0. Lemma 6 There exist an exponentially consistent sequence of tests Φn for testing H0 : Bn = B0 n vs. H1 : Bn = B0 n. Proof We begin by stating that Pn i=1 yi Xi, B0 n 2 χ2 n under B0 n. We choose the critical region of the test Φn as Cn = n Bn : 1 n Pn i=1 yi Xi, B0 n 2 > ϵ/4 o . Note that EB0 n(Φn) = PB0 n yi Xi, B0 n 2 > nϵ/4 exp nϵ , for large n, where the last line follows by simplifying Lemma 1 in Laurent and Massart (2000). Guhaniyogi et al. Now we will use the fact that i=1 (yi Xi, B0 n )2 i=1 (yi Xi, Bn )2 + 1 Xi, Bn B0 n 2 + 2 i=1 (yi Xi, Bn ) ( Xi, Bn B0 n) Xi, Bn B0 n 2 + 1 i=1 KLi + 2 i=1 (yi Xi, Bn ) Xi, Bn B0 n . Note that, under Bn, i=1 (yi Xi, Bn ) Xi, Bn B0 n N(0, 4 so that, 2 n Pn i=1(yi Xi, Bn ) Xi, Bn B0 n = q 4 n Pn i=1 KLi Z n, where Z N(0, 1). Thus, sup Bn Bcn EBn(1 Φn) = sup Bn Bcn PBn i=1 (yi Xi, B0 n )2 ϵ/4 sup Bn Bcn PBn i=1 KLi Z n + 1 i=1 (yi Xi, Bn )2 sup Bn Bcn PBn i=1 KLi Z n + 1 i=1 (yi Xi, Bn )2 sup Bn Bcn PBn 4 Pn i=1 KLi i=1 (yi Xi, Bn )2 4 Pn i=1 KLi 1 2n Pn i=1 KLi . Using this fact we have sup Bn Bcn EBn(1 Φn) sup Bn Bc n PBn 4 Pn i=1 KLi i=1 (yi Xi, Bn )2 + sup Bn Bc n PBn (Tn) sup Bn Bcn PBn i=1 KLi ϵ/4 i=1 (yi Xi, Bn )2 + sup Bn Bcn PBn i=1 (yi Xi, Bn )2 + PBn χ2 1 nϵ Bayesian Tensor Regression where the last line requires an application of Lemma 1 in Laurent and Massart (2000). Theorem 1 Proof Under Lemma 6 one has Bcn f(yn|Bn)πn(F n) R f(yn|Bn)πn(F n) = Bcn f(yn|Bn) f(yn|B0 n)πn(F n) R f(yn|Bn) f(yn|B0 n)πn(F n) = N D Φn + (1 Φn)N Note that we have PB0 n (Φn > exp( b1n/2)) EB0 n (Φn) exp(b1n/2) c1 exp( b1n/2). Therefore P n=1 PB0 n (Φn > exp( b1n/2)) < . Using Borel-Cantelli lemma PB0 n (Φn > exp( b1n/2)i.o.) = 0. It follows that Φn 0 a.s. (13) In addition, we have EB0 n((1 Φn)N) = Z (1 Φn) Z f(yn|Bn) f(yn|B0 n)πn(F n)f(yn|B0 n) Z (1 Φn)f(yn|Bn)πn(F n) sup Bn Bcn EBn(1 Φn) c2 exp( b2n). Using a similar technique as above, PB0 n ((1 Φn)N exp(nb2/2) > exp( nb2/4)i.o.) = 0 so exp(bn)(1 Φn)N 0 a.s.. (14) By Lemma 6 and (13)-(14) it is enough to show that M = exp( bn) R f(yn|Bn) f(yn|B0 n)πn(F n) for some b b = ϵ 256. We choose b = b. Consider the set Hn = n Bn : 1 n log h f(yn|B0 n) f(yn|Bn) i < η o , for some η which is chosen later. M exp( bn) Z n log f(yn|B0 n) f(yn|Bn) exp(( b η)n)πn(Hn). 1 n log f(yn|Bn) i=1 (yi Xi, Bn )2 + 1 i=1 (yi Xi, B0 n )2 # Guhaniyogi et al. Let yn = (y1, ..., yn) , Hn = ( X1, Bn , . . . , Xn, Bn ) and H0 n = X1, B0 n , . . . , Xn, B0 n . Then n ||yn H0 n||2 + ||yn Hn||2 < 2η 2||yn H0 n|| ||yn Hn|| ||yn H0 n|| + ||yn Hn|| ||yn H0 n|| 2 < 2η 2||yn H0 n||||H0 n Hn|| + ||Hn H0 n||2 < 2η n||H0 n Hn|| < 2η 3ζn , ||yn H0 n||2 < ζ2 n where A1n = n 1 n||Hn H0 n|| < 2η o , A2n = ||yn H0 n||2 < ζ2 n . We will show that PB0 n(A2n) = 1 for all large n. Assume ζn = n(1+ρ3)/2, ρ3 > 0 so that ζ2 n > 8n for all large n. Then, PB0 n(A 2n) = PB0 n(χ2 n > ζ2 n) exp( ζ2 n/2). Therefore, using Borel-Cantelli lemma PB0 n(A 2n i.o.) = 0. Hence PB0 n(A2n) = 1 for all large n. It is enough to bound πn(A1n). Let Mn = 1 q Pn i=1 ||Xi||2 2. Now use the fact that 1 n||Hn H0 n|| = 1 q Pn i=1( Xi, Bn B0 n )2 1 n q Pn i=1 ||Xi||2 2 ||Bn B0 n||2 to conclude ||Bn B0 n||2 < 2η 3Mnζn By (11) one has πn(A1n) πn ||Bn B0 n||2 < 2η 3Mnζn exp( dn) and hence M exp ( b η d)n as n proving the result. Theorem 2 Proof Define g : R R s.t. g(κ) = RκD + κD 1 D X r=1 ||β0(r) j,n ||2 + + κ l =j ||β0(r) l,n ||2. Let κn > 0 be s.t. g(κn) = 2η 3Mnζn . Note that by Decarte s rule of sign, the equation g(κ) 2η 3Mnζn = 0 has a unique positive root. Further 1 κn < 1 + max i=1,...,D 3 P j1 = =ji PR r=1 l=1 ||β0(r) jl,n||2 κn < 1 + max 2η 3Mnζn R, max i=1,...,D P j1 = =ji PR r=1 l=1 ||β0(r) jl,n||2 Bayesian Tensor Regression by Lemma 8 in Appendix B. Using Lemma 7 in Appendix B, it is easy to see that n ||β(r) j,n β0(r) j,n ||2 κn, j = 1, ..., D; r = 1, ..., R o ||Bn B0 n||2 < 2η 3Mnζn Using (15), πn ||Bn B0 n||2 < 2η 3Mnζn πn n ||β(r) j,n β0(r) j,n ||2 κn, j = 1, ..., D; r = 1, ..., R o . Note that πn n ||β(r) j,n β0(r) j,n ||2 κn, j = 1, ..., D; r = 1, ..., R o |{wjr,l}pj,n l=1 , {λjr}D,R 1 j,r=1 , {φr}R 1 r=1 , τ r=1 πn ||β(r) j,n β0(r) j,n ||2 κn|{wjr,l}pj,n l=1 , {λjr}D,R 1 j,r=1 , {φr}R 1 r=1 , τ Therefore, it is enough to bound πn(||β(r) j,n β0(r) j,n || κn, j = 1, ..., D; r = 1, ..., R). For j = 1, ..., D, r = 1, ..., R, πn(||β(r) j,n β0(r) j,n || κn|{wjr,l}pj,n l=1 , λjr, {φr}R 1 r=1 , τ) |β(r) j,n,l β0(r) j,n,l| κn pj,n |{wjr,l}pj,n l=1 , λjr, {φr}R 1 r=1 , τ 2κn p2pj,nπwjr,lφrτ |β0(r) j,n,l|2 + κ2 n/pj,n wjr,lφrτ where the last step follows from the fact that R b a e x2/2dx e (a2+b2)/2(b a). Thus, πn(||β(r) j,n β0(r) j,n || κn|λjr, {φr}R 1 r=1 , τ) = E h πn(||β(r) j,n β0(r) j,n || κn|{wjr,l}pj,n l=1 , λjr, {φr}R 1 r=1 , τ) i !pj,n pj,n Y 1 wjr,l exp |β0(r) j,n,l|2 + κ2 n/pj,n wjr,lφrτ 2κnλ2 jr 2 p !pj,n pj,n Y 1 wjr,l exp |β0(r) j,n,l|2 + κ2 n/pj,n wjr,lφrτ λ2 jrwjr,l Guhaniyogi et al. Use the change of variable 1 wjr,l = zjr,l and the normalizing constant from the inverse Gaussian density to deduce 1 wjr,l exp |β0(r) j,n,l|2 + κ2 n/pj,n wjr,lφrτ λ2 jrwjr,l z3 jr,l exp (|β0(r) j,n,l|2 + κ2 n/pj,n) φrτ zjr,l λ2 jr 2zjr,l 2 |β0(r) j,n,l|2 + κ2n/pj,n (19) can be written as πn(||β(r) j,n β0(r) j,n || κn|λjr, {φr}R 1 r=1 , τ) 2κnλ2 jr 2 p !pj,n pj,n Y 2 |β0(r) j,n,l|2 + κ2n/pj,n 2 |β0(r) j,n,l|2 + κ2n/pj,n πn(||β(r) j,n β0(r) j,n || κn|{φr}R 1 r=1 , τ) !pj,n baλ,r λ,r Γ(aλ,r) λjr λpj,n+aλ,r 1 jr exp 2 |β0(r) j,n,l|2 + κ2n/pj,n !pj,n baλ,r λ,r Γ(aλ,r) Γ(pj,n + aλ,r) 2 |β0(r) j,n,l|2+κ2n/pj,n 2κn 2bλ,r p !pj,n 1 Γ(aλ,r) Γ(pj,n + aλ,r) 2 |β0(r) j,n,l|2+κ2n/pj,n bλ,r φrτ + 1 pj,n+aλ,r . Bayesian Tensor Regression The final expression as in the above yields πn(||β(r) j,n β0(r) j,n || κn, j = 1, ..., D, r = 1, ..., R|{φr}R 1 r=1 , τ) 2κn 2bλ,r p !pj,n 1 Γ(aλ,r)λpj,n+aλ,r 1 j,r Γ(pj,n + aλ,r) 2 |β0(r) j,n,l|2+κ2n/pj,n bλ,r φrτ + 1 We will now use the fact that for φr 1, 2 |β0(r) j,n,l|2+κ2n/pj,n bλ,r φrτ + 1 pj,n+aλ,r 1 2 |β0(r) j,n,l|2+κ2n/pj,n bλ,r φrτ + 1 τφr pj,n+aλ,r Iτ [0,1]. This inequality is critical to provide a lower bound on πn(||β(r) j,n β0(r) j,n || κn, j = 1, ..., D, r = 1, ..., R) as following πn(||β(r) j,n β0(r) j,n || κn, j = 1, ..., D, r = 1, ..., R) λλ1 2 Γ(Ra) Γ(λ1)Γ(a)R κn pj,nbλ,r pj,n Γ(pj,n + aλ,r) τ τ λ1 R PD j=1 pj,n 2 1 exp( λ2τ) QR r=1 φa 1 r QR r=1 φ PD j=1 pj,n 2 |β0(r) j,n,l|2+κ2n/pj,n bλ,r φrτ + 1 pj,n+aλ,r dφdτ λλ1 2 Γ(Ra) Γ(λ1)Γ(a)R κn pj,nbλ,r pj,n Γ(pj,n + aλ,r) 2 |β0(r) j,n,l|2+κ2n/pj,n τ=0 τ λ1+PR r=1 aλ,r D 2 1 exp( τλ2)dτ Z r=1 φ a+aλ,r D = λλ1 2 Γ(Ra) Γ(λ1)Γ(a)R κn pj,nbλ,r pj,n Γ(pj,n + aλ,r) 2 |β0(r) j,n,l|2+κ2n/pj,n (λ1 + PR r=1 aλ,r D QR r=1 Γ(a + aλ,r D 2 PR r=1 aλ,r) . Guhaniyogi et al. Denote C6 = λλ1 2 Γ(Ra) Γ(λ1)[Γ(a)]R exp( λ2) (λ1+PR r=1 aλ,r D QR r=1[Γ(a+aλ,r D 2 )] Γ(Ra+PR r=1 aλ,r D 2 ) . Then the above expression gives us log ||Bn B0 n||2 < 2η 3Mnζn log(κn) + 1 2 log(pj,n) + log(bλ,r) + log(Γ(aλ,r) j=1 log(Γ(pn,j + aλ,r)) + r=1 (pj,n + aλ,r) log 2 |β0(r) j,n,l|2 + κ2n/pj,n Using (16) and assumption (b), it is easy to see that 1 κn < G5nρ2+ ρ3+1 2 QD j=1 pj,n for a con- stant G5 > 0 for all large n. Therefore, PD j=1 PR r=1 pj,n h log 1 κn 2 log(pj,n) + log(bλ,r) + log(Γ(aλ,r) i = o(n). Also, PR r=1 PD j=1 log(Γ(pj,n + aλ,r))] PD j=1(pj,n + aλ,r) log(pj,n + aλ,r) = o(n), by assumption (c). Finally, PD j=1 PR r=1(pj,n + aλ,r) log h Ppj,n l=1 2 |β0(r) j,n,l|2+κ2n/pj,n bλ,r + 1 i = o(n), by assumptions (b) and (c). Thus, log πn(Bn : ||Bn B0 n||2 < 2η 3Mnζn ) < dn for all d > 0, for all large n. This proves the result. This Section contains additional Lemmas relevant to the article. Lemma 7 Suppose T = T1 TD and F = F1 FD are two rank one tensors of same dimension. Then T F = (T1 F1) (TD FD) + I1 I2=1:D,|I1|=l,|I2|=D l γ1 γD, where γj = Fj if j I2; = Tj Fj if j I1. Proof We will show it by induction. If D = 2 then, T F = T1 T2 F1 F2 = (T1 F1) T2 + F1 T2 F1 F2 = (T1 F1) (T2 F2) + (T1 F1) F2 + F1 (T2 F2). Bayesian Tensor Regression Assume the result to hold for D 1. For D, T1 TD F1 FD = (T1 F1) T2 TD + F1 [T2 TD F2 FD] = (T1 F1) [(T2 F2) (TD FD) + F2 FD+ I1 I2,|I1|=l,|I2|=D 1 l γ2 γD]+ F1 [(T2 F2) (TD FD) + I1 I2,|I1|=l,|I2|=D 1 l γ2 γD] = (T1 F1) (TD FD) + I1 I2,|I1|=l,|I2|=D l γ1 γD]. Hence proved. Lemma 8 Let x be a real root of the polynomial P(x) = akxk +ak 1xk 1 + +a1x a0. Then 1/|x | < 1 + maxi=1,...,k ai Proof Consider the polynomial P1(ζ) = ζk a1 a0 . By making a change of variable with ζ = 1 x, we obtain = akxk + + a1x a0 Note that P1 1 x = 0 is solved by x = x . Therefore, P1(ζ) = 0 is solved by ζ = 1 x . The result follows by using Cauchy bound on the roots of a polynomial. Lemma 9 Suppose T1, . . . , Tm are independent random variables with Tj having density fj supported in (0, ). Let φj = Tj Pm l=1 Tm . Then the joint density of (φ1, . . . , φm 1) has a joint density supported on the simplex Sm 1 and is given by f(φ1, . . . , φm 1) = Z t=0 tm 1 m Y l=1 fj(φjt)dt, where φm = 1 Pm 1 l=1 φl. Proof This result is well known in the theory of normalized random measures (Kruijer et al., 2010). Guhaniyogi et al. Artin Armagan, David B Dunson, and Jaeyong Lee. Generalized double pareto shrinkage. Statistica Sinica, 23(1):119 143, 2013a. Artin Armagan, David B Dunson, Jaeyong Lee, Waheed U Bajwa, and Nate Strawn. Posterior consistency in linear models under shrinkage priors. Biometrika, 100(4):1011 1018, 2013b. Anirban Bhattacharya, Debdeep Pati, Natesh S Pillai, and David B Dunson. Dirichlet laplace priors for optimal shrinkage. Journal of the American Statistical Association, 110 (512):1479 1490, 2015. Brian S Caffo, Ciprian M Crainiceanu, Guillermo Verduzco, Suresh Joel, Stewart H Mostofsky, Susan Spear Bassett, and James J Pekar. Two-stage decompositions for the analysis of functional connectivity for fmri with application to alzheimer s disease risk. Neuro Image, 51(3):1140 1149, 2010. David Gerard and Peter Hoff. Adaptive higher-order spectral estimators. ar Xiv preprint ar Xiv:1505.02114, 2015. JeffGoldsmith, Lei Huang, and Ciprian M Crainiceanu. Smooth scalar-on-image regression via spatial Bayesian variable selection. Journal of Computational and Graphical Statistics, 23(1):46 64, 2014. Chris Hinrichs, Vikas Singh, Lopamudra Mukherjee, Guofan Xu, Moo K Chung, Sterling C Johnson, and Alzheimer s Disease Neuroimaging Initiative. Spatially augmented lpboosting for ad classification with evaluations on the adni dataset. Neuroimage, 48(1):138 149, 2009. Peter D Hoffet al. Multilinear tensor regression for longitudinal relational data. The Annals of Applied Statistics, 9(3):1169 1193, 2015. Hung Hung and Chen-Chien Wang. Matrix variate logistic regression model with application to eeg data. Biostatistics, 14(1):189 202, 2013. Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455 500, 2009. Willem Kruijer, Judith Rousseau, and Aad Van Der Vaart. Adaptive Bayesian density estimation with location-scale mixtures. Electronic Journal of Statistics, 4:1225 1257, 2010. Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000. Nicole Lazar. The Statistical Analysis of Functional MRI Data. Springer Science & Business Media, 2008. Martin A Lindquist. The statistical analysis of fmri data. Statistical Science, 23(4):439 464, 2008. Bayesian Tensor Regression Nicholas G Polson and James G Scott. Local shrinkage rules, l evy processes and regularized regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2):287 311, 2012. Peihua Qiu. Jump surface estimation, edge detection, and image restoration. Journal of the American Statistical Association, 102(478):745 756, 2007. Philip T Reiss and R Todd Ogden. Functional generalized linear models with images as predictors. Biometrics, 66(1):61 69, 2010. Srikanth Ryali, Kaustubh Supekar, Daniel A Abrams, and Vinod Menon. Sparse logistic regression for whole-brain classification of fmri data. Neuro Image, 51(2):752 764, 2010. Taiji Suzuki. Convergence rate of Bayesian tensor estimatior and its minimax optimality. In Proceedings of the 32nd International Conference on Machine Learning (Lille, 2015), pages 1273 1282, 2015. Xuejing Wang, Bin Nan, Ji Zhu, and Robert Koeppe. Regularized 3d functional regression for brain image data via Haar wavelets. The annals of applied statistics, 8(2):1045, 2014. Yun Yang and David B Dunson. Minimax optimal Bayesian aggregation. ar Xiv preprint ar Xiv:1403.1345, 2014. Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540 552, 2013. Jing Zhou, Anirban Bhattacharya, Amy H Herring, and David B Dunson. Bayesian factorizations of big sparse tensors. Journal of the American Statistical Association, 110(512): 1562 1576, 2015.