# soft_tensor_regression__530c6d5b.pdf

Journal of Machine Learning Research 22 (2021) 1-53 Submitted 5/20; Revised 7/21; Published 9/21

Soft Tensor Regression

Georgia Papadogeorgou gpapadogeorgou@ufl.edu Department of Statistics University of Florida Gainesville, FL 32611-8545, USA

Zhengwu Zhang zhengwu zhang@unc.edu Department of Statistics and Operations Research University of North Carolina at Chapel Hill Chapel Hill, NC 27599-3260, USA

David B. Dunson dunson@duke.edu Department of Statistical Science Duke University Durham, NC 27708-0251, USA

Editor: Edo Airoldi

Statistical methods relating tensor predictors to scalar outcomes in a regression model generally vectorize the tensor predictor and estimate the coeﬃcients of its entries employing some form of regularization, use summaries of the tensor covariate, or use a low dimensional approximation of the coeﬃcient tensor. However, low rank approximations of the coeﬃcient tensor can suﬀer if the true rank is not small. We propose a tensor regression framework which assumes a soft version of the parallel factors (PARAFAC) approximation. In contrast to classic PARAFAC where each entry of the coeﬃcient tensor is the sum of products of rowspeciﬁc contributions across the tensor modes, the soft tensor regression (Softer) framework allows the row-speciﬁc contributions to vary around an overall mean. We follow a Bayesian approach to inference, and show that softening the PARAFAC increases model ﬂexibility, leads to improved estimation of coeﬃcient tensors, more accurate identiﬁcation of important predictor entries, and more precise predictions, even for a low approximation rank. From a theoretical perspective, we show that employing Softer leads to a weakly consistent posterior distribution of the coeﬃcient tensor, irrespective of the true or approximation tensor rank, a result that is not true when employing the classic PARAFAC for tensor regression. In the context of our motivating application, we adapt Softer to symmetric and semi-symmetric tensor predictors and analyze the relationship between brain network characteristics and human traits.

keywords: adjacency matrix, Bayesian, brain connectomics, graph data, latent factors, low rank, network data, parafac, tensor regression

1. Introduction

In many applications, data naturally have an array or tensor structure. When the tensor includes the same variable across two of its modes, it is often referred to as a network. Network dependence is often summarized via an adjacency matrix or tensor. For example,

c 2021 Georgia Papadogeorgou, Zhengwu Zhang, David B. Dunson.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-476.html.

Papadogeorgou, Zhang and Dunson

data might correspond to an R R p array containing p features measuring the strength of connections between an individual s R brain regions. In tensor data analysis, interest often lies in characterizing the relationship between a tensor predictor and a scalar outcome within a regression framework. Estimation of such regression models most often requires some type of parameter regularization or dimensionality reduction since the number of entries of the tensor predictor is larger than the sample size. In this paper, we propose a soft tensor regression (Softer) framework for estimating a high-dimensional linear regression model with a tensor predictor and scalar outcome. Softer accommodates the predictor s tensor structure by basing the coeﬃcient tensor estimation on the parallel factors approximation, similarly to other approaches in the literature. However, in contrast to previously developed methodology, Softer adaptively expands away from its low-rank mean to adequately capture and ﬂexibly estimate more complex coeﬃcient tensors. Softer s deviations from the underlying low-rank, tensor-based structure are interpretable as variability in the tensor s row-speciﬁc contributions.

1.1 Tensor Regression in the Literature

Generally, statistical approaches to tensor regression fall under the following categories: they estimate the coeﬃcients corresponding to each tensor entry with entry-speciﬁc penalization, regress the scalar outcome on low-dimensional summaries of the tensor predictor, or estimate a coeﬃcient tensor assuming a low-rank approximation. A simple approach to tensor regression vectorizes the tensor predictor and ﬁts a regression model of the outcome on the tensor s entries while performing some form of variable selection or regularization. Examples include Cox and Savoy (2003) and Craddock et al. (2009) who employed support vector classiﬁers to predict categorical outcomes based on the participants brain activation or connectivity patterns. Other examples in neuroscience include Mitchell et al. (2004); Haynes and Rees (2005); O Toole et al. (2005); Polyn et al. (2005) and Richiardi et al. (2011) (see Norman et al. (2006) for a review). However, this regression approach to handle tensor predictors is, at the least, unattractive, since it fails to account for the intrinsic array structure of the predictor, eﬀectively ﬂattening it prior to the analysis. Alternatively, dimensionality reduction can be performed directly on the tensor predictor reducing it to low dimensional summaries. In such approaches, the expectation is that these summaries capture all essential information while decreasing the number of parameters to be estimated. For example, Zhang et al. (2019) and Zhai and Li (2019) use principal component analysis to extract information on the participants structural and functional brain connectivity, and use these principal components to study the relationship between brain network connections and outcomes within a classic regression framework. However, this approach could suﬀer due to its unsupervised nature which selects principal components without examining their relationship to the outcome. Moreover, the performance of the lowdimensional summaries is highly dependent on the number and choice of those summaries, and the interpretation of the estimated coeﬃcients might not be straightforward. Ginestet et al. (2017) and Durante and Dunson (2018) developed hypothesis tests for diﬀerences in the brain connectivity distribution among subgroups of individuals, and employed them to study the relationship between categorical outcomes and binary network

Soft Tensor Regression

measurements. Even though related, such approaches do not address our interest in building regression models with tensor predictors.

An attractive approach to tensor regression performs dimension reduction on the coeﬃcient tensor. Generally, these approaches exploit a tensor s Tucker decomposition (Tucker, 1966) and its restriction known as the parallel factors (PARAFAC) or canonical decomposition. According to the PARAFAC, a tensor is the sum of D rank-1 tensors, and each entry can be written as the sum of D products of row-speciﬁc elements. The minimum value of D for which that holds is referred to as the tensor s rank. Note that the word row along a tensor mode is used here to represent rows in the classic matrix sense (slice of the tensor along the ﬁrst mode), columns (slice of the tensor along the second mode), or slices along higher modes.

Within the frequentist paradigm, Hung and Wang (2013) suggested a bi-linear logistic regression model in the presence of a matrix predictor. For a tensor predictor, Zhou et al. (2013) and Li et al. (2018) exploited the PARAFAC and Tucker decompositions respectively, and proposed low rank approximations to the coeﬃcient tensor. Guhaniyogi et al. (2017) proposed a related Bayesian tensor regression approach for estimating the coeﬃcient tensor. Even though these approaches perform well for prediction in these high-dimensional tensor settings, they are bound by the approximation rank in the sense that they cannot capture any true coeﬃcient tensor. Moreover, these approaches are not directly applicable for identifying important connections. In this direction, Wang et al. (2014) imposed sparsity in the coeﬃcient tensor of a multi-linear logistic regression model by penalizing the PARAFAC contributions. Wang et al. (2018) proposed a related approach to identify small brain subgraphs that are predictive of an individual s cognitive abilities. Further, Guha and Rodriguez (2021) assumed a PARAFAC decomposition of the mean of the coeﬃcient tensor and used a spike-and-slab prior distribution to identify brain regions whose connections are predictive of an individual s creativity index.

Low-rank approximations to the coeﬃcient tensor of a linear tensor regression model provide a supervised approach to estimating the relationship between a tensor predictor and a scalar outcome. However, such approximations can lead to a poorly estimated coeﬃcient tensor, misidentiﬁcation of important connections, and inaccurate predictions, if the true rank of the coeﬃcient tensor is not small. As we will illustrate in Section 3, this performance issue arises due to the inﬂexibility of the PARAFAC approximation which speciﬁes that each row has a ﬁxed contribution to all coeﬃcient entries that involve it, leading to an overly rectangular or block structure of the estimated coeﬃcient tensor. If the true coeﬃcient tensor does not exhibit such a block structure, a large number of components D might be necessary in order to adequately approximate it. Due to this rigid structure, we refer to the PARAFAC approximation to estimating the coeﬃcient tensor as the hard PARAFAC.

Recently, there have been eﬀorts to relax the hard PARAFAC structure employing non-parametric methodology using kernels, Gaussian-processes, or neural networks (Signoretto et al., 2013; Suzuki et al., 2016; Kanagawa et al., 2016; Imaizumi and Hayashi, 2016; Maruhashi et al., 2018). Even though these methods are very promising, we focus on regression models which include the tensor predictor linearly, and focus on ﬂexibly estimating the linear functional. The ideas of PARAFAC softening presented here could be extended to non-linear settings and situations outside the scope of tensor-regression.

Papadogeorgou, Zhang and Dunson

1.2 Our Contribution

With our main focus being improved estimation and inference over model coeﬃcients, we address the inﬂexibility of the hard PARAFAC in the linear tensor regression setting. Towards this goal, we propose a hierarchical modeling approach to estimate the coeﬃcient tensor. Similarly to the hard PARAFAC, each entry of the coeﬃcient tensor is the sum of products of row-speciﬁc contributions. However, our model speciﬁcation allows a row s contribution to the coeﬃcients that involve it to be entry-speciﬁc. Then, the mean of the entry-speciﬁc contributions along a tensor row can be thought of as a row s overall importance, and the entry-speciﬁc deviations as small variations in the row s importance when interacting with the rows of the other tensor modes. Allowing for the row contributions to vary by entry leads to the softening of the hard structure in the PARAFAC approximation, and for this reason, we refer to it as the soft PARAFAC. We refer to the tensor regression model that uses the soft PARAFAC for estimation of the coeﬃcient tensor as Soft Tensor Regression (Softer). We follow a fully Bayesian approach to inference which allows for straightforward uncertainty quantiﬁcation in the coeﬃcient estimates and predictions. By studying the induced prior distribution on the coeﬃcient tensor, we choose sensible values for the hyperparameters. Importantly, the ﬂexible structure of the soft PARAFAC allows for Softer to capture any true coeﬃcient tensor without increasing the base rank, in contrast to existing models that use strictly low-rank approximations. We explicitly show this by proving that the imposed prior structure on the coeﬃcient tensor has full support on a large class including true tensors of any rank, and its posterior distribution is consistent for any approximation rank used. We also illustrate it in simulations, where we show that the performance of Softer is robust to the choice of the approximation rank. Due to its increased ﬂexibility, Softer performs better than strictly low-rank models in identifying important entries of the tensor predictor. We extend Softer to generalized linear models and symmetric tensors, widening its applicability. We use the soft tensor regression framework in a study of the relationship between brain structural connectomics and human traits for participants in the Human Connectome Project (HCP; Van Essen et al. (2013)). In the last few years, the HCP has played a very important role in expanding our understanding of the human brain by providing a database of anatomical and functional connections and individual demographics and traits on over a thousand healthy subjects. Data availability and increased sample sizes have allowed researchers across various ﬁelds to develop and implement new tools in order to analyze these complex and rich data (see Cole et al. (2014); Mc Donough and Nashiro (2014); Smith et al. (2015); Riccelli et al. (2017); Croxson et al. (2018) among many others). Using data from the HCP, exploiting state-of-the-art connectomics processing pipelines (Zhang et al., 2018), and within an adaptation of the supervised Softer framework for symmetric tensor predictors, we investigate the relationship between structural brain connection characteristics and a collection of continuous and binary human traits.

2. Tensor Regression

In this section, we introduce some useful notation and regression of scalar outcomes on a tensor predictor.

Soft Tensor Regression

2.1 Useful Notation

Let a Rp1 and b Rp2. Then a b Rp1 p2 is used to represent the outer product of a and b with dimension p1 p2 and entries [a b]ij = aibj. Similarly, for vectors ak Rpk, k = 1, 2, . . . , K, the outer product a1 a2 a K is a K-mode tensor A of dimensions p1, p2, . . . , p K and entries Aj1j2...j K = QK k=1 ak,jk. For two tensors A1, A2 of the same dimensions, we use A1 A2 to represent the Hadamard product, deﬁned as the element-wise product of the two tensors. Further, we use A1, A2 F to represent the Frobenius inner product, which is the sum of the elements of A1 A2. When the tensors are vectors (1-mode), the Frobenius inner product is the classic dot product. For a K-mode tensor A of dimensions p1, p2, . . . , p K, we use jth slice of A along mode k to refer to the (K 1)-mode tensor G with dimensions p1, . . . , pk 1, pk+1, . . . , p K and entries Gj1...jk 1jk+1...j K = Aj1...jk 1jjk+1...j K. For example, the jth slice of a p1 p2 matrix along mode 1 is the matrix s jth row. As a result, we refer to slice-speciﬁc quantities as row-speciﬁc even when that slice is not along mode 1. For example, for a p1 p2 matrix, the mean entry of the jth row along mode 2 is the mean of the jth column. Remembering that we use row to refer to slices (and not necessarily to rows in the classic matrix sense) will be useful when discussing the hard PARAFAC in Section 3 and when introducing the soft PARAFAC in Section 4.

2.2 Regression of Scalar Outcome on Tensor Predictor

Let Yi be a continuous outcome, Ci = (Ci1, Ci2, . . . , Cip)T scalar covariates, and Xi a Kmode tensor of dimensions p1, p2, . . . , p K with entries [Xi]j1j2...j K = Xi,j1j2...j K, for unit i = 1, 2, . . . N. Even though our model is presented here for continuous outcomes, the relationship between tensor predictors and binary or categorical outcomes can be similarly evaluated by considering an appropriate link function as we do in Section 6. We study the relationship between the outcome and the scalar and tensor predictors by assuming a linear model

Yi = µ + CT i δ +

j K=1 Xi,j1j2...j Kβj1j2...j K + ϵi, ϵi N(0, τ 2), (1)

where δ Rp and βj1j2...j K R. Alternatively, organizing all coeﬃcients βj1j2...j K in a tensor B of equal dimensions to X and j1j2 . . . j K entry equal to βj1j2...j K, the same model can be written as

Yi = µ + CT i δ + Xi, B F + ϵi. (2)

Since the coeﬃcient tensor B includes QK k=1 pk coeﬃcients, it is infeasible to estimate it without some form of regularization or additional structure. Penalization or variable selection approaches based on the vectorization of the tensor predictor are implemented directly on the model of Equation 1, ignoring the predictor s tensor structure. Alternatively, an approach to account for the predictor s inherent structure is to assume a low-rank approximation to B based on the PARAFAC decomposition, discussed in Section 3.

Papadogeorgou, Zhang and Dunson

3. Tensor Regression Using the Hard PARAFAC Approximation

Under the PARAFAC decomposition, a tensor B Rp1 p2 ...p K can be written as

d=1 β(d) 1 β(d) 2 β(d) K (3)

for some integer D and β(d) k Rpk. The minimum value of D for which B equals its representation in Equation 3 is referred to as its rank. For matrices (2 mode tensors), this decomposition is equivalent to the singular value decomposition, and D is the matrix rank. We refer to the dth term in the sum as the PARAFAC s dth component. The tensor PARAFAC decomposition leads to a natural approximation of the coeﬃcient tensor in Equation 2 by assuming that it is in the form of Equation 3 for some small value of D, potentially much smaller than its true rank. Then, the QK k=1 pk coeﬃcients in B are approximated using D PK k=1 pk parameters leading to a large reduction in the number of quantities to be estimated. However, this reduction in the number of parameters might come at a substantial price if the rank D used in the approximation is smaller than the tensor s true rank. According to Equation 3, the (j1j2 . . . j K) entry of B is equal to

Bj1j2...j K =

d=1 β(d) 1,j1β(d) 2,j2 . . . β(d) K,j K. (4)

In turn, Equation 4 implies that row jk along mode k has ﬁxed importance, expressed as ﬁxed row contributions β(d) k,jk, to all coeﬃcient entries Bj1j2...j K that include jk, irrespective

of the remaining indices. We refer to β(d) k,jk as the dth jk-row contribution along mode k. This is best illustrated by considering a rank-1 2-mode tensor (matrix) B = β1 β2 for vectors β1 Rp1 and β2 Rp2. Then, Bj1j2 = β1,j1β2,j2, and the same entry β1,j1 is used in Bj1j2 irrespective of j2. This gives rise to a rectangular structure in B in which a row s importance, β1,j1, is ﬁxed across all columns (and similarly for β2,j2). We further illustrate this in Figure 1a where we plot β1 β2 for randomly generated vectors β1, β2 {0, 1}100. It is evident from Figure 1a that rank-1 matrices are organized in a rectangular structure where rows and columns are either uniformly excluded or not. Even though the generated vectors are binary for ease of illustration, the rectangular structure persists even when β1, β2 include non-binary entries. The rectangular structure observed in rank-1 tensors indicates that a rank-1 (D = 1) approximation to the coeﬃcient tensor could be quite limiting. Generally, a rank-D approximation for D > 1 is employed to estimate the coeﬃcient tensor. Figure 1b shows a matrix B of rank D = 3, summing over three rank-1 tensors like the one in Figure 1a. The rank-3 tensor alleviates but does not annihilate the rectangular structure observed previously. This is most obvious in Figure 1c where the rows and columns of Figure 1b are re-ordered according to their mean entry. In Appendix B we further demonstrate the inﬂexibility of the hard PARAFAC s block structure. The said block structure is also evident in the work by Zhou et al. (2013); Guhaniyogi et al. (2017) and Li et al. (2018) where they simulated data based on binary coeﬃcient matrices. When these matrices represent combinations of rectangles (such as squares or

Soft Tensor Regression

(c) Rank 3 - ordered

Figure 1: Illustration of the Hard PARAFAC Inﬂexibility. Panel (a) shows a rank-1 tensor of the form B = β1 β2, for vectors β1, β2 {0, 1}100. Panel (b) shows a rank-3 matrix that is the sum of the three rank-1 tensors like the one in panel (a). Panel (c) shows the same rank-3 matrix with rows and columns reordered according to their mean entry.

crosses), the approximation performed well in estimating the true coeﬃcient tensor. However, in situations where the true coeﬃcient tensor was irregular, an increase in the rank was necessary in order to vaguely approximate the truth.

4. Soft Tensor Regression

Our approach proceeds by increasing the number of parameters in the regression model in Equation 2 and subsequently imposing suﬃcient structure to ensure model regularization and adaptability, simultaneously. It borrows from the low-rank structure of the hard PARAFAC, but it is more ﬂexible than the rigid form illustrated in Equation 4 and Figure 1, and can adaptively expand away from low-rank. We introduce tensors B(d) k of equal dimensions to B and write

d=1 B(d) 1 B(d) 2 . . . B(d) K . (5)

From Equation 5, the coeﬃcient with indices j = (j1j2 . . . j K) is written as the sum of D products of K parameters

d=1 β(d) 1, j β(d) 2, j . . . β(d) K, j,

where β(d) k, j is the jth entry of the tensor B(d) k . For reasons that will become apparent later,

the parameters β(d) k, j are referred to as the jth k row-speciﬁc contributions along mode k. Note that these row-speciﬁc contributions are allowed to depend on all indices j. For unrestricted

B(d) k s, Equation 5 does not impose any restrictions and any tensor B can be written in this form (take D = 1, B(1) 1 = B and B(1) k = 1, for all k > 1).

Papadogeorgou, Zhang and Dunson

Writing the coeﬃcient tensor as in Equation 5 might appear counter-productive at ﬁrst since the already high-dimensional problem of estimating the QK k=1 pk parameters in B is translated to an even higher-dimensional problem with DK QK k=1 pk parameters in the right-hand side of Equation 5. However, as we will see, that is not a problem if adequate structure is imposed on the tensors B(d) k . In fact, in Section 4.1 we show that the hard PARAFAC can be written in this form for carefully designed tensors B(d) k , which shows that suﬃcient structure on the tensors B(d) k can lead to drastic dimensionality reduction. In Section 4.2, we present our approach which is based on imposing structure on the tensors B(d) k in a careful manner such that it simultaneously allows for low-dimensional and ﬂexible estimation of B. We refer to the resulting characterization as the soft PARAFAC of B.

4.1 Representation of the Hard PARAFAC Motivating the Soft PARAFAC

As shown in Equation 4, the hard PARAFAC row-speciﬁc contributions to each entry of the coeﬃcient tensor are ﬁxed across the remaining indices. Hence, the hard PARAFAC can be written in the form Equation 5 by specifying tensors B(d) k of which two entries with the same jk index are equal: B(d) k

j1j2...jk...j K = B(d) k

j 1...j k 1jkj k+1...j K.

This structure on the tensors B(d) k can be visualized as pk constant slices along mode k representing the ﬁxed row-speciﬁc contributions to all coeﬃcient entries that involve it. This structure is illustrated in Figure 2a for a 4-by-3 coeﬃcient matrix. As an example, the contribution of row 2 along mode 1 is constant (β1,(2,1) = β1,(2,2) = β1,(2,3)), and the same is true for the contribution of row 1 along mode 2 (β2,(1,1) = β2,(2,1) = β2,(3,1) = β2,(4,1)). This demonstrates that the hard PARAFAC is one example of structure that can be imposed on the B(d) k s in order to approximate B. The connection between Equation 5 and the hard PARAFAC is the reason why we refer to β(d) k, j as row-speciﬁc contributions along mode k. However, the hard PARAFAC structure is quite strict and it imposes equalities across the pk slices of B(d) k . Since the hard PARAFAC can only capture coeﬃcient tensors of rank up to D, the hard PARAFAC s rigid structure on the tensors B(d) k can severely limit the ﬂexibility of the model to capture a true coeﬃcient tensor B of higher rank.

4.2 The Soft PARAFAC

The soft PARAFAC builds upon the hard PARAFAC s low-rank structure, while providing additional ﬂexibility by introducing entry-speciﬁc variability in the row contributions. Instead of forcing all entries of B(d) k with the same index jk to be equal to each other, the soft PARAFAC assumes that the entries of B(d) k are centered around a jk-speciﬁc value, γ(d) k,jk. Speciﬁcally, the soft PARAFAC speciﬁes that, for all k = 1, 2, . . . , K, jk = 1, 2, . . . , pk, and d = 1, 2, . . . D, β(d) k, j N(γ(d) k,jk, σ2 kζ(d)), (6)

for some γ(d) k,jk R, σ2 k, ζ(d) > 0, where j is the (j1j2 . . . jk . . . j K) entry of the tensor. The connection between the soft and hard PARAFAC is evident by noticing that the mean

Soft Tensor Regression

(a) Hard PARAFAC

(b) Soft PARAFAC

Figure 2: Row-speciﬁc Contributions for the Hard and Soft PARAFAC. Left: For the hard PARAFAC, the contributions are ﬁxed across remaining indices. Right: For the soft PARAFAC, the contributions of a row and column are entry speciﬁc and centered around an overall mean.

values γ depend only on the index jk, and not on the remaining indices in j. In fact, the γ parameters resemble the jk-speciﬁc entries in the hard PARAFAC, and they specify that Softer is based on an underlying γ-based rank-D PARAFAC:

E[B j|Γ, S, Z] =

d=1 γ(d) 1,j1γ(d) 2,j2 . . . γ(d) K,j K,

where Γ, S, Z are the collections of the γ, σ, ζ parameters respectively. At the same time, Equation 6 allows variation within the mode k slices of B(d) k by considering them as random eﬀects centered around an overall mean. This implies that row jk s importance is allowed to be entry-speciﬁc leading to a softening in the hard PARAFAC structure. The soft PARAFAC is illustrated in Figure 2b. Here, the row-contributions are centered around a common value (a value resembling the row-contribution according to the hard PARAFAC) but are entry-speciﬁc. For example, β1,(2,1) is similar but not equal to β1,(2,2), β1,(2,3). The entry-speciﬁc contributions deviate from the baseline according to a mode-speciﬁc parameter, σ2 k, and a component-speciﬁc parameter, ζ(d). As we will discuss later, the inclusion of ζ(d) in the variance forces a larger amount of shrinkage on the entry-speciﬁc importance for components d that have limited overall importance. For σ2 kζ(d) = 0 the soft PARAFAC reverts back to the hard PARAFAC, with row-speciﬁc contributions ﬁxed at γ(d) k,jk. However, larger values of σ2 kζ(d) allow for a PARAFAC-based approximation that deviates from its hard underlying structure and can be used to represent any true tensor B. This is further illustrated in Figure 3 where γ1, γ2 {0, 1}64, and entry-speciﬁc contributions are generated according to Equation 6 with σ2 kζ(d) {0, 0.05, 0.1, 0.2}. The soft PARAFAC resembles a structured matrix though higher values of the conditional variance lead to further deviations from a low-rank structure. The structure imposed by the soft PARAFAC has an interesting interpretation. The parameters γ(d) k,jk represent the baseline importance of row jk along the tensor s kth mode.

Papadogeorgou, Zhang and Dunson

Figure 3: Soft PARAFAC Matrices. The plot on the left shows a rank-1 binary matrix γ1 γ2 for γ1, γ2 {0, 1}64. The remaining matrices are generated according to Equation 6 for variance equal to 0.05, 0.1, and 0.2.

In contrast to the hard PARAFAC, the soft PARAFAC allows for structured, tensor-based deviations of a row s importance, by allowing row jk s contribution to manifest diﬀerently based on the rows of the other modes that participate with it in a coeﬃcient entry, j \{jk},

through β(d) k, j. This interpretation of the soft PARAFAC structure is coherent in network settings like the one in our brain connectomics study, where we expect a brain region to have some baseline value for its connections, but the magnitude of this importance might slightly vary depending on the other region with which these connections are made. In this sense, deﬁning deviations from the hard PARAFAC through deviations in the row-speciﬁc contributions as speciﬁed in Equation 6 represents a tensor-based relaxation of the hard PARAFAC structure, which is itself tensor-based.

4.3 Bayesian Inference in the Soft Tensor Regression Framework

Softer is placed within the Bayesian paradigm, which allows for straightforward uncertainty quantiﬁcation. We consider the structure on B(d) k expressed in Equation 6 as part of the prior speciﬁcation on the model parameters of Equation 2. Since γ(d) k,jk are the key building blocks for the mean of B representing the underlying hard PARAFAC, we borrow from Guhaniyogi et al. (2017) and specify

γ(d) k,jk N(0, τγζ(d)w(d) k,jk)

τγ Γ(aτ, bτ)

w(d) k,jk Exp((λ(d) k )2/2),

λ(d) k Γ(aλ, bλ)

ζ Dirichlet(α/D, α/D, . . . , α/D),

where ζ = (ζ(1), ζ(2), . . . , ζ(D)). Therefore, the parameters γ(d) k,jk vary around 0 with variance that depends on an overall parameter τγ, and component and row-speciﬁc parameters ζ(d)

and w(d) k,jk. As discussed in Guhaniyogi et al. (2017), the row-speciﬁc components w(d) k,jk lead

to an adaptive Lasso type penalty on γ(d) k,jk (Armagan et al., 2013), and γ(d) k,jk|τγ, ζ(d), λ(d) k

Soft Tensor Regression

follows a double exponential (Laplace) distribution centered at 0 with scale τγζ(d)/λ(d) k (Park and Casella, 2008). The component-speciﬁc variance parameter ζ(d) in the prior of γ(d) k,jk encourages only a subset of the D components to contribute substantially in the tensor s PARAFAC approximation, since parameters γ(d) k,jk for d with small ζ(d) are shrunk towards zero. We

include ζ(d) in the conditional variance of β(d) k, j in Equation 6 to ensure that shrinkage of

the baseline row contributions γ(d) k,jk is accompanied by a shrinkage of the corresponding

entry-speciﬁc contributions β(d) k, j towards zero, and that a reduction in the variance of γ(d) k,jk is not overcompensated by an increase in the conditional variance of β(d) k, j. We assume normal prior distributions on the intercept and scalar covariates coeﬃcients (µ, δ) N(0, Σ0), and inverse gamma priors on the residual variance τ 2 IG(aτ, bτ) and the mode-speciﬁc variance components σ2 k Γ(aσ, bσ). Speciﬁc choices for the hyperparameter values are discussed in Section 4.4.

4.4 Choosing Hyperparameters to Achieve Desirable Characteristics of the Induced Prior

The prior distribution on the coeﬃcient tensor B πB is induced by our prior speciﬁcation on the remaining parameters. The choice of hyperparameters can have a large eﬀect on model performance, and the use of diﬀuse, non-informative priors can perform poorly in some situations (Gelman et al., 2008). For that reason and in order to assist default choices of hyperparameters leading to weakly informative prior distributions with interpretable ﬂexibility, we study the properties of the induced prior on B. We do so in the following way. First, in Proposition 1 we provide expressions for the induced prior expectation, variance and covariance for the entries in B. These expressions illustrate the importance of certain hyperparameters in how the soft PARAFAC transitions away from its underlying, low-rank, hard version. Then, in Proposition 2 we provide default values for hyperparameters for a standardized 2-mode tensor predictor such that, a priori, Var(B j) = V , and the proportion of the variance that arises due to the proposed PARAFAC softening is equal to AV . Finally, in Section 4.5 we study statistical properties of the induced prior on B, and we show full support over a large class of coeﬃcient tensors irrespective of the base rank D used, which leads to consistent posterior distributions. All proofs are in Appendix A.

Proposition 1 For j, j K k=1{1, 2, . . . , pk} such that j = j , we have that E(B j) = 0, Cov(B j, B j ) = 0, and for aλ > 2,

Var(B j) = n D

n 2b2 λ bτ(aλ 1)(aλ 2)

where ρ0 = 1 and ρl = aτ(aτ + 1) . . . (aτ + l 1) for l 1.

Remark 1 The hyperparameters of the softening variance, aσ, bσ. Remember that σ2 k is the parameter driving the PARAFAC softening by allowing row-speciﬁc contributions to

Papadogeorgou, Zhang and Dunson

vary. From Proposition 1, it is evident that the prior of σ2 k is only inﬂuential on the ﬁrst two moments of B j through its mean, aσ

bσ , with higher prior expectation of σ2 k leading to higher prior variance of B j. Therefore, prior elicitation for aσ, bσ could be decided based on the ratio aσ

Remark 2 Variance of coeﬃcient entries for the hard PARAFAC. For σ2 k = 0, the prior variance of the coeﬃcient tensor entries is equal to the prior variance of the hard PARAFAC,

Varhard(B j) = n D

n 2b2 λ (aλ 1)(aλ 2)

Comparing the variance of B j based on the soft and hard PARAFAC quantiﬁes the amount of additional ﬂexibility that is provided by the PARAFAC softening, expressed as

AV = Var(B j) Varhard(B j)

Var(B j) [0, 1).

We refer to this quantity as the additional variance. Motivated by Figure 3, hyperparameters should be chosen such that the induced prior assigns more weight to coeﬃcient tensors that resemble low-rank factorizations, and achieves suﬃciently but not overly large variance on the regression coeﬃcients. Proposition 2 provides such conditions on the hyperparameters for matrix predictors (K = 2). For a tensor predictor with K > 2, similar conditions on the hyperparameters can be acquired by following similar steps.

Proposition 2 For a matrix predictor, target variance V (0, ), target additional variance AV [0, 1), if the hyperparameters satisfy aλ > 2,

2b2 λ (aλ 1)(aλ 2) = bτ

V (1 AV )aτ

C(aτ + 1) (7)

V (1 AV )aτ

1 1 AV ) 1 1 , (8)

where C = (α/D + 1)/(α + 1), then we have that a priori Var(B j) = V , and AV = AV .

Proposition 2 is used in our simulations and study of brain connectomics to choose hyperparameters such that, a priori, Var(B j) = 1 and AV = 10%, assuming a tensor predictor with standardized entries. Speciﬁcally, we set aτ = 3, aσ = 0.5 and calculate the values of bτ, bσ for which V = 1 and AV = 10%. These values correspond to bτ 6.3

C and bσ 8.5

C. We specify ασ = 0.5 < 1 to encourage, a priori, smaller values of σ2 k. Throughout, we use α = 1 and D = 3, unless stated otherwise. For the hyperparameters controlling the underlying hard PARAFAC, we specify aλ = 3 and bλ = 2K aλ. Lastly, assuming centered and scaled outcome and scalar covariates, we specify (µ, δT )T N(0, Ip+1), and residual variance τ 2 IG(2, 0.35) which speciﬁes P(τ 2 < 1) 0.99.

Soft Tensor Regression

Remark 3 Interplay between variance hyperparameters. From Equation 8, we see that the prior mean of σ2 k depends on the target variance and the proportion of that variance that is attributable to the PARAFAC softening, and does not depend on the remaining hyperparameters (considering that aτ/(aτ +1) 1). Furthermore, Equation 7 only includes hyperparameters which drive the underlying hard PARAFAC, and it depends on V and AV

only through V (1 AV ), which expresses the prior variability in B that is not attributed to the PARAFAC softening. Therefore, in Softer there is a clear and desirable separation between the hard and soft PARAFAC variance hyperparameters. Lastly, since 2b2 λ (aλ 1)(aλ 2) is the prior mean of w(d) k,jk, Equation 7 illustrates the interplay between the two components in the variance V (1 AV ) of the underlying hard PARAFAC structure: when the prior mean of τγ increases, the prior mean of wk,jk has to decrease in order to maintain the target variance due to the underlying PARAFAC at V (1 AV ).

4.5 Posterior Consistency and Softer s Robustness to the Underlying Rank

In this section, we focus on the choice of the rank D for Softer. We discuss that results based on Softer are likely robust to small changes in the choice of the underlying rank. To support this claim, we ﬁrst provide two intuition-based arguments, and then a theoretical one. Finally, Softer s robustness to the choice of D is empirically illustrated in simulated examples in Section 5.

When employing the hard PARAFAC, Guhaniyogi et al. (2017) recommended using D = 10 for a predictor of dimension 64 64. The reason is that the prior on ζ permits some amount of ﬂexibility in the number of components that contribute to the coeﬃcient matrix approximation in that (a) if the matrix can be well-approximated by a rank lower than D, the prior leads to a reduction in the approximation s eﬀective rank, and (b) if all D components are useful in estimation, all of them acquire suﬃcient weight. Since Softer employs the same prior on ζ, it also allows for sparsity in the eﬀective components in the underlying hard PARAFAC, in that if the true coeﬃcient tensor is of rank lower than D, higher order components will be heavily shrunk towards zero, and softening will be minimal. This claim will be further supported in simulations in Section 5.1 where we illustrate that Softer reverts back to the hard PARAFAC when the underlying low-rank structure is true.

However, if the true coeﬃcient tensor is not of low rank, Softer can expand away from a low-rank structure in two ways: it can use additional components in the underlying hard PARAFAC, or it can soften away from the rigid underlying low-rank structure. The deviations based on the PARAFAC softening can eﬀectively capture components corresponding to singular values of any magnitude. In Figure 4 we illustrate the range of singular values that would be accommodated when expanding away from a rank-D1 hard PARAFAC approximation by (1) increasing the hard PARAFAC rank, and (2) softening the PARAFAC. Increasing the hard PARAFAC rank would include components corresponding to some small singular values, but softening the PARAFAC accommodates deviations from the underlying D1-rank structure across all singular values. Since Softer based on an underlying D-rank can capture these arbitrary deviations from a D-rank structure, increasing the rank D in Softer s underlying PARAFAC is not expected to drastically alter the results, making Softer robust to the choice of rank D. This intuition is empirically evaluated in Section 5.2.

Papadogeorgou, Zhang and Dunson

Figure 4: Hypothetical histogram of the singular values of a matrix. A rank D1 PARAFAC approximation incorporates the components which correspond to the D1 largest singular values. Increasing the rank to D2 allows for D2 D1 additional components, whereas Softer allows for deviations corresponding to any singular value.

Softer s robustness to the choice of the underlying rank D and its ability to capture any true coeﬃcient tensor is also evident in the following theoretical results. In Proposition 3 we show that the prior on B, πB, has full L prior support in that it assigns positive prior weight to any ϵ-sized neighborhood of any true coeﬃcient tensor B0. This indicates that, even if the value of D is lower than the coeﬃcient tensor s true rank, Softer assigns positive prior weight to a neighborhood of the true tensor. Full prior support is a key element for establishing suﬃcient ﬂexibility of a Bayesian procedure. In turn, the posterior distribution of the coeﬃcient tensor is consistent, irrespective of the true coeﬃcient tensor s rank, or the rank used in Softer s underlying PARAFAC.

Proposition 3 (Full prior support) Let ϵ > 0. Then, πB B ϵ B0 > 0, where B ϵ B0 = {B : max j |B0

j B j| < ϵ}.

We assume that the true data generating model is Equation 2 with true coeﬃcient tensor B0, and that the tensor predictor X has bounded entries. Since our interest is in estimating B0, we assume that τ 2 = 1, µ = 0 and δ = 0 are known. Then, we have the following.

Proposition 4 The posterior distribution of B is weakly consistent for B0.

These results indicate that softening the PARAFAC allows us to capture any true coeﬃcient tensor. Since they hold irrespective of the choice of the underlying rank, they also indicate that Softer is robust to the choice of rank D, at least asymptotically.

4.6 Approximating the Posterior Distribution of the Coeﬃcient Tensor

We approximate the posterior distribution of B using Markov chain Monte Carlo (MCMC). An MCMC scheme where most parameters are updated using Gibbs sampling is shown in Appendix C. We found this approach to be suﬃciently eﬃcient when the sample size is larger or of similar order to the number of parameters. However, in very high-dimensional settings where p n, mixing and convergence was slow under reasonable time constraints. For that reason, and in order to provide a sampling approach that performs well across n, p situations, we instead rely on Hamiltonian Monte Carlo (HMC) implemented in Stan (Carpenter et al., 2017) and on the R interface (Stan Development Team, 2018) to acquire

Soft Tensor Regression

samples from the posterior distribution. HMC is designed to improve mixing relative to Gibbs sampling by employing simultaneous updates, and relying on gradients calculated with automatic diﬀerentiation to obtain eﬃcient proposals. Using Stan, we employ the No-U-Turn sampler (NUTS) algorithm (Hoﬀman and Gelman, 2014) which automatically tunes the HMC parameters to achieve a target acceptance rate (the default step size adaptation parameters were used). If MCMC convergence is slow, one could increase the NUTS parameter δ in RStan from 0.8, which is the default, to a value closer to 1. MCMC convergence was assessed based on visual inspection of traceplots across chains with diﬀerent starting values and the potential scale reduction factor (Gelman and Rubin, 1992) for the regression coeﬃcients µ, δ, B and the residual variance τ 2. Note that the remaining parameters are not individually identiﬁable as the underlying PARAFAC parameters are themselves non-identiﬁable, and therefore the corresponding softening parameters are also non-identiﬁable. In simulations, we considered a model that restricts the underlying PARAFAC parameters to make them identiﬁable (using the constraints discussed in Section 2.3 of Guhaniyogi et al. (2017)). We found that the original and constrained models had identical estimation performance, but that the MCMC for the unconstrained model including non-identiﬁable parameters required a smaller number of iterations to converge than the constrained model that uses the identiﬁable underlying PARAFAC.

5. Simulations

To illustrate the performance of Softer and compare it against alternatives, we simulated data under various scenarios. In one set of simulations (Section 5.1), we considered a tensor predictor of dimension 32 32, corresponding coeﬃcient tensors ranging from close to low-rank to full-rank, and with diﬀerent degrees of sparsity, and sample size equal to 400. In another set of simulations (Section 5.2), we considered a tensor predictor of dimension 20 20 and corresponding coeﬃcient tensor of rank 3, 5, 7, 10 and 20 in order to investigate the performance of Softer relative to the hard PARAFAC for a true coeﬃcient tensor that increasingly deviates from low rank form, and various choices of the algorithmic rank D. The sample size in this situation was 200. In all scenarios, the predictor s entries were drawn independently from a N(0, 1) distribution, and the outcome was generated from a model in the form Equation 2 with true residual variance τ 2 = 0.5. We considered the following methods: (a) Softer, (b) the Bayesian hard PARAFAC approach of Guhaniyogi et al. (2017), and (c) estimating the coeﬃcient tensor by vectorizing the predictor and performing Lasso. We considered these two competing approaches because they represent the two extremes of how much prioritization is given to the predictor s array structure (the hard PARAFAC directly depends on it, the Lasso completely ignores it), whereas Softer is designed to exploit the predictor s structure while allowing deviations from it. We also considered (d) the Bayesian tensor regression approach of Spencer (2020) which is based on the Tucker decomposition, a generalization of the PARAFAC decomposition. Methods were evaluated in terms of how well they estimated the true coeﬃcient tensor B by calculating (1) the entry-speciﬁc bias and mean squared error of the posterior mean (for the Bayesian approaches) and the penalized likelihood estimate (for the Lasso), and (2) the frequentist coverage of the 95% credible intervals. In order to evaluate the methods performance in accurately identifying important entries (entries with non-zero coeﬃcients),

Papadogeorgou, Zhang and Dunson

we calculated methods (3a) sensitivity (percentage of important entries that were identiﬁed), (3b) speciﬁcity (percentage of non-important entries that were correctly deemed non-important), (3c) false positive rate (percentage of identiﬁed entries that are truly not important), and (3d) false negative rates (percentage of non-identiﬁed entries that are important). For the Bayesian methods, an entry was ﬂagged as important if its corresponding 95% credible interval did not overlap with zero. Hierarchical Bayesian models have been shown to automatically perform adjustment for multiple testing error (Scott and Berger, 2010; M uller et al., 2006). Conﬁdence intervals and entry selection for the Lasso were not considered. We also evaluated the models predictive performance using (4) the predictive mean squared error deﬁned as the mean of the squared diﬀerence between the true outcome and the predictive mean over 1,000 new data points. Lastly, (5) we compared the hard PARAFAC and Softer in terms of their computational time, for various ranks D.

Additional simulations are shown in Appendix D and are summarized below, where applicable.

5.1 Simulation Results for Tensor Predictor of Dimension 32 32

The scenarios we considered represent settings of varying complexity and sparsity. The ﬁrst column of Figure 5 shows the true coeﬃcient tensors (squares, feet, dog, diagonal). The squares coeﬃcient matrix is used as a scenario where the true coeﬃcient matrix is rectangular, but not low rank, and not sparse. The next two scenarios represent situations where the underlying structure is not of low-rank form, but could be potentially approximated by a low rank matrix up to a certain degree, hence representing scenarios somewhat favorable to the hard PARAFAC. In the last case, the diagonal coeﬃcient matrix is used to represent a sparse coeﬃcient matrix of full-rank without a network-structure, a scenario that is favorable for the Lasso, but is expected to be diﬃcult for the hard PARAFAC. Therefore, the scenarios we consider here represent a wide range of rank and sparsity speciﬁcations, and we investigate Softer s ability to use a low rank structure when such structure is useful (feet, dog), and expand away from it when it is not (squares, diagonal). Even though none of these coeﬃcient tensors is low-rank, we considered a scenario with a low-rank coeﬃcient matrix in the Appendix, and we discuss it brieﬂy below.

The remaining columns of Figure 5 show the average posterior mean or penalized estimate across simulated data sets. Results from Softer and the hard PARAFAC correspond to D = 3, though the methods are also considered with rank D = 7 (the results are shown in the Appendix and are discussed below). In the squares, feet and dog scenarios, the hard PARAFAC performs decently in providing a low-rank approximation to the true coeﬃcient matrix. However, certain coeﬃcient entries are estimated poorly to ﬁt its rectangular structure. This is most evident in the squares scenario where the non-zero coeﬃcients are obscured in order to ﬁt in a low-rank form. Results for the approach based on the Tucker decomposition are worse, with estimated coeﬃcient matrices that deviate from the truth more. In the diagonal scenario, the hard PARAFAC and Tucker approaches totally miss the diagonal structure and estimate (on average) a coeﬃcient matrix that is very close to zero. In contrast, the Lasso performs best in the sparse, diagonal scenario and estimates on average the correct coeﬃcient matrix structure. However, in the squares, dog and feet settings, it underestimates the coeﬃcient matrix since it is based on assumed sparsity which

Soft Tensor Regression

Figure 5: True and average estimated coeﬃcient matrix based on Softer, hard PARAFAC, Tucker regression, and Lasso.

is not true, and does not borrow any information across coeﬃcient matrix entries. In all situations, Softer closely identiﬁes the structure of the underlying coeﬃcient matrix, providing a compromise between tensor-based and unstructured estimation, and having small biases across all simulated scenarios (average bias also reported in Table 1). In the squares, feet and dog scenarios, Softer bases estimation on the low-rank structure of the underlying hard PARAFAC, but it expands from it to better describe the true coeﬃcient matrix s details. At the same time, Softer also performs well in the diagonal scenario, where the true coeﬃcient tensor is full-rank. Therefore, the strength of Softer is found in its ability to use the low-rank structure of the PARAFAC when necessary, and diverge from it when needed. In Table 1, we report the methods bias and root mean squared error (r MSE), predictive mean squared error, and frequentist coverage of the 95% credible intervals. Conclusions remain unchanged, with Softer performing similarly to the hard PARAFAC when its underlying structure is close to true, and has the ability to diverge from it and estimate a coeﬃcient tensor that is not low-rank in other scenarios. This is evident by an average

Papadogeorgou, Zhang and Dunson

coverage of 95% posterior credible intervals that is 94.7% in the diagonal scenario. In terms of their predictive ability, the pattern observed for the bias and mean squared error persists, with Softer having the smallest mean squared predictive error in the squares, feet and dog scenarios, and the Lasso for the diagonal scenario, followed closely by Softer. Across all scenarios, the approach based on the Tucker decomposition performs worst among the Bayesian methods in terms of both estimation and predictive accuracy.

Softer PARAFAC Tucker Lasso squares Truly zero bias 0.003 0.005 0.013 0.012 r MSE 0.033 0.05 0.05 0.143 coverage 99.7% 98.3% 99.8%

Truly non-zero bias 0.084 0.106 0.136 0.501 r MSE 0.11 0.148 0.166 0.601 coverage 80.1% 68.4% 90.7%

Prediction MSE 5.05 8.99 12.16 111.8 feet Truly zero bias 0.037 0.046 0.057 0.016 r MSE 0.092 0.109 0.102 0.198 coverage 96.9% 94.2% 95.5%

Truly non-zero bias 0.116 0.138 0.175 0.43 r MSE 0.184 0.21 0.226 0.558 coverage 89.2% 80% 84.8%

Prediction MSE 31.9 41.8 50.6 264.6 dog Truly zero bias 0.047 0.059 0.083 0.029 r MSE 0.111 0.129 0.129 0.151 coverage 98.0% 92.8% 95.5%

Truly non-zero bias 0.129 0.159 0.214 0.350 r MSE 0.205 0.230 0.261 0.435 coverage 88.2% 76.9% 78.5%

Prediction MSE 35.0 45.4 61.7 138.4 diagonal Truly zero bias 0.002 0.004 0.002 <0.001 r MSE 0.02 0.051 0.024 0.009 coverage 100% 100% 100%

Truly non-zero bias 0.111 0.899 0.954 0.07 r MSE 0.126 0.906 0.955 0.084 coverage 94.7% 3% 0.8%

Prediction MSE 1.41 29.7 30.6 0.81

Table 1: Simulation Results. Average bias, root mean squared error, frequentist coverage of 95% credible intervals among truly zero and truly non-zero coeﬃcient entries, and predictive mean squared error for Softer (with D = 3), the hard PARAFAC (with D = 3), Tucker regression, and Lasso for the simulation scenario with tensor predictor of dimensions 32 32 and sample size n = 400. Bold text is used for the approach performing best in each scenario and for each metric. If no entry is bold, no conclusive argument can be made.

Soft Tensor Regression

Table 2 shows the performance of Softer, the hard PARAFAC, and Tucker regression approaches for identifying the important entries of the tensor predictor (entries with non-zero coeﬃcients). Perfect performance would imply speciﬁcity and sensitivity equal to 100, and false positive and negative rates (FPR, FNR) equal to 0. The methods perform comparably in terms of speciﬁcity, sensitivity, and FNR, except for the diagonal scenario where the sensitivity of the hard PARAFAC and Tucker approaches is dramatically lower. However, the big diﬀerence is in the FPR. Even though Softer s FPR is at times higher than 5%, it remains at much lower levels than the hard PARAFAC and Tucker approaches for which FPR is on average over 10% in the dog and approximately 30% in the diagonal scenario. These results indicate that identifying important entries based on the hard PARAFAC could lead to overly high false discovery rates, whereas Softer alleviates this issue. In Appendix D.1 we investigate the cases where Softer and hard PARAFAC return contradicting results related to an entry s importance. We illustrate that, when PARAFAC identiﬁes entries as important and Softer does not, it is most often entries with small coeﬃcient values, in an eﬀort to ﬁt the coeﬃcient matrix into a low-rank form.

Appendix D shows additional simulation results. Appendix D.2 shows results for alternative coeﬃcient matrices, including a coeﬃcient matrix of rank 3, a scenario favorable to the hard PARAFAC. There, we see that Softer collapses to the underlying low-rank structure when such a structure is true. These simulation results are in agreement with our ﬁrst intuition-based argument of Section 4.5 with regards to the choice of rank D for Softer. Appendix D.3 shows results for a subset of the coeﬃcient matrices and for sample size n = 200.

Sensitivity Speciﬁcity FPR FNR squares Softer 100 99.7 0.9 (0, 2.5) 0 PARAFAC 100 98.3 4.7 (1.2, 8.7) 0 Tucker 100 99.9 0.3 (0, 0.4) 0 feet Softer 64.5 96.9 2.9 (1.7, 4.1) 36.3 PARAFAC 68.4 94.1 5.2 (3.3, 7.1) 34.3 Tucker 63.6 95.5 4.4 (3.3, 5.5) 37.2 dog Softer 52.9 96.7 5.2 (2.7, 8.1) 34.7 PARAFAC 63.1 90.1 12.4 (8.4, 15.9) 30.9 Tucker 46.3 93.1 11.5 (5.2, 17.6) 38.6 diagonal Softer 100 100 0 (0, 0) 0 PARAFAC 3 100 28.8 (0, 70) 3 Tucker < 1 100 33.3 (10, 50) 3.1

Table 2: Methods performance in identifying important entries. For sensitivity, speciﬁcity and false negative rate (FNR), results are shown as average across simulated data sets ( 100), and for false positive rate (FPR) as average (10th, 90th percentile) ( 100).

The average FPR is taken over simulated data sets for which at least one entry was identiﬁed

as important.

Most coeﬃcients in the dog simulation were non-zero. Results are presented considering coef-

ﬁcients smaller than 0.05 as eﬀectively zero.

Papadogeorgou, Zhang and Dunson

Simulations with a smaller n to p ratio show that Softer performs comparably to the hard PARAFAC for the dog and feet scenarios and has substantially smaller bias and r MSE for the truly non-zero coeﬃcients in the diagonal scenario. The most notable conclusion is that Softer results are closer to the hard PARAFAC results when the sample size is small. This indicates that the data inform the amount of PARAFAC softening which depends on the sample size. Finally, Appendix D.4 shows results for Softer and the hard PARAFAC when D = 7. Even though the hard PARAFAC shows substantial improvements using the higher rank, Softer performs almost identically for rank 3 or 7, illustrating its robustness to the choice of the underlying rank. This last conclusion is further investigated in Section 5.2.

5.2 Simulation Results for Coeﬃcient Tensor of Increasing Rank

We evaluated the performance of the hard and soft PARAFAC tensor regression with various values of the algorithmic rank D and the coeﬃcient tensor s true rank. We considered tensor predictor of dimension 20 20, rank of the true coeﬃcient matrix equal to 3, 5, 7, 10 and 20, and we evaluated the hard PARAFAC and Softer with D {1, 3, 5, 7, 10}. For every value of the true rank, we generated 100 data sets. In Figure 6, we show the average across entries of the coeﬃcient matrix of the absolute bias and mean squared error, and the predictive mean squared error. For illustrative simplicity, we include a subset of the results: Softer with D = 3, and the hard PARAFAC with D = 3 and D = 5, though the complete results are included in the Appendix and discussed below. When the true rank of the coeﬃcient matrix is 3, the three approaches perform similarly. This indicates that both the hard PARAFAC with D = 5 and Softer are able to convert back to low ranks when this is true. For true rank equal to 5 or 7, the hard PARAFAC with D = 5 slightly outperforms Softer. However, for D > 7, Softer based on a rank-3 underlying structure performs best both in estimation and in prediction. These results indicate that, in realistic situations where the coeﬃcient tensor is not of low-rank

Figure 6: Average absolute bias, estimation mean squared error and predictive mean squared error (y-axis) for tensor predictor of dimensions 20 20 and true coeﬃcient matrix of increasing rank (x-axis). Results are shown for the hard PARAFAC with D = 3 and D = 5, and Softer with D = 3.

Soft Tensor Regression

form, Softer with a low rank has the ability to capture the coeﬃcient tensor s complex structure more accurately than the hard PARAFAC. In Appendix D.5, we show the performance of Softer and the hard PARAFAC across all values of D considered, D {1, 3, 5, 7, 10}. The conclusions about the methods relative performance remain unchanged: even though Softer and the hard PARAFAC perform similarly for coeﬃcient tensors with small true rank, Softer with D > 1 outperforms the hard PARAFAC in terms of both estimation and prediction when the coeﬃcient tensor is of high rank. These results illustrate that the performance of the hard PARAFAC depends heavily on the choice of rank D. In contrast, Softer s performance is strikingly similar across all values of the algorithmic rank, supporting the second intuition-based argument on Softer s robustness in Section 4.5. Finally, Table 3 shows the average time across simulated data sets for 15,000 MCMC iterations for the two methods. As expected, the computational time for both approaches increases as the value of D increases, though the true rank of the coeﬃcient tensor seems to not play a role. Here, we see that Softer generally requires two to three times as much time as the hard PARAFAC for the same value of D, which might seem computationally prohibitive at ﬁrst. However, our results in Figure 6 and Appendix D.5 show that Softer requires a smaller value of D in order to perform similarly to or better than the hard PARAFAC in terms of both estimation and prediction. Therefore, Softer s computational burden can be drastically alleviated by ﬁtting the method for a value of D which is substantially lower than the value of D for the hard PARAFAC. In Section 7, we also discuss frequentist counterparts to our model that might also reduce the computational load.

6. Estimating the Relationship Between Brain Connectomics and Human Traits

Data from the Human Connectome Project (HCP) contain information on about 1,200 healthy young adults including age, gender, various brain imaging data, and a collection of measures assessing cognition, personality, substance intake and so on (referred to as traits).

True rank of coeﬃcient tensor 3 5 7 10 20 Softer D = 1 28 29 30 37 39 D = 3 166 110 101 104 117 D = 5 181 149 158 159 152 D = 7 212 189 205 216 220 D = 10 275 236 255 288 259 Hard PARAFAC D = 1 16 16 15 15 15 D = 3 30 39 46 35 36 D = 5 69 54 57 53 56 D = 7 100 90 82 72 76 D = 10 104 117 108 104 102

Table 3: Computational time (in minutes) for Softer and the hard PARAFAC.

Papadogeorgou, Zhang and Dunson

We are interested in studying the brain structural connectome, referring to anatomical connections of brain regions via white matter ﬁbers tracts. The white matter ﬁber tracts can be indirectly inferred from diﬀusion MRI data. Two brain regions are considered connected if there is at least one ﬁber tract running between them. However, there can be thousands of ﬁber tracts connecting a pair of regions. Properties of the white matter tracts in a connection, such as number of tracts, and patterns of entering the regions, might be informative about an individual s traits. Using data from the HCP and based on the soft tensor regression framework, we investigate the relationship between diﬀerent connectome descriptors and human traits. Structural connectivity data were extracted using state-of-the-art pipelines in Zhang et al. (2018). In total, about 20 connectome descriptors (adjacency matrices) describing diﬀerent aspects of white matter ﬁber tract connections were generated (see Zhang et al. (2018) for more information on the extracted descriptors). Each adjacency matrix has a dimension of 68 68, representing R = 68 regions connection pattern. The 68 regions were deﬁned using the Desikan-Killiany atlas (Desikan et al., 2006). Of the 20 extracted connectome features, we consider two in this analysis: (a) count, describing the number of streamlines, and (b) connected surface area (CSA), describing the area covered by small circles at the interactions of ﬁber tracts and brain regions, since they are the most predictive features according to results in Zhang et al. (2019). We examine the relationship between these descriptors of structural brain connections and 15 traits, covering domains such as cognition, motor, substance intake, psychiatric and life function, emotion, personality and health. The full list of outcomes we analyze is presented in Table E.1 and includes both binary and continuous traits. For binary traits, a logistic link function is assumed.

6.1 Adapting Softer for (Semi-)Symmetric Brain Connectomics Analysis

The nature of the brain connectivity data implies that the R R-dimensional tensor predictor including a speciﬁc connectivity feature among R regions of interest (ROIs) is symmetric and the diagonal elements can be ignored since self-loops are not considered. Further, considering p features simultaneously would lead to an R R p tensor predictor which is semi-symmetric (symmetric along its ﬁrst two modes). The (semi-)symmetry encountered in the predictor allows us to slightly modify Softer and reduce the number of parameters by imposing that the estimated coeﬃcient matrix B is also (semi-)symmetric. We provide the technical details for the (semi-)symmetric Softer in Appendix F.

6.2 Analyses of the Brain Connectomics Data

For the purpose of this paper, we investigate the relationship between features of brain connections and human traits by regressing each outcome on each of the two predictors (count and CSA) separately. Even though analyzing the relationship between the traits and multiple features simultaneously is possible, we avoid doing so here for simplicity. We analyze the data employing the following methods: (1) symmetric Softer with D = 6, (2) the hard PARAFAC approach of Guhaniyogi et al. (2017) which does not impose symmetry of the coeﬃcient tensor with D = 10, and (3) Lasso on the vectorized lower triangular part of the tensor predictor. Since publicly available code for non-continuous outcomes is not

Soft Tensor Regression

available for the hard PARAFAC approach, we only consider it when predicting continuous outcomes. We compare methods relative to their predictive performance. For each approach, we estimate the out-of-sample prediction error by performing 15-fold cross validation, ﬁtting the method on 90% of the data and predicting the outcome on the remaining 10%. In the case of Softer and the hard PARAFAC we also investigate the presence of speciﬁc brain connections that are important in predicting any of the outcomes by checking whether their coeﬃcients 95% posterior credible intervals include zero. Additional results based on Softer for a diﬀerent choice of baseline rank or when symmetry is ignored are included in Appendix E and are summarized below.

6.3 Using Features of Brain Connections for Predicting Human Traits

For continuous outcomes, the methods predictive performance was evaluated by calculating the percentage of the marginal variance explained by the model deﬁned as 1 (CV MSE)/(marginal variance). For binary outcomes, we used the model s estimated linear predictor to estimate the optimal cutoﬀfor classiﬁcation based on Youden s index (Youden, 1950) and calculated the average percentage of correctly classiﬁed observations in the heldout data. Figure 7 shows the results for the three approaches considered, and for each feature separately. For most outcomes, one of the two features appeared to be most predictive of the outcome across approaches. For example, the count of streamlines was more predictive than CSA of an individual s anger level (Ang Hostil), independent of the modeling approach used. By examining the methods predictive performance, it is evident that features of brain connectomics are, in some cases, highly predictive of outcomes. Speciﬁcally, over 30% of the variance in an individual s strength level, and over 10% of the variance in endurance, reading comprehension, and picture vocabulary ability can be explained by the count or CSA of streamlines of their brain connections. Not one approach outperformed the others in prediction across all features and outcomes. However, approaches that accommodate the network structure perform better than Lasso in most situations. One example is Softer s performance relative to Lasso when predicting individuals previous depressive episode (Depressive Ep). Here, Lasso performs worse than the random classiﬁer, whereas Softer has over 90% accuracy. Even when the number of observations is less than 300 (diﬃculty quitting tobacco, TB DSM Difficulty Quitting), Softer performs only slightly worse than Lasso. For continuous outcomes, Softer and hard PARAFAC perform comparably. As we saw in the simulations in Section 5 and in Appendix D.3, the similar predictive performance of Softer and hard PARAFAC could be due to the limited sample size that forces Softer to heavily rely on the underlying low-rank structure for estimation, essentially reverting back to the hard PARAFAC. The low signal in predicting some outcomes implies low power in identifying pairs of brain regions whose connection s features are important. In fact, 95% credible intervals for all coeﬃcients using the hard PARAFAC overlapped with zero. In contrast, Softer identiﬁed seven important connections: ﬁve of them were for predicting whether an individual has had a depressive episode (three using count of streamlines as the predictor, and two using CSA), one in predicting an individual s strength, and one in predicting the variable short penn

Papadogeorgou, Zhang and Dunson

NEOFAC Agreeableness BMI VSPLOT TC ASR Intn T

Alc 12 Drinks Per Day Alc Hvy Frq 5plus ASR Oth Raw Ang Hostil

Read Eng Age Adj Pic Vocab Age Adj Endurance Age Adj Strength Age Adj

0.0 2.5 5.0 7.5 10.0 12.5

% Variance explained

Method & Feature

Softer D6 Count

Softer D6 CSA

Guhaniyogi Count

Guhaniyogi CSA

Lasso Count

TB DSM Difficulty Quitting Mj Ab Dep Depressive Ep

% Correctly classified

Figure 7: Top: Percentage of outcome variance explained by the tensor predictor for continuous outcomes calculated as [1 - MSE / (marginal variance)] 100. Bottom: Average percentage of units correctly classiﬁed for binary outcomes. Results are presented using diﬀerent color for each method, and diﬀerent line-type for each feature of brain connections.

line orientation (VSPLOT). The identiﬁed connections are listed in Table 4 and agree with the literature in neuroscience. All identiﬁed connections in predicting a depressive episode involve the parahippocampal, which is the posterior limit of the amygdala and hippocampus and is located in the temporal lobe, and ROIs located in the frontal lobe (paracentral, lateral orbitofrontal, pars orbitalis). Dysfunction of the parahippocampal (as well as the amygdala and hippocampus) has been identiﬁed in various studies as an important factor in major depression and emotion-related memory observed in depression (Mayberg, 2003; Seminowicz et al., 2004; La Bar and Cabeza, 2006; Zeng et al., 2012). Further, dysregulation of the pathways between the frontal and temporal lobes has been identiﬁed as predictive of depression (Mayberg, 1994; Steingard et al., 2002), even when explicitly focusing on the cortical regions Softer identiﬁed as important (Liao et al., 2013). Two of the three connections were identiﬁed as important irrespective of the tensor predictor used (count or CSA). Even

Soft Tensor Regression

Outcome Feature ROI 1 ROI 2 Depressive Count (lh) Parahippocampal (rh) Lateral Orbitofrontal Episode (lh) Paracentral (lh) Pars Orbitalis CSA (rh) Lateral Orbitofrontal (lh) Paracentral Strength Count (rh) Precuneus (rh) Superior Parietal VSPLOT CSA (lh) Banks of Superior Temporal Sulcus (lh) Rostral Anterior Cingulate

Table 4: Brain connections with important features in predicting human traits.

though these predictors are related, they describe diﬀerent aspects of brain connectivity. Therefore, Softer identiﬁed the same connections as important based on two separate measures of brain connectivity. The identiﬁed connection in predicting strength involves the precuneus and superior parietal regions in the parietal lobe. Precuneus connectivity has been associated with a variety of human functions, including motor-related traits (Cavanna and Trimble, 2006; Wenderoth et al., 2005; Simon et al., 2002), and the parietal lobe in general is believed to control humans motor system (Fogassi and Luppino, 2005). Appendix E.2 includes results from the symmetric Softer using a smaller rank (D = 3), and from Softer using the same rank as the results in this section (D = 6) but ignoring the known symmetry of the predictor. All three versions of Softer perform similarly in terms of prediction, with potentially slightly lower predictive power for the symmetric Softer with rank 3. The entries identiﬁed as important by Softer with D = 3 are similar, though not identical to the ones in Table 4. When symmetry is not accounted for, Softer does not identify any important connections. Finally, in Appendix E.3 we investigate the reproducibility of identiﬁed connections across random subsamples including 90% of the observations. We ﬁnd that for strength, when connections are identiﬁed in the subsample, they generally include the ones in Table 4. Finally, when predicting whether an individual has has a depressive episode, any of the connections identiﬁed across any subsample and for either predictor involved at least one of Parahippocampal the (in the left hemisphere) or Lateral Orbitofrontal (in the right hemisphere) ROIs, implying that the importance of these two regions is strikingly reproducible.

7. Discussion

In this paper, we considered modeling a scalar outcome as a function of a tensor predictor within a regression framework. Estimation of regression models in high dimensions is generally based on some type of assumed sparsity of the true underlying model: sparsity directly on covariates, or latent sparsity by assuming a low-dimensional structure. When the assumed sparsity is not true, the model s predictive ability and estimation can suﬀer. Our approach is positioned within the class of latent sparsity, since it exploits a lowdimensional underlying structure. We focused on adequately relaxing the assumed structure by softening the low-dimensional PARAFAC approximation and allowing for interpretable deviations of row-speciﬁc contributions. We show that softening the PARAFAC leads to improved estimation of coeﬃcient tensors, better performance in identifying important entries, consistent estimation irrespective of the underlying rank used for estimation, and more

Papadogeorgou, Zhang and Dunson

accurate predictions. The approach is applicable to both continuous and binary outcomes, and was adapted to (semi-)symmetric tensor predictors, which is common in settings where the predictor is measured on a network of nodes. Softer was used to study the relationship between brain connectomics and human traits, and identiﬁed several important connections in predicting depression. Combining the two types of assumed sparsity for low-rank and sparse matrix estimation has received some attention in the literature, especially in machine learning with matrix data. Cand es et al. (2011), Waters et al. (2011) and Zhou and Tao (2011) decomposed matrices as the sum of a low-rank and a sparse component. Zhang et al. (2016) employed such decomposition in anomaly detection by studying the Mahalanobis distance between the observed data and the low-rank component. Richard et al. (2012) developed a penalizationbased approach to estimating matrices that are simultaneously sparse and low-rank by adopting one type of penalty for sparsity and one for rank. All these approaches have been formulated as optimization problems and algorithms are generally based on iterative procedures. Within a Bayesian regression framework, Guha and Rodriguez (2021) combined the two types of sparsity and proposed a network-based spike and slab prior on the nodes importance. Under that model, a node is either active or inactive, and active nodes are expected to contribute to the outcome based on a low-rank coeﬃcient tensor. In that sense, this approach has important commonalities to estimating a coeﬃcient tensor that is simultaneously sparse and low-rank. Even though we ﬁnd that approach to be promising, we ﬁnd that node selection in itself can be too restrictive in some settings, and future work could incorporate hierarchical or parallel node and entry selection. Softer has similarities but also diﬀerences from the methods discussed above. On one hand, Softer provides a relaxation of an assumed low-rank form. However, this relaxation (or softening) is not sparse in any way, and every element of the tensor is allowed to deviate from the low-rank structure. We ﬁnd that an exciting line of research would combine lowrank and sparse approaches while allowing for suﬃcient ﬂexibility to deviate from both of them. Important questions remain on the interplay between entry selection and the assumed structure on the coeﬃcient tensor. We illustrated in simulations that tensor regression based on the hard PARAFAC can perform poorly for entry selection. Future work could focus on studying the properties of multiplicity control in structured settings, and forming principled variable selection approaches with desirable properties within the context of structured data. Even though we showed that the posterior distribution of the coeﬃcient tensor based on Softer is consistent irrespective of the tensor s true rank or the algorithmic rank used, additional theoretical results would help illuminate Softer s performance. For example, it would be interesting to understand how Softer performs under the asymptotic regime where the tensor predictor is allowed to grow with the sample size. Furthermore, even though our simulation results indicate that Softer reverts back to the hard PARAFAC when the underlying low-rank structure is true, future approaches to soft regression can investigate whether the soft and hard approaches asymptotically converge if the hard structure is true. If this result was derived, then it would imply a new form of robustness: robustness in terms of the method chosen, soft or hard. Finally, a criticism of Bayesian methods is often their computational burden. One way forward in alleviating this issue for Softer is to re-formulate our model within the frequen-

Soft Tensor Regression

tist paradigm. For example, one could impose a penalty term on the magnitude of the γ(d) k,jk parameters using penaltyγ(d) = P k P jk(γ(d) k,jk)2, substitute the softening structure in

Equation 6 with a penalty term such as penaltyβ(d) = P k P

j(β(d) k, j γ(d) k,jk)2, and maximize the penalized likelihood with appropriate tuning parameters. One might need to link the tuning parameter for penaltyγ(d) with that of penaltyβ(d) to ensure that penalization of one does not lead to overcompensation from the other (similarly to our discussion in Section 4.3 for the inclusion of ζ(d) in the distribution for β(d) k, j). Such L2-norm penalties with a continuous outcome would imply that the maximization problem is convex and optimization procedures can be developed relatively easily. Even though this reformulation would not reduce the number of parameters, it is expected to require far fewer iterations than an MCMC procedure. Alternative norms could also be considered to impose more sparsity, such as using a group-Lasso penalty instead of penaltyβ(d).

Acknowledgments

This work was supported by grant R01MH118927 from the National Institutes of Health (NIH) and R01-ES027498-01A1 from the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (NIH).

Papadogeorgou, Zhang and Dunson

Appendix A. Proofs

In this appendix, we provide proofs to all propositions.

Proof of Proposition 1

Expectation. We use S, Z and W to denote the collection of σ2 k, ζ(d) and w(d) k,jk, over k, d, and (k, jk, d) accordingly. We start by noting that

β(d) k, j|σ2 k, ζ(d), τγ, w(d) k,jk N(0, σ2 kζ(d) + τγζ(d)w(d) k,jk),

and, if (k, jk, d) = (k , j k, d ) β(d) k, j β(d ) k , j |S, Z, W, τγ. (A.1)

Note that β(d) k, j is not independent of β(d) k, j conditional on (S, Z, W, τγ) when jk = j k due to

their shared dependence on γ(d) k,jk. Then,

E(B j|S, Z, W, τγ) = E D X

k=1 β(d) k, j|S, Z, W, τγ =

k=1 E β(d) k, j|S, Z, W, τγ = 0

So, a priori, all elements of the coeﬃcient tensor have mean 0, E(B j) = 0.

Variance. Furthermore, we have

Var(B j) = E n Var B j|S, Z, W, τγ o = E n Var D X

k=1 β(d) k, j|S, Z, W, τγ o .

Since the β(d) k, j are conditionally independent across d, QK k=1 β(d) k, j are also conditionally in-

dependent across d. Moreover, the terms of the product β(d) k, j are independent across k

and are mean-zero random variables, implying that QK k=1 β(d) k, j are mean zero variables.

Note here that two independent mean-zero random variables A, B satisfy that Var(AB) = Var(A)Var(B). Then,

Var(B j) = E n D X

k=1 Var β(d) k, j|S, Z, W, τγ o

k=1 ζ(d)(σ2 k + τγw(d) k,jk) o

d=1 (ζ(d))Ko ES,W,τγ n K Y

k=1 (σ2 k + τγw(d) k,jk) o ,

where in the last equation we used that, a priori, Z (S, W, τγ) to write ES,W,τγ|Z as ES,W,τγ, and separate the two expectations.

Soft Tensor Regression

However, σ2 k + τγw(d) k,jk are not independent of each other for diﬀerent values of k since they all involve the same parameter τγ. We overcome this diﬃculty in calculating the expectation of the product by writing QK k=1(σ2 k + τγw(d) k,jk) = PK l=0 clτ l γ, where

K {1,2,...,K}:|K|=l

k K w(d) k,jk Y

So, for every power of τγ, τ l γ, l {0, 1, . . . , K}, the corresponding coeﬃcient is a sum of all terms involving l distinct w s and K l distinct σ2 s. For example, for K = 2, c1 = w(d) 1,j1σ2 2 + σ2 1w(d) 2,j2. Writing the product in this way, separates the terms (w(d) k,jk, σ2 k) from τγ, which are a priori independent. Then,

Var(B j) = EZ n D X

d=1 (ζ(d))Ko ES,W,τγ K X

l=0 clτ l γ = EZ n D X

d=1 (ζ(d))Kon K X

l=0 E(τ l γ)ES,W (cl) o .

We continue by studying ES,W (cl) = P K:|K|=l ES,W Q k K w(d) k,jk Q k K σ2 k . Note that

since all parameters {w(d) k,jk, σ2 k}k for ﬁxed jk are a priori independent (any dependence in

the w(d) k,jk exists across jk of the same mode due to the common value λ(d) k ), ES,W (cl) = P K:|K|=l Q k K EW (w(d) k,jk) Q k K ES(σ2 k) . Now, note that E(σ2 k) = aσ

EW (w(d) k,jk) = EΛ{EW|Λ[w(d) k,jk]} = 2EΛ{(λ(d) k ) 2}.

Since λ(d) k Γ(aλ, bλ), 1/λ(d) k IG(aλ, bλ), we have that

E{(1/λ(d) k )2} = Var(1/λ(d) k ) + E2(1/λ(d) k ) = b2 λ (aλ 1)(aλ 2), aλ > 2.

Putting this together, we have that, for aλ > 2,

ES,W (cl) = K l

n 2b2 λ (aλ 1)(aλ 2)

Further, since τγ Γ(aτ, bτ), we have that

E(τ l γ) = baτ τ Γ(aτ)

Z τ aτ+l 1 γ exp{ bττγ} dτγ = Γ(aτ + l)

Γ(aτ)blτ = ρl

for ρl = 1 if l = 0, and ρl = aτ(aτ + 1) . . . (aτ + l 1) if l 1. Lastly, since ζ Dir(α/D, α/D, . . . , α/D), we have that ζ(d) Beta(α/D, (D 1)α/D), and

E{(ζ(d))K} =

Papadogeorgou, Zhang and Dunson

Combining all of these, we can write the prior variance for entries B j of the coeﬃcient tensor as

Var(B j) = EZ h D X

d=1 (ζ(d))Ki K X

n 2b2 λ bτ(aλ 1)(aλ 2)

n 2b2 λ bτ(aλ 1)(aλ 2)

Covariance. Since E(B j|S, Z, W, τγ) = 0, we have that

Cov(B j, B j ) = E n Cov(B j, B j |S, Z, W, τγ) o .

Remember from Equation A.1 that, when at least one of k, jk, d are diﬀerent, β(d) k, j

β(d ) k , j |S, Z, W, τγ. However, that is not true when (k, jk, d) = (k , j k, d ), even if j = j. We write

E n Cov(B j, B j |S, Z, W, τγ) o = E n Cov D X

k=1 β(d) k, j,

k=1 β(d) k, j S, Z, W, τγ) o

d,d =1 E n Cov K Y

k=1 β(d) k, j,

k=1 β(d ) k, j |S, Z, W, τγ o .

k=1 β(d) k, j,

k=1 β(d ) k, j |S, Z, W, τγ =

k=1 β(d) k, jβ(d ) k, j |S, Z, W, τγ E K Y

k=1 β(d) k, j|S, Z, W, τγ E K Y

k=1 β(d ) k, j |S, Z, W, τγ =

k=1 β(d) k, jβ(d ) k, j |S, Z, W, τγ ,

where the last equation holds because the β(d) k, j are independent of each other across k

conditional on S, Z, W, τγ and have mean zero. Furthermore, since the β(d) k, j are conditionally

independent across d, we have that for d = d , Cov QK k=1 β(d) k, j, QK k=1 β(d ) k, j |S, Z, W, τγ = 0.

So we only need to study the conditional covariance for d = d . For Γ representing the set of all γ(d) k,jk, we write

k=1 β(d) k, jβ(d) k, j |S, Z, W, τγ = E n E K Y

k=1 β(d) k, jβ(d) k, j |Γ, S, Z, W, τγ |S, Z, W, τγ o .

Soft Tensor Regression

Conditional on Γ, S, Z, W, τγ, and as long as j = j , the β(d) k, j are independent across all indices, even if they have all of k, jk, d common, leading to

k=1 β(d) k, jβ(d) k, j |S, Z, W, τγ =

k=1 E β(d) k, j|Γ, S, Z, W, τγ E β(d) k, j |Γ, S, Z, W, τγ |S, Z, W, τγ o

k=1 γ(d) k,jkγ(d) k,j k|S, Z, W, τγ

k=1 E γ(d) k,jkγ(d) k,j k|S, Z, W, τγ

E γ(d) k,jk 2|S, Z, W, τγ

E γ(d) k,jk|S, Z, W, τγ E γ(d) k,j k|S, Z, W, τγ

where the ﬁrst equality holds because j = j , the third equality holds because the γ(d) k,jk are conditionally independent across k, and the fourth equality holds because they are conditionally independent across jk.

Proof of Proposition 2

We want Var(B j) = V and AV = AV . The second target will be achieved when Var(B j)/Varhard(B j) = (1 AV ) 1. Since aσ

bσ is the quantity driving the soft PARAFAC s additional variability we use this condition to acquire a form for aσ

bσ as a function of the remaining hyperparameters.

Varhard(B j)

P2 l=0 ρl blτ 2 l 2b2 λ (aλ 1)(aλ 2) l aσ

ρ2 b2τ 2b2 λ (aλ 1)(aλ 2) 2

n 2b2 λ bτ(aλ 1)(aλ 2)

n 2b2 λ bτ(aλ 1)(aλ 2)

n 2b2 λ bτ(aλ 1)(aλ 2)

= 1 aτ(aτ + 1)

n 2b2 λ bτ(aλ 1)(aλ 2)

2 + 2 aτ + 1

n 2b2 λ bτ(aλ 1)(aλ 2)

Papadogeorgou, Zhang and Dunson

Therefore, in order for Var(B j)/Varhard(B j) = (1 AV ) 1, aσ

bσ is the solution to a second degree polynomial. We calculate the positive root of this polynomial.

= 4 (aτ + 1)2

n 2b2 λ bτ(aλ 1)(aλ 2)

4 aτ(aτ + 1)

n 2b2 λ bτ(aλ 1)(aλ 2)

o 2 1 (1 AV ) 1

= 4 (aτ + 1)2

n 2b2 λ bτ(aλ 1)(aλ 2)

o 2h 1 aτ + 1

n 1 1 AV ) 1oi > 0.

bσ is positive, we have that aσ

bσ is equal to

2 aτ+1 n 2b2 λ bτ(aλ 1)(aλ 2) o 1 +

4 (aτ+1)2 n 2b2 λ bτ(aλ 1)(aλ 2) o 2h 1 aτ+1

n 1 1 AV ) 1 oi

2 aτ(aτ+1) n 2b2 λ bτ(aλ 1)(aλ 2) o 2

n 1 1 AV ) 1 o

n 2b2 λ bτ(aλ 1)(aλ 2) o 1

2b2 λ (aλ 1)(aλ 2)

1 1 AV ) 1 1 . (A.2)

Denoting ξ = 1 aτ + 1

1 1 AV ) 1 and substituting the form of aσ

bσ in Var(B j) we

Var(B j) = C

n 2b2 λ (aλ 1)(aλ 2)

= C n 2b2 λ (aλ 1)(aλ 2)

= C n 2b2 λ (aλ 1)(aλ 2)

ξ 1 2 + 2 p

ξ 1 + 1 + 1

= C n 2b2 λ (aλ 1)(aλ 2)

aτ = 1 aτ + 1

1 1 AV 1 + 1

aτ = aτ + 1

Var(B j) = V 2b2 λ (aλ 1)(aλ 2) = bτ

V (1 AV )aτ

C(aτ + 1) . (A.3)

Soft Tensor Regression

Substituting Equation A.3 back into Equation A.2, we have that

V (1 AV )aτ

1 1 AV ) 1 1 .

Proof of Proposition 3

Start by noting that

πB(B ϵ B0 ) = EΓ,S,Z h p B : max

j B j| < ϵ Γ, S, Z i ,

where Γ, S, Z are as deﬁned in the proof of Proposition 1. Then, take ϵ = Kp

ϵ/(2(D 1)) and write

j B j| < ϵ Γ, S, Z

j B j| < ϵ Γ, S, Z |β(d) k, j| < ϵ , all k, j, and d 2

p |β(d) k, j| < ϵ , all k, j, and d 2 |Γ, S, Z .

Conditional on Γ, S, Z, β(d) k, j are independent normal variables with positive weight in an

ϵ -neighborhood of 0, implying that p |β(d) k, j| < ϵ , all k, j, and d 2 |Γ, S, Z > 0.

Remember from Equation 5 that B = PD d=1 B(d) 1 B(d) 2 . . . B(d) K , and denote B(d) = B(d) 1 B(d) 2 . . . B(d) K . Then, B j = B(1)

j . Note that

j B j| < ϵ Γ, S, Z |β(d) k, j| < ϵ , all k, j, and d 2

j | < ϵ/2 Γ, S, Z |β(d) k, j| < ϵ , all k, j, and d 2

= p B : max

j | < ϵ/2 Γ, S, Z ,

where the equality holds because the entries of B(1) are independent of all β(d) k, j for d 2

conditional on Γ, S, Z, and the inequality holds because |B0

j | < ϵ/2 and |β(d) k, j| < ϵ

for d 2 implies that

j B j| = |B0

j | + |B(2)

j | + + |B(D)

< ϵ/2 + (D 1)(ϵ )K = ϵ.

Since all of β(1) k, j are independent conditional on Γ, S, Z, we have that

j | < ϵ/2 Γ, S, Z = Y

j | < ϵ/2 Γ, S, Z > 0,

Papadogeorgou, Zhang and Dunson

since all entries B(1)

j are products of draws from K normal distributions and therefore assign

positive weight in all R, including the ϵ/2 neighborhood of B0

Putting all of this together, we have the desired result that πB(B ϵ B0 ) > 0.

Proof of Proposition 4

We show that for any ϵ > 0, there exists ϵ > 0 such that n B : max j |B0

j B j| < ϵ o n B : KL(B0, B) < ϵ o , where

KL(B0, B) = Z log φ(y; B0)

φ(y; B) φ(y; B0)dy,

and φ(y; B) is the density of a normal distribution with coeﬃcient tensor B and variance 1. If we show this, consistency follows from Schwartz (1965). Assume that there exists M such that |X j| < M for all j with probability 1. For two normal distributions with mean X, B0 F and X, B F respectively, we have that

KL(B0, B) = 1

X, B0 F X, B0 F 2

j B j X j i2

j B j 2ih X

2ϵ/ M(p1p2 . . . p K)2 and consider B B ϵ B0 . We will show that B satisﬁes that KL(B0, B) < ϵ completing the proof. Note ﬁrst that

j B j 2 p1p2 . . . pk max

j B j 2 < p1p2 . . . pk(ϵ )2 = 2ϵ Mp1p2 . . . p K ,

and since |X j| M we have that P

j Mp1p2 . . . p K. Putting these results together

KL(B0, B) < 1

2 2ϵ Mp1p2 . . . p K Mp1p2 . . . p K = ϵ.

Appendix B. Hard PARAFAC Error in Estimating the True Matrix

Due to the block structure and subsequent inﬂexibility of the hard PARAFAC approximation, a large number of components D might be required in order to adequately approximate a coeﬃcient tensor B. To further demonstrate this, we considered a coeﬃcient matrix B whose entries Bij are centered around (but are not equal to) the entries of a rank-1 matrix of the form β1 β2. Therefore, even though the matrix has a somewhat rectangular structure, it is not exactly in that form. Using the singular value decomposition (which is

Soft Tensor Regression

Figure B.1: Distribution of errors based on the hard PARAFAC with increasing rank. The true matrix B resembles but is not exactly equal to a rank-1 tensor, and estimation is based on the singular value decomposition using D {3, 10, 20, 50} factors.

the PARAFAC analog for matrices), we considered the quality of the approximation based on D factors, for various values of D. Figure B.1 shows histograms of the diﬀerence of the true entries in B from the approximated ones. Even for D = 20, substantial error remains in estimating the matrix B.

Appendix C. Alternative Sampling from the Posterior Distribution

The full set of parameters is θ = {µ, δ, τ 2, β(d) k, j, γ(d) k,jk, σ2 k, ζ(d), w(d) k,jk, λ(d) k , τ 2 γ, for all d, k, jk, j}.

We use the notation | and | , y to denote conditioning on the data and all parameters, and the data and all parameters but y, accordingly. Then, our MCMC updates are:

(µ, δ)| N(µ , Σ ), for Σ = (Σ 1 0 + e CT e C/τ 2) 1, and µ = Σ e CT RB/τ 2, where e C is the N (p+1) matrix with ith row equal to (1, Ci), and RB = (Y1 X1, B F , . . . , YN XN, B F )T is the vector of residuals of the outcome on the tensor predictor.

τ 2| IG(aτ + N/2, bτ + PN i=1(Yi µ CT i δ Xi, B F )).

σ2 k| gi G(p , a , b ), for p = aσ D QK k=1 pk/2, a = 2bσ, and b = P d, j(β(d) k, j

γ(d) k,jk)2/ζ(d). As a reminder, X gi G(p, a, b) if p(x) xp 1 exp{ (ax + b/x)/2}.

γ(d) k,jk| N(µ , σ2 ), for σ2 = n (τγw(d) k,jkζ(d)) 1 + X

l =k pl /(σ2 kζ(d)) o 1 , and

j: jk=jk β(d) k, j/(σ2 kζ(d)) o .

τγ| gi G aτ D P k pk/2, 2bτ, P d,k,jk(γ(d) k,jk)2/(ζ(d)w(d) k,jk) .

w(d) k,jk| gi G(1/2, λ2 k, (γ(d) k,jk)2/(τγζ(d))).

Papadogeorgou, Zhang and Dunson

[λ(d) k | , w(d) k,jk, all jk] Γ(aλ +pk, bλ +P jk |γ(d) k,jk|/(τγζ(d))). Therefore, λ(d) k is updated

conditional on all parameters excluding all w(d) k,jk, jk = 1, 2, . . . , pk. Its distribution

can be acquired by noting that γ(d) k,jk|τγ, ζ(d), λ(d) k DE(µ = 0, b = τγζ(d)/λ(d) k ) (Park and Casella, 2008), where DE stands for double exponential or Laplace distribution.

for each k = 1, 2, . . . , K, d = 1, 2, . . . , D and jk = 1, 2, . . . , K, we use B(d) k,jk to

denote the jth k slice of tensor B(d) k along mode k, which is a (K 1)-mode tensor. Then, vec(B(d) k,jk)| N(µ , Σ ), for Σ = Σ 1 π + PN i=1 ΨiΨT i /τ 2 1, and

µ = Σ Σ 1 π µπ + PN i=1 Ψi Ri,Ψ /τ 2 , where

Σπ is a diagonal matrix of dimension QK k=1 pk /pk with repeated entry σ2 kζ(d),

µπ is a constant vector of length QK k=1 pk /pk with entry γ(d) k,jk,

Ψi = vec(B(d) 1 . . . B(d) k 1 B(d) k+1 . . . B(d) K Xi), and

Ri,Ψ = Yi α CT i δ Xi, P r =d B(r) 1 B(r) 2 . . . B(r) K F is the residual excluding component d of the coeﬃcient tensor.

for each d = 1, 2, . . . , D, we update ζ(d) from its full conditional, and then ensure that ζ sums to 1, by dividing all its entries with PD d=1 ζ(d). The ζ(d) update is from ζ(d)| gi G(p , a , b ), where p = α/D K(Q pk + P pk)/2, a = 0, and b = P k, j(β(d) k, j γ(d) k,jk)2/σ2 k + P k,jk γ2 k,jk/(τγw(d) k,jk).

Appendix D. Additional Simulation Results

D.1 Comparing Softer and PARAFAC in Identifying Important Entries of the Tensor Predictor

In Section 5.1 we presented an evaluation of the relative performance of Softer and hard PARAFAC in identifying important entries. There, we showed that Softer has signiﬁcant lower FPR indicating that the two methods systematically disagree in the entries of the tensor predictor they identify as important. In order to study their disagreement, for each entry of the tensor predictor we calculate the percentage of data sets for which Softer or hard PARAFAC identiﬁes the entry as important while the other does not. We plot the results in Figure D.1 as a function of the entry s true coeﬃcient. We see that the entries that Softer identiﬁes as important and hard PARAFAC does not happen uniformly over the entries true coeﬃcient. In contrast, when hard PARAFAC identiﬁes an entry as important and Softer does not, it is more likely that the coeﬃcient of this entry will be in reality small or zero. When further investigating this behavior of the hard PARAFAC, we identiﬁed that the entries that it identiﬁes as signiﬁcant in disagreement to Softer are most often the ones that attribute to the coeﬃcient tensor s block structure. This is evident in Figure D.2 where we see that the entries with high identiﬁcation by PARAFAC in contrast to Softer are the ones at the boundary of the truly non-zero entries.

Soft Tensor Regression

0.00 0.25 0.50 0.75 1.00 True coefficient

% Disagreed and Identified

0.00 0.25 0.50 0.75 1.00 True coefficient

% Disagreed and Identified

Figure D.1: Percentage of simulated data sets that an entry with coeﬃcient shown on the x-axis was identiﬁed as important by the stated method but not of the other one in the feet (left) and dog (right) scenario respectively.

Figure D.2: Percentage of data sets that an entry was identiﬁed as important by hard PARAFAC and not by Softer in the feet (left) and dog (right) scenario.

D.2 Simulation Results with Alternative Coeﬃcient Tensors

For a tensor predictor of dimension 32 32 we also considered three alternative coeﬃcient matrices. The constant squares represent a rank-3, sparse scenario. Both the hard PARAFAC and the Lasso are expected to perform well in this situation, and we are interested in investigating Softer s performance when softening is not necessary. The varying feet and dog scenarios are scenarios similarly to the dog and feet of the main text but for non-zero entries varying between 0.5 and 1. Figure D.3 shows the true and average estimated coeﬃcient matrices in these additional simulations. Even though we plot the true, Softer, and hard PARAFAC matrices using a common scale, we plot the expected coeﬃcients employing Lasso using a diﬀerent color scale. That is because, we want to show that Lasso gets the correct structure, on average,

Papadogeorgou, Zhang and Dunson

(a) Truth (b) Softer (c) PARAFAC (d) Lasso

Figure D.3: Simulation results for alternative coeﬃcient matrices. True and average estimated coeﬃcient matrix based on Softer, hard PARAFAC, and Lasso. The color scale is the same for the truth, and the Softer and hard PARAFAC approaches, but diﬀerent for the Lasso because the average estimated coeﬃcients for the Lasso is much smaller.

but largely underestimates coeﬃcients due to the assumption of sparsity. Further, Table D.1 reports the average absolute bias, root mean squared error and 95% coverage of the truly zero, and truly non-zero coeﬃcients, and the prediction mean squared error. When the true underlying hard PARAFAC structure is correct, Softer is able to revert back to its hard version, as is evident by the simulation results for the constant squares coeﬃcient matrix. Further, Softer performs better than the hard PARAFAC for the varying feet and varying dog scenarios. In all three scenarios, Softer has the best out-of-sample predictive ability.

D.3 Simulation Results for 32 32 Tensor Predictor and Sample Size n = 200

Simulation results in this section represent a subset of the scenarios in Section 5.1 (dog, feet, diagonal) but for sample size n = 200. The general conclusions from Section 5.1 remain even when considering a smaller sample size. Figure D.4 shows a plot similar to the one in Figure 5 including the true coeﬃcient matrices and average posterior mean or penalized estimate across data sets. Again, the color scale of the results is the same for Softer and hard PARAFAC, but is diﬀerent for the Lasso. Using diﬀerent scales facilitates illustration of the underlying structure the methods estimate, even though diﬀerent methods estimate diﬀerent magnitude of coeﬃcients. For example, Lasso captures the feet structure correctly,

Soft Tensor Regression

Softer PARAFAC Lasso constant Truly zero bias 0.0016 0.0014 0.0013 squares r MSE 0.016 0.015 0.016 coverage 99.2% 99.7% -

Truly non-zero bias 0.0211 0.017 0.073 r MSE 0.06 0.076 0.093 coverage 90.9% 79.6% -

Prediction MSE 0.79 1.25 1.32 varying Truly zero bias 0.06 0.076 0.014 feet r MSE 0.123 0.145 0.169 coverage 96.4% 90.7%

Truly non-zero bias 0.165 0.205 0.569 r MSE 0.244 0.279 0.661 coverage 87.7% 73.8%

Prediction MSE 40.79 55.19 194 varying Truly zero bias 0.071 0.091 0.013 dog r MSE 0.159 0.176 0.152 coverage 98.3% 92.7

Truly non-zero bias 0.263 0.321 0.554 r MSE 0.358 0.398 0.644 coverage 81.2 63.2

Prediction MSE 67.6 85.3 159.2

Table D.1: Average bias, root mean squared error, frequentist coverage of 95% credible intervals among truly zero and truly non-zero coeﬃcient entries, and predictive mean squared error for Softer, hard PARAFAC and Lasso for the simulation scenario with tensor predictor of dimensions 32 32 and sample size n = 400. Bold text is used for the approach performing best in each scenario and for each metric.

but non-zero coeﬃcients are greatly underestimated around 0.1 (instead of 1). In contrast, in the truly sparse, diagonal scenario, Lasso estimates non-zero coeﬃcients at about 0.8 whereas Softer estimates them to be close to 0.3, and hard PARAFAC near 0.06.

One of the main conclusions is that Softer deviates less from hard PARAFAC when the (n, p) ratio is small, which is evident by the more rectangular structure in the mean coeﬃcient matrix for the dog scenario, and the stronger shrinkage of the non-zero coeﬃcients in the diagonal scenario. Further, Table D.2 show the average absolute bias and root mean squared error for estimating the coeﬃcient matrix entries, and the prediction mean squared error.

Papadogeorgou, Zhang and Dunson

(a) Truth (b) Softer (c) PARAFAC (d) Lasso

Figure D.4: True and average estimated coeﬃcient matrix for Softer, hard PARAFAC, and Lasso. Note that the true matrices might be shown at a diﬀerent color scale than the estimated ones.

Softer PARAFAC Lasso

feet Truly zero 0.06 (0.14) 0.065 (0.15) 0.01 (0.118) Truly non-zero 0.216 (0.332) 0.229 (0.328) 0.535 (0.585) Prediction 95.4 (76.6, 111.1) 94 (78.8, 107.5) 312 (299, 323)

dog Truly zero 0.042 (0.155) 0.087 (0.164) 0.017 (0.155) Truly non-zero 0.171 (0.264) 0.174 (0.254) 0.248 (0.305) Prediction 110.6 (99.6, 121.1) 101.2 (94, 107.8) 177.3 (170.9, 183.4)

diagonal Truly zero 0.005 (0.054) 0.003 (0.038) 0.002 (0.02) Truly non-zero 0.753 (0.773) 0.945 (0.948) 0.149 (0.181) Prediction 22.76 (21.04, 24.29) 31.08 (30.1, 31.78) 2.00 (1.57 2.32)

Table D.2: Mean bias and r MSE among truly zero and truly non-zero coeﬃcient entries (presented as bias (r MSE)), and average and IQR of the predictive mean squared error (presented as average (IQR)) for tensor predictor of dimensions 32 32 and n = 200. Bold text is used for the approach performing best in each scenario.

Soft Tensor Regression

D.4 Simulation Results for Alternative Rank of the Hard PARAFAC

In order to investigate the reliance of hard PARAFAC and Softer on the rank of the PARAFAC approximation, we considered the simulation scenarios of Section 5.1 for D = 7. Results are shown in Table D.3. (Simulations comparing the performance of Softer and the hard PARAFAC for varying D under an alternative scenario are included in Appendix D.5). Comparing Table D.3 to the results in Section 5.1 we see that the performance of Softer is remarkably unaltered when D = 3 or D = 7. Perhaps the only diﬀerence is in the dog simulations where bias and mean squared error for the truly zero coeﬃcients is slightly increased when D = 7. This indicates that Softer is robust to the speciﬁcation of the rank of the underlying hard PARAFAC. In contrast, the hard PARAFAC approach shows improvements in performance when D = 7 compared to D = 3. This is evident when examining the bias and r MSE for the coeﬃcient entries in the feet, dog, and diagonal scenarios. Speciﬁcally, increasing the rank of

Softer PARAFAC squares Truly zero bias 0.003 0.005 r MSE 0.034 0.049 coverage 99.5% 98.6%

Truly non-zero bias 0.085 0.104 r MSE 0.111 0.146 coverage 79.6% 70.5%

Prediction MSE 5.17 8.76 feet Truly zero bias 0.035 0.041 r MSE 0.092 0.104 coverage 97.3% 96.7%

Truly non-zero bias 0.112 0.128 r MSE 0.181 0.199 coverage 89.2% 85.2%

Prediction MSE 30.69 37.15 dog Truly zero bias 0.079 0.078 r MSE 0.138 0.138 coverage 92.8% 94.7%

Truly non-zero bias 0.084 0.095 r MSE 0.157 0.166 coverage 92.4% 90.3%

Prediction MSE 32.49 36.68 diagonal Truly zero bias 0.002 0.005 r MSE 0.02 0.06 coverage 100% 99.9%

Truly non-zero bias 0.113 0.852 r MSE 0.128 0.861 coverage 93.3% 4.1%

Prediction MSE 1.43 28.09

Table D.3: Simulation results for Softer and hard PARAFAC with D = 7.

Papadogeorgou, Zhang and Dunson

the hard PARAFAC from D = 3 to D = 7 leads to reductions of up to 15% in absolute bias and 10% in r MSE. When examining the mean estimated coeﬃcient matrices (not shown) we see the improvements in estimation are in picking up the toes (in the feet scenario) and the eyes (in the dog scenario). However, the improvements in performance are quite minor. We suspect that the reason is that the singular values of the true coeﬃcient matrices decrease slowly after the ﬁrst three, indicating that adding a few additional components does not drive much of the approximation. Relatedly, it is likely that the Dirichlet prior on ζ in Guhaniyogi et al. (2017), along with a prior on Dirichlet parameter α, reduces the approximation rank to eﬀective values smaller than 7. Despite the improvements in performance for the hard PARAFAC when the rank is increased, Softer with either rank (D = 3 or 7) outperforms the hard PARAFAC. We suspect that the reason is that Softer allows for deviations from the underlying PARAFAC with D = 3 across any singular value (see argument corresponding to Figure 4).

D.5 Comparing Softer and the Hard PARAFAC when Varying the Underlying Rank D and the True Rank of the Coeﬃcient Tensor

Here we provide additional simulation results from Section 5.2 for more choices of the algorithmic rank D of the hard PARAFAC and Softer. Figure D.5 shows the results. Of course, as the true rank of the coeﬃcient tensor increases, the performance of both methods

Figure D.5: Bias, MSE and predictive MSE (rows) for varying rank of the true coeﬃcient matrix (horizontal axis) for the hard PARAFAC and Softer (columns) with various choices of the algorithmic rank D.

Soft Tensor Regression

deteriorates, irrespective of the algorithmic rank used. That is expected since the problem becomes harder with more parameters required to estimate the coeﬃcient matrix, and sample size equal to 200. In agreement to the results in Section 5.2, the performance of Softer is generally better than that of the hard PARAFAC, especially when the true rank is larger than the value of D used. Next, we compare how the same method performs for diﬀerent values of D. The performance of the hard PARAFAC with higher values of D is consistently better than the same method s performance for lower values of D. In contrast, Softer performance is strikingly similar across most values of D, illustrated by essentially indistinguishable lines, especially for larger values of the true coeﬃcient tensor. These results again illustrate that the results from Softer are robust to the choice of D.

Appendix E. Additional Information on our Brain Connectomics Study

E.1 Outcome Information for our Brain Connectomics Study

Table E.1 shows the full list of outcomes we considered in our analysis. Both binary and continuous outcomes are considered. Information includes the category of trait. C/B represents continuous and binary outcomes.

Category Name Description #obs Type Mean (SD) Cognition Read Eng Age Adj Age adjusted reading score 1,065 C 107.1 (14.8)

Pic Vocab Age Adj Age adjusted vocabulary comprehension 1,065 C 109.4 (15.2)

VSPLOT TC Spatial Orientation (Variable Short Penn Line Orientation Test) 1,062 C 15 (4.44)

Motor Endurance Age Adj Age Adjusted Endurance 1,063 C 108.1 (13.9) Strength Age Adj Age Adjusted Strength 1,064 C 103.6 (20.1)

Substance Intake Alc 12 Drinks Per Day Drinks per day 1,010 C 2.3 (1.57)

Alc Hvy Frq 5plus Frequency of 5+ drinks 1,011 C 3 (1.44) TB DSM Diﬃculty Quitting Tobacco diﬃculty quitting 280 B 74.6%

Mj Ab Dep Marijuana Dependence 1,064 B 9.3%

Psychiatric and ASR Intn T Achenbach Adult Self-Report (Internalizing Symptoms) 1,062 C 48.5 (10.7)

Life function ASR Oth Raw Achenbach Adult Self-Report (Other problems) 1,062 C 9.1 (4.6)

Depressive Ep

Has the participant experienced a diagnosed DSMIV major depressive episode over his/her lifetime

1,035 B 9.2%

Papadogeorgou, Zhang and Dunson

Emotion Ang Hostil Unadj NIH Toolbox Anger and Aﬀect Survey (Attitudes of Hostility) 1,064 C 50.5 (8.58)

Personality NEOFAC Agreeableness Big Five trait: Agreeableness 1,063 C 32 (4.93)

Health BMI Body Mass Index 1,064 C 26.4 (5.1)

Table E.1: List of outcomes with description and descriptive statistics.

E.2 Additional Results

Figure E.1 shows predictive power of (1) symmetric Softer with D = 3 and (2) D = 6, (3) standard Softer ignoring symmetry with D = 6, (4) hard PARAFAC and (5) Lasso. Results from (2), (4), and (5) are also shown in Figure 7. Comparing results from (1) and (2) we see that increasing the rank from D = 3 to D = 6 improved predictions for a subset of outcomes. Ignoring the symmetry of the tensor predictor performed sometimes better and

NEOFAC Agreeableness BMI VSPLOT TC ASR Intn T

Alc 12 Drinks Per Day Alc Hvy Frq 5plus ASR Oth Raw Ang Hostil

Read Eng Age Adj Pic Vocab Age Adj Endurance Age Adj Strength Age Adj

% Variance explained

Method & Feature

Softer D3 Count

Softer D3 CSA

Softer D6 Count

Softer D6 CSA

Stand_Softer D6 Count

Stand_Softer D6 CSA

Guhaniyogi Count

Guhaniyogi CSA

Lasso Count

TB DSM Difficulty Quitting Mj Ab Dep Depressive Ep

% Correctly classified

Figure E.1: Additional study results showing predicted variable explained and percentage of correctly classiﬁed held-out data based on a larger set of approaches.

Soft Tensor Regression

sometimes worse than accounting for symmetry directly into Softer, showing that the two approaches perform comparably for prediction (comparing (2) and (3)). The two versions of Softer with D = 3 and D = 6 returned related results for the identiﬁed important entries. Table E.2 shows the connections identiﬁed based on Softer for D = 3. When comparing the identiﬁed entries between symmetric Softer with rank 3 and 6, we see that the important connection for strength in Table 4 is also identiﬁed by Softer with D = 3, using either count or CSA as the tensor predictor. However, Softer with D = 3 identiﬁed a larger number of important connections for predicting strength than Softer with D = 6, though all of the connections involved the Precuneus region. The 3 connections identiﬁed by Softer with D = 3 using CSA were among the 5 connections identiﬁed by the same method using count as the predictor, an indication of stability in the identiﬁed brain connections across diﬀerent features of brain connections. These results indicate the interplay between increasing the number of parameters to be estimated when increasing the rank, with the increased ﬂexibility that increasing the rank provides to estimation. No important connections were identiﬁed for having had a depressive episode when using Softer with D = 3, and the identiﬁed connection for predicting VSPLOT includes the Banks of Superior Temporal Sulcus as one of the two regions, though in the diﬀerent hemisphere and in combination with a diﬀerent region for D = 3 and D = 6. It is also promising that the 2 out of 3 connections identiﬁed by Softer with D = 3 for CSA tensor predictor and endurance as the outcome correspond to the same pair of regions, though in the opposite hemisphere.

E.3 Reproducibility of Identiﬁed Connections Across Subsamples

We ﬁt Softer with D = 6 on 15 randomly chosen subsamples of our data which included 90% of the observations. Since the sample size is smaller in the subsamples, we expect our power to detect important connections to also decrease. We investigate in how many of the subsamples we identify the connections marked as important in the full data set (with the same method), shown in Table 4.

Outcome Feature ROI 1 ROI 2 Strength Count (rh) Precuneus (rh) Superior Parietal (lh) Superior Frontal (rh) Caudal Middle Frontal (rh) Isthmus of cingulate gyrus (rh) Pericalcarine CSA (rh) Precuneus (rh) Superior Parietal (lh) Superior Frontal (rh) Caudal Middle Frontal VSPLOT CSA (rh) Banks of Superior Temporal Sulcus (lh) Superior Frontal Endurance CSA (lh) Cuneus (lh) Pericalcarine (lh) Cuneus (rh) Pericalcarine (lh) Insula (rh) Inferior Parietal

: connections that are related to those identiﬁed based on Softer with D = 6. : connections that are also identiﬁed based on alternative feature, outcome, or hemisphere.

Table E.2: Identiﬁed connections based on Softer with D = 3.

Papadogeorgou, Zhang and Dunson

Strength. Using the count of streamlines as the tensor predictor, Softer with D = 6 identiﬁed one important connection between the ROIs Precuneus and Superior Parietal, both in the right hemisphere. The same connection was also identiﬁed in 10 out of 15 subsamples, while Softer did not identify any important connection in 4 of the subsamples. These results indicate that the identiﬁed connection for strength based on the count of streamlines stated in Table 4 is reproducible across subsamples.

Depressive Episode. Out of 15 subsamples, Softer with D = 6 identiﬁed important connections in 6 subsamples when using count of streamlines, and in 7 subsamples when using CSA as the tensor predictor. The connection between the Parahippocampal in the left hemisphere and the Lateral Orbitofrontal in the right hemisphere, was identiﬁed in all 13 instances with identiﬁed connections, based on either predictor (and in agreement with the results in Table 4). What s more, any of the connections identiﬁed across any subsample and for either predictor involved at least one of these two regions, implying that the importance of these two regions in predicting whether an individual has had a depressive episode is reproducible.

VSPLOT. Using CSA as the tensor predictor, Softer with D = 6 identiﬁed no connections in any of the subsamples. This might imply that the connection in Table 4 is not reproducible, or that the reduced sample size in the subsamples does not allow us to identify it.

Appendix F. Symmetric and Semi-Symmetric Soft Tensor Regression

F.1 Softer for Symmetric 2-mode Predictor

We start again from model Equation 2 with Yi = µ + CT i δ + Xi, B F + ϵi. However, now Xi is an R R symmetric matrix with ignorable diagonal elements. This means that we can think of B as a real symmetric matrix with ignorable diagonal elements.

F.1.1 Eigenvalue decomposition of real symmetric matrices

We can still approximate B in the same way as in PARAFAC (SVD) by writing B = PD d=1 γ(d) 1 γ(d) 2 for some D large enough and γ(d) 1 , γ(d) 2 RR. However, this would not enforce that B is symmetric, since Bj1j2 = PD d=1 γ(d) 1,j1γ(d) 2,j2 = PD d=1 γ(d) 1j2γ(d) 2j1 = Bj2j1. This implies that the entries of B would only be identiﬁable up to Bj1j2 + Bj2j1. Since B is a real symmetric matrix, it is diagonalizable and it has an eigenvalue decomposition. Therefore, we can think of approximating B using

d=1 ξ(d)γ(d) γ(d), (F.1)

for suﬃciently large D, where γ(d) RR and ξ(d) R. Note that the vectors γ(d) here resemble the ones in the PARAFAC decomposition, but they are the same across the two tensor modes (matrix rows and columns). The main diﬀerence between using the eigenvalue-based approximation in Equation F.1, compared to the PARAFAC-based approximation is the inclusion of the parameters ξ(d). Here, ξ(d) are necessary in order to have a eigenvalue decomposition employing vectors with

Soft Tensor Regression

real entries. In fact, excluding ξ(d) from Equation F.1 can only be used to approximate positive deﬁnite symmetric matrices. To see this, take vector v RR. Then,

d=1 γ(d) γ(d) v = v T D X

d=1 γ(d)γ(d)T v =

d=1 v T γ(d)(v T γ(d))T =

d=1 (v T γ(d))2 0

F.1.2 Soft eigenvalue-based tensor regression

In the case of tensor predictors without symmetries, Softer was built based on the PARAFAC (multimodal equivalent to SVD) approximation of the coeﬃcient tensor. Instead, for symmetric matrices, Softer is based on the eigenvalue decomposition while still allowing for deviations in row (and column)-speciﬁc contributions. However, these deviations also have to account for the symmetric nature of the tensor predictor. Write B = PD d=1 ξ(d)B(d) 1 B(d) 2 similarly to Equation 5, and assume prior distributions on all parameters as in Section 4.3. Note that the parameters γ(d) are not forced to be of norm 1 (as in classic eigenvalue decomposition), and they can have any magnitude. This allows us to restrict the parameters ξ(d) to be in { 1, 1}, and base shrinkage of unnecessary ranks on shrinkage of the vectors γ(d). Therefore, we assume a Bernoulli(0.5) distribution over { 1, 1} for parameters ξ(d). However, even though the symmetry of the underlying decomposition is enforced based on γ(d), we need to ensure that it is also enforced when writing B = PD d=1 ξ(d)B(d) 1 B(d) 2 using B(d) 1 , B(d) 2 . Note that the Softer framework assumes that entries β(d) 1,j1j2 are centered

around γ(d) j1 , and similarly entries β(d) 2,j1j2 are centered around γ(d) j2 . However, doing so does not necessarily lead to symmetric matrices B since

d=1 β(d) 1,j1j2β(d) 2,j1j2 =

d=1 β(d) 1,j2j1β(d) 2,j2j1 = Bj2j1.

We enforce symmetry of B by focusing only on the lower-triangular part. Softer for symmetric matrix predictor speciﬁes row i s contributions to entries Bij, β(d) 1,ij, as centered

around γ(d) i only for i > j. Then, for j > i we set β(d) 1,ij = β(d) 1,ji. Similarly, column i s

contributions to entries Bji, β(d) 2,ji are centered around γ(d) i only for i < j, and for j > i we

set β(d) 2,ji = β(d) 2,ij. An equivalent way to enforce symmetry on B is to allow all entries in B(d) 1 to have the same form as in Softer, and force B(d) 2 = (B(d) 1 )T .

F.1.3 Note on implementation using RStan

Note that RStan cannot directly handle discrete parameters as ξ(d). The most common approach to discrete parameters is to specify the likelihood integrating these parameters out. However, this approach is not easily applicable in our setting since ξ(d) are entangled in the likelihood through their role in the coeﬃcient matrix B. For that reason, we take an alternative approach, and assume that ξ(d) are continuous and specify a mixture of normals distribution on each of them: ξ(d) 0.5N( 1, 0.001) + 0.5N(1, 0.001). Since the parameters ξ(d) are not directly of interest, and shrinkage of the contributions of component

Papadogeorgou, Zhang and Dunson

d in a rank D decomposition is achieved through the prior on γ(d), we expect that this approach will closely resemble results from a speciﬁcation that deﬁnes ξ(d) to be binary taking values in { 1, 1} from a Bernoulli(0.5) distribution.

F.2 Softer for Semi-Symmetric 3-mode Tensor

In brain connectomics, and speciﬁcally in our study of features of brain connections and their relationship to traits, tensor predictors are often of dimensions R R p and are semi-symmetric. Semi-symmetry means that the predictor X is symmetric along its ﬁrst two modes and Xj1j2j3 = Xj2j1j3. An example of such tensor includes R brain regions along the ﬁrst two modes and p features of brain connection characteristics along its third mode. When these features are symmetric (feature of connection from region i to region j is the same as the feature of connection from region j to region i), the tensor predictor is semi-symmetric. In such cases, the standard Softer approach could be applied, but entries of B would be identiﬁable only up to Bj1j2j3 + Bj2j1j3. In order to account for the semi-symmetry in X we can enforce the same type of semi-symmetry in B by adopting a PARAFAC-eigenvalue decomposition hybrid. Speciﬁcally, assume that B is a 3-mode semi-symmetric coeﬃcient tensor corresponding to the semi-symmetric predictor X. Then, for suﬃciently large D, γ(d) RR and ρ(d) Rp, we can write

d=1 γ(d) γ(d) ρ(d). (F.2)

This leads to a natural approximation for B for some value D potentially smaller than the true one. Softer for semi-symmetric tensor predictor builds on Equation F.2 while allowing for deviations in the row-speciﬁc contributions along the three modes. We achieve that by specifying B = PD d=1 B(d) 1 B(d) 2 B(d) 3 for B(d) k are tensors of dimensions R R p. The structure and speciﬁcation of B(d) k are as in Section 4.3 with small changes to account for the semi-symmetric structure in X and ensure that the estimated coeﬃcient tensor is also semi-symmetric. Note that the (j1, j2, j3) entry of B is equal to

d=1 β(d) 1,j1j2j3β(d) 2,j1j2j3β(d) 3,j1j2j3,

and we want Bj1j2j3 = Bj2j1j3. Borrowing from the symmetric case, we allow all rowspeciﬁc contributions along mode 1, β(d) 1,j1j2j3, to vary around the corresponding entry in

the decomposition Equation F.2, γ(d) j1 , and set B2,..j3 = BT 1,..j3. Further, we allow entries

β(d) 3,j1j2j3 to vary around ρ(d) j3 for j1 < j2, and set β(d) 3,j1j2j3 = β(d) 3,j2j1j3 when j1 > j2. Doing so, ensures that Bj1j2j3 = Bj2j1j3.

Soft Tensor Regression

Artin Armagan, David B Dunson, and Jaeyong Lee. Generalized double Pareto shrinkage. Statistica Sinica, 23(1):119 143, 2013.

Emmanuel J. Cand es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM, 58(3):Article 11, 2011.

Bob Carpenter, Andrew Gelman, Matthew D Hoﬀman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 2017.

Andrea E Cavanna and Michael R Trimble. The precuneus: a review of its functional anatomy and behavioural correlates. Brain, 129(3):564 583, 2006.

Michael W Cole, Danielle S Bassett, Jonathan D Power, Todd S Braver, and Steven E Petersen. Article intrinsic and task-evoked network architectures of the human brain. Neuron, 83(1):238 251, 2014.

David D. Cox and Robert L. Savoy. Functional magnetic resonance imaging (f MRI) brain reading : Detecting and classifying distributed patterns of f MRI activity in human visual cortex. Neuro Image, 19(2):261 270, 2003.

R. Cameron Craddock, Paul E. Holtzheimer, Xiaoping P. Hu, and Helen S. Mayberg. Disease state prediction from resting state functional connectivity. Magnetic Resonance in Medicine, 62(6):1619 1628, dec 2009.

Paula L Croxson, Stephanie J Forkel, Leonardo Cerliani, and Michel Thiebaut De Schotten. Structural variability across the primate brain: A cross-species comparison. Cerebral Cortex, 28(11):3829 3841, 2018.

Rahul S. Desikan, Florent S egonne, Bruce Fischl, Brian T. Quinn, Bradford C. Dickerson, Deborah Blacker, Randy L. Buckner, Anders M. Dale, R. Paul Maguire, Bradley T. Hyman, Marilyn S. Albert, and Ronald J. Killiany. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuro Image, 31:968 980, 2006.

Daniele Durante and David B Dunson. Bayesian inference and testing of group diﬀerences in brain networks. Bayesian Analysis, 13(1):29 58, 2018.

Leonardo Fogassi and Giuseppe Luppino. Motor functions of the parietal lobe. Current Opinion in Neurobiology, 15(6):626 631, 2005.

Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using multiple sequences. Statistical Science, 7(4):457 511, 1992.

Andrew Gelman, Aleks Jakulin, Maria Grazia Pittau, and Yu-Sung Su. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4):1360 1383, 2008.

Papadogeorgou, Zhang and Dunson

Cedric E Ginestet, Jun Li, Prakash Balachandran, Steven Rosenberg, and Eric D Kolaczyk. Hypothesis testing for network data in functional neuroimaging. The Annals of Applied Statistics, 11(2):725 750, 2017.

Sharmistha Guha and Abel Rodriguez. Bayesian regression with undirected network predictors with an application to brain connectome data. Journal of the American Statistical Association, 116(534):581 593, 2021. ISSN 1537274X. doi: 10.1080/01621459.2020.1772079.

Rajarshi Guhaniyogi, Shaan Qamar, and David B Dunson. Bayesian tensor regression. Journal of Machine Learning Research, 18:1 31, 2017.

John-Dylan Haynes and Geraint Rees. Predicting the orientation of invisible stimuli from activity in human primary visual cortex. Nature Neuroscience, 8(5):686 691, 2005.

Matthew D. Hoﬀman and Andrew Gelman. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15: 1593 1623, 2014.

Hung Hung and Chen-Chien Wang. Matrix variate logistic regression model with application to EEG data. Biostatistics, 14(1):189 202, 2013. doi: 10.1093/biostatistics/kxs023.

Masaaki Imaizumi and Kohei Hayashi. Doubly decomposing nonparametric tensor regression. In 33rd International Conference on Machine Learning, ICML 2016, volume 2, pages 1136 1149, 2016. ISBN 9781510829008.

Heishiro Kanagawa, Taiji Suzuki, Hayato Kobayashi, Nobuyuki Shimizu, and Yukihiro Tagami. Gaussian process nonparametric tensor estimator and its minimax optimality. In 33rd International Conference on Machine Learning, ICML 2016, volume 4, pages 2452 2473, 2016. ISBN 9781510829008.

Kevin S La Bar and Roberto Cabeza. Cognitive neuroscience of emotional memory. Nature Reviews Neuroscience, 7(1):54 64, jan 2006.

Xiaoshan Li, Da Xu, Hua Zhou, and Lexin Li. Tucker tensor regression and neuroimaging analysis. Statistics in Biosciences, 10(3):520 545, 2018.

Yi Liao, Xiaoqi Huang, Qizhu Wu, Chuang Yang, Weihong Kuang, Mingying Du, Su Lui, Qiang Yue, Raymond CK Chan, Graham J Kemp, and Qiyong Gong. Is depression a disconnection syndrome? Meta-analysis of diﬀusion tensor imaging studies in patients with MDD. Journal of psychiatry & neuroscience, 38(1):49 56, 2013.

Koji Maruhashi, Masaru Todoriki, Takuya Ohwa, Keisuke Goto, Yu Hasegawa, Hiroya Inakoshi, and Hirokazu Anai. Learning multi-way relations via tensor decomposition with neural networks. In 32nd AAAI Conference on Artiﬁcial Intelligence, AAAI 2018, pages 3770 3777, 2018. ISBN 9781577358008.

Helen S Mayberg. Frontal lobe dysfunction in secondary depression. The Journal of Neuropsychiatry and Clinical Neurosciences, 6(4):428 442, 1994.

Soft Tensor Regression

Helen S Mayberg. Modulating dysfunctional limbic-cortical circuits in depression: towards development of brain-based algorithms for diagnosis and optimised treatment. British medical bulletin, 65(1):193 207, 2003.

Ian M. Mc Donough and Kaoru Nashiro. Network complexity as a measure of information processing across resting-state networks: evidence from the Human Connectome Project. Frontiers in Human Neuroscience, 8:409, 2014.

Tom M Mitchell, Rebecca Hutchinson, Radu S Niculescu, Francisco Pereira, Xuerui Wang, Marcel Just, Sharlene Newman, Nada Lavraˇc, Hiroshi Motoda, and Tom Fawcett. Learning to decode cognitive states from brain images. Machine learning, 57(1-2):145 175, 2004.

Peter M uller, Giovanni Parmigiani, and Kenneth Rice. FDR and Bayesian multiple comparisons rules. Technical report, 2006.

Kenneth A Norman, Sean M Polyn, Greg J Detre, and James V Haxby. Beyond mindreading: multi-voxel pattern analysis of f MRI data. Trends in Cognitive Sciences, 10(9): 424 430, 2006.

Alice J O Toole, Fang Jiang, Herv e Abdi, and James V Haxby. Partially distributed representations of objects and faces in ventral temporal cortex. Journal of Cognitive Neuroscience, 17(4):580 590, 2005.

Trevor Park and George Casella. The Bayesian Lasso. Journal of the American Statistical Association, 103(482):681 686, 2008.

Sean M Polyn, Vaidehi S Natu, Jonathan D Cohen, and Kenneth A Norman. Categoryspeciﬁc cortical activity precedes retrieval during memory search. Science, 310(5756): 1963 1966, 2005.

Roberta Riccelli, Nicola Toschi, Salvatore Nigro, Antonio Terracciano, and Luca Passamonti. Surface-based morphometry reveals the neuroanatomical basis of the ﬁve-factor model of personality. Social Cognitive and Aﬀective Neuroscience, 12(4):671 684, 2017.

Emile Richard, Pierre-Andr e Savalle, and Nicolas Vayatis. Estimation of simultaneously sparse and low rank matrices. In Proceedings of the 29th International Conference on Machine Learning, 2012.

Jonas Richiardi, Hamdi Eryilmaz, Sophie Schwartz, Patrik Vuilleumier, and Dimitri Van De Ville. Decoding brain states from f MRI connectivity graphs. Neuro Image, 56(2):616 626, may 2011.

Lorraine Schwartz. On Bayes procedures. Zeitschrift f ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 4(1):10 26, 1965.

James G. Scott and James O. Berger. Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Annals of Statistics, 38(5):2587 2619, 2010.

Papadogeorgou, Zhang and Dunson

D.A Seminowicz, H.S Mayberg, A.R Mc Intosh, K Goldapple, S Kennedy, Z Segal, and S Raﬁ-Tari. Limbic frontal circuitry in major depression: a path modeling metanalysis. Neuro Image, 22(1):409 418, may 2004.

Marco Signoretto, Lieven De Lathauwer, and Johan A K Suykens. Learning tensors in reproducing kernel Hilbert spaces with multilinear spectral penalties. Technical report, 2013.

Olivier Simon, Jean Fran cois Mangin, Laurent Cohen, Denis Le Bihan, and Stanislas Dehaene. Topographical layout of hand, eye, calculation, and language-related areas in the human parietal lobe. Neuron, 33(3):475 487, 2002.

Stephen M Smith, Thomas E Nichols, Diego Vidaurre, Anderson M Winkler, Timothy E J Behrens, Matthew F Glasser, Kamil Ugurbil, Deanna M Barch, David C Van Essen, and Karla L Miller. A positive-negative mode of population covariation links brain connectivity, demographics and behavior. Nature Neuroscience, 18(11), 2015.

Daniel Spencer. Inference and uncertainty quantiﬁcation for high-dimensional tensor regression with tensor decompositions and Bayesian methods. Ph D thesis, 2020. URL https://escholarship.org/uc/item/9b3072j1.

Stan Development Team. RStan: the R interface to Stan., 2018. URL http://mc-stan.org.

Ronald J. Steingard, Perry F. Renshaw, John Hennen, Mara Lenox, Christina Bonella Cintron, Ashley D. Young, Daniel F. Connor, Trang H. Au, and Deborah A. Yurgelun Todd. Smaller frontal lobe white matter volumes in depressed adolescents. Biological Psychiatry, 52(5):413 417, 2002.

Taiji Suzuki, Heishiro Kanagawa, Hayato Kobayash, Nobuyuki Shimizu, and Yukihiro Tagami. Minimax optimal alternating minimization for kernel nonparametric tensor learning. In Advances in Neural Information Processing Systems, pages 3790 3798, 2016.

Ledyard Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279 311, 1966.

David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy E J Behrens, and Essa Yacoub. The WU-Minn Human Connectome Project: An overview. Neuro Image, 80: 62 79, 2013.

Lu Wang, Zhengwu Zhang, and David Dunson. Symmetric bilinear regression for signal subgraph estimation. IEEE Transactions on Signal Processing, PP(c):1, 2018.

Xiang Wang, David Sontag, and Fei Wang. Unsupervised learning of disease progression models. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 85 94, 2014. ISBN 9781450329569. doi: 10.1145/ 2623330.2623754.

Andrew E Waters, Aswin C Sankaranarayanan, and Richard G Baraniuk. Spa RCS: Recovering low-rank and sparse matrices from compressive measurements. Advances in Neural Information Processing Systems, pages 1089 1097, 2011.

Soft Tensor Regression

Nicole Wenderoth, Filiep Debaere, Stefan Sunaert, and Stephan P. Swinnen. The role of anterior cingulate cortex and precuneus in the coordination of motor behaviour. European Journal of Neuroscience, 22(1):235 246, jul 2005.

W. J. Youden. Index for rating diagnostic tests. Cancer, 3(1):32 35, jan 1950.

Ling-Li Zeng, Hui Shen, Li Liu, Lubin Wang, Baojuan Li, Peng Fang, Zongtan Zhou, Yaming Li, and Dewen Hu. Identifying major depression using whole-brain functional connectivity: a multivariate pattern analysis. Brain, 135:1498 1507, 2012.

Jian Zhai and Ke Li. Predicting brain age based on spatial and temporal features of human brain functional networks. Frontiers in Human Neuroscience, 13:Article 62, 2019.

Yuxiang Zhang, Bo Du, Liangpei Zhang, and Shugen Wang. A low-rank and sparse matrix decomposition-based mahalanobis distance method for hyperspectral anomaly detection. IEEE Transactions on Geoscience and Remote Sensing, 54(3):1376 1389, 2016.

Zhengwu Zhang, Maxime Descoteaux, Jingwen Zhang, Gabriel Girard, Maxime Chamberland, David Dunson, Anuj Srivastava, and Hongtu Zhu. Mapping population-based structural connectomes. Neuro Image, 172:130 145, 2018.

Zhengwu Zhang, Genevera I. Allen, Hongtu Zhu, and David Dunson. Tensor network factorizations: Relationships between brain structural connectomes and traits. Neuro Image, 197:330 343, 2019.

Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540 552, 2013.

Tianyi Zhou and Dacheng Tao. Go Dec: Randomized low-rank & sparse matrix decomposition in noisy case. In Proceedings of the 28th International Conference on Machine Learning, 2011.