# statistical_mechanics_of_lowrank_tensor_decomposition__ca852976.pdf

Statistical mechanics of low-rank tensor decomposition

Jonathan Kadmon Department of Applied Physics, Stanford University kadmonj@stanford.edu

Surya Ganguli Department of Applied Physics, Stanford University and Google Brain, Mountain View, CA sganguli@stanford.edu

Often, large, high dimensional datasets collected across multiple modalities can be organized as a higher order tensor. Low-rank tensor decomposition then arises as a powerful and widely used tool to discover simple low dimensional structures underlying such data. However, we currently lack a theoretical understanding of the algorithmic behavior of low-rank tensor decompositions. We derive Bayesian approximate message passing (AMP) algorithms for recovering arbitrarily shaped low-rank tensors buried within noise, and we employ dynamic mean ﬁeld theory to precisely characterize their performance. Our theory reveals the existence of phase transitions between easy, hard and impossible inference regimes, and displays an excellent match with simulations. Moreover it reveals several qualitative surprises compared to the behavior of symmetric, cubic tensor decomposition. Finally, we compare our AMP algorithm to the most commonly used algorithm, alternating least squares (ALS), and demonstrate that AMP signiﬁcantly outperforms ALS in the presence of noise.

1 Introduction

The ability to take noisy, complex data structures and decompose them into smaller, interpretable components in an unsupervised manner is essential to many ﬁelds, from machine learning and signal processing [1, 2] to neuroscience [3]. In datasets that can be organized as an order 2 data matrix, many popular unsupervised structure discovery algorithms, like PCA, ICA, SVD or other spectral methods, can be uniﬁed under rubric of low rank matrix decomposition. More complex data consisting of measurements across multiple modalities can be organized as higher dimensional data arrays, or higher order tensors. Often, one can ﬁnd simple structures in such data by approximating the data tensor as a sum of rank 1 tensors. Such decompositions are known by the name of rank-decomposition, CANDECOMP/PARAFAC or CP decomposition (see [4] for an extensive review).

The most widely used algorithm to perform rank decomposition is alternating least squares (ALS) [5, 6], which uses convex optimization techniques on different slices of the tensor. However, a major disadvantage of ALS is that it does not perform well in the presence of highly noisy measurements. Moreover, its theoretical properties are not well understood. Here we derive and analyze an approximate message passing (AMP) algorithm for optimal Bayesian recovery of arbitrarily shaped, high-order low-rank tensors buried in noise. As a result, we obtain an AMP algorithm that both out-performs ALS and admits an analytic theory of its performance limits.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

AMP algorithms have a long history dating back to early work on the statistical physics of perceptron learning [7, 8] (see [9] for a review). The term AMP was coined by Donoho, Maleki and Montanari in their work on compressed sensing [10] (see also [11, 12, 13, 14, 15, 16] for replica approaches to compressed sensing and high dimensional regression). AMP approximates belief propagation in graphical models and a rigorous analysis of AMP was carried out in [17]. For a rank-one matrix estimation problem, AMP was ﬁrst introduced and analyzed in [18]. This framework has been extended in a beautiful body of work by Krzakla and Zdeborova and collaborators to various lowrank matrix factorization problems in [19, 20, 21, 22]). Recently low-rank tensor decomposition through AMP was studied in [21], but their analysis was limited to symmetric tensors which are then necessarily cubic in shape. In [23], a similar approach was used to extend the analysis of order-2 tensors (matrices) to order-3 tensors, which can potentially be further extended to higher orders.

However, tensors that occur naturally in the wild are almost never cubic in shape, nor are they symmetric. The reason is that the p different modes of an order p tensor correspond to measurements across very different modalities, resulting in very different numbers of dimensions across modes, yielding highly irregularly shaped, non-cubic tensors with no symmetry properties. For example in EEG studies 3 different tensor modes could correspond to time, spatial scale, and electrodes [24]. In f MRI studies the modes could span channels, time, and patients [25]. In neurophysiological measurements they could span neurons, time, and conditions [26] or neurons, time, and trials [3]. In studies of visual cortex, modes could span neurons, time and stimuli [27].

Thus, given that tensors in the wild are almost never cubic, nor symmetric, to bridge the gap between theory and experiment, we go beyond prior work to derive and analyze Bayes optimal AMP algorithms for arbitrarily shaped high order and low rank tensor decomposition with different priors for different tensor modes, reﬂecting their different measurement types. We ﬁnd that the low-rank decomposition problem admits two phase transitions separating three qualitatively different inference regimes: (1) the easy regime at low noise where AMP works, (2) the hard regime at intermediate noise where AMP fails but the ground truth tensor is still possible to recover, if not in a computationally tractable manner, and (3) the impossible regime at high noise where it is believed no algorithm can recover the ground-truth low rank tensor.

From a theoretical perspective, our analysis reveals several surprises relative to the analysis of symmetric cubic tensors in [21]. First, for symmetric tensors, it was shown that the easy inference regime cannot exist, unless the prior over the low rank factor has non-zero mean. In contrast, for non-symmetric tensors, one tensor mode can have zero mean without destroying the existence of the easy regime, as long as the other modes have non-zero mean. Furthermore, we ﬁnd that in the space of all possible tensor shapes, the hard regime has the largest width along the noise axis when the shape is cubic, thereby indicating that tensor shape can have a strong effect on inference performance, and that cubic tensors have highly non-generic properties in the space of all possible tensor shapes.

Before continuing, we note some connections to the statistical mechanics literature. Indeed, AMP is closely equivalent to the TAP equations and the cavity method [28, 29] in glassy spin systems. Furthermore, the posterior distribution of noisy tensor factorization is equivalent to p-spin magnetic systems [30], as we show below in section 2.2. For Bayes-optimal inference, the phase space of the problem is reduced to the Nishimori line [31]. This ensures that the system does not exhibit replicasymmetry breaking. Working in the Bayes-optimal setting thus signiﬁcantly simpliﬁes the statistical analysis of the model. Furthermore, it allows theoretical insights into the inference phase-transitions, as we shall see below. In practice, for many applications the prior or underlying rank of the tensors are not known a-priori. The algorithms we present here can also be applied in a non Bayes-optimal setting, where the parametric from of the prior can not be determined. In that case, the theoretical asymptotics we describe here may not hold. However, approximate Bayesian-optimal settings can be recovered through parameter learning using expectation-maximization algorithms [32]. We discuss these consequences in section 4. Importantly,the connection to the statistical physics of magnetic systems allows the adaptation of many tools and intuitions developed extensively in the past few decades, see e.g. [33]. We discuss more connections to statistical mechanics as we proceed below.

2 Low rank decomposition using approximate message passing

In the following we deﬁne the low-rank tensor decomposition problem and present a derivation of AMP algorithms designed to solve this problem, as well as a dynamical mean ﬁeld theory analysis of their performance. A full account of the derivations can be found in the supplementary material.

2.1 Low-rank tensor decomposition

Consider a general tensor Y of order-p, whose components are given by a set of p indices, Yi1,i2,...,ip. Each index iα is associated with a speciﬁc mode of the tensor. The dimension of the mode α is Nα so the index iα ranges from 1, . . . , Nα. If Nα = N for all α then the tensor is said to be cubic. Otherwise we deﬁne N as the geometric mean of all dimensions N = (Qp α Nα)1/p, and denote nα Nα/N so that Qp α nα = 1. We employ the shorthand notation Yi1,i2,...,ip Ya, where a = {i1, . . . , ip} is a set of p numbers indicating a speciﬁc element of Y . A rank-1 tensor of order-p is the outer product of p vectors (order-1 tensors) Q 1 α p xα,where xα RNα. A rank-r tensor of order-p has a special structure that allows it to be decomposed into a sum of r rank-1 tensors, each of order-p. The goal of the rank decomposition is to ﬁnd all xρ α RNα , for α = 1, ..p, and ρ = 1, . . . , r, given a tensor Y of order-p and rank-r. In the following, we will use xαi Rr to denote the vector of values at each entry of the tensor, spanning the r rank-1 components. In a low-rank decomposition it is assumed that r < N. In noisy low-rank decomposition, individual elements Ya are noisy measurements of a low-rank tensor [Figure 1.A]. A comprehensive review on tensor decomposition can be found in [4].

We state the problem of low-rank noisy tensor decomposition as follows: Given a rank-r tensor

α xρ αi, (1)

we would like to ﬁnd all the underlying factors xρ αi. We note that we have used the shorthand notation i = iα to refer to the index iα which ranges from 1 to Nα, i.e. the dimensionality of mode α of the tensor.

Now consider a noisy measurement of the rank-r tensor w given by

where ϵ is a random noise tensor of the same shape as w whose elements are distributed i.i.d according to a standard normal distribution, yielding a total noise variance O(1) [Fig. 1.A]. The underlying factors xρ αi are sampled i.i.d from a prior distribution Pα(x), that may vary between the modes α. This model is a generalization of the spiked-tensor models studied in [21, 34].

We study the problem in the thermodynamic limit where N while r,nα O(1). In that limit, the mean-ﬁeld theory we derive below becomes exact. The achievable performance in the decomposition problem depends on the signal-to-noise ratio (SNR) between the underlying low rank tensor (the signal) and the noise variance . In eq. (1) we have scaled the SNR (signal variance divided by noise variance) with N so that the SNR is proportional to the ratio between the O(N) unknowns and the N p measurements, making the inference problem neither trivially easy nor always impossible. From a statistical physics perspective, this same scaling ensures that the posterior distribution over the factors given the data corresponds to a Boltzmann distribution whose Hamiltonian has extensive energy proportional to N, which is necessary for nontrivial phase transitions to occur.

2.2 Tensor decomposition as a Bayesian inference problem

In Bayesian inference, one wants to compute properties of the posterior distribution

P(w|Y ) = 1 Z(Y, w)

i Pα(xρ αi) Y

a Pout (Ya |wa ) . (3)

x11 x32 x2j x1i

x11 x12 x13 x14

x21 x22 x23 x24

x31 x32 x33 x34

Factor node Variable node

Figure 1: Low rank-decomposition of an order-3 spiked-tensor. (A) The observation tensor Y is a sum of r rank 1 tensors and a noise tensor ϵ with variance . (B) Factor graph for the decomposition of an order-3 tensor. (C) Incoming messages into each variable node arrives from variable nodes connected to the adjacent factor nodes. (D) Each node receives N p 2 messages from each of the other variable nodes in the graph.

Here Pout (Ya |wa ) is an element-wise output channel that introduce independent noise into individual measurements. For additive white Gaussian noise, the output channel in eq. (3) is given by

log Pout (Ya |wa ) = g (Ya |wa ) = 1 2 (Ya wa)2 1

2 log 2π , (4)

where g( ) is a quadratic cost function. The denominator Z(Y, w), in 3 is a normalization factor, or the partition function in statistical physics. In a Bayes-optimal setting, the priors Pα(x), as well as the rank r and the noise are known.

The channel universality property [19] states that for low-rank decomposition problems, r N, any output channel is equivalent to simple additive white Gaussian noise, as deﬁned in eq. (2). Brieﬂy, the output channel can be developed as a power series in wa. For low-rank estimation problems we have wa 1 [eq. (1)], and we can keep only the leading terms in the expansion. One can show that the remaining terms are equivalent to random Gaussian noise, with variance equal to the inverse of the Fisher information of the channel [See supplementary material for further details]. Thus, non-additive and non-Gaussian measurement noise at the level of individual elements, can be replaced with an effective additive Gaussian noise, making the theory developed here much more generally applicable to diverse noise scenarios.

The motivation behind the analysis below, is the observation that the posterior (3), with the quadratic cost function (4) is equivalent to a Boltzmann distribution of a magnetic system at equilibrium, where xαi Rr can be though of as the r-dimensional vectors of a spherical (xy)-spin model [30].

2.3 Approximate message passing on factor graphs

To solve the problem of low-rank decomposition we frame the problem as a graphical model with an underlying bipartite factor graph. The variable nodes in the graph represent the r N P

α nα unknowns xρ αi and the N p factor nodes correspond to the measurements Ya. The edges in the graph are between factor node Ya and the variable nodes in the neighbourhood a [Figure 1.B]. More precisely, for each factor node a = {i1, i2, ..., ip}, the set of variable nodes in the neighbourhood a are precisely {x1i1, x2i2, ..., xpip}, where each xαiα Rr. Again, in the following we will use the shorthand notation xαi for xαiα. The state of a variable node is deﬁned as the marginal probability distribution ηαi(x) for each of the r components of the vectors xαi Rr . The estimators ˆxαi Rr for the values of the factors xαi are given by the means of each of the marginal distributions ηαi(x).

In the approximate message passing framework, the state of each node (also known as a belief ), ηαi(x) is transmitted to all other variable nodes via its adjacent factor nodes [Fig. 1.C]. The state of each node is then updated by marginalizing over all the incoming messages, weighted by the cost function and observations in the factor nodes they passed on the way in:

ηαi(x) = Pα(x)

βj a\αi Trxβjηβj(xβj)eg(ya,wa). (5)

Here Pα(x) is the prior for each factor xαi associated with mode α, and Zαi = R dxηαi(x) is the partition function for normalization. The ﬁrst product in (5) spans all factor nodes adjacent to variable node αi. The second product is over all variable nodes adjacent to each of the factor nodes, excluding the target node αi. The trace Trxβj denotes the marginalization of the cost function g(ya, wa) over all incoming distributions.

The mean of the marginalized posterior at node αi is given by

ˆxαi = Z dxηαi(x)x Rr, (6)

and its covariance is

ˆσ2 αi = Z dxηαi(x)xx T ˆxαiˆx T αi Rr r. (7)

Eq. (5) deﬁnes an iterative process for updating the beliefs in the network. In what follows, we use mean-ﬁeld arguments to derive iterative equations for the means and covariances of the these beliefs in (6)-(7). This is possible given the assumption that incoming messages into each node are probabilistically independent. Independence is a good assumption when short loops in the underlying graphical model can be neglected. One way this can occur is if the factor graph is sparse [35, 36]. Such graphs can be approximated by a directed acyclic graph; in statistical physics this is known as the Bethe approximation [37]. Alternatively, in low-rank tensor decomposition, the statistical independence of incoming messages originates from weak pairwise interactions that scale as w N (p 1)/2. Loops correspond to higher order terms interaction terms, which become negligible in the thermodynamic limit [17, 33].

Exploiting these weak interactions we construct an accurate mean-ﬁeld theory for AMP. Each node αi receives N p 2 messages from every node βj with β = α, through all the factor nodes that are connected to both nodes,{yb|b αi βj} [Fig. 1.D]. Under the independence assumption of incoming messages, we can use the central limit theorem to express the state of node αj in (5) as

ηαi(x) = Pα(x) Zα(Aαi, uαi)

βj =αi exp x T Aβjx + u T βjx , (8)

where A 1 β uβj and A 1 β are the mean and covariance of the local incoming messages respectively. The distribution is normalized by the partition function

Zα(A, u) = Z dx Pα(x) exp u T x x T Ax . (9)

The mean and covariance of the distribution, eq. (6) and (7) are the moments of the partition function

ˆxαi = uαi log Zα, ˆσαi = 2

uαi u T αi log Zα. (10)

Finally, by expanding g(Ya, wa) in eq. (5) to quadratic order in w, and averaging over the posterior, one can ﬁnd a self consistent equation for Aαi and uαi in terms of xαi and Y [see supplemental material for details].

2.4 AMP algorithms

Using equations (10), and the self-consistent equations for Aαi and uαi, we construct an iterative algorithm whose dynamics converges to the solution of the self-consistent equations [ see supplemental

material for details]. The resulting update equations for the parameters are

ut αi = nα N (p 1)/2 X

(β,j) b\(α,i) ˆxt βj

β =α Σt β Dt,t 1 αβ (11)

j=1 ˆxt βjˆxt T βj

ˆxt+1 αi = ut αi log Zα(At α, ut αi) (13)

ˆσt+1 αi = 2

ut αi ut T αi log Zα(At α, ut αi), (14)

The second term on the RHS of (11), is given by

Dt,t 1 αβ =

k ˆxt γkˆxt 1,T γk

, Σt α = N 1 X

i ˆσt αi. (15)

This term originates from the the exclusion the target node αi from the product in equations (5) and (8). In statistical physics it corresponds to an Onsager reaction term due to the removal of the node yielding a cavity ﬁeld [38]. In the above, the notations , Q denote component-wise multiplication between two, and multiple tensors respectively.

Note that in the derivation of the iterative update equation above, we have implicitly used the assumption that we are in the Bayes-optimal regime which simpliﬁes eq. (11)-(14) [see supplementary material for details]. The AMP algorithms can be derived without the assumption of Bayes-optimality, resulting in a slightly more complicated set of algorithms [See supplementary material for details]. However, further analytic analysis, which is the focus of this current work, and the derivation of the dynamic mean-ﬁeld theory which we present below is applicable in the Bayes-optimal regime, were there is no replica-symmetry breaking, and the estimators are self-averaging. Once the update equations converge, the estimates for the factors xαi and their covariances are given by the ﬁxed point value of equations (13) and (14) respectively. A statistical treatment for the convergence in typical settings is presented in the following section.

2.5 Dynamic mean-ﬁeld theory

To study the performance of the algorithm deﬁned by eq. (11)-(14), we use another mean-ﬁeld approximation that estimates the evolution of the inference error. As before, the mean-ﬁeld becomes exact in the thermodynamic limit. We begin by deﬁning order parameters that measure the correlation of the estimators ˆxt αi with the ground truth values xαi for each mode α of the tensor

M t α = (nαN) 1 Nα X

i=1 ˆxt αix T αi Rr r. (16)

Technically, the algorithm is permutation invariant, so one should not expect the high correlation values to necessarily appear on the diagonal of M t α. In the following, we derive an update equation for M t α, which will describe the performance of the algorithm across iterations.

An important property of Bayes-optimal inference is that there is no statistical difference between functions operating on the ground truth values, or on values sampled uniformly from the posterior distribution. In statistical physics this property is known as one of the Nishimori conditions [31]. These conditions allow us to derive a simple equation for the update of the order parameter (16). For example, from (12) one easily ﬁnds that in Bayes-optimal settings At α = M t α. Furthermore, averaging the expression for uαi over the posterior, we ﬁnd that [ supplemental material]

EP (W |Y ) ut αi = M t αxαi, (17) where

β =α M t β. (18)

Similarly, the covariance matrix of uαi under the posterior is

COVP (W |Y ) ut iα = M t α. (19)

Finally, using eq. (13) for the estimation of ˆxαi, and the deﬁnition of M t α in (16) we ﬁnd a dynamical equation for the evolution of the order parameters M t α:

M t+1 α = EPα(x),z

M t α, M t αxαi + q

M tαz x T αi

where fα u log Zα(A, u) is the estimation of ˆxt+1 αi from (13). The average in (20) is over the prior Pα(x) and over the standard Gaussian variables z Rr. The average over z represents ﬂuctuations in the mean M t αxαi in (17), due to the covariance M t α in (19).

Finally, the performance of the algorithm is given by the ﬁxed point of (20),

M α = EPα(x),z

M α, M αxαi + q

M αz x T αi

, where M α nα

β =α M β. (21)

As we will see below, the inference error can be calculated from the ﬁxed point order parameters M α in a straightforward manner.

3 Phase transitions in generic low-rank tensor decomposition

The dynamics of M t α depend on the SNR via the noise level . To study this dependence, we solve equations (20) and (21) with speciﬁc priors. Below we present the solution of using Gaussian priors. In the supplementary material we also solve for Bernoulli and Gauss-Bernoulli distributions, and discuss mixed cases where each mode of the tensor is sampled from a different prior. Given our choice of scaling in (1), we expect phase transitions at O(1) values of , separating three regimes where inference is: (1) easy at small ; (2) hard at intermediate ; and (3) impossible at large . For simplicity we focus on the case of rank r = 1, where the order parameters M t α in (16) become scalars, which we denote mt α.

3.1 Solution with Gaussian priors

We study the case where xαi are sampled from normal distributions with mode-dependent mean and variance Pα(x) N(µα, σ2 α). The mean-ﬁeld update equation (20) can be written as

µ2 α σ2α + σ2 + µ2 mt α σ 2 + mtα , (22)

where mt α 1nα Q

β =α mβ , as in (18). We deﬁne the average inference error for all modes

1 + µ2 α σ2α 1

where m α is the ﬁxed point of eq. (22). Though we focus here on the r = 1 case for simplicity, the theory is equally applicable to higher-rank tensors.

Solutions to the theory in (22) and (23) are plotted in Fig. 2.A together with numerical simulations of the algorithm (11)-(14) for order-3 tensors generated randomly according to (2). The theory and simulations match perfectly. The AMP dynamics for general tensor decomposition is qualitatively similar to that of rank-1 symmetric matrix and tensor decompositions, despite the fact that such symmetric objects possess only one mode. As a consequence, the space of order parameters for these two problems is only one-dimensional; in contrast for the general case we consider here, it is p-dimensional. Indeed, the p = 3 order parameters are all simultaneously and correctly predicted by our theory.

For low levels of noise, the iterative dynamics converge to a stable ﬁxed point of (22) with low MSE. As the noise increases beyond a bifurcation point alg, a second stable ﬁxed point emerges with

(A) (B) (C) (D)

Figure 2: Phase transitions in the inference of arbitrary order-3 rank-1 tensors. (A) MSE at the ﬁxed point plotted against the noise level . Shaded blue area marks the bi-stability regime alg < < dyn. Solid (dashed) lines show theoretically predicted MSE for random (informed) initializations. Points are obtained from numerical simulations [σα = 1, µ1 = µ2 = 0.1 (blue), µ3 = 0.3 (orange), N = 500, nα = 1]. (B) Contours of dyn as a function of the two non-zero means of modes α = 1, 2 [µ3 = 0, σα = 1]. As either of the two nonzero means increases, dyn increases with them, reﬂecting an increase in the regime of noise level over which inference is easy. Importantly, the transition dyn is ﬁnite even when only one prior has non zero mean. (C) Same as (B) but for alg. Again, as either mean increases, alg increases also, reﬂecting a delay in the onset of the impossible regime as the noise level increases. The algorithmic phase transition is ﬁnite when at most one prior has zero mean. (D) Lower and higher transition points Alg (blue) and Dyn (orange) as a function of tensor shape. The ratios between the mode dimensions are nα = {1, nx, 1/nx}. The width of the bi-stable or hard inference regime is widest at the cubic point where nx = 1.

m α 1 and MSE 1. Above this point AMP may not converge to the true factors. The basin of attraction of the two stable ﬁxed points are separated by a p 1 dimensional sub-manifold in the p-dimensional order parameter space of mα. If the initial values x0 αi have sufﬁciently high overlap with the true factors xαi, then the AMP dynamics will converge to the low error ﬁxed point; we refer to this as the informative initialization, as it requires prior knowledge about the true structure. For uninformative initializations, the dynamics will converge to the high error ﬁxed point almost surely in the thermodynamic limit.

At a higher level of noise, dyn, another pitchfork bifurcation occurs and the high error ﬁxed point becomes the only stable point. With noise levels above dyn, the dynamic mean ﬁeld equations will always converge to a high error ﬁxed point. In this regime AMP cannot overcome the high noise and inference is impossible.

From eq. (22), it can be easily checked that if the prior means µα = 0, α then the high error ﬁxed point with mα = 0 is stable for any ﬁnite . This implies that alg = 0, and there is no easy regime for inference, so AMP with uninformed initialization will never ﬁnd the true solution. This difﬁculty was previously noted for the low-rank decomposition of symmetric tensors [21], and it was further shown there that the prior must be non-zero for the existence of an easy inference regime. However, for general tensors there is higher ﬂexibility; one mode α can have a zero mean without destroying the existence of an easy regime. To show this we solved (22), with different prior means for different modes and we plot the phase boundaries in Fig. 2.B-C. For the p = 3 case, dyn is ﬁnite even if two of the priors have zero mean. Interestingly, the algorithmic transition alg is ﬁnite if at most one prior is has zero mean. Thus, the general tensor decomposition case is qualitatively different than the symmetric case in that an easy regime can exist even when a tensor mode has zero mean.

3.2 Non-cubic tensors

The shape of the tensor, deﬁned by the different mode dimensions nα, has an interesting effect on the phase transition boundaries, which can be studied using (22). In ﬁgure 2.D the two transitions, alg and dyn are plotted as a function of the shape of the tensor. Over the space of all possible tensor shapes, the boundary between the hard and impossible regimes, dyn is maximized, or pushed furthest to the right in Fig. 2.D, when the shape takes the special cubic form where all dimensions are equal nα = 1, α. This diminished size of the impossible regime at the cubic point can be understood by noting the cubic tensor has the highest ratio between the number of observed data points N p and the number of unknowns r N P

Interestingly the algorithmic transition is lowest at this point. This means that although the ratio of observations to unknowns is the highest, algorithms may not converge, as the width of the hard

Figure 3: Comparing AMP and ALS . Left: The percentage out of 50 simulations that converged to the low error solution as a function of the noise Right: MSE averaged over the 50 simulations. In both ﬁgures, the vertical dashed line is the theoretical predication of the algorithmic transition point alg. [p = 3, r = 1, σα = 1, µα = 0.2, nα = {1, 8

8 }, N = 500 ]

regime is maximized. To explain this observation, we note that in (18), the noise can be rescaled independently in each mode by deﬁning α = /nα. It follows that for non-cubic tensors the worst case effective noise across modes will be necessarily higher than in the cubic case. As a consequence, moving from cubic to non-cubic tensors lowers the minimum noise level alg at which the uninformative solution is stable, thereby extending the hard regime to the left in Fig. 2.D.

4 Bayesian AMP compared to maximum a-posteriori (MAP) methods

We now compare the performance of AMP to one of the most commonly used algorithms in practice, namely alternating least squares (ALS) [4]. ALS is motivated by the observation that optimizing one mode while holding the rest ﬁxed is a simple least-squares subproblem [6, 5]. Typically, ALS performs well at low noise levels, but here we explore how well it compares to AMP at high noise levels, in the scaling regime deﬁned by deﬁned by (1) and (2), where inference can be non-trivial.

In Fig. 3 we compare the performance of ALS with that of AMP on the same underlying large (N = 500) tensors with varying amounts of noise. First, we note that that ALS does not exhibit a sharp phase transition, but rather a smooth cross-over between solvable and unsolvable regimes. Second, the robustness of ALS to noise is much lower than that of AMP. This difference is more substantial as the size of the tensors, N, is increased [data not shown].

One can understand the difference in performance by noting that ALS is like a MAP estimator, while Bayesian AMP attempts to ﬁnd the minimal mean square error (MMSE) solution. AMP does so by marginalizing probabilities at every node. Thus AMP is expected to produce better inferences when the posterior distribution is rough and dominated by noise. From a statistical physics perspective, ALS is a zero-temperature method, and so it is subject to replica symmetry breaking. AMP on the other hand is Bayes-optimal and thus operates at the Nishimori temperature [31]. At this temperature the system does not exhibit replica symmetry breaking, and the true global ground state can be found in the easy regime, when < alg.

In summary, our work partially bridges the gap between theory and practice by creating new AMP algorithms that can ﬂexibly assign different priors to different modes of a high-order tensor, thereby enabling AMP to handle arbitrarily shaped high order tensors that actually occur in the wild. Moreover, our theoretical analysis reveals interesting new phenomena governing how irregular tensor shapes can strongly affect inference performance and the positions of phase boundaries, and highlights the special, non-generic properties of cubic tensors. Finally, we hope the superior performance of our ﬂexible AMP algorithms relative to ALS will promote the adoption of AMP in the wild. Code to reproduce all simulations presented in this paper is available at https://github.com/ganguli-lab/tensor AMP.

Acknowledgments

We thank Alex Williams for useful discussions. We thank the Center for Theory of Deep Learning at the Hebrew University (J.K), and the Burroughs-Wellcome, Mc Knight, James S.Mc Donnell, and Simons Foundations, and the Ofﬁce of Naval Research and the National Institutes of Health (S.G) for support.

[1] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773 2832, 2014.

[2] Nicholas D Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E Papalexakis, and Christos Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551 3582, 2017.

[3] Alex H. Williams, Tony Hyun Kim, Forea Wang, Saurabh Vyas, Stephen I. Ryu, Krishna V. Shenoy, Mark Schnitzer, Tamara G. Kolda, and Surya Ganguli. Unsupervised discovery of demixed, low-dimensional neural dynamics across multiple timescales through tensor component analysis. Neuron, 98(6):1099 1115.e8, 2018.

[4] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455 500, 2009.

[5] J Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional scaling via an n-way generalization of eckart-young decomposition. Psychometrika, 35(3):283 319, 1970.

[6] Richard A Harshman. Foundations of the parafac procedure: Models and conditions for an" explanatory" multimodal factor analysis. UCLA Working Papers in Phonetics, 1970.

[7] Marc Mezard. The space of interactions in neural networks: Gardner s computation with the cavity method. Journal of Physics A: Mathematical and General, 22(12):2181, 1989.

[8] Yoshiyuki Kabashima and Shinsuke Uda. A bp-based algorithm for performing bayesian inference in large perceptron-type networks. In International Conference on Algorithmic Learning Theory, pages 479 493. Springer, 2004.

[9] Madhu Advani, Subhaneil Lahiri, and Surya Ganguli. Statistical mechanics of complex neural systems and high dimensional data. Journal of Statistical Mechanics: Theory and Experiment, 2013(03):P03014, 2013.

[10] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914 18919, 2009.

[11] Yoshiyuki Kabashima, Tadashi Wadayama, and Toshiyuki Tanaka. A typical reconstruction limit for compressed sensing based on lp-norm minimization. Journal of Statistical Mechanics: Theory and Experiment, 2009(09):L09003, 2009.

[12] Sundeep Rangan, Vivek Goyal, and Alyson K Fletcher. Asymptotic analysis of map estimation via the replica method and compressed sensing. In Advances in Neural Information Processing Systems, pages 1545 1553, 2009.

[13] Surya Ganguli and Haim Sompolinsky. Statistical mechanics of compressed sensing. Physical review letters, 104(18):188701, 2010.

[14] Surya Ganguli and Haim Sompolinsky. Short-term memory in neuronal networks through dynamical compressed sensing. In Advances in neural information processing systems, pages 667 675, 2010.

[15] Madhu Advani and Surya Ganguli. Statistical mechanics of optimal convex inference in high dimensions. Physical Review X, 6(3):031034, 2016.

[16] Madhu Advani and Surya Ganguli. An equivalence between high dimensional bayes optimal inference and m-estimation. In Advances in Neural Information Processing Systems, pages 3378 3386, 2016.

[17] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764 785, 2011.

[18] Sundeep Rangan and Alyson K Fletcher. Iterative estimation of constrained rank-one matrices in noise. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pages 1246 1250. IEEE, 2012.

[19] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. Mmse of probabilistic low-rank matrix estimation: Universality with respect to the output channel. In Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, pages 680 687. IEEE, 2015.

[20] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. Constrained low-rank matrix estimation: Phase transitions, approximate message passing and applications. Journal of Statistical Mechanics: Theory and Experiment, 2017(7):073403, 2017.

[21] Thibault Lesieur, Léo Miolane, Marc Lelarge, Florent Krzakala, and Lenka Zdeborová. Statistical and computational phase transitions in spiked tensor estimation. In Information Theory (ISIT), 2017 IEEE International Symposium on, pages 511 515. IEEE, 2017.

[22] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. Phase transitions in sparse pca. In Information Theory (ISIT), 2015 IEEE International Symposium on, pages 1635 1639. IEEE, 2015.

[23] Jean Barbier, Nicolas Macris, and Léo Miolane. The layered structure of tensor estimation and its mutual information. 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1056 1063, 2017.

[24] Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and Bülent Yener. Multiway analysis of epilepsy tensors. Bioinformatics, 23(13):i10 i18, 2007.

[25] Borbála Hunyadi, Patrick Dupont, Wim Van Paesschen, and Sabine Van Huffel. Tensor decompositions and data fusion in epileptic electroencephalography and functional magnetic resonance imaging data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(1), 2017.

[26] Jeffrey S Seely, Matthew T Kaufman, Stephen I Ryu, Krishna V Shenoy, John P Cunningham, and Mark M Churchland. Tensor analysis reveals distinct population structure that parallels the different computational roles of areas m1 and v1. PLo S computational biology, 12(11):e1005164, 2016.

[27] Neil C Rabinowitz, Robbe L Goris, Marlene Cohen, and Eero P Simoncelli. Attention stabilizes the shared gain of v4 populations. Elife, 4, 2015.

[28] David J Thouless, Philip W Anderson, and Robert G Palmer. Solution of solvable model of a spin glass . Philosophical Magazine, 35(3):593 601, 1977.

[29] A Crisanti and H-J Sommers. Thouless-anderson-palmer approach to the spherical p-spin spin glass model. Journal de Physique I, 5(7):805 813, 1995.

[30] Andrea Crisanti and H-J Sommers. The spherical p-spin interaction spin glass model: the statics. Zeitschrift für Physik B Condensed Matter, 87(3):341 354, 1992.

[31] Hidetoshi Nishimori. Statistical physics of spin glasses and information processing: an introduction, volume 111. Clarendon Press, 2001.

[32] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1 38, 1977.

[33] Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: Thresholds and algorithms. Advances in Physics, 65(5):453 552, 2016.

[34] Emile Richard and Andrea Montanari. A statistical model for tensor pca. In Advances in Neural Information Processing Systems, pages 2897 2905, 2014.

[35] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Bethe free energy, kikuchi approximations, and belief propagation algorithms. Advances in neural information processing systems, 13, 2001.

[36] Marc Mezard and Andrea Montanari. Information, physics, and computation. Oxford University Press, 2009. [37] Hans A Bethe. Statistical theory of superlattices. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 150(871):552 575, 1935. [38] Marc Mézard and Giorgio Parisi. Replicas and optimization. Journal de Physique Lettres, 46(17):771 778, 1985.