# tensor_balancing_on_statistical_manifold__10de1cc7.pdf

Tensor Balancing on Statistical Manifold

Mahito Sugiyama 1 2 Hiroyuki Nakahara 3 Koji Tsuda 4 5 6

We solve tensor balancing, rescaling an Nth order nonnegative tensor by multiplying N tensors of order N 1 so that every ﬁber sums to one. This generalizes a fundamental process of matrix balancing used to compare matrices in a wide range of applications from biology to economics. We present an efﬁcient balancing algorithm with quadratic convergence using Newton s method and show in numerical experiments that the proposed algorithm is several orders of magnitude faster than existing ones. To theoretically prove the correctness of the algorithm, we model tensors as probability distributions in a statistical manifold and realize tensor balancing as projection onto a submanifold. The key to our algorithm is that the gradient of the manifold, used as a Jacobian matrix in Newton s method, can be analytically obtained using the M obius inversion formula, the essential of combinatorial mathematics. Our model is not limited to tensor balancing, but has a wide applicability as it includes various statistical and machine learning models such as weighted DAGs and Boltzmann machines.

1. Introduction

Matrix balancing is the problem of rescaling a given square nonnegative matrix A Rn n 0 to a doubly stochastic matrix RAS, where every row and column sums to one, by multiplying two diagonal matrices R and S. This is a fundamental process for analyzing and comparing matrices in a wide range of applications, including input-output analysis in economics, called the RAS approach (Parikh, 1979; Miller & Blair, 2009; Lahr & de Mesnard, 2004), seat assignments in elections (Balinski, 2008; Akartunalı &

1National Institute of Informatics 2JST PRESTO 3RIKEN Brain Science Institute 4Graduate School of Frontier Sciences, The University of Tokyo 5RIKEN AIP 6NIMS. Correspondence to: Mahito Sugiyama <mahito@nii.ac.jp>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

Every ﬁber sums to 1

Given tensor A

Multistochastic tensor A Submanifold 퓢(β)

Probability distribution P

Statistical manifold 퓢 (dually ﬂat Riemannian manifold)

Projection Tensor balancing

Projected distribution Pβ Projected distribution Pβ

Figure 1. Overview of our approach.

Knight, 2016), Hi-C data analysis (Rao et al., 2014; Wu & Michor, 2016), the Sudoku puzzle (Moon et al., 2009), and the optimal transportation problem (Cuturi, 2013; Frogner et al., 2015; Solomon et al., 2015). An excellent review of this theory and its applications is given by Idel (2016).

The standard matrix balancing algorithm is the Sinkhorn Knopp algorithm (Sinkhorn, 1964; Sinkhorn & Knopp, 1967; Marshall & Olkin, 1968; Knight, 2008), a special case of Bregman s balancing method (Lamond & Stewart, 1981) that iterates rescaling of each row and column until convergence. The algorithm is widely used in the above applications due to its simple implementation and theoretically guaranteed convergence. However, the algorithm converges linearly (Soules, 1991), which is prohibitively slow for recently emerging large and sparse matrices. Although Livne & Golub (2004) and Knight & Ruiz (2013) tried to achieve faster convergence by approximating each step of Newton s method, the exact Newton s method with quadratic convergence has not been intensively studied yet.

Another open problem is tensor balancing, which is a generalization of balancing from matrices to higher-order multidimentional arrays, or tensors. The task is to rescale an Nth order nonnegative tensor to a multistochastic tensor, in which every ﬁber sums to one, by multiplying (N 1)th order N tensors. There are some results about mathematical properties of multistochastic tensors (Cui et al., 2014; Chang et al., 2016; Ahmed et al., 2003). However, there is no result for tensor balancing algorithms with guaranteed convergence that transforms a given tensor to a multistochastic tensor until now.

Tensor Balancing on Statistical Manifold

Here we show that Newton s method with quadratic convergence can be applied to tensor balancing while avoiding solving a linear system on the full tensor. Our strategy is to realize matrix and tensor balancing as projection onto a dually ﬂat Riemmanian submanifold (Figure 1), which is a statistical manifold and known to be the essential structure for probability distributions in information geometry (Amari, 2016). Using a partially ordered outcome space, we generalize the log-linear model (Agresti, 2012) used to model the higher-order combinations of binary variables (Amari, 2001; Ganmor et al., 2011; Nakahara & Amari, 2002; Nakahara et al., 2003), which allows us to model tensors as probability distributions in the statistical manifold. The remarkable property of our model is that the gradient of the manifold can be analytically computed using the M obius inversion formula (Rota, 1964), the heart of combinatorial mathematics (Ito, 1993), which enables us to directly obtain the Jacobian matrix in Newton s method. Moreover, we show that (n 1)N entries for the size n N of a tensor are invariant with respect to one of the two coordinate systems of the statistical manifold. Thus the number of equations in Newton s method is O(n N 1).

The remainder of this paper is organized as follows: We begin with a low-level description of our matrix balancing algorithm in Section 2 and demonstrate its efﬁciency in numerical experiments in Section 3. To guarantee the correctness of the algorithm and extend it to tensor balancing, we provide theoretical analysis in Section 4. In Section 4.1, we introduce a generalized log-linear model associated with a partial order structured outcome space, followed by introducing the dually ﬂat Riemannian structure in Section 4.2. In Section 4.3, we show how to use Newton s method to compute projection of a probability distribution onto a submanifold. Finally, we formulate the matrix and tensor balancing problem in Section 5 and summarize our contributions in Section 6.

2. The Matrix Balancing Algorithm

Given a nonnegative square matrix A = (aij) Rn n 0 , the task of matrix balancing is to ﬁnd r, s Rn that satisfy

(RAS)1 = 1, (RAS)T 1 = 1, (1)

where R = diag(r) and S = diag(s). The balanced matrix A = RAS is called doubly stochastic, in which each entry a ij = aijrisj and all the rows and columns sum to one. The most popular algorithm is the Sinkhorn-Knopp algorithm, which repeats updating r and s as r = 1/(As) and s = 1/(AT r). We denote by [n] = {1, 2, . . . , n} hereafter.

In our algorithm, instead of directly updating r and s, we update two parameters θ and η deﬁned as

j j θi j , ηij =

j j pi j (2)

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

η11 η12 η13 η14

η21 θ22 θ23 θ24

η31 θ32 θ33 θ34

η41 θ42 θ43 θ44

Matrix Constraints for balancing

Figure 2. Matrix balancing with two parameters θ and η.

for each i, j [n], where we normalized entries as pij = aij/

ij aij so that

ij pij = 1. We assume for simplicity that each entry is strictly larger than zero. The assumption will be removed in Section 5.

The key to our approach is that we update θ(t) ij with i = 1 or j = 1 by Newton s method at each iteration t = 1, 2, . . . while ﬁxing θij with i, j = 1 so that η(t) ij satisﬁes the following condition (Figure 2):

η(t) i1 = (n i + 1)/n, η(t) 1j = (n j + 1)/n.

Note that the rows and columns sum not to 1 but to 1/n due to the normalization. The update formula is described as

θ(t+1) 11... θ(t+1) 1n θ(t+1) 21... θ(t+1) n1

θ(t) 11... θ(t) 1n θ(t) 21... θ(t) n1

η(t) 11 (n 1 + 1)/n ... η(t) 1n (n n + 1)/n η(t) 21 (n 2 + 1)/n ... η(t) n1 (n n + 1)/n

where J is the Jacobian matrix given as

J(ij)(i j ) = η(t) ij θ(t) i j = ηmax{i,i } max{j,j } n2ηijηi j , (4)

which is derived from our theoretical result in Theorem 3. Since J is a (2n 1) (2n 1) matrix, the time complexity of each update is O(n3), which is needed to compute the inverse of J.

After updating to θ(t+1) ij , we can compute p(t+1) ij and η(t+1) ij by Equation (2). Since this update does not ensure the condition

ij p(t+1) ij = 1, we again update θ(t+1) 11 as

θ(t+1) 11 = θ(t+1) 11 log

ij p(t+1) ij and recompute p(t+1) ij and η(t+1) ij for each i, j [n].

By iterating the above update process in Equation (3) until convergence, A = (aij) with aij = npij becomes doubly stochastic.

3. Numerical Experiments

We evaluate the efﬁciency of our algorithm compared to the two prominent balancing methods, the standard Sinkhorn Knopp algorithm (Sinkhorn, 1964) and the state-of-the-art

Tensor Balancing on Statistical Manifold

Number of iterations

10 50 500 5000

Running time (sec.)

10 50 500 5000

Newton (proposed) Sinkhorn BNEWT

Figure 3. Results on Hessenberg matrices. The BNEWT algorithm (green) failed to converge for n 200.

Number of iterations

0 500 1000 1500 2000 2500 3000

Sinkhorn Newton

Figure 4. Convergence graph on H20.

algorithm BNEWT (Knight & Ruiz, 2013), which uses Newton s method-like iterations with conjugate gradients. All experiments were conducted on Amazon Linux AMI release 2016.09 with a single core of 2.3 GHz Intel Xeon CPU E5-2686 v4 and 256 GB of memory. All methods were implemented in C++ with the Eigen library and compiled with gcc 4.8.31. We have carefully implemented BNEWT by directly translating the MATLAB code provided in (Knight & Ruiz, 2013) into C++ with the Eigen library for fair comparison, and used the default parameters. We measured the residual of a matrix A = (a ij) by the squared norm (A 1 1, A T 1 1) 2, where each entry a ij is obtained as npij in our algorithm, and ran each of three algorithms until the residual is below the tolerance threshold 10 6.

Hessenberg Matrix. The ﬁrst set of experiments used a Hessenberg matrix, which has been a standard benchmark for matrix balancing (Parlett & Landis, 1982; Knight & Ruiz, 2013). Each entry of an n n Hessenberg matrix Hn = (hij) is given as hij = 0 if j < i 1 and hij = 1 otherwise. We varied the size n from 10 to 5, 000, and measured running time (in seconds) and the number of iterations of each method.

Results are plotted in Figure 3. Our balancing algorithm with the Newton s method (plotted in blue in the ﬁgures)

1An implementation of algorithms for matrices and third order tensors is available at: https://github.com/ mahito-sugiyama/newton-balancing

Number of iterations

20 100 200 300

Running time (sec.)

20 100 200 300

Newton (proposed) Sinkhorn BNEWT

Figure 5. Results on Trefethen matrices. The BNEWT algorithm (green) failed to converge for n 200.

is clearly the fastest: It is three to ﬁve orders of magnitude faster than the standard Sinkhorn-Knopp algorithm (plotted in red). Although the BNEWT algorithm (plotted in green) is competitive if n is small, it suddenly fails to converge whenever n 200, which is consistent with results in the original paper (Knight & Ruiz, 2013) where there is no result for the setting n 200 on the same matrix. Moreover, our method converges around 10 to 20 steps, which is about three and seven orders of magnitude smaller than BNEWT and Sinkhorn-Knopp, respectively, at n = 100.

To see the behavior of the rate of convergence in detail, we plot the convergence graph in Figure 4 for n = 20, where we observe the slow convergence rate of the Sinkhorn Knopp algorithm and unstable convergence of the BNEWT algorithm, which contrasts with our quick convergence.

Trefethen Matrix. Next, we collected a set of Trefethen matrices from a collection website2, which are nonnegative diagonal matrices with primes. Results are plotted in Figure 5, where we observe the same trend as before: Our algorithm is the fastest and about four orders of magnitude faster than the Sinkhorn-Knopp algorithm. Note that larger matrices with n > 300 do not have total support, which is the necessary condition for matrix balancing (Knight & Ruiz, 2013), while the BNEWT algorithm fails to converge if n = 200 or n = 300.

4. Theoretical Analysis

In the following, we provide theoretical support to our algorithm by formulating the problem as a projection within a statistical manifold, in which a matrix corresponds to an element, that is, a probability distribution, in the manifold.

We show that a balanced matrix forms a submanifold and matrix balancing is projection of a given distribution onto the submanifold, where the Jacobian matrix in Equation (4) is derived from the gradient of the manifold.

2http://www.cise.ufl.edu/research/sparse/ matrices/

Tensor Balancing on Statistical Manifold

4.1. Formulation

We introduce our log-linear probabilistic model, where the outcome space is a partially ordered set, or a poset (Gierz et al., 2003). We prepare basic notations and the key mathematical tool for posets, the M obius inversion formula, followed by formulating the log-linear model.

4.1.1. M OBIUS INVERSION

A poset (S, ), the set of elements S and a partial order on S, is a fundamental structured space in computer science. A partial order is a relation between elements in S that satisﬁes the following three properties: For all x, y, z S, (1) x x (reﬂexivity), (2) x y, y x x = y (antisymmetry), and (3) x y, y z x z (transitivity). In what follows, S is always ﬁnite and includes the least element (bottom) S; that is, x for all x S. We denote S \ { } by S+.

Rota (1964) introduced the M obius inversion formula on posets by generalizing the inclusion-exclusion principle. Let ζ : S S {0, 1} be the zeta function deﬁned as

ζ(s, x) = { 1 if s x, 0 otherwise.

The M obius function µ: S S Z satisﬁes ζµ = I, which is inductively deﬁned for all x, y with x y as

1 if x = y,

x s<y µ(x, s) if x < y, 0 otherwise.

From the deﬁnition, it follows that

s S ζ(s, y)µ(x, s) =

x s y µ(x, s) = δxy,

s S ζ(x, s)µ(s, y) =

x s y µ(s, y) = δxy (5)

with the Kronecker delta δ such that δxy = 1 if x = y and δxy = 0 otherwise. Then for any functions f, g, and h with the domain S such that

s S ζ(s, x)f(s) =

s S ζ(x, s)f(s) =

f is uniquely recovered with the M obius function:

s S µ(s, x)g(s), f(x) =

s S µ(x, s)h(s).

This is called the M obius inversion formula and is at the heart of enumerative combinatorics (Ito, 1993).

4.1.2. LOG-LINEAR MODEL ON POSETS

We consider a probability vector p on (S, ) that gives a discrete probability distribution with the outcome space S.

A probability vector is treated as a mapping p: S (0, 1) such that

x S p(x) = 1, where every entry p(x) is assumed to be strictly larger than zero.

Using the zeta and the M obius functions, let us introduce two mappings θ: S R and η: S R as

s S µ(s, x) log p(s), (6)

s S ζ(x, s)p(s) =

s x p(s). (7)

From the M obius inversion formula, we have

s S ζ(s, x)θ(s) =

s x θ(s), (8)

s S µ(x, s)η(s). (9)

They are generalization of the log-linear model (Agresti, 2012) that gives the probability p(x) of an n-dimensional binary vector x = (x1, . . . , xn) {0, 1}n as

i<j θijxixj +

i<j<k θijkxixjxk

+ + θ1...nx1x2 . . . xn ψ,

where θ = (θ1, . . . , θ12...n) is a parameter vector, ψ is a normalizer, and η = (η1, . . . , η12...n) represents the expectation of variable combinations such that

ηi = E[xi] = Pr(xi = 1),

ηij = E[xixj] = Pr(xi = xj = 1), i < j, . . .

η1...n = E[x1 . . . xn] = Pr(x1 = = xn = 1).

They coincide with Equations (8) and (7) when we let S = 2V with V = {1, 2, . . . , n}, each x S as the set of indices of 1 of x, and the order as the inclusion relationship, that is, x y if and only if x y. Nakahara et al. (2006) have pointed out that θ can be computed from p using the inclusion-exclusion principle in the log-linear model. We exploit this combinatorial property of the loglinear model using the M obius inversion formula on posets and extend the log-linear model from the power set 2V to any kind of posets (S, ). Sugiyama et al. (2016) studied a relevant log-linear model, but the relationship with M obius inversion formula has not been analyzed yet.

4.2. Dually Flat Riemannian Manifold

We theoretically analyze our log-linear model introduced in Equations (6), (7) and show that they form dual coordinate systems on a dually ﬂat manifold, which has been mainly studied in the area of information geometry (Amari, 2001; Nakahara & Amari, 2002; Amari, 2014; 2016). Moreover, we show that the Riemannian metric and connection of our model can be analytically computed in closed forms.

Tensor Balancing on Statistical Manifold

In the following, we denote by ξ the function θ or η and by the gradient operator with respect to S+ = S \{ }, i.e., ( f(ξ))(x) = f/ ξ(x) for x S+, and denote by S the set of probability distributions speciﬁed by probability vectors, which forms a statistical manifold. We use uppercase letters P, Q, R, . . . for points (distributions) in S and their lowercase letters p, q, r, . . . for the corresponding probability vectors treated as mappings. We write θP and ηP if they are connected with p by Equations (6) and (7), respectively, and abbreviate subscripts if there is no ambiguity.

4.2.1. DUALLY FLAT STRUCTURE

We show that S has the dually ﬂat Riemannian structure induced by two functions θ and η in Equation (6) and (7). We deﬁne ψ(θ) as

ψ(θ) = θ( ) = log p( ), (10)

which corresponds to the normalizer of p. It is a convex function since we have

from log p(x) =

<s x θ(s) ψ(θ). We apply the Legendre transformation to ψ(θ) given as

φ(η) = max θ

( θ η ψ(θ ) ) , θ η =

x S+ θ (x)η(x). (11)

Then φ(η) coincides with the negative entropy.

Theorem 1 (Legendre dual).

x S p(x) log p(x).

Proof. From Equation (5), we have

<s x µ(s, x) log p (s)

x S+ p(x) ( log p (x) log p ( ) ) .

Thus it holds that

θ η ψ(θ ) =

x S p(x) log p (x). (12)

Hence it is maximized with p(x) = p (x).

Since they are connected with each other by the Legendre transformation, they form a dual coordinate system ψ(θ) and φ(η) of S (Amari, 2016, Section 1.5), which coincides with θ and η as follows.

Theorem 2 (dual coordinate system).

ψ(θ) = η, φ(η) = θ. (13)

Proof. They can be directly derived from our deﬁnitions (Equations (6) and (11)) as

<s y θ(s) )

<s y θ(s) ) =

s x p(s) = η(x),

φ(η) η(x) = η(x)

( θη ψ(θ) ) = θ(x).

Moreover, we can conﬁrm the orthogonality of θ and η as

E [ log p(s)

θ(x) log p(s)

s S ζ(x, s)µ(s, y) = δxy.

The last equation holds from Equation (5), hence the M obius inversion directly leads to the orthogonality.

The Bregman divergence is known to be the canonical divergence (Amari, 2016, Section 6.6) to measure the difference between two distributions P and Q on a dually ﬂat manifold, which is deﬁned as

D [P, Q] = ψ(θP ) + φ(ηQ) θP ηQ.

In our case, since we have φ(ηQ) =

x S q(x) log q(x) and θP ηQ ψ(θP ) =

x S q(x) log p(x) from Theorem 1 and Equation (12), it is given as

x S q(x) log q(x)

which coincides with the Kullback Leibler divergence (KL divergence) from Q to P: D [P, Q] = DKL [Q, P].

4.2.2. RIEMANNIAN STRUCTURE

Next we analyze the Riemannian structure on S and show that the M obius inversion formula enables us to compute the Riemannian metric of S.

Theorem 3 (Riemannian metric). The manifold (S, g(ξ)) is a Riemannian manifold with the Riemannian metric g(ξ) such that for all x, y S+

[ ζ(x, s)ζ(y, s)p(s) η(x)η(y) ] if ξ = θ,

s S µ(s, x)µ(s, y)p(s) 1 if ξ = η.

Proof. Since the Riemannian metric is deﬁned as

g(θ) = ψ(θ), g(η) = φ(η),

when ξ = θ we have

θ(x) θ(y)ψ(θ) = θ(x)η(y)

s S ζ(y, s) exp

<u s θ(u) ψ(θ)

Tensor Balancing on Statistical Manifold

s S ζ(x, s)ζ(y, s)p(s) |S|η(x)η(y).

When ξ = η, it follows that

η(x) η(y)φ(η) = η(x)θ(y)

s y µ(s, y) log

u s µ(s, u)η(u)

s S µ(s, x)µ(s, y)p(s) 1.

Since g(ξ) coincides with the Fisher information matrix,

E [ θ(x) log p(s) θ(y) log p(s) ] = gxy(θ),

E [ η(x) log p(s) η(y) log p(s) ] = gxy(η).

Then the Riemannian (Levi Chivita) connection Γ(ξ) with respect to ξ, which is deﬁned as

Γxyz(ξ) = 1

ξ(x) + gxz(ξ)

ξ(y) gxy(ξ)

for all x, y, z S+, can be analytically obtained.

Theorem 4 (Riemannian connection). The Riemannian connection Γ(ξ) on the manifold (S, g(ξ)) is given in the following for all x, y, z S+,

( ζ(x, s) η(x) )( ζ(y, s) η(y) )

( ζ(z, s) η(z) ) p(s) if ξ = θ,

s S µ(s, x)µ(s, y)µ(s, z)p(s) 2 if ξ = η.

Proof. Connections Γxyz(θ) and Γxyz(η) can be obtained by directly computing gyz(θ)/ θ(x) and gyz(η)/ η(x), respectively.

4.3. The Projection Algorithm

Projection of a distribution onto a submanifold is essential; several machine learning algorithms are known to be formulated as projection of a distribution empirically estimated from data onto a submanifold that is speciﬁed by the target model (Amari, 2016). Here we deﬁne projection of distributions on posets and show that Newton s method can be applied to perform projection as the Jacobian matrix can be analytically computed.

4.3.1. DEFINITION

Let S(β) be a submanifold of S such that

S(β) = {P S | θP (x) = β(x), x dom(β)} (14)

speciﬁed by a function β with dom(β) S+. Projection of P S onto S(β), called m-projection, which is deﬁned as the distribution Pβ S(β) such that { θPβ(x) = β(x) if x dom(β), ηPβ(x) = ηP (x) if x S+ \ dom(β),

is the minimizer of the KL divergence from P to S(β):

Pβ = argmin Q S(β) DKL[P, Q].

The dually ﬂat structure with the coordinate systems θ and η guarantees that the projected distribution Pβ always exists and is unique (Amari, 2009, Theorem 3). Moreover, the Pythagorean theorem holds in the dually ﬂat manifold, that is, for any Q S(β) we have

DKL[P, Q] = DKL[P, Pβ] + DKL[Pβ, Q].

We can switch η and θ in the submanifold S(β) by changing DKL[P, Q] to DKL[Q, P], where the projected distribution Pβ of P is given as { θPβ(x) = θP (x) if x S+ \ dom(β), ηPβ(x) = β(x) if x dom(β),

This projection is called e-projection.

Example 1 (Boltzmann machine). Given a Boltzmann machine represented as an undirected graph G = (V, E) with a vertex set V and an edge set E {{i, j} | i, j V }. The set of probability distributions that can be modeled by a Boltzmann machine G coincides with the submanifold

SB = {P S | θP (x) = 0 if |x| > 2 or x E},

with S = 2V . Let ˆP be an empirical distribution estimated from a given dataset. The learned model is the mprojection of the empirical distribution ˆP onto SB, where the resulting distribution Pβ is given as { θPβ(x) = 0 if |x| > 2 or x E, ηPβ(x) = η ˆ P (x) if |x| = 1 or x E.

4.3.2. COMPUTATION

Here we show how to compute projection of a given probability distribution. We show that Newton s method can be used to efﬁciently compute the projected distribution Pβ by iteratively updating P (0) β = P as P (0) β , P (1) β , P (2) β , . . . until converging to Pβ.

Let us start with the m-projection with initializing P (0) β =

P. In each iteration t, we update θ(t) Pβ(x) for all x domβ

while ﬁxing η(t) Pβ(x) = ηP (x) for all x S+ \ dom(β), which is possible from the orthogonality of θ and η. Using Newton s method, η(t+1) Pβ (x) should satisfy ( θ(t) Pβ(x) β(x) ) +

y dom(β) Jxy ( η(t+1) Pβ (y) η(t) Pβ(y) ) = 0,

Tensor Balancing on Statistical Manifold

for every x dom(β), where Jxy is an entry of the |dom(β)| |dom(β)| Jacobian matrix J and given as

Jxy = θ(t) Pβ(x)

η(t) Pβ(y) =

s S µ(s, x)µ(s, y)p(t) β (s) 1

from Theorem 3. Therefore, we have the update formula for all x dom(β) as

η(t+1) Pβ (x) = η(t) Pβ(x)

y dom(β) J 1 xy ( θ(t) Pβ(y) β(y) ) .

In e-projection, update η(t) Pβ(x) for x dom(β) while ﬁx-

ing θ(t) Pβ(x) = θP (x) for all x S+ \ dom(β). To ensure

η(t) Pβ( ) = 1, we add to dom(β) and β( ) = 1. We

update θ(t) Pβ(x) at each step t as

θ(t+1) Pβ (x) = θ(t) Pβ(x)

y dom(β) J 1 xy ( η(t) Pβ(y) β(y) ) ,

J xy = η(t) Pβ(x)

θ(t) Pβ(y) =

s S ζ(x, s)ζ(y, s)p(t) β (s)

|S|η(t) Pβ(x)η(t) Pβ(y).

In this case, we also need to update θ(t) Pβ( ) as it is not guaranteed to be ﬁxed. Let us deﬁne

p (t+1) β (x) = p(t) β (x)

exp ( θ(t+1) Pβ (s) )

exp ( θ(t) Pβ(s) ) ζ(s, x).

Since we have

p(t+1) β (x) = exp ( θ(t+1) Pβ ( ) )

exp ( θ(t) Pβ( ) ) p (t+1) β (x),

it follows that

θ(t+1) Pβ ( ) θ(t) Pβ( )

exp ( θ(t) Pβ( ) ) +

x S+ p (t+1) β (x)

The time complexity of each iteration is O(|dom(β)|3), which is required to compute the inverse of the Jacobian matrix.

Global convergence of the projection algorithm is always guaranteed by the convexity of a submanifold S(β) deﬁned in Equation (14). Since S(β) is always convex with respect to the θand η-coordinates, it is straightforward to see that our e-projection is an instance of the Bregman algorithm onto a convex region, which is well known to always converge to the global solution (Censor & Lent, 1981).

5. Balancing Matrices and Tensors

Now we are ready to solve the problem of matrix and tensor balancing as projection on a dually ﬂat manifold.

5.1. Matrix Balancing

Recall that the task of matrix balancing is to ﬁnd r, s Rn

that satisfy (RAS)1 = 1 and (RAS)T 1 = 1 with R = diag(r) and S = diag(s) for a given nonnegative square matrix A = (aij) Rn n 0 .

Let us deﬁne S as S = {(i, j) | i, j [n] and aij = 0}, (15) where we remove zero entries from the outcome space S as our formulation cannot treat zero probability, and give each probability as p((i, j)) = aij/

ij aij. The partial order of S is naturally introduced as x = (i, j) y = (k, l) i j and k l, (16) resulting in = (1, 1). In addition, we deﬁne ιk,m for each k [n] and m {1, 2} such that ιk,m = min{ x = (i1, i2) S | im = k }, where the minimum is with respect to the order . If ιk,m does not exist, we just remove the entire kth row if m = 1 or kth column if m = 2 from A. Then we switch rows and columns of A so that the condition ι1,m ι2,m ιn,m (17) is satisﬁed for each m {1, 2}, which is possible for any matrices. Since we have

η(ιk,m) η(ιk+1,m) =

{ n j=1 p((k, j)) if m = 1, n i=1 p((i, k)) if m = 2 if the condition (17) is satisﬁed, the probability distribution is balanced if for all k [n] and m {1, 2}

η(ιk,m) = n k+1

Therefore, we obtain the following result.

Matrix balancing as e-projection: Given a matrix A Rn n with its normalized probability distribution P S such that p((i, j)) = aij/

ij aij. Deﬁne the poset (S, ) by Equations (15) and (16) and let S(β) be the submanifold of S such that S(β) = {P S | ηP (x) = β(x) for all x dom(β)}, where the function β is given as dom(β) = {ιk,m S | k [n], m {1, 2}},

β(ιk,m) = n k+1

Matrix balancing is the e-projection of P onto the submanifold S(β), that is, the balanced matrix (RAS)/n is the distribution Pβ such that { θPβ(x) = θP (x) if x S+ \ dom(β), ηPβ(x) = β(x) if x dom(β), which is unique and always exists in S, thanks to its dually ﬂat structure. Moreover, two balancing vectors r and s are

k=1 θPβ(ιk,m) θP (ιk,m)

= { ri if m = 1, ai if m = 2,

for every i [n] and r = rn/

Tensor Balancing on Statistical Manifold

5.2. Tensor Balancing

Next, we generalize our approach from matrices to tensors. For an Nth order tensor A = (ai1i2...i N ) Rn1 n2 n N and a vector b Rnm, the m-mode product of A and b is deﬁned as

(A m b)i1...im 1im+1...i N =

im=1 ai1i2...i N bim.

We deﬁne tensor balancing as follows: Given a tensor A Rn1 n2 n N with n1 = = n N = n, ﬁnd (N 1) order tensors R1, R2, . . . , RN such that

A m 1 = 1 ( Rn1 nm 1 nm+1 n N ) (18)

for all m [N], i.e., n im=1 a i1i2...i N = 1, where each entry a i1i2...i N of the balanced tensor A is given as

a i1i2...i N = ai1i2...i N

m [N] Rm i1...im 1im+1...i N .

A tensor A that satisﬁes Equation (18) is called multistochastic (Cui et al., 2014). Note that this is exactly the same as the matrix balancing problem if N = 2.

It is straightforward to extend matrix balancing to tensor balancing as e-projection onto a submanifold. Given a tensor A Rn1 n2 n N with its normalized probability distribution P such that

p(x) = ai1i2...i N /

j1j2...j N aj1j2...j N (19)

for all x = (i1, i2, . . . , i N). The objective is to obtain Pβ such that n im=1 pβ((i1, . . . , i N)) = 1/(n N 1) for all m [N] and i1, . . . , i N [n]. In the same way as matrix balancing, we deﬁne S as

S = { (i1, i2, . . . , i N) [n]N ai1i2...i N = 0 }

with removing zero entries and the partial order as

x = (i1 . . . i N) y = (j1 . . . j N) m [N], im jm.

In addition, we introduce ιk,m as

ιk,m = min{ x = (i1, i2, . . . , i N) S | im = k }.

and require the condition in Equation (17).

Tensor balancing as e-projection: Given a tensor A Rn1 n2 n N with its normalized probability distribution P S given in Equation (19). The submanifold S(β) of multistochastic tensors is given as

S(β) = {P S | ηP (x) = β(x) for all x dom(β)},

where the domain of the function β is given as

dom(β) = { ιk,m | k [n], m [N] }

and each value is described using the zeta function as

l [n] ζ(ιk,m, ιl,m) 1 n N 1 .

Tensor balancing is the e-projection of P onto the submanifold S(β), that is, the multistochastic tensor is the distribution Pβ such that { θPβ(x) = θP (x) if x S+ \ dom(β), ηPβ(x) = β(x) if x dom(β),

which is unique and always exists in S, thanks to its dually ﬂat structure. Moreover, each balancing tensor Rm is

Rm i1...im 1im+1...i N

k=1 θPβ(ιk,m ) θP (ιk,m )

for every m [N] and R1 = R1n N 1/

j1...j N aj1...j N to recover a multistochastic tensor.

Our result means that the e-projection algorithm based on Newton s method proposed in Section 4.3 converges to the unique balanced tensor whenever S(β) = holds.

6. Conclusion

In this paper, we have solved the open problem of tensor balancing and presented an efﬁcient balancing algorithm using Newton s method. Our algorithm quadratically converges, while the popular Sinkhorn-Knopp algorithm linearly converges. We have examined the efﬁciency of our algorithm in numerical experiments on matrix balancing and showed that the proposed algorithm is several orders of magnitude faster than the existing approaches.

We have analyzed theories behind the algorithm, and proved that balancing is e-projection in a special type of a statistical manifold, in particular, a dually ﬂat Riemannian manifold studied in information geometry. Our key ﬁnding is that the gradient of the manifold, equivalent to Riemannian metric or the Fisher information matrix, can be analytically obtained using the M obius inversion formula.

Our information geometric formulation can model several machine learning applications such as statistical analysis on a DAG structure. Thus, we can perform efﬁcient learning as projection using information of the gradient of manifolds by reformulating such models, which we will study in future work.

Acknowledgements

The authors sincerely thank Marco Cuturi for his valuable comments. This work was supported by JSPS KAKENHI Grant Numbers JP16K16115, JP16H02870 (MS), JP26120732 and JP16H06570 (HN). The research of K.T. was supported by JST CREST JPMJCR1502, RIKEN Post K, KAKENHI Nanostructure and KAKENHI JP15H05711.

Tensor Balancing on Statistical Manifold

Agresti, A. Categorical data analysis. Wiley, 3 edition, 2012.

Ahmed, M., De Loera, J., and Hemmecke, R. Polyhedral Cones of Magic Cubes and Squares, volume 25 of Algorithms and Combinatorics, pp. 25 41. Springer, 2003.

Akartunalı, K. and Knight, P. A. Network models and biproportional rounding for fair seat allocations in the UK elections. Annals of Operations Research, pp. 1 19, 2016.

Amari, S. Information geometry on hierarchy of probability distributions. IEEE Transactions on Information Theory, 47(5):1701 1711, 2001.

Amari, S. Information geometry and its applications: Convex function and dually ﬂat manifold. In Nielsen, F. (ed.), Emerging Trends in Visual Computing: LIX Fall Colloquium, ETVC 2008, Revised Invited Papers, pp. 75 102. Springer, 2009.

Amari, S. Information geometry of positive measures and positive-deﬁnite matrices: Decomposable dually ﬂat structure. Entropy, 16(4):2131 2145, 2014.

Amari, S. Information Geometry and Its Applications. Springer, 2016.

Balinski, M. Fair majority voting (or how to eliminate gerrymandering). American Mathematical Monthly, 115(2): 97 113, 2008.

Censor, Y. and Lent, A. An iterative row-action method for interval convex programming. Journal of Optimization Theory and Applications, 34(3):321 353, 1981.

Chang, H., Paksoy, V. E., and Zhang, F. Polytopes of stochastic tensors. Annals of Functional Analysis, 7(3): 386 393, 2016.

Cui, L.-B., Li, W., and Ng, M. K. Birkhoff von Neumann theorem for multistochastic tensors. SIAM Journal on Matrix Analysis and Applications, 35(3):956 973, 2014.

Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, pp. 2292 2300, 2013.

Frogner, C., Zhang, C., Mobahi, H., Araya, M., and Poggio, T. A. Learning with a Wasserstein loss. In Advances in Neural Information Processing Systems 28, pp. 2053 2061, 2015.

Ganmor, E., Segev, R., and Schneidman, E. Sparse loworder interaction network underlies a highly correlated and learnable neural population code. Proceedings of

the National Academy of Sciences, 108(23):9679 9684, 2011.

Gierz, G., Hofmann, K. H., Keimel, K., Lawson, J. D., Mislove, M., and Scott, D. S. Continuous Lattices and Domains. Cambridge University Press, 2003.

Idel, M. A review of matrix scaling and sinkhorn s normal form for matrices and positive maps. ar Xiv:1609.06349, 2016.

Ito, K. (ed.). Encyclopedic Dictionary of Mathematics. The MIT Press, 2 edition, 1993.

Knight, P. A. The Sinkhorn Knopp algorithm: Convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1):261 275, 2008.

Knight, P. A. and Ruiz, D. A fast algorithm for matrix balancing. IMA Journal of Numerical Analysis, 33(3): 1029 1047, 2013.

Lahr, M. and de Mesnard, L. Biproportional techniques in input-output analysis: Table updating and structural analysis. Economic Systems Research, 16(2):115 134, 2004.

Lamond, B. and Stewart, N. F. Bregman s balancing method. Transportation Research Part B: Methodological, 15(4):239 248, 1981.

Livne, O. E. and Golub, G. H. Scaling by binormalization. Numerical Algorithms, 35(1):97 120, 2004.

Marshall, A. W. and Olkin, I. Scaling of matrices to achieve speciﬁed row and column sums. Numerische Mathematik, 12(1):83 90, 1968.

Miller, R. E. and Blair, P. D. Input-Output Analysis: Foundations and Extensions. Cambridge University Press, 2 edition, 2009.

Moon, T. K., Gunther, J. H., and Kupin, J. J. Sinkhorn solves sudoku. IEEE Transactions on Information Theory, 55(4):1741 1746, 2009.

Nakahara, H. and Amari, S. Information-geometric measure for neural spikes. Neural Computation, 14(10): 2269 2316, 2002.

Nakahara, H., Nishimura, S., Inoue, M., Hori, G., and Amari, S. Gene interaction in DNA microarray data is decomposed by information geometric measure. Bioinformatics, 19(9):1124 1131, 2003.

Nakahara, H., Amari, S., and Richmond, B. J. A comparison of descriptive models of a single spike train by information-geometric measure. Neural computation, 18 (3):545 568, 2006.

Tensor Balancing on Statistical Manifold

Parikh, A. Forecasts of input-output matrices using the R.A.S. method. The Review of Economics and Statistics, 61(3):477 481, 1979.

Parlett, B. N. and Landis, T. L. Methods for scaling to doubly stochastic form. Linear Algebra and its Applications, 48:53 79, 1982.

Rao, S. S. P., Huntley, M. H., Durand, N. C., Stamenova, E. K., Bochkov, I. D., Robinson, J. T., Sanborn, A. L., Machol, I., Omer, A. D., Lander, E. S., and Aiden, E. L. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 159(7): 1665 1680, 2014.

Rota, G.-C. On the foundations of combinatorial theory I: Theory of M obius functions. Z. Wahrseheinlichkeitstheorie, 2:340 368, 1964.

Sinkhorn, R. A relationship between arbitrary positive matrices and doubly stochastic matrices. The Annals of Mathematical Statistics, 35(2):876 879, 06 1964.

Sinkhorn, R. and Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Paciﬁc Journal of Mathematics, 21(2):343 348, 1967.

Solomon, J., de Goes, F., Peyr e, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., and Guibas, L. Convolutional Wasserstein distances: Efﬁcient optimal transportation on geometric domains. ACM Transactions on Graphics, 34(4):66:1 66:11, 2015.

Soules, G. W. The rate of convergence of sinkhorn balancing. Linear Algebra and its Applications, 150:3 40, 1991.

Sugiyama, M., Nakahara, H., and Tsuda, K. Information decomposition on structured space. In 2016 IEEE International Symposium on Information Theory, pp. 575 579, July 2016.

Wu, H.-J. and Michor, F. A computational strategy to adjust for copy number in tumor Hi-C data. Bioinformatics, 32 (24):3695 3701, 2016.