# kopa_automated_kronecker_product_approximation__591ca74f.pdf

Journal of Machine Learning Research 23 (2022) 1-44 Submitted 8/20; Revised 7/22; Published 8/22

Ko PA: Automated Kronecker Product Approximation

Chencheng Cai chencheng.cai@temple.edu Department of Statistics, Operations and Data Science Fox School of Business, Temple University Philadelphia, PA 19122, USA

Rong Chen rongchen@stat.rutgers.edu Department of Statistics Rutgers University Piscataway, NJ 08854, USA

Han Xiao hxiao@stat.rutgers.edu Department of Statistics Rutgers University Piscataway, NJ 08854, USA

Editor: David Wipf

We consider the problem of matrix approximation and denoising induced by the Kronecker product decomposition. Speciﬁcally, we propose to approximate a given matrix by the sum of a few Kronecker products of matrices, which we refer to as the Kronecker product approximation (Ko PA). Because the Kronecker product is an extensions of the outer product from vectors to matrices, Ko PA extends the low rank matrix approximation, and includes it as a special case. Comparing with the latter, Ko PA also oﬀers a greater ﬂexibility, since it allows the user to choose the conﬁguration, which are the dimensions of the two smaller matrices forming the Kronecker product. On the other hand, the conﬁguration to be used is usually unknown, and needs to be determined from the data in order to achieve the optimal balance between accuracy and parsimony. We propose to use extended information criteria to select the conﬁguration. Under the paradigm of high dimensional analysis, we show that the proposed procedure is able to select the true conﬁguration with probability tending to one, under suitable conditions on the signal-to-noise ratio. We demonstrate the superiority of Ko PA over the low rank approximations through numerical studies, and several benchmark image examples.

Keywords: Information Criterion, Kronecker Product, Low Rank Approximation, Matrix Decomposition, Random Matrix

1. Introduction

Observations that are matrix/tensor valued have been commonly seen in various scientiﬁc ﬁelds and social studies. In recent years, advances in technology have made high dimensional matrix/tensor type data possible and more and more prevalent. Examples include high resolution images in face recognition and motion detection (Turk and Pentland, 1991; Bruce and Young, 1986; Parkhi et al., 2015), brain images through f MRI (Belliveau et al., 1991; Maldjian et al., 2003), adjacent matrices of social networks of millions of nodes (Goldenberg

c 2022 Chencheng Cai, Rong Chen and Han Xiao.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-931.html.

Cai, Chen and Xiao

et al., 2010), the covariance matrix of thousands of stock returns (Ng et al., 1992; Fan et al., 2011), the import/export network among hundreds of countries (Chen et al., 2022), etc. Due to the high dimensionality of the data, it is often useful and preferred to store, compress, represent, or summarize the matrices/tensors through low dimensional structures. In particular, low rank approximations of matrices have been ubiquitous. Finding a low rank approximation of a given matrix is closely related to the singular value decomposition (SVD), and the connection was revealed as early as Eckart and Young (1936). SVD has proven extremely useful in matrix completion (Cand es and Recht, 2009; Candes and Plan, 2010; Cai et al., 2010), community detection (Le et al., 2016), image denoising (Guo et al., 2015), among many others. In this paper, we investigate matrix approximations induced by the Kronecker product. Since the Kronecker product is an extension of the outer product, we call the proposed method Ko PA (Kronecker outer Product Approximation). Kronecker product is an operation on two matrices which generalizes the outer product from vectors to matrices. Specifically, the Kronecker product of a p q matrix A = (aij) and a p q matrix B = (bij), denoted by A B, is deﬁned as a (pp ) (qq ) matrix which takes the form of a block matrix. In A B, there are pq blocks of size p q , where the (i, j)-th block is the scalar product aij B. We refer the readers to Horn and Johnson (1991) and Van Loan and Pitsianis (1993) for overviews of the properties and computations of the Kronecker product. Kronecker product has also found wide applications in signal processing, image restoration and quantum computing, etc. For example, in the statistical modeling of a multi-input multi-output (MIMO) channel communication system, Werner et al. (2008) modeled the covariance matrix of channel signals as the Kronecker product of the transmit covariance matrix and the receive covariance matrix. In compressed sensing, Duarte and Baraniuk (2012) utilized Kronecker products to provide a sparse basis for high-dimensional signals. In image restoration, Kamm and Nagy (1998) considered the blurring operator as a Kronecker product of two smaller matrices. In quantum computing, Kaye et al. (2007) represented the joint state of quantum bits as a Kronecker product of their individual states. In SVD, a matrix is represented as the sum of rank one matrices, where each of them is represented as the outer product of the left singular vector and the corresponding right singular vector (after the transpose). Similarly, the Kronecker Product Decomposition (KPD) of a (pp ) (qq ) matrix C is deﬁned as

where d = min{pq, p q }, and Ak and Bk are p q and p q respectively. In the deﬁnition of the KPD, the dimensions of Ak and Bk have to be speciﬁed, which (in this case, p q and p q ) we refer to as the conﬁguration of the KPD. Further constraints on Ak and Bk are necessary to make the decomposition well deﬁned and unique, but we will defer the exact deﬁnition of KPD to Section 2. Since the Kronecker product is an extension of the vector outer product, so is KPD of SVD. In particular, if p = 1, and q = 1, then Ak and Bk are column and row vectors respectively, and the KPD, under this particular conﬁguration, becomes the SVD. Similar to rank-one approximation, the best matrix approximation given by a Kronecker product is formulated as ﬁnding the closest Kronecker product under the Frobenius norm.

Ko PA: Automated Kronecker Product Approximation

This was introduced in the matrix computation literature as the nearest Kronecker product (NKP) problem in Van Loan and Pitsianis (1993), who also demonstrated its equivalence to the best rank one approximation and therefore also to the SVD, after a proper rearrangement of the matrix entries. Such an equivalence is also maintained if one seeks the best approximation of a given matrix by the sum of K Kronecker products of the same conﬁguration, PK k=1 Ak Bk. Despite of its connection to SVD, ﬁnding a best Kronecker approximation also involves a pre-step: determining the conﬁgurations of the Kronecker products, i.e., determining the dimensions of Ak and Bk. One of our major contributions in this paper is on the selection of the conﬁguration based on an information criterion. Although the conﬁguration selection poses new challenges, KPD also provides a framework that is more ﬂexible than SVD. Here we use the cameraman s image, a benchmark in image analysis, to illustrate the potential advantage of KPD over SVD. The left panel in Figure 1 is the 512 512 pixel image of a cameraman in gray scale. The middle panel shows the best rank-1 approximation of the original image given by the leading term of SVD. The rank-1 approximation explains 45.63% of the total variation of the original image with 1023 parameters. The right panel in Figure 1 displays the image obtained by the nearest Kronecker product of conﬁguration (16 32) (32 16). With the same number of parameters as the rank-1 approximation, this nearest Kronecker product approximation explains 77.55% of the variance of the original image.

Figure 1: (Left) Original cameraman s image; (Middle) SVD approximation; (Right) KPD approximation

We will revisit the cameraman s image in Section 6 with a more detailed analysis. We notice here that the superiority of Ko PA over low rank approximation in representing images is partially due to the similarity of local blocks in the image. In this regard it is related to the patch based denoising methods (Dabov et al., 2007; Chatterjee and Milanfar, 2011) in the ﬁeld of image processing, which explore the recurrence of similar local pattern throughout the image. However, we have a substantially distinct focus in this paper. One of our main objectives is to devise a formal procedure to determine the conﬁguration, or the patch size , from the data, which is usually chosen in an ad hoc manner in patch based methods. We introduce a statistical model to characterize the image generating mechanism, and propose to use information criteria to select the conﬁguration. Practically it implies an emphasis on the balance between the complexity (number of model parameters) and accuracy (closeness to the original image). Furthermore, the Ko PA framework and

Cai, Chen and Xiao

the model selection also has potential applications in high dimensional panel time series, large network analysis, recommending systems, and other matrix-type data analysis. For example, in modeling dense networks (Leskovec et al., 2010), the adjacency matrix can be represented by a Kronecker product A B, where A and B correspond to the interand inner-community structures respectively. As a second example, the Ko PA may as well replace the low rank approximation in the synchronization problem (Chen and Chen, 2008; Singer, 2011) to identify the groups/clusters of the individuals, at the same time of denoising the distance matrix. It is worth mentioning that Ko PA can also be used to speed up the computation. If the transition matrix of a Markov Chain can be represented as one or a sum of a few Kronecker products, then the state update can be calculated more eﬃciently (Dayar, 2012). Ko PA plays its role in guiding the choice of the Kronecker product approximation of the transition matrix.

In this paper, we focus on the model

Y = λA B + σ PQE,

where (P, Q) is the dimension of the matrix Y , E is a standard Gaussian ensemble consisting of IID standard normal entries, λ > 0 and σ > 0 indicate the strength of signal and noise respectively. We consider the matrix denoising problem which aims to recover the Kronecker product λA B from the noisy observation Y . Here the conﬁguration of the Kronecker product, i.e. the dimensions of A and B, is to be determined from the data. We propose to use information criteria (which include AIC and BIC as special cases) to select the conﬁguration, and prove its consistency under some conditions on the signal-to-noise ratio. The consistency of the conﬁguration selection is established for both deterministic and random A and B, under the paradigm of high dimensional analysis, where the dimension of Y diverges to inﬁnity.

The rest of the paper is organized as follows. In Section 2, we give the precise deﬁnition of the KPD, and introduce the model, with a review of some of their basic properties. In Section 3, we propose the information criteria for selecting the conﬁguration of the Kronecker product. We investigate and establish the consistency of the proposed selection procedure in Section 4. Extension to the multi-term Kronecker product models is discussed in Section 5. In Section 6, we carry out extensive simulations to assess the performance of our method, and demonstrate its superiority over the SVD approach. We also present a detailed analysis of the cameraman s image.

Notations: Throughout this paper, for a vector v, v denotes its Euclidean norm. And for a matrix M, M F = p

tr(M M) and M S = max u =1 Mu denote its Frobenius norm and spectral norm respectively. For any two real numbers a and b, a b and a b stand for min{a, b} and max{a, b} respectively. For any number x, x+ denotes the positive part x 0 = max{x, 0}.

Ko PA: Automated Kronecker Product Approximation

2. Kronecker Product Model

2.1 Kronecker Product Decomposition

We ﬁrst repeat the deﬁnition of the Kronecker product of a p q matrix A and a p q

matrix B, which is given by

a1,1B a1,2B a1,q B a2,1B a2,2B a2,q B ... ... ... ap,1B ap,2B ap,q B

Let C be a (pp ) (qq ) real matrix, its Kronecker Product Decomposition (KPD) of conﬁguration (p, q, p , q ) is deﬁned as

k=1 λk Ak Bk. (1)

where d = min{pq, p q }, each Ak is a p q matrix with Frobenius norm Ak F = 1, each Bk is a p q matrix with Bk F = 1, and λ1 λ2 λd 0. The matrices Ak are mutually orthogonal in the sense that tr(Ak A l) = 0 for 1 k < l d, and so are the matrices Bk. The best way to see that the KPD is a valid deﬁnition is through its connection with the SVD, after a proper rearrangement of the elements of C, as demonstrated in Van Loan and Pitsianis (1993). Denote by vec( ) the vectorization of a matrix by stacking its rows. If A = (aij) is a p q matrix, then

vec(A) := [a1,1, . . . , a1,q, . . . , ap,1, . . . , ap,q] .

If B = (bij) is a p q matrix, then vec(A)[vec(B)] is a (pq) (p q ) matrix containing the same set of elements as the Kronecker product A B, but in diﬀerent positions. We deﬁne the rearrangement operator R to represent this relationship. Write the matrix C as a p q array of blocks of the same block size p q , and denote by Cp ,q i,j the (i, j)-th block, where 1 i p, 1 j q. The operator R maps the matrix C to

Rp,q[C] = h vec(Cp ,q 1,1 ), . . . , vec(Cp ,q 1,q ), . . . , vec(Cp ,q p,1 ), . . . , vec(Cp ,q p,q ) i , (2)

When applied to a Kronecker product A B, it holds that

Rp,q[A B] = vec(A)[vec(B)] . (3)

In view of (2) and (3), we see that the KPD in (1) corresponds to the SVD of the rearranged matrix Rp,q[C], and the conditions imposed on Ak and Bk are derived from the properties of the singular vectors. Here, we note that the rearrangement operator R is conﬁguration dependent, which we emphasize by explicitly specifying the dimension of Ak (in this case, p and q) in the subscript of R, see (2) and (3). When there is no ambiguity, the subscript of R may be omitted for notational simplicity. According to the deﬁnition, the mapping Rp,q : Rpp qq Rpq p q is an isomorphism since it is linear and bijective. In addition, since the order of elements does not change the Frobenius norm, the mapping R is also isometric under Frobenius norm.

Cai, Chen and Xiao

2.2 Kronecker Product Model

We consider the model where the observed P Q matrix Y is a noisy version of an unknown Kronecker product Y = λA B + σ PQE. (4)

To resolve the obvious unidentiﬁability regarding A and B, we require

A F = B F = 1, (5)

so that λ > 0 indicates the strength of the signal part. Note that under (5), A and B are identiﬁed up to a sign change given their dimensions. To further identify A and B, it is common to assume the largest non-zero element (in absolute value) of one of them is positive. We assume that the noise matrix E has IID stand normal entries, and consequently the strength of the noise is controlled by σ > 0. The dimensions of A and B correspond to the integer factorization of the dimension of Y . For convenience, we assume throughout this article that the dimension of the observed matrix Y in (4) is 2M 2N with M, N N. As a result, the dimension of A must be of the form 2m0 2n0, where 0 m0 M and 0 n0 N, and the corresponding dimension of B is 2m 0 2n 0, where m 0 = M m0 and n 0 = N n0. Therefore, we can simply use the pair (m0, n0) to denote the conﬁguration of the Kronecker product in (4). An implicit advantage of this assumption lies in the fact that if two conﬁgurations (m, n) and (m , n ) are diﬀerent, then the number of rows of A under one conﬁgurations divides the one under the other, and similarly for the numbers of columns, and for B. For example, if m m , then the number of rows of A under the former conﬁguration, which is 2m, divides the number of rows 2m under the latter. This fact leads to a more elegant treatment of the theoretical analysis in Section 4. For image analysis, assuming the dimension to be powers of 2 seems rather reasonable. On the other hand, for other applications where the dimension of the observed matrix are not powers of 2, one can transform the matrix to fulﬁll the assumption. For example, one can super-sample the matrix to increase the dimension to the closest powers of 2, or augment the matrix by padding zeros. The methodology proposed in this paper can be applied to any integer numbers P and Q with more than two factors. We will consider two mechanisms for the signal part λA B.

Deterministic Scheme. We assume that λ, A and B are deterministic, satisfying (5). We deﬁne the following signal-to-noise ratio to measure the signal strength

λA B 2 F E σE/2(M+N)/2 2 F = λ2

Random Scheme. Assume that λ, A and B are random and independent with E. Although A and B are stochastic, we assume that they have been rescaled so that (5) is fulﬁlled. In this case the signal-to-noise ratio is deﬁned as

E λA B 2 F E σE/2(M+N)/2 2 F = E λ2

Remark 1. We distinguish between these two schemes to account for the diﬀerent assumptions on data generating mechanism. In the random scheme, the observed matrix data is

Ko PA: Automated Kronecker Product Approximation

assumed to be randomly chosen from a (super-)population of matrices with an ad-hoc prior, which for example can be chosen as the Kronecker product of two independent Gaussian random matrices. Under the random scheme assumption, ill-behaved matrices arise with negligible probabilities under the prior. Similar assumptions have been used in factor analysis and random eﬀects models. The deterministic scheme incorporates arbitrary matrices. Additional assumptions need to be imposed to exclude extreme cases for which the proposed model selection would fail.

2.3 Estimation with a Known Conﬁguration

Suppose we want to estimate A and B based on a given conﬁguration (m, n), that is, the dimensions of A and B are 2m 2n and 2m 2n respectively. Again we use m = M m and n = N n to ease the notation when M and N are known. To estimate A and B in (4) from the observed matrix Y , we solve the minimization problem

min λ,A,B Y λA B 2 F , subject to A F = B F = 1. (6)

Since we have assumed that the noise matrix contains IID standard normal entries, (6) is also equivalent to the MLE. This optimization problem has been formulated as the nearest Kronecker product (NKP) problem in the matrix computation literature (Van Loan and Pitsianis, 1993), and solved through the SVD after rearrangement. According to Section 2.1, after applying the rearrangement operator, the cost function in (6) is equivalent to

Y λA B 2 F = R[Y ] λvec(A)[vec(B)] 2 F .

We note that the rearrangement operator R deﬁned in (2) depends on the conﬁguration of the block matrix, and in the current case, on the conﬁguration (m, n). Let R[Y ] = Pd k=1 λkukv k be the SVD of the rearranged matrix Rm,n[Y ], where λ1 λd 0 are the singular values in decreasing order, uk and vk are the corresponding left and right singular vectors and d = 2m+n 2m +n . The estimators for model (4) are given by

ˆλ = λ1 = R[Y ] S, ˆ A = vec 1(u1), ˆ B = vec 1(v1), ˆσ2 = Y 2 F ˆλ2, (7)

where vec 1 is the inverse operation of vec( ) that restores a vector back into a matrix of proper dimensions. We examine a few special cases of the conﬁguration (m, n). When (m, n) = (0, 0) or (m, n) = (M, N), the nearest Kronecker product approximation of Y is always itself. For instance, if m = n = 0, the estimators are

ˆλ = Y F , ˆ A = 1, ˆ B = ˆλ 1Y , ˆσ2 = 0.

These two conﬁgurations are obviously over-ﬁtting, and we shall exclude them in the subsequent analysis. When (m, n) = (0, N) or (m, n) = (M, 0), the nearest Kronecker product approximation of Y is the same as the rank-1 approximation of Y without rearrangement. When the true conﬁguration used to generate Y is chosen, that is (m, n) = (m0, n0), the problem is equivalent to denoising a perturbed rank-1 matrix, since

Rm0,n0[Y ] = λvec(A)vec(B) + σ 2(M+N)/2 Rm0,n0[E], (8)

Cai, Chen and Xiao

where the rearranged noise matrix Rm0,n0[E] is still a standard Gaussian ensemble. Therefore λ, A and B can be recovered consistently when σ Rm0,n0[E] S = op(λ 2(M+N)/2). Details will be discussed in Section 4.

3. Conﬁguration Determination through an Information Criterion

Our primary goal is to recover the Kronecker product λA B from Y , based on model (4). It depends on the conﬁguration of the Kronecker product, which is typically unknown. We propose to use the information criterion based procedure to select the conﬁguration. Recall that the dimension of Y is 2M 2N. If the dimension of A is 2m 2n, then the dimension of B must be 2m 2n , where m = M m and n = N n. Therefore, the conﬁguration can be indexed by the pair (m, n), which takes value from the Cartesian product set {0, . . . , M} {0, . . . , N}. For any given conﬁguration (m, n), the estimation procedure in Section 2.3 leads to the corresponding estimators ˆλ, ˆ A and ˆ B. Denote the estimated Kronecker product by ˆY (m,n) = ˆλ ˆ A ˆ B. Note that all of ˆλ, ˆ A and ˆ B depend implicitly on the conﬁguration (m, n) used in estimation, and should be written as ˆλ = ˆλ(m,n) etc. However, we will suppress the conﬁguration index from the notation for simplicity, whenever its meaning is clear in the context. Under the assumption that the noise matrix E is a standard Gaussian ensemble, we deﬁne the information criterion as

ICκ(m, n) = 2M+N ln Y ˆY (m,n) 2 F + κη, (9)

where η = 2m+n + 2m +n is the number of parameters involved in the Kronecker product of the conﬁguration (m, n), and κ 0 controls the penalty on the model complexity. The information criterion (9) can be viewed as an extended version of the BIC. Similar proposals have been introduced by Chen and Chen (2008) and Foygel and Drton (2010) in the linear regression and graphical models setting, respectively. The information criterion (9) reduces to the log mean square error when κ = 0, and corresponds to the Akaike information criterion (AIC) (Akaike, 1998) when κ = 2, and the Bayesian information criterion (BIC) (Schwarz, 1978) when κ = ln 2M+N = (M + N) ln 2. Remark 2. Strictly speaking, the number of parameters involved in the Kronecker product λA B should be 2m+n +2m +n 1 because of the constraints (5). Since it does not aﬀect the selection procedure to be introduced in (10), we will use η = 2m+n+2m +n for simplicity.

The information criterion (9) can be calculated for all conﬁgurations, and the one corresponding to the smallest value of (9) will be selected, based on which the estimation procedure in Section 2.3 proceeds. In other words, the selected conﬁguration ( ˆm, ˆn) is obtained through ( ˆm, ˆn) = arg min (m,n) C ICκ(m, n), (10)

where C is the set of all candidate conﬁgurations. As discussed in Section 2.3, when m = n = 0 or (m, n) = (M, N), it holds that ˆY = Y , and the information criterion (9) will be , no matter what value κ takes. Therefore, these two conﬁgurations should be excluded in model selection and we use

C := {0, . . . , M} {0, . . . , N} \ {(0, 0), (M, N)},

Ko PA: Automated Kronecker Product Approximation

as the set of candidate conﬁgurations in (10). Note that the set {0, . . . , M} {0, . . . , N} forms a rectangle lattice in Z2, and (m, n) = (0, 0) and (m, n) = (M, N) are the bottom left and top right corner of the lattice. Therefore, we sometimes intuitively refer to these two conﬁgurations as the corner cases in the sequel. Furthermore, deﬁne W as the set of all wrong conﬁgurations W := C \ {(m0, n0)}.

We now provide a heuristic argument to show how the selection procedure (10) is able to select the true conﬁguration (m0, n0). We will leave some technical results aside, and only highlight the essential idea. Precise statements and their rigorous analysis will be presented in Section 4. For simplicity, assume that λ, σ and κ are ﬁxed constants. Also assume that both (m0 + n0) and (m 0 + n 0) diverge, so that the number of parameters η0 = 2m0+n0 + 2m 0+n 0 is of a smaller magnitude than 2M+N. According to (7), for a given conﬁguration (m, n), Rm,n[ ˆY ] equals the ﬁrst SVD component of Rm,n[Y ], and it follows that Y ˆY 2 F = Y 2 F ˆY 2 F = Y 2 F ˆλ2, and the information criterion (9) can be rewritten as

ICκ(m, n) = 2M+N ln( Y 2 F ˆλ2) + κη. (11)

For the true conﬁguration (m, n) = (m0, n0), the rearranged matrix Rm0,n0[Y ] takes the form (8), where the ﬁrst term is a rank-1 matrix of spectral norm λ, and the noise term has a spectral norm of the order O(2 (m0+n0)/2 + 2 (m 0+n 0)/2) (details given in Section 4), which is negligible relative to λ, under the assumption m0 +n0 1, m 0 +n 0 1. So under the true conﬁguration, ˆλ λ. On the other hand, the number of parameters η0 = o(2M+N), making the penalty term much smaller than the log likelihood in (9). To summarize,

ICκ(m0, n0) 2M+N ln h λA B + σ 2 (M+N)/2E 2 F λ2i 2M+N ln σ2.

For a wrong conﬁguration (m, n) W that is close to the true one, the spectrum norm Rm,n[E] S and the number of parameters η are still negligible. However, the estimated coeﬃcient ˆλ is smaller than λ since

ˆλ = Rm,n[Y ] S Rm,n[λA B] S < λ.

Let us assume that Rm,n[λA B] S φλ for some 0 < φ < 1, which implies that for the wrong conﬁguration (m, n),

ICκ(m, n) 2M+N ln h λA B + σ2 (M+N)/2E 2 F ˆλ2i

2M+N ln h σ2 (M+N)/2E 2 F + λ2 φ2λ2i

2M+N ln σ2 1 + (1 φ2)λ2

Therefore, the information criterion (9) is in favor of the true conﬁguration over a wrong but close-to-truth one, and the two quantities are separated by

ICκ(m, n) ICκ(m0, n0) 2M+N ln[1 + (1 φ2)λ2/σ2].

Cai, Chen and Xiao

On the other hand, for a wrong conﬁguration (m, n) W that is close to the corner conﬁguration (0, 0) or (M, N), the singular value Rm,n[E] S can be as large as 1/2, making the separation between ICκ(m, n) and ICκ(m0, n0) by the log likelihood not guaranteed, i.e. it can happen that ˆλ > λ under the wrong conﬁguration. But at the same time the number of parameters η is also approximately 2M+N, so ICκ(m, n) receives a heavy penalty, which once again makes it greater than ICκ(m0, n0). In summary, the trade-oﬀbetween log likelihood and model complexity plays its role here, as expected. Wrong but close-to-truth conﬁgurations involve similar numbers of parameters as the true one, but lead to much smaller likelihoods. On the other hand, a close-to-corner conﬁguration may yield a ˆY closer to the original Y , but requires much more parameters to do so. The true conﬁguration can thus be selected because it reaches the optimal balance between the the likelihood and model complexity. In the preceding discussion we have assumed several convenient conditions to simplify the heuristic arguments and to signify the essential idea. In particular, by assuming that λ is a positive constant, the signal strength in model (4) is quite strong. In Section 4 we will make eﬀort to establish the model selection consistency under minimal conditions.

Remark 3. We remark that the optimization (10) is an exhaustive search over the candidate set C. The information function ICκ(m, n) is not necessarily a uni-modal function in general, though it is likely to be uni-modal when both A and B are Gaussian random matrices. As an extreme example, if A takes the form A = C D, then both A B and C (D B) are feasible conﬁgurations, and the ICκ function is bi-modal. The local algorithms can be trapped at any of them, but the exhaustive search is then able to choose the better one which involves less number of parameters. On the other hand, in our speciﬁc settings, the matrix is of dimensions 2M 2N. Therefore the total number of candidate conﬁgurations would be MN 2 = log2(2M) log2(2N) 2, which is the product of the logarithms of the dimensions of the observed matrix Y , much smaller compared to the size of Y . For more general dimensions P and Q, the conﬁgurations are chosen from the set of all divisors of P and Q. Unless P and Q are highly composite numbers (which are rare according to the number theory), the numbers of their divisors are usually much smaller. Finally, it is possible to develop more advanced searching algorithms. There are two challenges: (1) the IC function is not necessarily convex and (2) the search space in discrete. We leave the investigation of such development to future research.

4. Theoretical Results

In this section we provide a theoretical guarantee of the conﬁguration selection procedure proposed in Section 3, by establishing its asymptotic consistency. Throughout this section all our discussion will be based on model (4).

4.1 Assumptions and Estimation Consistency under Known Conﬁguration

We ﬁrst introduce the assumptions of the theoretical analysis. Recall that for model (4), (m0, n0) denotes the true conﬁguration, i.e. the matrices A and B are of dimensions 2m0 2n0 and 2m 0 2n 0 respectively. For the asymptotic analysis, we make the following

Ko PA: Automated Kronecker Product Approximation

assumption on the sizes of A and B, which follows the paradigm of high dimensional analysis.

Assumption 1 (Assumption on Dimension). Consider model (4). Assume that as M + N , the true conﬁguration (m0, n0) satisﬁes

m0 + n0 ln ln(MN) , m 0 + n 0 ln ln(MN) ,

where m 0 = M m0 and n 0 = N n0.

The condition entails that the numbers of entries in A and B will need to diverge to inﬁnity, and so is that of Y . It is also ensured that the true conﬁguration cannot stay too close to the corners. We remark that this will be the only condition on the sizes of the involved matrices. In particular, we do not require all of m0, n0, m 0, n 0 to go to inﬁnity. Consequently, the low rank approximation (when (m0, n0) = (M, 0) or (m0, n0) = (0, N)) is also covered by the Ko PA framework and our analysis as a special case. The number of parameters involved in the Kronecker product λA B is η0 = 2m0+n0 + 2m 0+n 0. It is a much smaller number than 2M 2N, the number of elements in Y . Hence Assumption 1 implies a signiﬁcant dimension reduction. We also make the following assumption on the error matrix E.

Assumption 2 (Assumption on Noise). Consider model (4). Assume that E is a standard Gaussian ensemble, i.e. with IID standard normal entries.

We conclude this subsection with the convergence rates of the estimators ˆλ , ˆ A and ˆ B, given by the estimation procedure in Section 2.3 under the true conﬁguration. Since the error matrix E has IID standard normal entries, according to Vershynin (2010), the expectation of the largest singular value of the rearranged error matrix Rm0,n0[E] is bounded by s0 = 2(m0+n0)/2 + 2(m 0+n 0)/2.

Theorem 1. Let ˆλ, ˆ A and ˆ B be the estimators obtained under the true conﬁguration, as given in (7). Suppose Assumptions 1 and 2 hold, then for the deterministic scheme of model (4), we have

, min s= 1 ˆ A s A 2 F = Op

, min s= 1 ˆ B s B 2 F = Op

where r0 = s0 2(M+N)/2 = 2 (m0+n0)/2 + 2 (m 0+n 0)/2.

4.2 Consistency of Conﬁguration Selection

To study the consistency of the conﬁguration selection proposed in Section 3, we need assumptions on the signal-to-noise ratio. We choose to present model (4) with both λ and σ so that it is able to account for any actual data generating mechanism. On the other hand, the mathematical properties would only depend on the ratio λ/σ. The strength

Cai, Chen and Xiao

of the signal also depends on the contrast between true and wrong conﬁgurations. If a conﬁguration (m, n) W is used for the estimation, Y is rearranged as

Rm,n[Y ] = λRm,n[A B] + σ2 (M+N)/2Rm,n[E]. (12)

Ignoring the noise term, only the ﬁrst singular value component of Rm,n[A B] (multiplied by λ) is expected to enter ˆY . When the true conﬁguration is used, Rm,n[A B] is a rank-1 matrix, and its leading singular value equals 1 (recall that we have assumed that A F = B F = 1). On the other hand, if a wrong conﬁguration is used, then Rm,n[A B] is no longer rank-1, and its leading singular value should be smaller than 1. Deﬁne

φ := max (m,n) W Rm,n[A B] S. (13)

The quantity φ characterize how much of the signal A B can be captured by a wrong conﬁguration, and it always holds that 0 < φ 1, so we also introduce

ψ2 := 1 φ2, (14)

and call it the representation gap. Note that 0 ψ2 < 1, and the larger ψ2 is, the easier it is to separate true and wrong conﬁgurations. The following assumption shows the interplay between the representation gap ψ2 and the signal-to-noise ratio λ/σ.

Assumption 3 (Representation Gap). For model (4), assume that A and B are deterministic matrices, and

lim M+N 2(M+N)/2

2(m0+n0)/2 + 2(m 0+n 0)/2 (λ/σ) ψ = , (15)

and lim M+N 2(M+N)/4 (λ/σ) ψ2 = . (16)

In both (15) and (16), the signal-to-noise ratio and the representation gap ψ2 can diminish to zero, as long as they do not converge to zero too fast. In this sense, Assumption 3 is very ﬂexible by requiring only very weak signal strength. Remark 4. We have deﬁned φ as the maximum over W, the set of all wrong conﬁgurations. In fact, if we let φm,n := Rm,n[A B] S, and ψ2 m,n = 1 φ2 m,n, then Assumption 3 can also be given through ψ2 m,n instead of an uniform lower bound ψ2, leading to a weaker version of the assumption. On the other hand, as will be shown in Section 4.3, if A and B are randomly generated according to the Random Scheme, then indeed all ψ2 m,n are larger than or around 1/2 with an overwhelming probability. This is suggesting that using the lower bound ψ2 in Assumption 3 for the deterministic scheme is still reasonable. Therefore, we do not spell out the detailed version of Assumption 3 using ψ2 m,n, but present it in the current simple form. Remark 5. Notions similar to the representation gap appear as key parameters in many other problems. For example, in variable selection of linear regression problems,the representation gap would be the smallest absolute non-zero coeﬃcient in the model. In matrix rank determination problems or factor models, the representation gap would be the eigengap, or the smallest nonzero singular value. The following theorem quantiﬁes the separation of the information criterion (9) between the true and wrong conﬁgurations.

Ko PA: Automated Kronecker Product Approximation

Theorem 2. Consider model (4), and assume Assumptions 1, 2, 3. If

κ 2 ln 2, and κ = o 2M+N ln(1 + (λ/σ)2ψ2)

2m0+n0 + 2m 0+n 0

min (m,n) W E[ICκ(m, n)] E[ICκ(m0, n0)] 2M+N ln[1 + (λ/σ)2ψ2] (1 + o(1)).

To be precise, we note that for a sequence of numbers {ak}, the statement ak o(1) is understood as max{ ak, 0} = o(1). According to Assumptions 3, (λ/σ)2ψ2 2 (M+N)/2, so Theorem 2 shows that the separation of the information criterion is at least of the order 2(M+N)/2. Remark 6. The ﬁrst condition in (17) ensures that the penalty on the number of parameters is large enough to exclude conﬁgurations close to (0, 0) and (M, N). The second condition in (17) is imposed so that the contribution from the penalty term under the true conﬁguration is dominated by the representation gap. The exact formula of the diﬀerence in expected information criterion is given by (36) in Appendix. Next theorem establishes the consistency of (9). We need to deﬁne the symbol : for two sequences of positive numbers {ak} and {bk}, ak bk is deﬁned as lim infk ak/bk > 0.

Theorem 3. Assume the same conditions of Theorem 2, then

P ICκ(m0, n0) < min (m,n) W ICκ(m, n) 1 exp C2M+N + ln(MN) ,

where the constant C, depending on λ/σ and ψ, is of order

C(λ/σ, ψ) (α1/3 1)

with α = 1+(λ/σ)2ψ2. In particular, the preceding convergence rate implies the consistency of the conﬁguration selection, i.e.

lim M+N P ICκ(m0, n0) < min (m,n) W ICκ(m, n) = 1. (18)

Remark 7. In Assumption 3, we focus on the minimal signal-to-noise ratio and representation gap. On the other hand, if they are large enough such that lim inf(λ/σ)2ψ2 1/2, then the condition κ 2 ln 2 can be dropped from Theorem 2 and Theorem 3, which will continue to hold if we set κ = 0 in (9). In other words, if the signal strength and the representation gap are suﬃciently large, one can simply use mean squared error to select the conﬁguration. Remark 8. The normality assumption can be extended to any other distribution G as long as the concentration inequality of Rm,n[E] S (under any conﬁguration (m, n)) is available. There is no substantial diﬀerence in the analysis except that the threshold for signal-to-noise ratio may vary under diﬀerent noise distributions. We carry out simulation experiments in Section 6.1.2 to demonstrate the performance of ICκ for the conﬁguration selection under normality. We also include additional simulations results under diﬀerent noise distributions in Appendix H.

Cai, Chen and Xiao

4.3 Model Selection under Random Scheme

In this section we consider the consistency of the model selection under the random scheme (20). First of all, similar convergence rates as Theorem 1 can be obtained under the random scheme.

Corollary 1. Assume Assumptions 1 and 2. If A and B are generated according to the random scheme (20), then the conclusion of Theorem 1 continue to hold.

If a conﬁguration (m, n) W is used, then the estimation procedure given in Section 2.3 rearranges Y as (12). In Section 4.2 for the deterministic scheme, we introduce φ as the upper bound of Rm,n[A B] S over all wrong conﬁgurations. For the random scheme, it turns out this upper bound and hence the representation gap ψ, depending on A and B, is also random. We introduce the following random version of Assumption 3.

Assumption 4 (Representation Gap). Assuem in model (4), λ, A and B are random and independent with E. Assume there exist two sequences of positive numbers {λ0} and {ψ0} satisfying (15) and (16) (by replacing λ and ψ therein), such that

lim sup M+N E[λ2/λ2 0] < , lim sup M+N E[ψ2/ψ2 0] < ,

and for any constant c > 0

lim M+N MN P λ2/λ2 0 < 1 c = lim M+N MN P ψ2/ψ2 0 < 1 c = 0. (19)

With Assumption 4, Theorem 2 and 3 continue to hold under the random scheme, as asserted by the next theorem.

Theorem 4. Consider model (4) with random λ, A and B. Under Assumptions 1, 2 and 4, it holds that

min (m,n) W E[ICκ(m, n)] E[ICκ(m0, n0)] 2M+N ln[1 + (λ0/σ)2ψ2 0] (1 + o(1)).

Furthermore, the consistency (18) holds.

Assumption 4 is formulated to single out the minimal condition required for the consistency under the random scheme. There is no speciﬁc distributional assumptions imposed on A and B. In the rest of this section, we demonstrate that how it can be satisﬁed under normality.

Example 1. Consider model (4). Suppose that

λA B = λ0 A B 2(M+N)/2 , (20)

where λ0 is deterministic, and A and B are independent, and both consisting of IID standard normal entries. In order to fulﬁll the identiﬁability condition (5), we let A = A/ A F , B = B/ B F , and λ = λ0 A F B F /2(M+N)/2. Also assume that A and B are both independent with E. For this example, the signal-to-noise ratio becomes

E λA B 2 F E σE/2(M+N)/2 2 F = λ2 0 σ2 .

Ko PA: Automated Kronecker Product Approximation

Recall that φ is deﬁned as the upper bound of Rm,n[A B] S over all wrong conﬁgurations. Only when the true conﬁgurations (m0, n0) is used, the rearrangement Rm0,n0[A B] has the simple structure of a rank-1 matrix. Under a wrong conﬁguration Rm,n[A B] no longer takes any special form. Nevertheless, the following lemma characterizes how the spectral norm of Rm,n[A B] depends on further rearrangements of both A and B. It is a property of the Kronecker products and the KPD (1), so we present it in the general form, without referring to any true conﬁguration.

Lemma 1. Let A be a 2m 2n matrix and B be a 2m 2n matrix. Then for any m , n Z, 0 m M, 0 n N,

Rm ,n [A B] S = Rm m ,n n [A] S R(m m)+,(n n)+[B] S

Applying Lemma 1 to Example 1 leads to the following corollary.

Corollary 2. For Example 1, under Assumption 1, it holds that

max (m,n) W Rm,n[A B] S = 1

And as a consequence, Assumption 4 holds with the λ0 in (20) and ψ2 0 = 1/2.

5. Multi-term Kronecker Product Models

In this section, we extend the one-term Kronecker product model in (4) to the following K-term Ko PA model.

k=1 λk Ak Bk + σ 2(M+N)/2 E, (21)

where λ1 > λ2 > > λK > 0 and Ak R2m0 2n0, Bk R2m 0 2n 0, k = 1, , K satisfy the following orthonormal condition:

tr(Ak A l) = tr(Bk B l) = δkl :=

( 1 if k = l, 0 if k = l.

The orthonormal condition and the assumption λ1 > λ2 > > λk > 0 implies the identiﬁability: Ak and Bk are identiﬁed up to sign changes, see Section 2.1. Note that the K terms in model (21) have the same conﬁguration (m0, n0). Therefore, although multiple terms are present, there is only one conﬁguration to be determined.

Remark 9. For the multi-term model (21), both the conﬁguration (m0, n0) and the number of terms K need to be determined from the data. We propose to select the conﬁguration ﬁrst. Once the conﬁguration is selected as ( ˆm, ˆn), the rearranged R ˆm,ˆn(Y ) becomes the sum of a rank K matrix and a noise matrix, and the determination of K turns into the rank selection based on R ˆm,ˆn(Y ), and existing methods in low rank approximation (Bai, 2003; Ahn and Horenstein, 2013) can be applied. In this paper, we follow this two-step procedure and focus on the choice of the conﬁguration for model (21). We also remark that it is of

Cai, Chen and Xiao

interest to study the joint selection of the conﬁguration and the number of terms, which we shall leave for future study.

To select the conﬁguration, we propose to use the same procedure in Section 3, that is, for any candidate conﬁguration (m, n) C, although Y is generated from the multi-term model (21), we nonetheless still calculate the information criterion (9) by ﬁtting the one-term Kronecker product model (4) to Y . This approach avoids the need of the determination of the number of Kronecker product terms when seeking the correct conﬁguration. It allows the separation of the two. Similar to Eq.(13), for each term Ak Bk, deﬁne

φk = max (m,n) W Rm,n[Ak Bk] S. (22)

Under the true conﬁguration, we have P k λk Rm0,n0(Ak Bk) S = λ1 given the orthonormality assumption. For any wrong conﬁguration, we have the following upper bound

max (m,n) W

k=1 λk Ak Bk

k=1 λk max (m,n) W Rm,n[Ak Bk] S =

k=1 λkφk =: λ1 φ,

where φ := λ 1 1 PK k=1 λkφk. When φ < 1, there exists a strict representation gap

ψ2 = 1 φ2 (23)

between the true conﬁguration and wrong conﬁgurations. Therefore, one can identify the true conﬁguration by minimizing the information criterion (11) over C. The theoretical results in Section 4 can be adapted immediately. Corollary 3 is a direct extension of Theorems 2 and 3. We skip the proof here.

Corollary 3 (Extension of Theorem 3). If ψ2 > 0, Theorem 2 and Theorem 3 continue to hold for the multi-term model (21), by replacing the signal λ and the representation gap ψ with λ1 and ψ respectively.

The bound φ is obtained by direct applications of the triangular inequality and may not be sharp. The resulted condition on the representation gap in (23) is therefore very strong.On the other hand, since φ is only attained when the singular spaces of Rm,n[Ak Bk] are the same for some wrong conﬁguration (m, n), one can further sharpen the upper bound for max(m,n) W Rm,n[P k λk Ak Bk] S when the singular spaces of Rm,n[P k λk Ak Bk] s are not identical, leading to a larger representation gap. And the more diﬀerent these singular spaces are, the larger the gap is. We will discuss a reﬁned result in the next subsection for the two-term models.

5.1 Analysis of the Two-Term Model

When K = 2, the multi-term model (21) becomes the two-term model

Y = λ1A1 B1 + λ2A2 B2 + σ 2(M+N)/2 E. (24)

Ko PA: Automated Kronecker Product Approximation

Again, we use the same conﬁguration selection procedure in Section 3, that is, we calculate the information criterion (9) by ﬁtting the one-term Kronecker product model (4) to Y . In this case, the estimated ˆλ used in the information criterion (11) is

ˆλ = Rm,n[Y ] S = λ1Rm,n[A1 B2] + λ2Rm,n[A2 B2] + σ2 (M+N)/2Rm,n[E] S. (25)

Note that under the true conﬁguration, we have ˆλ λ1. To bound ˆλ under wrong conﬁgurations, we deﬁne

φ1 = max (m,n) W Rm,n[A1 B1] S, φ2 = max (m,n) W Rm,n[A2 B2] S,

and the representation gaps

ψ2 1 := 1 φ2 1, ψ2 2 := 1 φ2 2.

Even though vec(A1) and vec(A2) are orthogonal according to the model assumption, the column spaces of Rm,n[A1 B1] and Rm,n[A2 B2] are not necessarily orthogonal. In the worst case when Rm,n[A1 B1] and Rm,n[A2 B2] have the same column space and the same row space, then ˆλ in (25) can be close to λ1φ1 + λ2φ2, which may exceed λ1. Therefore, we need to bound the distance between the column (and row) spaces of Rm,n[A1 B1] and Rm,n[A2 B2]. For this purpose, we make use of the principal angles between linear subspaces. Speciﬁcally, if M1 and M2 are two matrices of the same number of rows, the smallest principal angle between their column spaces, denote by Θ(M1, M2), is deﬁned as

cos Θ(M1, M2) = sup u1 =0,u2 =0

u 1M 1M2u2 M1u1 M2u2 .

We ﬁrst discuss the deterministic scheme, where Ak and Bk are non-random. In Assumption 5, θc and θr are lower bounds of the smallest possible principal angles between the column spaces and the row spaces of the two rearranged components, respectively.

Assumption 5. There exist 0 < ξ < 1 such that

max (m,n) WA cos Θ(Rm,n[A1 B1], Rm,n[A2 B2]) ξ,

and max (m,n) WB cos Θ([Rm,n[A1 B1]] , [Rm,n[A2 B2]] ) ξ,

WA = {(m, n) W : m + n m + n }, WB = {(m, n) W : m + n < m + n }.

Remark 10. The assumption may look unintuitive at ﬁrst sight, since it might be thought that the matrices Ai Bi, after the rearrangement under wrong conﬁgurations, are in general full rank. This is, however, not true, in view of Lemma 1, most easily seen when the wrong conﬁguration (m, n) is nested with the true one (m0, n0) in the sense m n0 and n n0. On the other hand, the conditions in Assumption 5 are given separately over

Cai, Chen and Xiao

WA and WB. In each of them, the matrices involved have more rows than columns, and the condition is on the corresponding column spaces. The following lemma provides an upper bound of the spectral norm of a sum of two matrices. It utilizes the principal angles between the column and row spaces to make the bound sharper than the one given by the triangular inequality. Assumption 5 enables us to apply Lemma 2 to bound ˆλ in (25).

Lemma 2. Suppose M1 and M2 are two matrices of the same dimension. Let M1 S = µ, M2 S = ν. Denote the principle angles between the column spaces and the row spaces as θ = Θ(M1, M2), η = Θ(M 1, M 2), respectively. Then

M1 + M2 2 S Λ2(µ, ν, θ, η),

Λ2(µ, ν, θ, η) = 1

(µ2 + ν2 + 2µν cos θ cos η)2 4µ2ν2 sin2 θ sin2 η

+ µ2 + ν2 + 2µν cos θ cos η .

Similar to Assumption 3, we assume the signal strengths λ1, λ2 and the noise level σ satisfy the following assumption.

Assumption 6. For model (24), we assume that λk and the matrices Ak, Bk, k = 1, 2 are deterministic and

lim M+N 2M+N

2m+n + 2m +n λ2 1ψ2 1 λ2 2φ2 2 2λ1λ2φ1φ2ξ σ2 + λ2 2 = (26)

lim M+N 2(M+N)/4 λ2 1ψ2 1 λ2 2φ2 2 2λ1λ2φ1φ2ξ (λ1 + λ2)σ = . (27)

The conditions (26) and (27) correspond to (15) and (16) in the one-term model. Speciﬁcally, when λ2 = 0, the two-term model reduces to one-term case, and Assumption 6 reduces to Assumption 3 as well. The main result for the two-term model is stated in Theorem 5.

Theorem 5. Consider the two-term model (24), where λk and the matrices Ak and Bk are deterministic. Suppose Assumptions 1, 2, 5 and 6 hold. If κ satisﬁes

κ 2 ln 2 and κ = o 2M+Nα 2m0+n0 + 2M+N m0 n0

then min (m,n) W E[ICκ(m, n)] E[ICκ(m0, n0)] 2M+Nα(1 + op(1)),

α = ln 1 + λ2 1ψ2 1 λ2 2φ2 2 2λ1λ2φ1φ2ξ σ2 + λ2 2

Furthermore, the consistency (18) continues to hold.

Ko PA: Automated Kronecker Product Approximation

Similar to Theorem 2, we have shown that for the two-term model, the information criterion obtained by ﬁtting a one-term model can still separate the true and wrong conﬁgurations with a gap of the order O(2M+Nα). On the other hand, comparing with Assumption 3, Theorem 5 depends on Assumption 6, which requires not only the signal-to-noise ratio (λ1/σ), but also the relative strength of the two terms (λ1/λ2) to be large enough. Comparing the two term model (24) with the one term model (i.e. λ2 = 0), we note that the information criterion gap α in Theorem 5 is smaller than the one given by Theorem 2. This phenomenon can be intuitively explained through (28). On one hand, λ2 2 contributes to the noise term when extracting the ﬁrst KPD component, since λ2 2 + σ2 appears in the denominator in (28). On the other hand, over-ﬁtting due to the second Kronecker product reduces Y ˆY 2 F under the wrong conﬁguration, which is quantiﬁed by λ2 2φ2 2+2λ1λ2φ1φ2ξ in the numerator of (28). Similar to Example 1, we consider the following example of the two term model under normality.

Example 2. Consider the two term model (24). Suppose that

λk Ak Bk = λk0 Ak Bk/2(M+N)/2, k = 1, 2,

where all of the ﬁve matrices Ak and Bk and E are independent, and each consisting of IID standard normal entries. To translate it back into the form of (24), we let Ak = Ak/ Ak F , Bk = Bk/ Bk F , and λk = λk0 Ak F Bk F /2(M+N)/2.

For Example 2, it turns out that with probabilities tending to one, ξ is close to 0 and the representation gaps ψ2 1 and ψ2 2 are close to 1/2 (due to Corollary 2). As an immediate consequence, Theorem 5 yields a information criterion gap of the size

α = ln 1 + λ2 10 λ2 20 2(σ2 + λ2 20)

However, by a reﬁned analysis of Assumption 5 under the normality of Example 2, we are able to prove the following improved result.

Corollary 4. Consider Example 2. Under Assumptions 1 and 2, Theorem 5 holds with the information criterion gap

α = ln 1 + λ2 10 2(σ2 + λ2 20)

6. Examples

We illustrate the performance of the estimation and conﬁguration selection procedure through simulation studies in Section 6.1, and image examples in Section 6.2.

6.1 Simulations

We design two simulation studies: the ﬁrst one on the performance of the estimation procedure introduced in Section 2.3, and the second one on the conﬁguration selection proposed in Section 3. Many implications of the theoretical results in Section 4 surface from the outcomes of the numerical studies.

Cai, Chen and Xiao

6.1.1 Estimation with known configuration

We ﬁrst consider the performance of the estimators of λ, A and B given in (7), when the true conﬁguration (m0, n0) is known. Throughout this subsection the simulations are based on model (4) with m0 = 5, n0 = 5, M = 10, N = 10 and σ = 1. The model (4) after the rearrangement under the true conﬁguration becomes

Rm0,n0[Y ] = λvec(A)vec(B) + σ2 (M+N)/2Rm0,n0[E],

where vec(A) R2m0+n0, vec(B) R2m 0+n 0 are unit vectors. Without loss of generality, set vec(A) = (1, 0, . . . , 0) , vec(B) = (1, 0, . . . , 0) . In this experiment, the noise level is ﬁxed at σ = 1, so the signal-to-noise ratio is controlled by λ, which takes values from the set {e1, e2, . . . , e16}. For each value of λ, we calculate the errors of the corresponding estimators ˆλ, ˆ A and ˆ B by

!2 and ln ˆ A A 2 F + ln ˆ B B 2 F .

The errors based on 20 repetitions are reported in Figure 2.

Figure 2: Boxplots for errors in ˆλ, ˆ A and ˆ B against the signal-to-noise ratio.

Figure 2 displays an interesting linear pattern, that is, as the signal-to-noise ratio in-

creases, ln ˆλ

λ 1 2 is approximately linear against ln λ with a slope around 2, and so is

the error ln( ˆ A A 2 F ˆ B B 2 F ) for the matrix estimators. We note that this pattern is consistent with Theorem 1, which asserts that

ˆλ λ 1 = Op

and ˆ A A F ˆ B B F = Op

since r0 deﬁned in Theorem 1 remains a constant here as we vary the signal strength λ in the simulation.

6.1.2 Configuration Selection

We now demonstrate the performance of the information criterion based procedure for selecting the conﬁguration. Two criteria will be considered: MSE (when κ = 0) and AIC

Ko PA: Automated Kronecker Product Approximation

(when κ = 2). Corresponding to the oneand multi-term models considered in Sections 4 and 5, we carry out two experiments respectively.

Experiment 1: One-term Ko PA model

The simulation is based on model (4). Two conﬁgurations are considered: (i) M = N = 9, m0 = 4, n0 = 4, and (ii) M = N = 10, m0 = 5, n0 = 4. Similar to Section 6.1.1, the noise level is ﬁxed at σ = 1, so the signal-to-noise ratio is controlled by λ. To control the representation gap ψ2, we construct the matrices A and B as follows:

where vec(Di), i = 1, 2, 3, 4 are independent random unit vectors such that vec(D1) and vec(D2) are orthogonal, and so are vec(D3) and vec(D4). In the experiment, ﬁve values of ϕ2 are considered: ϕ2 {0.1, 0.2, 0.3, 0.4, 0.5}. We remark that the construction above controls the representation gaps for conﬁgurations (1, 0) and (m0 +1, n0) at ϕ2 exactly, and the representation gaps for conﬁgurations with m + n {1, M + N 1} (close to trivial conﬁgurations) or |m m0| + |n n0| = 1 (close to the true conﬁguration) at roughly 0.5. Consequently, when ϕ2 = 0.1, 0.2, 0.3, 0.4, the overall representation gap ψ2 is at the desired level ϕ2 with high probabilities. But when ϕ2 = 0.5, the representation gap ψ2 can be slightly smaller than 0.5. In Figure 3, we plot the empirical frequencies of the correct conﬁguration selection, out of 100 repetitions, against the signal-to-noise ratio λ/σ. Note that the x-axis scale in Subﬁgures 3a and 3b is diﬀerent from that in 3c and 3d. The performances of both MSE (κ = 0) and AIC (κ = 2) are illustrated. BIC (κ = (M + N) ln 2) has a very similar performance to AIC, and is not reported here. For extremely weak signal-to-noise ratio λ 0.03, neither of MSE and AIC is able to select the true conﬁguration with a high probability, for both conﬁgurations. This does not contradict with Theorem 3. When the signal is very weak, larger dimensions of the observed matrix Y are required for the consistency. As the signal-to-noise ratio increases from 0.01 to 0.13, the probability that the true conﬁguration is selected increases gradually and eventually gets very close to one for AIC as shown in Figures 3a and 3b. We also note that the performance gets better as the representation gap ψ2 increases. These observations are echoing Theorem 2, which shows that AIC (with κ = 2 > 2 ln 2) only requires a minimal condition (λ/σ)2ψ2 > 0 to achieve the consistency, and the separation gap of AIC is a monotone function of (λ/σ)2ψ2. On the other hand, the performance of MSE exhibits a phase transition: it only starts to select the true conﬁguration with a decent probability when the signal-to-noise ratio λ/σ exceed a certain threshold. The theoretical asymptotic threshold for MSE is λ/σ p

1/(2ψ2) as discussed in Remark 7. For ψ2 {0.5, 0.4, 0.3, 0.2, 0.1} used in this simulation, the corresponding thresholds for λ/σ are {1, 1.12, 1.29, 1.58, 2.24}, which can be clearly visualized in Figures 3c and 3d. Comparing Figures 3a with Figures 3b, we observe that the empirical frequency curve increases from 0 to 100 much faster when the matrices are larger. This is consistent with

Cai, Chen and Xiao

(a) M = N = 9, AIC

(b) M = N = 10, AIC

(c) M = N = 9, MSE

(d) M = N = 10, MSE

Figure 3: The empirical frequencies of the correct conﬁguration selection out of 100 repetitions.

Theorem 2, which shows that the probability of correct conﬁguration selection approaches 1 exponentially fast.

Experiment 2: Two-term Ko PA model

In the second experiment, we consider the two-term Ko PA model in (24) where Ak and Bk are generated under the random scheme in Example 2 such that ψ2 1 1/2 and ψ2 2 1/2. According to Theorem 5, besides the signal-to-noise ratio λ1/σ, the relative strength of the second term λ2/λ1 (for the random scheme adopted in this experiment, see Corollary 4) aﬀects the conﬁguration selection as well.

In this simulation, we ﬁx the conﬁgurations to M = N = 9, (m0, n0) = (4, 4) and consider four diﬀerent relative strengths of the second term λ2 2/λ2 1 {0.3, 0.4, 0.5, 0.6}. Similar to Experiment 1, we report the empirical frequencies of correct conﬁgurations selection of MSE and AIC, out of 100 repetitions, as a function of the signal-to-noise ratio λ1/σ in Figure 4.

Figure 4a shows that the performance of AIC is in-sensitive to the ratio λ2 2/λ2 1 over the experimented range. To the contrary, it is seen from Figure 4b that MSE performs better when the ratio λ2 2/λ2 1 gets smaller, which is consistent with Corollary 4.

Ko PA: Automated Kronecker Product Approximation

(a) M = N = 9, AIC

(b) M = N = 9, MSE

Figure 4: The empirical frequencies of the correct conﬁguration selection out of 100 repetitions, for the two-term model.

6.2 Analysis of Image Examples

6.2.1 The cameraman s image

In this section we revisit and analyze the cameraman image introduced in Section 1. The original image, denoted by Y0, has 512 512 pixels. Each entry of Y0 is a real number between 0 and 1, where 0 codes black and 1 indicates white. The grayscale cameraman image Y0 is displayed in Figure 1. Our analysis will be based on the de-meaned version Y of the original image Y0. We demonstrate how well the image Y can be approximated by a Kronecker product or the sum of a few Kronecker products, and make comparisons with the low rank approximations given by SVD. We ﬁrst consider the conﬁguration selection by MSE, AIC and BIC on the original image Y . Figure 5 plots the heat maps for the information criterion ICκ(m, n) for all candidate conﬁgurations in the set

C = {(m, n) : 0 m, n 9} \ {(0, 0), (9, 9)},

where the top-left and bottom-right corners are always excluded from the consideration. Since darker cells correspond to smaller values of the information criteria, we see that MSE and AIC select the conﬁguration (8, 9), and BIC selects (6, 7). We also observe an overall pattern in Figure 5: conﬁgurations with larger (m, n) values are more preferable than those with smaller (m, n). Note that the Kronecker product does not commute, and with conﬁguration (m, n) the product is a 2m 2n block matrix, each block of the size 29 m 29 n. Real images usually show the locality of pixels in the sense that nearby pixels tend to have similar colors. Therefore, it can be understood that larger values of m and n are preferred, since they are better suited to capture the locality. Actually, for the cameraman s image, the conﬁguration (8, 9) accounts for 99.50% of the total variation of Y . The penalty on the number of parameters in AIC is not strong enough to oﬀset the closer approximation given by the conﬁguration (8, 9). With a stronger penalty term, BIC selects a conﬁguration that is closer to the center of the conﬁguration space, involving a much smaller number of parameters.

Cai, Chen and Xiao

Figure 5: Information Criteria for the cameraman s image. (Left) MSE (Mid) AIC (Right) BIC. Darker color corresponds to lower IC value.

From the perspective of image compressing, Ko PA is more ﬂexible than the low rank approximation, by allowing a choice of the conﬁguration, and hence a choice of the compression rate. To compare their performances, we use the ratio ˆY 2 F / Y 2 F to measure how close the approximation ˆY is to the original image Y . In Figure 6, these ratios are plotted against the numbers of parameters for the KPD, marked by + on the solid line. Since the number of parameters involved in the Kronecker product with conﬁguration (m, n) is η = 2m+n + 2M+N m n, the conﬁgurations {(m, n) : m + n = c} for any given 0 < c < M + N have the same number of parameters. Among these conﬁgurations, we only plot the one with the largest ˆY 2 F / Y 2 F . On the other hand, each cross stands for a rank-k approximation of Y , where its value on the horizontal axis is the number of parameters

j=1 (2M + 2N 2j + 1) for k = 1, . . . , 2M N.

According to Figure 6, there always exists a one-term Kronecker product which provides a better approximation of the original cameraman s image than the best low rank approximation involving the same number of parameters. We also consider denoising the images corrupted by additive Gaussian white noise

Yσ = Y + σE,

where E is a matrix with IID standard normal entries. We experiment with three levels of corruption: σ = 0.1, 0.2, 0.3. Examples of the corrupted images with diﬀerent σ are shown in Figure 7 with the values rescaled to [0, 1] for plotting. For the corrupted images, the information criteria ICκ(m, n) are calculated, and the corresponding heat maps are plotted in Figure 8. With added noise, AIC and BIC tend to select conﬁgurations in the middle of the conﬁguration space. Now we consider multi-term Kronecker approximation. Following the discussion in Section 5, for each of three corrupted images Yσ, we use the conﬁguration selected by BIC in Figure 8. Speciﬁcally, conﬁgurations (6, 6), (5, 6) and (5, 5) are selected when σ = 0.1, 0.2 and 0.3, respectively. A two-term Kronecker product model (24) is then ﬁtted under the selected conﬁguration. The ﬁtted images are plotted in the upper panel of Figure 9. Each

Ko PA: Automated Kronecker Product Approximation

Figure 6: Percentage of variance explained against the number of parameters, for Ko PA of all conﬁgurations, and for low rank approximations of all ranks.

Figure 7: Noisy cameraman s images. (Left) σ = 0.1, (Mid) σ = 0.2, (Right) σ = 0.3.

of them is compared with the image obtained by the low rank approximation involving a similar number of parameters as the two-term Ko PA. From Figure 9, it is quite evident that the details can easily be recognized from the images reconstructed by the two-term Ko PA, but can hardly be perceived in those given by the low rank approximation. Finally, we examine the reconstruction error deﬁned by

Y ˆY 2 F Y 2 F ,

where Y is the original image and ˆY is the one reconstructed from Yσ. For each of the three noisy images, we continue to use the conﬁguration selected by BIC. With ﬁxed conﬁgurations, we keep increasing the number of terms in the Ko PA until Yσ is fully ﬁtted, and plot the corresponding reconstruction error against the number of parameters in Figure 10. It has the familiar U shape, showing the trade-oﬀbetween estimation bias and variation. A similar curve is given for the low rank approximations exhausting all possible ranks. From Figure 10, it is seen that the multi-term Ko PA constantly outperforms the low rank approximation at any given number of parameters. Furthermore, the minimum

Cai, Chen and Xiao

MSE AIC BIC

Figure 8: Heat maps for three diﬀerent information criteria for the camera s images with diﬀerent noise levels. Darker color means lower IC value.

Ko PA: Automated Kronecker Product Approximation

σ = 0.1 σ = 0.2 σ = 0.3

Figure 9: The ﬁtted image given by multi-term Ko PA, and the SVD approximation with similar number of parameters.

Figure 10: Reconstruction error against the number of parameters for Ko PA and low rank approximations. The three panels from left to right correspond to σ = 0.1, σ = 0.2 and σ = 0.3 respectively.

reconstruction error that Ko PA can reach is always smaller than that given by the low rank approximation.

6.2.2 More images

To assess the performance of Ko PA model in image denoising, we repeat the experiment in Section 6.2.1 to a larger set of test images. The 10 test images printed in Figure 11 are collected from Image Processing Place1 and The Waterloo image Repository2. Each

1. http://www.imageprocessingplace.com/root ﬁles V3/image databases.htm 2. http://links.uwaterloo.ca/Repository.html

Cai, Chen and Xiao

Figure 11: List of test images.

image SVD Ko PA m SVD m Ko PA TVR boat 0.4709 0.1757 0.0853 0.0613 0.0356 cameraman 0.5446 0.1337 0.0644 0.0399 0.0294 goldhill 0.4632 0.1391 0.0759 0.0568 0.0363 jetplane 0.7347 0.1853 0.0866 0.0596 0.0302 lake 0.5425 0.1287 0.0825 0.0539 0.0308 livingroom 0.6747 0.2055 0.0995 0.0811 0.0589 mandril 0.6949 0.3557 0.1471 0.0889 0.0739 peppers 0.7394 0.1075 0.0734 0.0445 0.0224 pirate 0.7746 0.1533 0.1018 0.0686 0.0413 walkbridge 0.6617 0.2085 0.1263 0.0925 0.0593

Table 1: Reconstruction errors of one-term SVD, one-term Ko PA, multi-term SVD(m SVD), multi-term Ko PA(m Ko PA) and total variation regularization (TVR) on the ten test images.

of the 10 test images is a 512 512 gray-scaled matrix, same as the cameraman s image. We corrupt the test image with additive Gaussian noise, whose amplitude is 0.5 times the standard deviation of all its pixel values:

Yσ = Y + 0.5 std(Y ) E.

We compare ﬁve methods of denoising these images: one-term SVD and Ko PA models, multi-term SVD and Ko PA models, image denoising algorithm through total variation regularization (Chambolle, 2004). Since determining the number of terms in multi-term models is beyond the scope of this article, the numbers of terms in the multi-term models are chosen to minimize the reconstruction error. The performance of the ﬁve approaches on the ten images are reported in Table 1. For each image, the conﬁguration of the Ko PA is selected by BIC (κ = 18 ln 2). From Table 1, the Ko PA-based methods outperform SVD-based approaches, which is not surprising as SVD corresponds to a special conﬁguration in Ko PA models. On the other hand,

Ko PA: Automated Kronecker Product Approximation

the image denoising based on Ko PA (and multi-term Ko PA) is close to the TVR (total variation regularization) method but the latter does have a superior performance. We note that Ko PA and TVR are not directly comparable. Image is a special type of matrix data, whose entries usually possess certain continuity in values. TVR fully utilizes this continuity by imposing regularization on the total variation while SVD and Ko PA do not. The diﬀerence can be seen from Figure 12 as well. The TVR can recover the smooth region (the mandrill s nose) well, while the multi-term Ko PA model has more details in nonsmooth regions (the mandrill s fur and beard). Finally we remark that the performance of Ko PA approach on image analysis can possibly be improved by adding a similar penalty term on the smoothness of B.

Figure 12: (left) The mandrill image, (mid) recovered images from multi-term Ko PA model and (right) total variation regularization.

7. Conclusion and Discussions

In this article, we propose to use the Kronecker product approximation as an alternative of the low rank approximation of large matrices. Comparing with the low rank approximation, Ko PA is more ﬂexible because any conﬁguration of the Kronecker product can potentially be used, leading to diﬀerent levels of approximation and compression. To select the conﬁguration, we propose to use the extended information criterion, which includes MSE, AIC and BIC as special cases. We establish the asymptotic consistency of the conﬁguration selection procedure, and use an example with a random Kronecker product to illustrate how the technical assumptions are fulﬁlled. Extension to the multi-term Kronecker product model is also investigated. Both simulations and analysis of image examples demonstrate that Ko PA can be superior over the low rank approximations in the sense that it can give a closer approximation of the original matrix/image with a higher compression rate. We conclude with a discussion of future directions. First of all, the Kronecker product model (4) is not permutation-invariant. In other words, after a permutation of columns and rows, the signal from the matrix Y may or may not be a Kronecker product. When the columns and rows have an order in nature as in image data and in spatial-temporal data, it is not an issue. But in general, especially when the data is allowed to be shuﬄed, a pre-processing step for ordering rows and columns should be investigated before conducting Ko PA analysis. Another extension is to consider a multi-term model, where each term can have its own conﬁguration. This approach certainly allows a greater ﬂexibility, but also poses new challenges not only on the conﬁguration and order selections, but also on the

Cai, Chen and Xiao

estimation and algorithms as well. It would be ideal if a natural and interpretable procedure for the estimation, order determination, and conﬁguration selection and be developed, with theoretical guarantees. Lastly, cross validation may also be used for conﬁguration selection. Note that we are working with one single observation (the matrix Y ). It is possible to randomly remove one or a set of elements of Y , and evaluate the performance of a conﬁguration based on the prediction accuracy of these elements. This approach requires a matrix completion procedure based on Kronecker Product approximation, which can be done based on low rank matrix completion procedure on the re-arranged matrix. We are currently studying such a procedure, though its theoretical properties requires further investigation.

Acknowledgments

Chen s research is supported in part by National Science Foundation grants DMS-1737857, IIS-1741390, CCF-1934924, DMS-2027855 and DMS-2052949. Xiao s research is supported in part by National Science Foundation grant DMS-1454817, DMS-2027855, DMS-2052949 and a research grant from NEC Labs America. The authors thank an Associate Editor and two referees for their insightful comments that have led to signiﬁcant improvement of the manuscript.

Ko PA: Automated Kronecker Product Approximation

Appendix Appendix A. Proof of Theorem 1 and Corollary 1

Without loss of generality, we assume σ = 1. Noticing that

ˆλ = Rm0,n0[Y ] S = λvec(A)vec(B) + σ2 (M+N)/2Rm0,n0[E] S,

by triangular inequality, we have ˆλ λvec(A)vec(B) S σ2 (M+N)/2 Rm0,n0[E] S,

where λvec(A)vec(B) S = λ. The following bound for Rm0,n0[E] S can be obtained using the concentration inequality from Vershynin (2010),

P( Rm0,n0[E] S 2(m0+n0)/2 + 2(M+N m0 n0)/2 + t) e t2/2.

Therefore, Rm0,n0[E] S = s0 + Op(1) and

|ˆλ λ| 2 (M+N)/2(s0 + Op(1)) = r0 + Op(2 (M+N)/2),

which yields ˆλ λ = Op(r0). The bounds for ˆ A and ˆ B corresponds to the error bounds in estimating the left and right singular vectors of Rm0,n0[Y ], which is a direct consequence of the analysis in Wedin (1972) by observing that

min s= 1 ˆ A s A 2 F = min s= 1 vec( ˆ A) svec(A) 2 2 = 2 sin2 Θ(vec( ˆ A), vec(A)).

A sharper bound is provided in Cai and Zhang (2018). Since above analysis holds for any ﬁxed value of λ, Corollary 1 follows immediately.

Appendix Appendix B. Proof of Theorem 2

We ﬁrst show and prove several technical lemmas.

Lemma 3. Suppose an > 0, an 0 and xn = Op(1) is a sequence of continuous random variables with density functions pn satisfying

(i) E|xn| C for some constant C for every n,

(ii) 1 + anxn > 0 almost surely,

(iii) a 2 n supx 1/(2an) pn(x) 0,

then we have E ln (1 + anxn) = O (an) .

Proof. Let pn(xn) be the density function of xn. For the positive part, we have

0 ln(1 + ant)pn(t)dt Z +

0 antpn(t)dt an E|xn| Can.

Cai, Chen and Xiao

For the negative part, we have

1/an ln(1 + ant)pn(t)dt

= Z 1/(2an)

1/an ln(1 + ant)pn(t)dt + Z 0

1/(2an) ln(1 + ant)pn(t)dt

sup t 1/(2an) pn(t)

# Z 1/(2an)

1/an ln(1 + ant)dt + Z 0

1/(2an) 2antpn(t)dt

2an sup t< 1/(2an) pn(t) + 2an

o(an) 2Can.

Hence, E ln(1 + anxn) = E+ + E = O(an).

The conditions in Lemma 3 are easy to verify in the subsequent proofs. Condition (ii) ensures the logarithm is well-deﬁned on the whole support. Condition (i) is satisﬁed when xn converges in mean to a random variable x with ﬁnite expectation. Condition (iii) is controlling the left tails of the densities, and is easily fulﬁlled if they are exponential.

Lemma 4. Let X be an arbitrary P Q real matrix with P Q and E be a P Q matrix with IID standard Gaussian entries. Then we have

E X + E 2 S X 2 S + (

Q)2 + 4 X S

Q) + 2 =: U2.

Furthermore, the departure from the expectation is sub-Gaussian such that for any positive t, we have P[ X + E S U + t] e t2/2.

Proof. Without loss of generality, we assume X = [X1, X2], where X1 RP P is a diagonal matrix and X2 RP (Q P) is zero. Such a form of X can always be achieved by multiplying X and E from left and right by orthogonal matrices, without changing the distribution of E. Similarly, we partition E into [E1, E2] with E1 RP P and E2 RP (Q P). Then

X + E 2 S = sup u RP , u =1 u (X + E)(X + E) u

= sup u RP , u =1 u XX u + u EEu + 2u XE u

X 2 S + E 2 S + 2 X S E1 S

According to Vershynin (2010), we have E E1 S 2

Q + t] e t2/2.

Ko PA: Automated Kronecker Product Approximation

E E 2 S = Z

t=0 P[ E S > t]2tdt (

Hence, we have

E X + E 2 S X 2 S + (

Q)2 + 4 X S

Q) + 2 =: U2.

Since for any ﬁxed X, X +E S is a function of E with Lipschitz norm 1, by concentration inequality, for any positive t, we have

P[ X + E S U + t] e t2/2.

We rewrite the information criterion as

ICκ(m, n) = D h ln Y ˆY (m,n) 2 F + κr2 m,n 2κD 1/2i ,

where D = 2M+N and rm,n = 2 (m+n)/2 + 2 (m +n )/2. The constant term 2κD 1/2 is irrelevant to the conﬁguration (m, n) and is therefore ignored in subsequent proofs. Without loss of generality, we deﬁne the following expected information criterion

EICκ(m, n) = D h E ln Y ˆY (m,n) 2 F + κr2 m,n i

for simplicity. The diﬀerence in expected information criterion between wrong conﬁgurations and the true conﬁguration is of central interest, so we deﬁne

EICκ(m, n) = EICκ(m, n) EICκ(m0, n0)

Under the true conﬁguration (m0, n0), we have

E Y ˆY (m,n) 2 F E Y λA B 2 F = σ2D 1E E 2 F = σ2.

Therefore, we have

EICκ(m0, n0) D h ln E Y ˆY (m,n) 2 F + κr2 0 i D ln σ2 + κr2 0 , (29)

where r0 = rm0,n0. Deﬁne

ˆλ(m,n) := Rm,n[Y ] S = λRm,n[A B] + σD 1/2Rm,n[E] S. (30)

To calculate the information criterion for wrong conﬁgurations, we use the following equality

Y ˆY (m,n) 2 F = Y 2 F h ˆλ(m,n)i2 .

Cai, Chen and Xiao

Notice that

Y 2 F = λA B 2 F + σ2D 1 E 2 F + 2λσD 1/2tr[(A B)E ],

where λA B 2 F = λ2, σ2D 1 E 2 F = σ2(1 + Op(D 1/2)) and tr[(A B)E ] follows a standard normal distribution. We have

Y 2 F = λ2 + σ2 + R1, (31)

where R1 = Op (σ2 + λσ)D 1/2 .

For wrong conﬁgurations (m, n) W, without loss of generality, we assume m + n (M + N)/2. According to Lemma 4, we have the upper bound for (30):

[ˆλ(m,n)]2 λ2φ2 + σ2r2 m,n + 4λφσ2(m+n)/2D 1/2 + Op((λσ + σ2)D 1/2)

λ2φ2 + σ2r2 m,n + 4λσD 1/4 + Op((λσ + σ2)D 1/2). (32)

Y ˆY (m,n) 2 S λ2(1 φ2) + σ2(1 r2 m,n) 4λσD 1/4 + Op((λσ + σ2)D 1/2).

The last two terms are minor terms by Assumption 3. Therefore,

EICκ(m, n) D

ln(λ2ψ2 + σ2(1 r2 m,n)) O

Here Lemma 3 is applied since the stochastic term in (32) has an exponential tail bound. Notice that EICκ(m, n) in (33) is either a monotone increasing function or a uni-modal function of r2 m,n on [1/2, 4D1/2]. Therefore, the minimum of the right hand side of (33) is obtained on the boundary. When r2 m,n = 1/2, (33) becomes

EICκ(m, n) D

ln(λ2ψ2 + σ2/2) O

When r2 m,n = 4D 1/2, (33) becomes

EICκ(m, n) D

ln(λ2ψ2 + σ2) O

In conclusion, for any wrong conﬁguration (m, n) W, we have

EICκ(m, n) D

α = ln 1 + λ2ψ2

Ko PA: Automated Kronecker Product Approximation

When κ 2 ln 2, α takes the ﬁrst value in the preceding equation. The assumptions imposed in Theorem 2 ensure the leading term α in (36) dominates other terms so that the minimum of EIC over the wrong conﬁgurations is strictly positive.

We now address Remark 7. It turns out possible to use only the MSE to select the conﬁguration, which corresponds to κ = 0. It requires a stronger signal-to-noise ratio λ2ψ2/σ2 > 1/2 so that the leading term α in (36) is positive, and hence Theorem 2 continues to hold.

Remark 11. Note that the upper bound used in (32) is quite conservative, because the maximums of φ and 2(m+n)/2 over W are taken separately. It leads to a simple form of Assumption 3, which is actually not as optimal as possible. If we deﬁne φ(m,n) = Rm,n[A B] S, then the condition (16) in Assumption 3 can be relaxed to

lim M+N inf (m,n) W(2(m+n)/2 + 2(m +n )/2) λ

σ 1 [φ(m,n)]2

However, in the main text we choose to introduce the concept of representation gap and present a simple version of Assumption 3.

Appendix Appendix C. Proof of Theorem 3

We begin with the tail bounds for E 2 F . According to the tail bounds for χ2 random variable given in Laurent and Massart (2000), it holds that for any t > 0,

P h D 1 E 2 F > 1 +

2D 1/2t + D 1t2i e t2/2, (37)

P h D 1 E 2 F < 1

2D 1/2t i e t2/2, (38)

where D = 2M+N. Therefore, at the true conﬁguration (m0, n0), we have

P h Y ˆY (m0,n0) 2 F > σ2 +

2σ2D 1/2t + σ2D 1t2i

P h σD 1/2E 2 F > σ2 +

2σ2D 1/2t + σ2D 1t2i

e t2/2. (39)

Noticing that Y 2 F = λ2 + σ2D 1 E 2 F + 2λσD 1/2Z,

where Z = tr[(A B)E ] is a standard Gaussian random variable, by (38) we have

P h Y 2 F < λ2 + σ2 (

2σ2 + 2λσ)D 1/2t i 2e t2/2. (40)

Now we consider the tail bound for ˆλ(m,n) of wrong conﬁgurations. According to Lemma 4, we have the tail bound for ˆλ(m,n) as

P[ˆλ(m,n) U + σD 1/2t] e t2/2, (41)

Cai, Chen and Xiao

U2 = λ2φ2 + σ2r2 m,n + 4λφσ2(m+n)/2D 1/2 +

2πσ2rm,n D 1/2 + 2σ2D 1 < (λ + σ)2.

Let α = ln(1 + (λ/σ)2ψ2) be the positive gap constant. We have

P [ICκ(m0, n0) > EICκ(m0, n0) + Dα/3]

=P h Y ˆY (m0,n0) 2 F > σ2eα/3i

exp c2 1D/2 , (42)

where c2 1 = eα/3 1.

For any (m, n) W, it holds that

P [ICκ(m, n) < EICκ(m0, n0) + Dα/3]

P [ICκ(m, n) < EICκ(m, n) Dα/3]

P h Y 2 F ˆλ2 < λ2 + σ2 λ2φ2 2h i

P Y 2 F < λ2 + σ2 h + P h ˆλ2 > U2 + h i

2 exp c2 2D/2 + exp c2 3D/2 (43)

where we use (40) and (41) to obtain (43),

1 e α/3 (λ2(1 φ2) + σ2), c2 = h

and c3 is the solution of σ2c2 3 + 2(λ + σ)σc3 = h.

We conclude that

P ICκ(m0, n0) min (m,n) W ICκ(m, n)

(m,n) W P[ICκ(m0, n0) ICκ(m, n)]

P[ICκ(m0, n0) EICκ(m0, n0) + Dα/3]

+ P[ICκ(m, n) EICκ(m0, n0) + Dα/3]

4(M + 1)(N + 1) exp c2D/2 0, (44)

where c = min{c1, c2, c3}. By calculating the orders of c1, c2, c3, it holds that

Ko PA: Automated Kronecker Product Approximation

Speciﬁcally, if α 0 (or equivalently, (λ/σ)2ψ2 0), we have

σ2 ψ2 (λ2/σ2)2

(1 + λ/σ)2 ψ4

The right hand side is much greater than ln(MN), under Assumptions 1 and 3.

Appendix Appendix D. Proof of Theorem 4

The proof is very similar to the proofs of Theorem 2 and Theorem 3, so we only point out the major steps, but omit the details. Condition (19) implies that λ2 = λ2 0(1 + op(1)) and ψ2 = ψ2 0(1 + op(1)). The proof of Theorem 2 follows immediately by replacing λ2 and ψ2

with the deterministic values λ2 0 and ψ2 0, except that an op(λ2 0 + ψ2 0) term is added to (30). Since the additional stochastic term is negligible and has ﬁnite expectation, Theorem 2 continues to hold. The consistency follows same lines as those of Theorem 3 except that the deviations λ2 λ2 0 and ψ2 ψ2 0 should be incorporated into (44). Speciﬁcally, Assumption 4 implies that for any small constant δ, with probability larger than 1 o(1/(MN)), we have λ2 λ2 0(1 δ) and ψ2 ψ2 0(1 δ). Proof of Theorem 3 follows immediately by replacing λ2 and ψ2 with λ2 0(1 δ) and ψ2 0(1 δ). The following probability of exceptions should be added to (44).

(M + 1)(N + 1) P[λ2 < λ2 0(1 δ)] + P[ψ2 < ψ2 0(1 δ)] = o(1),

which does not aﬀect consistency but may reduce the convergence rate.

Appendix Appendix E. Proof of Lemma 1 and Corollary 2

Consider the complete Kronecker product decomposition of A with respect to the conﬁguration (m m , n n , (m m )+, (n n )+):

i=1 µi Ci Di, (45)

where I = 2m m +n n 2(m m )++(n n )+, µ1 µ2 µI are the coeﬃcients in decreasing order. Ci and Di satisfy

Ci, Cj = Di, Dj = δi,j, (46)

where δi,j is the Kronecker delta function such that δi,j = 1 if and only if i = j and δi,j = 0 otherwise, and A, B := tr[A B] is the trace inner product. Notice that the decomposition in (45) corresponds to the singular value decomposition for Rm m ,n n [A]. Therefore, the singular values µ1, . . . , µI are uniquely identiﬁable and the components Ci, Di are identiﬁable if the singular values are distinct. In particular,

µ1 = Rm m ,n n [A] S.

Similarly, the KPD of B with the conﬁguration ((m m)+, (n n)+, M m m , N n n ) is given by

j=1 νj Fj Gj,

Cai, Chen and Xiao

where J = 2(m m)++(n n)+ 2M+N m m n n and

ν1 = R(m m)+,(n n)+[B] S.

With the two KPD of A and B, we can rewrite A B as

i=1 µi Ci Di

j=1 νj Fj Gj

j=1 µiνj Ci Di Fj Gj.

Notice that the Kronecker product satisﬁes distributive law and associative law. The matrix Di is 2(m m )+ 2(n n )+ and the matrix Fj is 2(m m)+ 2(n n)+. For all possible values of m, m , n, n , either one of Di and Fj is a scalar, or they are both vectors; and for both cases Di Fj = Fj Di. Therefore,

j=1 µiνj Ci Fj Di Gj =

j=1 µiνj Pij Qij, (47)

where Pij := Ci Fj, Qij := Di Gj.

Notice that Pij is a 2m 2n matrix and Qij is a 2M m 2N n matrix. Therefore, (47) is a KPD of A B indexed by (i, j) with respect to the Kronecker conﬁguration (m , n , M m , N n ) as long as Pij and Qij satisfy the orthonormal condition in (46). In fact,

Pij, Pkl = tr[P ij Pkl]

= tr[(Ci Fj) (Dk Gl)]

= tr[(C i Dk) (F j Gl)]

= tr[C i Dk]tr[F j Gl]

= δi,jδk,l,

and similar results hold for Qij. It follows that

Rm ,n [A B] S = max i,j µiνj = µ1ν1 = Rm m ,n n [A] S R(m m)+,(n n)+[B] S,

and the proof of Lemma 1 is complete.

Now we consider Corollary 2. When A and B are generated as in Example 1, we have

Rm m ,n n [ A] S 2(m m +n n )/2 + 2((m m )++(n n )+)/2 + Op(1),

R(m m)+,(n n)+[ B] S 2((m m)++(n n)+)/2 + 2(M+N m m n n )/2 + Op(1),

A F B F = 2(M+N)/2(1 + Op(r0)).

Rm ,n [A B] S = Rm ,n [ A B] S A F B F 2 (m +n )/2 + 2 (M+N m n )/2

+ 2 (|m m |+|n n |)/2 + 2 (M+N |m m | |n n |)/2 + op(1).

Ko PA: Automated Kronecker Product Approximation

The maximum of the right hand side is obtained when |m m | + |n n | = 1, or m + n {1, M + N 1}, for which

Rm ,n [A B] S 1/

Furthermore, it is straightforward to verify that the upper bound is attained when m +n {1, M + N 1}, which leads to Corollary 2.

Appendix Appendix F. Proof of Lemma 2

We ﬁrst prove the following technical lemma.

Lemma 5. Let U, V be two vector subspaces of Rn with Θ(U, V ) = θ [0, π/2], where Θ(U, V ) denotes the smallest principal angle between U and V . Suppose w Rn is a unit vector and PUw = cos α,

for some α [0, π/2], where PU denotes the orthogonal projection to the space U. Then it holds that

( cos(θ α) if α θ, 1 if α > θ.

u = PUw PUw ,

then u = 1 and u U. Let {u1, u2, . . . , un} be an orthogonal basis of Rn such that u1 = u. For any vector v V , we have

v w = v n X

= v u1u 1w +

i=2 v uiu iw

= cos η cos α + sin η sin α

= cos(η α),

where v u1 = cos η. The proof is complete by noting that cos η = v u1 cos θ.

We now prove Lemma 2. Proof of Lemma 2. Recall that M1 and M2 are of the same dimension. We consider the maximization of (M1 + M2)u 2 over all unit vectors u. First write

(M1 + M2)u 2 = M1u + M2u 2

= M1PM 1u + M2PM 2u 2

= M1PM 1u 2 + M2PM 2u 2 + 2(M1PM 1u) M2PM 2u,

Cai, Chen and Xiao

where PM denotes the projection matrix to the column space of M. Since M1 S = µ and M2 S = ν, we have

M1PM 1u 2 µ2 PM 1u 2 and M2PM 2u 2 ν2 PM 2u 2.

Since M1PM 1u span(M1) and M2PM 2u span(M2), it holds that

(M1PM 1u) M2PM 2u cos θµν PM 1u PM 2u .

It follows that

(M1 + M2)u 2 µ2 PM 1u 2 + ν2 PM 2u 2 + 2µν PM 1u PM 2u cos θ.

Suppose PM 1u = cos α for some α [0, π/2]. If α > η, then PM 2u 1. The right hand side of the preceding inequality attains its maximum when PM 1u = cos η and PM 2u = 1. Hence, we only consider the case α η, which implies that PM 2u cos(η α), and

(M1 + M2)u 2 µ2 cos2 α + ν2 cos2(η α) + 2µν cos θ cos α cos(η α).

µ2 cos2 α + ν2 cos2(η α) + 2µν cos θ cos α cos(η α)

2µ2(1 + cos 2α) + 1

2ν2(1 + cos(2η 2α)) + µν cos θ[cos η + cos(η 2α)]

2(µ2 + ν2 + 2µν cos θ cos η)

2ν2 cos(2η) + µν cos θ cos η cos(2α) + 1

2ν2 sin(2η) + µν cos θ sin η sin(2α)

2(µ2 + ν2 + 2µν cos θ cos η)

2ν2 cos(2η) + µν cos θ cos η 2 + 1

2ν2 sin(2η) + µν cos θ sin η 2

µ2 + ν2 + 2µν cos θ cos η + q

(µ2 + ν2 + 2µν cos θ cos η)2 4µ2ν2 sin2 θ sin2 η .

The proof is complete.

Appendix Appendix G. Proofs of Theorem 5 and Corollary 4

The proof of Theorem 5 is similar to the proofs of Theorem 2 and Theorem 3, so we only point out the main steps here and omit the details. Following the same argument as in the proof of Theorem 2, the expected information criteria of the true conﬁguration is

EICκ(m0, n0) = D ln λ2 2 + σ2 + κr2 0 .

For a wrong conﬁguration (m, n) W, ˆλ(m,n) is obtained by

ˆλ(m,n) = λ1R[A1 B1] + λ2R[A2 B2] + σD 1/2R[E] S.

Ko PA: Automated Kronecker Product Approximation

According to Lemma 2 and Assumption 5, we have

λ1R[A1 B1] + λ2R[A2 B2] 2 S λ2 1φ2 1 + λ2 2φ2 2 + 2λ1λ2φ1φ2ξ < (λ1 + λ2)2. (48)

By Lemma 4, we have

[ˆλ(m,n)]2 λ2 1φ2 1 + λ2 2φ2 2 + 2λ1λ2φ1φ2ξ + σ2r2 m,n

+ O((λ1 + λ2)σD 1/4) + Op (λ1 + λ2 + σ)σD 1/2 . (49)

With (49) replacing (32), the rest of the proof follows the same line of the proof of Theorem 2. The proof of consistency is same as in the proof of Theorem 3 except that the formula of ˆλ(m,n) in (49) is used in (43).

We now prove Corollary 4. When model (24) is generated under the random scheme in Example 2, we only consider the wrong conﬁguration close to the true conﬁguration. It can be veriﬁed that the separation EIC(m, n) is larger at other conﬁgurations. Consider (m, n) such that |m0 m| + |n0 n| = 1. Then from Corollary 2, we have

2 + Op(r0), φ2 = 1

2 + Op(r0).

Now consider the principle angles between R[A1 B1] and R[A2 B2 as in Lemma 2, We have cos θ = Op(2 (m+n)), cos η = Op(2 (m +n )).

By Lemma 2, (48) can be revised to

λ1R[A1 B1] + λ2R[A2 B2] 2 S λ2 1 2 + Op(λ2 1r0).

Corollary 4 follows immediately.

Appendix Appendix H. Additional Simulation with Diﬀerent Noise Distributions

In this section, we examine the performance of conﬁguration selection under diﬀerent distributions of the noise matrix E. We replicate the simulation in Experiment 1 in Section 6.1.2 with M = N = 9, replacing the the normal distribution of the noise by (1) Unif[ 1, 1] and (2) Student s t4 distribution with degrees of freedom 4, both normalized to have unit variance. The uniform distribution is an example of the sub-Gaussian case, whose concentration inequality of the spectral norm is provided by, for example, Proposition 2.4 of Rudelson and Vershynin (2010). The t4 is an example of tails heavier than the Gaussian distribution. We plot the number of correct conﬁguration selections out of 100 replications for diﬀerent noise distributions in Figure 13. There is no substantial diﬀerence between Gaussian and uniform cases. The t4 noise appears to require higher signal-to-noise ratios than Gaussian noise due to its heavier tail. But the phase transition of correctly selecting the conﬁguration continues to exist.

Cai, Chen and Xiao

(a) AIC, Gaussian

(b) AIC, Uniform

(d) MSE, Gaussian

(e) MSE, Uniform

Figure 13: The empirical frequencies of the correct conﬁguration selection out of 100 repetitions under AIC and MSE for diﬀerent noise distributions with dimensions M = N = 9.

Seung C. Ahn and Alex R. Horenstein. Eigenvalue ratio test for the number of factors. Econometrica, 81(3):1203 1227, 2013.

Hirotogu Akaike. Information Theory and an Extension of the Maximum Likelihood Principle, pages 199 213. Springer New York, New York, NY, 1998.

Jushan Bai. Inferential theory for factor models of large dimensions. Econometrica, 71(1): 135 171, 2003.

J.W. Belliveau, D.N. Kennedy, R.C. Mc Kinstry, B.R. Buchbinder, R.M. Weisskoﬀ, M.S. Cohen, J.M. Vevea, T.J. Brady, and B.R. Rosen. Functional mapping of the human visual cortex by magnetic resonance imaging. Science, 254(5032):716 719, 1991.

Vicki Bruce and Andy Young. Understanding face recognition. British journal of psychology, 77(3):305 327, 1986.

Jian-Feng Cai, Emmanuel J Cand es, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956 1982, 2010.

T. Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46(1):60 89, 2018.

Emmanuel J. Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936, 2010.

Emmanuel J. Cand es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009.

Ko PA: Automated Kronecker Product Approximation

Antonin Chambolle. An algorithm for total variation minimization and applications. Journal of Mathematical imaging and vision, 20(1-2):89 97, 2004.

Priyam Chatterjee and Peyman Milanfar. Patch-based near-optimal image denoising. IEEE Transactions on Image Processing, 21(4):1635 1649, 2011.

Jiahua Chen and Zehua Chen. Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759 771, 2008.

Rong Chen, Dan Yang, and Cun-Hui Zhang. Factor models for high-dimensional tensor time series. Journal of the American Statistical Association, 117(537):94 116, 2022.

Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transform-domain collaborative ﬁltering. IEEE Transactions on image processing, 16(8):2080 2095, 2007.

Tugrul Dayar. Analyzing Markov chains using Kronecker products: theory and applications. Springer Science & Business Media, 2012.

Marco F. Duarte and Richard G. Baraniuk. Kronecker compressive sensing. IEEE Transactions on Image Processing, 21(2):494 504, Feb 2012.

Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211 218, 9 1936.

Jianqing Fan, Yuan Liao, and Martina Mincheva. High dimensional covariance matrix estimation in approximate factor models. Annals of statistics, 39(6):3320, 2011.

Rina Foygel and Mathias Drton. Extended bayesian information criteria for gaussian graphical models. In Advances in neural information processing systems, pages 604 612, 2010.

Anna Goldenberg, Alice X. Zheng, Stephen E. Fienberg, and Edoardo M. Airoldi. A survey of statistical network models. Foundations and Trends in Machine Learning, 2(2):129 233, 2010.

Qiang Guo, Caiming Zhang, Yunfeng Zhang, and Hui Liu. An eﬃcient SVD-based method for image denoising. IEEE transactions on Circuits and Systems for Video Technology, 26(5):868 880, 2015.

Roger A. Horn and Charles R. Johnson. Topics in matrix analysis. Cambridge University Press, Cambridge, 37:39, 1991.

Julie Kamm and James G. Nagy. Kronecker product and SVD approximations in image restoration. Linear Algebra and its Applications, 284(1):177 192, 1998. International Linear Algebra Society (ILAS) Symposium on Fast Algorithms for Control, Signals and Image Processing.

Phillip Kaye, Raymond Laﬂamme, and Michele Mosca. An introduction to quantum computing. Oxford University Press, 2007.

Cai, Chen and Xiao

B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000.

Can M. Le, Elizaveta Levina, and Roman Vershynin. Optimization via low-rank approximation for community detection in networks. The Annals of Statistics, 44(1):373 400, 2016.

Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research, 11(Feb):985 1042, 2010.

Joseph A. Maldjian, Paul J. Laurienti, Robert A. Kraft, and Jonathan H. Burdette. An automated method for neuroanatomic and cytoarchitectonic atlas-based interrogation of f MRI data sets. Neuroimage, 19(3):1233 1239, 2003.

Victor Ng, Robert F. Engle, and Michael Rothschild. A multi-dynamic-factor model for stock returns. Journal of Econometrics, 52(1-2):245 266, 1992.

Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In BMVC 2015 - Proceedings of the British Machine Vision Conference 2015, pages 1 12. British Machine Vision Association, 2015.

Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II IV: Invited Lectures, pages 1576 1602. World Scientiﬁc, 2010.

Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2): 461 464, 03 1978.

Amit Singer. Angular synchronization by eigenvectors and semideﬁnite programming. Applied and computational harmonic analysis, 30(1):20 36, 2011.

Matthew A. Turk and Alex P. Pentland. Face recognition using eigenfaces. In Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 586 591. IEEE, 1991.

Charles F. Van Loan and Nikos Pitsianis. Approximation with kronecker products. In Linear algebra for large scale and real-time applications, pages 293 314. Springer, 1993.

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing, 2010.

Per Ake Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99 111, 1972.

Karl Werner, Magnus Jansson, and Petre Stoica. On estimation of covariance matrices with kronecker product structure. IEEE Transactions on Signal Processing, 56(2):478 491, Feb 2008.