# kopa_automated_kronecker_product_approximation__591ca74f.pdf Journal of Machine Learning Research 23 (2022) 1-44 Submitted 8/20; Revised 7/22; Published 8/22 Ko PA: Automated Kronecker Product Approximation Chencheng Cai chencheng.cai@temple.edu Department of Statistics, Operations and Data Science Fox School of Business, Temple University Philadelphia, PA 19122, USA Rong Chen rongchen@stat.rutgers.edu Department of Statistics Rutgers University Piscataway, NJ 08854, USA Han Xiao hxiao@stat.rutgers.edu Department of Statistics Rutgers University Piscataway, NJ 08854, USA Editor: David Wipf We consider the problem of matrix approximation and denoising induced by the Kronecker product decomposition. Specifically, we propose to approximate a given matrix by the sum of a few Kronecker products of matrices, which we refer to as the Kronecker product approximation (Ko PA). Because the Kronecker product is an extensions of the outer product from vectors to matrices, Ko PA extends the low rank matrix approximation, and includes it as a special case. Comparing with the latter, Ko PA also offers a greater flexibility, since it allows the user to choose the configuration, which are the dimensions of the two smaller matrices forming the Kronecker product. On the other hand, the configuration to be used is usually unknown, and needs to be determined from the data in order to achieve the optimal balance between accuracy and parsimony. We propose to use extended information criteria to select the configuration. Under the paradigm of high dimensional analysis, we show that the proposed procedure is able to select the true configuration with probability tending to one, under suitable conditions on the signal-to-noise ratio. We demonstrate the superiority of Ko PA over the low rank approximations through numerical studies, and several benchmark image examples. Keywords: Information Criterion, Kronecker Product, Low Rank Approximation, Matrix Decomposition, Random Matrix 1. Introduction Observations that are matrix/tensor valued have been commonly seen in various scientific fields and social studies. In recent years, advances in technology have made high dimensional matrix/tensor type data possible and more and more prevalent. Examples include high resolution images in face recognition and motion detection (Turk and Pentland, 1991; Bruce and Young, 1986; Parkhi et al., 2015), brain images through f MRI (Belliveau et al., 1991; Maldjian et al., 2003), adjacent matrices of social networks of millions of nodes (Goldenberg c 2022 Chencheng Cai, Rong Chen and Han Xiao. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-931.html. Cai, Chen and Xiao et al., 2010), the covariance matrix of thousands of stock returns (Ng et al., 1992; Fan et al., 2011), the import/export network among hundreds of countries (Chen et al., 2022), etc. Due to the high dimensionality of the data, it is often useful and preferred to store, compress, represent, or summarize the matrices/tensors through low dimensional structures. In particular, low rank approximations of matrices have been ubiquitous. Finding a low rank approximation of a given matrix is closely related to the singular value decomposition (SVD), and the connection was revealed as early as Eckart and Young (1936). SVD has proven extremely useful in matrix completion (Cand es and Recht, 2009; Candes and Plan, 2010; Cai et al., 2010), community detection (Le et al., 2016), image denoising (Guo et al., 2015), among many others. In this paper, we investigate matrix approximations induced by the Kronecker product. Since the Kronecker product is an extension of the outer product, we call the proposed method Ko PA (Kronecker outer Product Approximation). Kronecker product is an operation on two matrices which generalizes the outer product from vectors to matrices. Specifically, the Kronecker product of a p q matrix A = (aij) and a p q matrix B = (bij), denoted by A B, is defined as a (pp ) (qq ) matrix which takes the form of a block matrix. In A B, there are pq blocks of size p q , where the (i, j)-th block is the scalar product aij B. We refer the readers to Horn and Johnson (1991) and Van Loan and Pitsianis (1993) for overviews of the properties and computations of the Kronecker product. Kronecker product has also found wide applications in signal processing, image restoration and quantum computing, etc. For example, in the statistical modeling of a multi-input multi-output (MIMO) channel communication system, Werner et al. (2008) modeled the covariance matrix of channel signals as the Kronecker product of the transmit covariance matrix and the receive covariance matrix. In compressed sensing, Duarte and Baraniuk (2012) utilized Kronecker products to provide a sparse basis for high-dimensional signals. In image restoration, Kamm and Nagy (1998) considered the blurring operator as a Kronecker product of two smaller matrices. In quantum computing, Kaye et al. (2007) represented the joint state of quantum bits as a Kronecker product of their individual states. In SVD, a matrix is represented as the sum of rank one matrices, where each of them is represented as the outer product of the left singular vector and the corresponding right singular vector (after the transpose). Similarly, the Kronecker Product Decomposition (KPD) of a (pp ) (qq ) matrix C is defined as where d = min{pq, p q }, and Ak and Bk are p q and p q respectively. In the definition of the KPD, the dimensions of Ak and Bk have to be specified, which (in this case, p q and p q ) we refer to as the configuration of the KPD. Further constraints on Ak and Bk are necessary to make the decomposition well defined and unique, but we will defer the exact definition of KPD to Section 2. Since the Kronecker product is an extension of the vector outer product, so is KPD of SVD. In particular, if p = 1, and q = 1, then Ak and Bk are column and row vectors respectively, and the KPD, under this particular configuration, becomes the SVD. Similar to rank-one approximation, the best matrix approximation given by a Kronecker product is formulated as finding the closest Kronecker product under the Frobenius norm. Ko PA: Automated Kronecker Product Approximation This was introduced in the matrix computation literature as the nearest Kronecker product (NKP) problem in Van Loan and Pitsianis (1993), who also demonstrated its equivalence to the best rank one approximation and therefore also to the SVD, after a proper rearrangement of the matrix entries. Such an equivalence is also maintained if one seeks the best approximation of a given matrix by the sum of K Kronecker products of the same configuration, PK k=1 Ak Bk. Despite of its connection to SVD, finding a best Kronecker approximation also involves a pre-step: determining the configurations of the Kronecker products, i.e., determining the dimensions of Ak and Bk. One of our major contributions in this paper is on the selection of the configuration based on an information criterion. Although the configuration selection poses new challenges, KPD also provides a framework that is more flexible than SVD. Here we use the cameraman s image, a benchmark in image analysis, to illustrate the potential advantage of KPD over SVD. The left panel in Figure 1 is the 512 512 pixel image of a cameraman in gray scale. The middle panel shows the best rank-1 approximation of the original image given by the leading term of SVD. The rank-1 approximation explains 45.63% of the total variation of the original image with 1023 parameters. The right panel in Figure 1 displays the image obtained by the nearest Kronecker product of configuration (16 32) (32 16). With the same number of parameters as the rank-1 approximation, this nearest Kronecker product approximation explains 77.55% of the variance of the original image. Figure 1: (Left) Original cameraman s image; (Middle) SVD approximation; (Right) KPD approximation We will revisit the cameraman s image in Section 6 with a more detailed analysis. We notice here that the superiority of Ko PA over low rank approximation in representing images is partially due to the similarity of local blocks in the image. In this regard it is related to the patch based denoising methods (Dabov et al., 2007; Chatterjee and Milanfar, 2011) in the field of image processing, which explore the recurrence of similar local pattern throughout the image. However, we have a substantially distinct focus in this paper. One of our main objectives is to devise a formal procedure to determine the configuration, or the patch size , from the data, which is usually chosen in an ad hoc manner in patch based methods. We introduce a statistical model to characterize the image generating mechanism, and propose to use information criteria to select the configuration. Practically it implies an emphasis on the balance between the complexity (number of model parameters) and accuracy (closeness to the original image). Furthermore, the Ko PA framework and Cai, Chen and Xiao the model selection also has potential applications in high dimensional panel time series, large network analysis, recommending systems, and other matrix-type data analysis. For example, in modeling dense networks (Leskovec et al., 2010), the adjacency matrix can be represented by a Kronecker product A B, where A and B correspond to the interand inner-community structures respectively. As a second example, the Ko PA may as well replace the low rank approximation in the synchronization problem (Chen and Chen, 2008; Singer, 2011) to identify the groups/clusters of the individuals, at the same time of denoising the distance matrix. It is worth mentioning that Ko PA can also be used to speed up the computation. If the transition matrix of a Markov Chain can be represented as one or a sum of a few Kronecker products, then the state update can be calculated more efficiently (Dayar, 2012). Ko PA plays its role in guiding the choice of the Kronecker product approximation of the transition matrix. In this paper, we focus on the model Y = λA B + σ PQE, where (P, Q) is the dimension of the matrix Y , E is a standard Gaussian ensemble consisting of IID standard normal entries, λ > 0 and σ > 0 indicate the strength of signal and noise respectively. We consider the matrix denoising problem which aims to recover the Kronecker product λA B from the noisy observation Y . Here the configuration of the Kronecker product, i.e. the dimensions of A and B, is to be determined from the data. We propose to use information criteria (which include AIC and BIC as special cases) to select the configuration, and prove its consistency under some conditions on the signal-to-noise ratio. The consistency of the configuration selection is established for both deterministic and random A and B, under the paradigm of high dimensional analysis, where the dimension of Y diverges to infinity. The rest of the paper is organized as follows. In Section 2, we give the precise definition of the KPD, and introduce the model, with a review of some of their basic properties. In Section 3, we propose the information criteria for selecting the configuration of the Kronecker product. We investigate and establish the consistency of the proposed selection procedure in Section 4. Extension to the multi-term Kronecker product models is discussed in Section 5. In Section 6, we carry out extensive simulations to assess the performance of our method, and demonstrate its superiority over the SVD approach. We also present a detailed analysis of the cameraman s image. Notations: Throughout this paper, for a vector v, v denotes its Euclidean norm. And for a matrix M, M F = p tr(M M) and M S = max u =1 Mu denote its Frobenius norm and spectral norm respectively. For any two real numbers a and b, a b and a b stand for min{a, b} and max{a, b} respectively. For any number x, x+ denotes the positive part x 0 = max{x, 0}. Ko PA: Automated Kronecker Product Approximation 2. Kronecker Product Model 2.1 Kronecker Product Decomposition We first repeat the definition of the Kronecker product of a p q matrix A and a p q matrix B, which is given by a1,1B a1,2B a1,q B a2,1B a2,2B a2,q B ... ... ... ap,1B ap,2B ap,q B Let C be a (pp ) (qq ) real matrix, its Kronecker Product Decomposition (KPD) of configuration (p, q, p , q ) is defined as k=1 λk Ak Bk. (1) where d = min{pq, p q }, each Ak is a p q matrix with Frobenius norm Ak F = 1, each Bk is a p q matrix with Bk F = 1, and λ1 λ2 λd 0. The matrices Ak are mutually orthogonal in the sense that tr(Ak A l) = 0 for 1 k < l d, and so are the matrices Bk. The best way to see that the KPD is a valid definition is through its connection with the SVD, after a proper rearrangement of the elements of C, as demonstrated in Van Loan and Pitsianis (1993). Denote by vec( ) the vectorization of a matrix by stacking its rows. If A = (aij) is a p q matrix, then vec(A) := [a1,1, . . . , a1,q, . . . , ap,1, . . . , ap,q] . If B = (bij) is a p q matrix, then vec(A)[vec(B)] is a (pq) (p q ) matrix containing the same set of elements as the Kronecker product A B, but in different positions. We define the rearrangement operator R to represent this relationship. Write the matrix C as a p q array of blocks of the same block size p q , and denote by Cp ,q i,j the (i, j)-th block, where 1 i p, 1 j q. The operator R maps the matrix C to Rp,q[C] = h vec(Cp ,q 1,1 ), . . . , vec(Cp ,q 1,q ), . . . , vec(Cp ,q p,1 ), . . . , vec(Cp ,q p,q ) i , (2) When applied to a Kronecker product A B, it holds that Rp,q[A B] = vec(A)[vec(B)] . (3) In view of (2) and (3), we see that the KPD in (1) corresponds to the SVD of the rearranged matrix Rp,q[C], and the conditions imposed on Ak and Bk are derived from the properties of the singular vectors. Here, we note that the rearrangement operator R is configuration dependent, which we emphasize by explicitly specifying the dimension of Ak (in this case, p and q) in the subscript of R, see (2) and (3). When there is no ambiguity, the subscript of R may be omitted for notational simplicity. According to the definition, the mapping Rp,q : Rpp qq Rpq p q is an isomorphism since it is linear and bijective. In addition, since the order of elements does not change the Frobenius norm, the mapping R is also isometric under Frobenius norm. Cai, Chen and Xiao 2.2 Kronecker Product Model We consider the model where the observed P Q matrix Y is a noisy version of an unknown Kronecker product Y = λA B + σ PQE. (4) To resolve the obvious unidentifiability regarding A and B, we require A F = B F = 1, (5) so that λ > 0 indicates the strength of the signal part. Note that under (5), A and B are identified up to a sign change given their dimensions. To further identify A and B, it is common to assume the largest non-zero element (in absolute value) of one of them is positive. We assume that the noise matrix E has IID stand normal entries, and consequently the strength of the noise is controlled by σ > 0. The dimensions of A and B correspond to the integer factorization of the dimension of Y . For convenience, we assume throughout this article that the dimension of the observed matrix Y in (4) is 2M 2N with M, N N. As a result, the dimension of A must be of the form 2m0 2n0, where 0 m0 M and 0 n0 N, and the corresponding dimension of B is 2m 0 2n 0, where m 0 = M m0 and n 0 = N n0. Therefore, we can simply use the pair (m0, n0) to denote the configuration of the Kronecker product in (4). An implicit advantage of this assumption lies in the fact that if two configurations (m, n) and (m , n ) are different, then the number of rows of A under one configurations divides the one under the other, and similarly for the numbers of columns, and for B. For example, if m m , then the number of rows of A under the former configuration, which is 2m, divides the number of rows 2m under the latter. This fact leads to a more elegant treatment of the theoretical analysis in Section 4. For image analysis, assuming the dimension to be powers of 2 seems rather reasonable. On the other hand, for other applications where the dimension of the observed matrix are not powers of 2, one can transform the matrix to fulfill the assumption. For example, one can super-sample the matrix to increase the dimension to the closest powers of 2, or augment the matrix by padding zeros. The methodology proposed in this paper can be applied to any integer numbers P and Q with more than two factors. We will consider two mechanisms for the signal part λA B. Deterministic Scheme. We assume that λ, A and B are deterministic, satisfying (5). We define the following signal-to-noise ratio to measure the signal strength λA B 2 F E σE/2(M+N)/2 2 F = λ2 Random Scheme. Assume that λ, A and B are random and independent with E. Although A and B are stochastic, we assume that they have been rescaled so that (5) is fulfilled. In this case the signal-to-noise ratio is defined as E λA B 2 F E σE/2(M+N)/2 2 F = E λ2 Remark 1. We distinguish between these two schemes to account for the different assumptions on data generating mechanism. In the random scheme, the observed matrix data is Ko PA: Automated Kronecker Product Approximation assumed to be randomly chosen from a (super-)population of matrices with an ad-hoc prior, which for example can be chosen as the Kronecker product of two independent Gaussian random matrices. Under the random scheme assumption, ill-behaved matrices arise with negligible probabilities under the prior. Similar assumptions have been used in factor analysis and random effects models. The deterministic scheme incorporates arbitrary matrices. Additional assumptions need to be imposed to exclude extreme cases for which the proposed model selection would fail. 2.3 Estimation with a Known Configuration Suppose we want to estimate A and B based on a given configuration (m, n), that is, the dimensions of A and B are 2m 2n and 2m 2n respectively. Again we use m = M m and n = N n to ease the notation when M and N are known. To estimate A and B in (4) from the observed matrix Y , we solve the minimization problem min λ,A,B Y λA B 2 F , subject to A F = B F = 1. (6) Since we have assumed that the noise matrix contains IID standard normal entries, (6) is also equivalent to the MLE. This optimization problem has been formulated as the nearest Kronecker product (NKP) problem in the matrix computation literature (Van Loan and Pitsianis, 1993), and solved through the SVD after rearrangement. According to Section 2.1, after applying the rearrangement operator, the cost function in (6) is equivalent to Y λA B 2 F = R[Y ] λvec(A)[vec(B)] 2 F . We note that the rearrangement operator R defined in (2) depends on the configuration of the block matrix, and in the current case, on the configuration (m, n). Let R[Y ] = Pd k=1 λkukv k be the SVD of the rearranged matrix Rm,n[Y ], where λ1 λd 0 are the singular values in decreasing order, uk and vk are the corresponding left and right singular vectors and d = 2m+n 2m +n . The estimators for model (4) are given by ˆλ = λ1 = R[Y ] S, ˆ A = vec 1(u1), ˆ B = vec 1(v1), ˆσ2 = Y 2 F ˆλ2, (7) where vec 1 is the inverse operation of vec( ) that restores a vector back into a matrix of proper dimensions. We examine a few special cases of the configuration (m, n). When (m, n) = (0, 0) or (m, n) = (M, N), the nearest Kronecker product approximation of Y is always itself. For instance, if m = n = 0, the estimators are ˆλ = Y F , ˆ A = 1, ˆ B = ˆλ 1Y , ˆσ2 = 0. These two configurations are obviously over-fitting, and we shall exclude them in the subsequent analysis. When (m, n) = (0, N) or (m, n) = (M, 0), the nearest Kronecker product approximation of Y is the same as the rank-1 approximation of Y without rearrangement. When the true configuration used to generate Y is chosen, that is (m, n) = (m0, n0), the problem is equivalent to denoising a perturbed rank-1 matrix, since Rm0,n0[Y ] = λvec(A)vec(B) + σ 2(M+N)/2 Rm0,n0[E], (8) Cai, Chen and Xiao where the rearranged noise matrix Rm0,n0[E] is still a standard Gaussian ensemble. Therefore λ, A and B can be recovered consistently when σ Rm0,n0[E] S = op(λ 2(M+N)/2). Details will be discussed in Section 4. 3. Configuration Determination through an Information Criterion Our primary goal is to recover the Kronecker product λA B from Y , based on model (4). It depends on the configuration of the Kronecker product, which is typically unknown. We propose to use the information criterion based procedure to select the configuration. Recall that the dimension of Y is 2M 2N. If the dimension of A is 2m 2n, then the dimension of B must be 2m 2n , where m = M m and n = N n. Therefore, the configuration can be indexed by the pair (m, n), which takes value from the Cartesian product set {0, . . . , M} {0, . . . , N}. For any given configuration (m, n), the estimation procedure in Section 2.3 leads to the corresponding estimators ˆλ, ˆ A and ˆ B. Denote the estimated Kronecker product by ˆY (m,n) = ˆλ ˆ A ˆ B. Note that all of ˆλ, ˆ A and ˆ B depend implicitly on the configuration (m, n) used in estimation, and should be written as ˆλ = ˆλ(m,n) etc. However, we will suppress the configuration index from the notation for simplicity, whenever its meaning is clear in the context. Under the assumption that the noise matrix E is a standard Gaussian ensemble, we define the information criterion as ICκ(m, n) = 2M+N ln Y ˆY (m,n) 2 F + κη, (9) where η = 2m+n + 2m +n is the number of parameters involved in the Kronecker product of the configuration (m, n), and κ 0 controls the penalty on the model complexity. The information criterion (9) can be viewed as an extended version of the BIC. Similar proposals have been introduced by Chen and Chen (2008) and Foygel and Drton (2010) in the linear regression and graphical models setting, respectively. The information criterion (9) reduces to the log mean square error when κ = 0, and corresponds to the Akaike information criterion (AIC) (Akaike, 1998) when κ = 2, and the Bayesian information criterion (BIC) (Schwarz, 1978) when κ = ln 2M+N = (M + N) ln 2. Remark 2. Strictly speaking, the number of parameters involved in the Kronecker product λA B should be 2m+n +2m +n 1 because of the constraints (5). Since it does not affect the selection procedure to be introduced in (10), we will use η = 2m+n+2m +n for simplicity. The information criterion (9) can be calculated for all configurations, and the one corresponding to the smallest value of (9) will be selected, based on which the estimation procedure in Section 2.3 proceeds. In other words, the selected configuration ( ˆm, ˆn) is obtained through ( ˆm, ˆn) = arg min (m,n) C ICκ(m, n), (10) where C is the set of all candidate configurations. As discussed in Section 2.3, when m = n = 0 or (m, n) = (M, N), it holds that ˆY = Y , and the information criterion (9) will be , no matter what value κ takes. Therefore, these two configurations should be excluded in model selection and we use C := {0, . . . , M} {0, . . . , N} \ {(0, 0), (M, N)}, Ko PA: Automated Kronecker Product Approximation as the set of candidate configurations in (10). Note that the set {0, . . . , M} {0, . . . , N} forms a rectangle lattice in Z2, and (m, n) = (0, 0) and (m, n) = (M, N) are the bottom left and top right corner of the lattice. Therefore, we sometimes intuitively refer to these two configurations as the corner cases in the sequel. Furthermore, define W as the set of all wrong configurations W := C \ {(m0, n0)}. We now provide a heuristic argument to show how the selection procedure (10) is able to select the true configuration (m0, n0). We will leave some technical results aside, and only highlight the essential idea. Precise statements and their rigorous analysis will be presented in Section 4. For simplicity, assume that λ, σ and κ are fixed constants. Also assume that both (m0 + n0) and (m 0 + n 0) diverge, so that the number of parameters η0 = 2m0+n0 + 2m 0+n 0 is of a smaller magnitude than 2M+N. According to (7), for a given configuration (m, n), Rm,n[ ˆY ] equals the first SVD component of Rm,n[Y ], and it follows that Y ˆY 2 F = Y 2 F ˆY 2 F = Y 2 F ˆλ2, and the information criterion (9) can be rewritten as ICκ(m, n) = 2M+N ln( Y 2 F ˆλ2) + κη. (11) For the true configuration (m, n) = (m0, n0), the rearranged matrix Rm0,n0[Y ] takes the form (8), where the first term is a rank-1 matrix of spectral norm λ, and the noise term has a spectral norm of the order O(2 (m0+n0)/2 + 2 (m 0+n 0)/2) (details given in Section 4), which is negligible relative to λ, under the assumption m0 +n0 1, m 0 +n 0 1. So under the true configuration, ˆλ λ. On the other hand, the number of parameters η0 = o(2M+N), making the penalty term much smaller than the log likelihood in (9). To summarize, ICκ(m0, n0) 2M+N ln h λA B + σ 2 (M+N)/2E 2 F λ2i 2M+N ln σ2. For a wrong configuration (m, n) W that is close to the true one, the spectrum norm Rm,n[E] S and the number of parameters η are still negligible. However, the estimated coefficient ˆλ is smaller than λ since ˆλ = Rm,n[Y ] S Rm,n[λA B] S < λ. Let us assume that Rm,n[λA B] S φλ for some 0 < φ < 1, which implies that for the wrong configuration (m, n), ICκ(m, n) 2M+N ln h λA B + σ2 (M+N)/2E 2 F ˆλ2i 2M+N ln h σ2 (M+N)/2E 2 F + λ2 φ2λ2i 2M+N ln σ2 1 + (1 φ2)λ2 Therefore, the information criterion (9) is in favor of the true configuration over a wrong but close-to-truth one, and the two quantities are separated by ICκ(m, n) ICκ(m0, n0) 2M+N ln[1 + (1 φ2)λ2/σ2]. Cai, Chen and Xiao On the other hand, for a wrong configuration (m, n) W that is close to the corner configuration (0, 0) or (M, N), the singular value Rm,n[E] S can be as large as 1/2, making the separation between ICκ(m, n) and ICκ(m0, n0) by the log likelihood not guaranteed, i.e. it can happen that ˆλ > λ under the wrong configuration. But at the same time the number of parameters η is also approximately 2M+N, so ICκ(m, n) receives a heavy penalty, which once again makes it greater than ICκ(m0, n0). In summary, the trade-offbetween log likelihood and model complexity plays its role here, as expected. Wrong but close-to-truth configurations involve similar numbers of parameters as the true one, but lead to much smaller likelihoods. On the other hand, a close-to-corner configuration may yield a ˆY closer to the original Y , but requires much more parameters to do so. The true configuration can thus be selected because it reaches the optimal balance between the the likelihood and model complexity. In the preceding discussion we have assumed several convenient conditions to simplify the heuristic arguments and to signify the essential idea. In particular, by assuming that λ is a positive constant, the signal strength in model (4) is quite strong. In Section 4 we will make effort to establish the model selection consistency under minimal conditions. Remark 3. We remark that the optimization (10) is an exhaustive search over the candidate set C. The information function ICκ(m, n) is not necessarily a uni-modal function in general, though it is likely to be uni-modal when both A and B are Gaussian random matrices. As an extreme example, if A takes the form A = C D, then both A B and C (D B) are feasible configurations, and the ICκ function is bi-modal. The local algorithms can be trapped at any of them, but the exhaustive search is then able to choose the better one which involves less number of parameters. On the other hand, in our specific settings, the matrix is of dimensions 2M 2N. Therefore the total number of candidate configurations would be MN 2 = log2(2M) log2(2N) 2, which is the product of the logarithms of the dimensions of the observed matrix Y , much smaller compared to the size of Y . For more general dimensions P and Q, the configurations are chosen from the set of all divisors of P and Q. Unless P and Q are highly composite numbers (which are rare according to the number theory), the numbers of their divisors are usually much smaller. Finally, it is possible to develop more advanced searching algorithms. There are two challenges: (1) the IC function is not necessarily convex and (2) the search space in discrete. We leave the investigation of such development to future research. 4. Theoretical Results In this section we provide a theoretical guarantee of the configuration selection procedure proposed in Section 3, by establishing its asymptotic consistency. Throughout this section all our discussion will be based on model (4). 4.1 Assumptions and Estimation Consistency under Known Configuration We first introduce the assumptions of the theoretical analysis. Recall that for model (4), (m0, n0) denotes the true configuration, i.e. the matrices A and B are of dimensions 2m0 2n0 and 2m 0 2n 0 respectively. For the asymptotic analysis, we make the following Ko PA: Automated Kronecker Product Approximation assumption on the sizes of A and B, which follows the paradigm of high dimensional analysis. Assumption 1 (Assumption on Dimension). Consider model (4). Assume that as M + N , the true configuration (m0, n0) satisfies m0 + n0 ln ln(MN) , m 0 + n 0 ln ln(MN) , where m 0 = M m0 and n 0 = N n0. The condition entails that the numbers of entries in A and B will need to diverge to infinity, and so is that of Y . It is also ensured that the true configuration cannot stay too close to the corners. We remark that this will be the only condition on the sizes of the involved matrices. In particular, we do not require all of m0, n0, m 0, n 0 to go to infinity. Consequently, the low rank approximation (when (m0, n0) = (M, 0) or (m0, n0) = (0, N)) is also covered by the Ko PA framework and our analysis as a special case. The number of parameters involved in the Kronecker product λA B is η0 = 2m0+n0 + 2m 0+n 0. It is a much smaller number than 2M 2N, the number of elements in Y . Hence Assumption 1 implies a significant dimension reduction. We also make the following assumption on the error matrix E. Assumption 2 (Assumption on Noise). Consider model (4). Assume that E is a standard Gaussian ensemble, i.e. with IID standard normal entries. We conclude this subsection with the convergence rates of the estimators ˆλ , ˆ A and ˆ B, given by the estimation procedure in Section 2.3 under the true configuration. Since the error matrix E has IID standard normal entries, according to Vershynin (2010), the expectation of the largest singular value of the rearranged error matrix Rm0,n0[E] is bounded by s0 = 2(m0+n0)/2 + 2(m 0+n 0)/2. Theorem 1. Let ˆλ, ˆ A and ˆ B be the estimators obtained under the true configuration, as given in (7). Suppose Assumptions 1 and 2 hold, then for the deterministic scheme of model (4), we have , min s= 1 ˆ A s A 2 F = Op , min s= 1 ˆ B s B 2 F = Op where r0 = s0 2(M+N)/2 = 2 (m0+n0)/2 + 2 (m 0+n 0)/2. 4.2 Consistency of Configuration Selection To study the consistency of the configuration selection proposed in Section 3, we need assumptions on the signal-to-noise ratio. We choose to present model (4) with both λ and σ so that it is able to account for any actual data generating mechanism. On the other hand, the mathematical properties would only depend on the ratio λ/σ. The strength Cai, Chen and Xiao of the signal also depends on the contrast between true and wrong configurations. If a configuration (m, n) W is used for the estimation, Y is rearranged as Rm,n[Y ] = λRm,n[A B] + σ2 (M+N)/2Rm,n[E]. (12) Ignoring the noise term, only the first singular value component of Rm,n[A B] (multiplied by λ) is expected to enter ˆY . When the true configuration is used, Rm,n[A B] is a rank-1 matrix, and its leading singular value equals 1 (recall that we have assumed that A F = B F = 1). On the other hand, if a wrong configuration is used, then Rm,n[A B] is no longer rank-1, and its leading singular value should be smaller than 1. Define φ := max (m,n) W Rm,n[A B] S. (13) The quantity φ characterize how much of the signal A B can be captured by a wrong configuration, and it always holds that 0 < φ 1, so we also introduce ψ2 := 1 φ2, (14) and call it the representation gap. Note that 0 ψ2 < 1, and the larger ψ2 is, the easier it is to separate true and wrong configurations. The following assumption shows the interplay between the representation gap ψ2 and the signal-to-noise ratio λ/σ. Assumption 3 (Representation Gap). For model (4), assume that A and B are deterministic matrices, and lim M+N 2(M+N)/2 2(m0+n0)/2 + 2(m 0+n 0)/2 (λ/σ) ψ = , (15) and lim M+N 2(M+N)/4 (λ/σ) ψ2 = . (16) In both (15) and (16), the signal-to-noise ratio and the representation gap ψ2 can diminish to zero, as long as they do not converge to zero too fast. In this sense, Assumption 3 is very flexible by requiring only very weak signal strength. Remark 4. We have defined φ as the maximum over W, the set of all wrong configurations. In fact, if we let φm,n := Rm,n[A B] S, and ψ2 m,n = 1 φ2 m,n, then Assumption 3 can also be given through ψ2 m,n instead of an uniform lower bound ψ2, leading to a weaker version of the assumption. On the other hand, as will be shown in Section 4.3, if A and B are randomly generated according to the Random Scheme, then indeed all ψ2 m,n are larger than or around 1/2 with an overwhelming probability. This is suggesting that using the lower bound ψ2 in Assumption 3 for the deterministic scheme is still reasonable. Therefore, we do not spell out the detailed version of Assumption 3 using ψ2 m,n, but present it in the current simple form. Remark 5. Notions similar to the representation gap appear as key parameters in many other problems. For example, in variable selection of linear regression problems,the representation gap would be the smallest absolute non-zero coefficient in the model. In matrix rank determination problems or factor models, the representation gap would be the eigengap, or the smallest nonzero singular value. The following theorem quantifies the separation of the information criterion (9) between the true and wrong configurations. Ko PA: Automated Kronecker Product Approximation Theorem 2. Consider model (4), and assume Assumptions 1, 2, 3. If κ 2 ln 2, and κ = o 2M+N ln(1 + (λ/σ)2ψ2) 2m0+n0 + 2m 0+n 0 min (m,n) W E[ICκ(m, n)] E[ICκ(m0, n0)] 2M+N ln[1 + (λ/σ)2ψ2] (1 + o(1)). To be precise, we note that for a sequence of numbers {ak}, the statement ak o(1) is understood as max{ ak, 0} = o(1). According to Assumptions 3, (λ/σ)2ψ2 2 (M+N)/2, so Theorem 2 shows that the separation of the information criterion is at least of the order 2(M+N)/2. Remark 6. The first condition in (17) ensures that the penalty on the number of parameters is large enough to exclude configurations close to (0, 0) and (M, N). The second condition in (17) is imposed so that the contribution from the penalty term under the true configuration is dominated by the representation gap. The exact formula of the difference in expected information criterion is given by (36) in Appendix. Next theorem establishes the consistency of (9). We need to define the symbol : for two sequences of positive numbers {ak} and {bk}, ak bk is defined as lim infk ak/bk > 0. Theorem 3. Assume the same conditions of Theorem 2, then P ICκ(m0, n0) < min (m,n) W ICκ(m, n) 1 exp C2M+N + ln(MN) , where the constant C, depending on λ/σ and ψ, is of order C(λ/σ, ψ) (α1/3 1) with α = 1+(λ/σ)2ψ2. In particular, the preceding convergence rate implies the consistency of the configuration selection, i.e. lim M+N P ICκ(m0, n0) < min (m,n) W ICκ(m, n) = 1. (18) Remark 7. In Assumption 3, we focus on the minimal signal-to-noise ratio and representation gap. On the other hand, if they are large enough such that lim inf(λ/σ)2ψ2 1/2, then the condition κ 2 ln 2 can be dropped from Theorem 2 and Theorem 3, which will continue to hold if we set κ = 0 in (9). In other words, if the signal strength and the representation gap are sufficiently large, one can simply use mean squared error to select the configuration. Remark 8. The normality assumption can be extended to any other distribution G as long as the concentration inequality of Rm,n[E] S (under any configuration (m, n)) is available. There is no substantial difference in the analysis except that the threshold for signal-to-noise ratio may vary under different noise distributions. We carry out simulation experiments in Section 6.1.2 to demonstrate the performance of ICκ for the configuration selection under normality. We also include additional simulations results under different noise distributions in Appendix H. Cai, Chen and Xiao 4.3 Model Selection under Random Scheme In this section we consider the consistency of the model selection under the random scheme (20). First of all, similar convergence rates as Theorem 1 can be obtained under the random scheme. Corollary 1. Assume Assumptions 1 and 2. If A and B are generated according to the random scheme (20), then the conclusion of Theorem 1 continue to hold. If a configuration (m, n) W is used, then the estimation procedure given in Section 2.3 rearranges Y as (12). In Section 4.2 for the deterministic scheme, we introduce φ as the upper bound of Rm,n[A B] S over all wrong configurations. For the random scheme, it turns out this upper bound and hence the representation gap ψ, depending on A and B, is also random. We introduce the following random version of Assumption 3. Assumption 4 (Representation Gap). Assuem in model (4), λ, A and B are random and independent with E. Assume there exist two sequences of positive numbers {λ0} and {ψ0} satisfying (15) and (16) (by replacing λ and ψ therein), such that lim sup M+N E[λ2/λ2 0] < , lim sup M+N E[ψ2/ψ2 0] < , and for any constant c > 0 lim M+N MN P λ2/λ2 0 < 1 c = lim M+N MN P ψ2/ψ2 0 < 1 c = 0. (19) With Assumption 4, Theorem 2 and 3 continue to hold under the random scheme, as asserted by the next theorem. Theorem 4. Consider model (4) with random λ, A and B. Under Assumptions 1, 2 and 4, it holds that min (m,n) W E[ICκ(m, n)] E[ICκ(m0, n0)] 2M+N ln[1 + (λ0/σ)2ψ2 0] (1 + o(1)). Furthermore, the consistency (18) holds. Assumption 4 is formulated to single out the minimal condition required for the consistency under the random scheme. There is no specific distributional assumptions imposed on A and B. In the rest of this section, we demonstrate that how it can be satisfied under normality. Example 1. Consider model (4). Suppose that λA B = λ0 A B 2(M+N)/2 , (20) where λ0 is deterministic, and A and B are independent, and both consisting of IID standard normal entries. In order to fulfill the identifiability condition (5), we let A = A/ A F , B = B/ B F , and λ = λ0 A F B F /2(M+N)/2. Also assume that A and B are both independent with E. For this example, the signal-to-noise ratio becomes E λA B 2 F E σE/2(M+N)/2 2 F = λ2 0 σ2 . Ko PA: Automated Kronecker Product Approximation Recall that φ is defined as the upper bound of Rm,n[A B] S over all wrong configurations. Only when the true configurations (m0, n0) is used, the rearrangement Rm0,n0[A B] has the simple structure of a rank-1 matrix. Under a wrong configuration Rm,n[A B] no longer takes any special form. Nevertheless, the following lemma characterizes how the spectral norm of Rm,n[A B] depends on further rearrangements of both A and B. It is a property of the Kronecker products and the KPD (1), so we present it in the general form, without referring to any true configuration. Lemma 1. Let A be a 2m 2n matrix and B be a 2m 2n matrix. Then for any m , n Z, 0 m M, 0 n N, Rm ,n [A B] S = Rm m ,n n [A] S R(m m)+,(n n)+[B] S Applying Lemma 1 to Example 1 leads to the following corollary. Corollary 2. For Example 1, under Assumption 1, it holds that max (m,n) W Rm,n[A B] S = 1 And as a consequence, Assumption 4 holds with the λ0 in (20) and ψ2 0 = 1/2. 5. Multi-term Kronecker Product Models In this section, we extend the one-term Kronecker product model in (4) to the following K-term Ko PA model. k=1 λk Ak Bk + σ 2(M+N)/2 E, (21) where λ1 > λ2 > > λK > 0 and Ak R2m0 2n0, Bk R2m 0 2n 0, k = 1, , K satisfy the following orthonormal condition: tr(Ak A l) = tr(Bk B l) = δkl := ( 1 if k = l, 0 if k = l. The orthonormal condition and the assumption λ1 > λ2 > > λk > 0 implies the identifiability: Ak and Bk are identified up to sign changes, see Section 2.1. Note that the K terms in model (21) have the same configuration (m0, n0). Therefore, although multiple terms are present, there is only one configuration to be determined. Remark 9. For the multi-term model (21), both the configuration (m0, n0) and the number of terms K need to be determined from the data. We propose to select the configuration first. Once the configuration is selected as ( ˆm, ˆn), the rearranged R ˆm,ˆn(Y ) becomes the sum of a rank K matrix and a noise matrix, and the determination of K turns into the rank selection based on R ˆm,ˆn(Y ), and existing methods in low rank approximation (Bai, 2003; Ahn and Horenstein, 2013) can be applied. In this paper, we follow this two-step procedure and focus on the choice of the configuration for model (21). We also remark that it is of Cai, Chen and Xiao interest to study the joint selection of the configuration and the number of terms, which we shall leave for future study. To select the configuration, we propose to use the same procedure in Section 3, that is, for any candidate configuration (m, n) C, although Y is generated from the multi-term model (21), we nonetheless still calculate the information criterion (9) by fitting the one-term Kronecker product model (4) to Y . This approach avoids the need of the determination of the number of Kronecker product terms when seeking the correct configuration. It allows the separation of the two. Similar to Eq.(13), for each term Ak Bk, define φk = max (m,n) W Rm,n[Ak Bk] S. (22) Under the true configuration, we have P k λk Rm0,n0(Ak Bk) S = λ1 given the orthonormality assumption. For any wrong configuration, we have the following upper bound max (m,n) W k=1 λk Ak Bk k=1 λk max (m,n) W Rm,n[Ak Bk] S = k=1 λkφk =: λ1 φ, where φ := λ 1 1 PK k=1 λkφk. When φ < 1, there exists a strict representation gap ψ2 = 1 φ2 (23) between the true configuration and wrong configurations. Therefore, one can identify the true configuration by minimizing the information criterion (11) over C. The theoretical results in Section 4 can be adapted immediately. Corollary 3 is a direct extension of Theorems 2 and 3. We skip the proof here. Corollary 3 (Extension of Theorem 3). If ψ2 > 0, Theorem 2 and Theorem 3 continue to hold for the multi-term model (21), by replacing the signal λ and the representation gap ψ with λ1 and ψ respectively. The bound φ is obtained by direct applications of the triangular inequality and may not be sharp. The resulted condition on the representation gap in (23) is therefore very strong.On the other hand, since φ is only attained when the singular spaces of Rm,n[Ak Bk] are the same for some wrong configuration (m, n), one can further sharpen the upper bound for max(m,n) W Rm,n[P k λk Ak Bk] S when the singular spaces of Rm,n[P k λk Ak Bk] s are not identical, leading to a larger representation gap. And the more different these singular spaces are, the larger the gap is. We will discuss a refined result in the next subsection for the two-term models. 5.1 Analysis of the Two-Term Model When K = 2, the multi-term model (21) becomes the two-term model Y = λ1A1 B1 + λ2A2 B2 + σ 2(M+N)/2 E. (24) Ko PA: Automated Kronecker Product Approximation Again, we use the same configuration selection procedure in Section 3, that is, we calculate the information criterion (9) by fitting the one-term Kronecker product model (4) to Y . In this case, the estimated ˆλ used in the information criterion (11) is ˆλ = Rm,n[Y ] S = λ1Rm,n[A1 B2] + λ2Rm,n[A2 B2] + σ2 (M+N)/2Rm,n[E] S. (25) Note that under the true configuration, we have ˆλ λ1. To bound ˆλ under wrong configurations, we define φ1 = max (m,n) W Rm,n[A1 B1] S, φ2 = max (m,n) W Rm,n[A2 B2] S, and the representation gaps ψ2 1 := 1 φ2 1, ψ2 2 := 1 φ2 2. Even though vec(A1) and vec(A2) are orthogonal according to the model assumption, the column spaces of Rm,n[A1 B1] and Rm,n[A2 B2] are not necessarily orthogonal. In the worst case when Rm,n[A1 B1] and Rm,n[A2 B2] have the same column space and the same row space, then ˆλ in (25) can be close to λ1φ1 + λ2φ2, which may exceed λ1. Therefore, we need to bound the distance between the column (and row) spaces of Rm,n[A1 B1] and Rm,n[A2 B2]. For this purpose, we make use of the principal angles between linear subspaces. Specifically, if M1 and M2 are two matrices of the same number of rows, the smallest principal angle between their column spaces, denote by Θ(M1, M2), is defined as cos Θ(M1, M2) = sup u1 =0,u2 =0 u 1M 1M2u2 M1u1 M2u2 . We first discuss the deterministic scheme, where Ak and Bk are non-random. In Assumption 5, θc and θr are lower bounds of the smallest possible principal angles between the column spaces and the row spaces of the two rearranged components, respectively. Assumption 5. There exist 0 < ξ < 1 such that max (m,n) WA cos Θ(Rm,n[A1 B1], Rm,n[A2 B2]) ξ, and max (m,n) WB cos Θ([Rm,n[A1 B1]] , [Rm,n[A2 B2]] ) ξ, WA = {(m, n) W : m + n m + n }, WB = {(m, n) W : m + n < m + n }. Remark 10. The assumption may look unintuitive at first sight, since it might be thought that the matrices Ai Bi, after the rearrangement under wrong configurations, are in general full rank. This is, however, not true, in view of Lemma 1, most easily seen when the wrong configuration (m, n) is nested with the true one (m0, n0) in the sense m n0 and n n0. On the other hand, the conditions in Assumption 5 are given separately over Cai, Chen and Xiao WA and WB. In each of them, the matrices involved have more rows than columns, and the condition is on the corresponding column spaces. The following lemma provides an upper bound of the spectral norm of a sum of two matrices. It utilizes the principal angles between the column and row spaces to make the bound sharper than the one given by the triangular inequality. Assumption 5 enables us to apply Lemma 2 to bound ˆλ in (25). Lemma 2. Suppose M1 and M2 are two matrices of the same dimension. Let M1 S = µ, M2 S = ν. Denote the principle angles between the column spaces and the row spaces as θ = Θ(M1, M2), η = Θ(M 1, M 2), respectively. Then M1 + M2 2 S Λ2(µ, ν, θ, η), Λ2(µ, ν, θ, η) = 1 (µ2 + ν2 + 2µν cos θ cos η)2 4µ2ν2 sin2 θ sin2 η + µ2 + ν2 + 2µν cos θ cos η . Similar to Assumption 3, we assume the signal strengths λ1, λ2 and the noise level σ satisfy the following assumption. Assumption 6. For model (24), we assume that λk and the matrices Ak, Bk, k = 1, 2 are deterministic and lim M+N 2M+N 2m+n + 2m +n λ2 1ψ2 1 λ2 2φ2 2 2λ1λ2φ1φ2ξ σ2 + λ2 2 = (26) lim M+N 2(M+N)/4 λ2 1ψ2 1 λ2 2φ2 2 2λ1λ2φ1φ2ξ (λ1 + λ2)σ = . (27) The conditions (26) and (27) correspond to (15) and (16) in the one-term model. Specifically, when λ2 = 0, the two-term model reduces to one-term case, and Assumption 6 reduces to Assumption 3 as well. The main result for the two-term model is stated in Theorem 5. Theorem 5. Consider the two-term model (24), where λk and the matrices Ak and Bk are deterministic. Suppose Assumptions 1, 2, 5 and 6 hold. If κ satisfies κ 2 ln 2 and κ = o 2M+Nα 2m0+n0 + 2M+N m0 n0 then min (m,n) W E[ICκ(m, n)] E[ICκ(m0, n0)] 2M+Nα(1 + op(1)), α = ln 1 + λ2 1ψ2 1 λ2 2φ2 2 2λ1λ2φ1φ2ξ σ2 + λ2 2 Furthermore, the consistency (18) continues to hold. Ko PA: Automated Kronecker Product Approximation Similar to Theorem 2, we have shown that for the two-term model, the information criterion obtained by fitting a one-term model can still separate the true and wrong configurations with a gap of the order O(2M+Nα). On the other hand, comparing with Assumption 3, Theorem 5 depends on Assumption 6, which requires not only the signal-to-noise ratio (λ1/σ), but also the relative strength of the two terms (λ1/λ2) to be large enough. Comparing the two term model (24) with the one term model (i.e. λ2 = 0), we note that the information criterion gap α in Theorem 5 is smaller than the one given by Theorem 2. This phenomenon can be intuitively explained through (28). On one hand, λ2 2 contributes to the noise term when extracting the first KPD component, since λ2 2 + σ2 appears in the denominator in (28). On the other hand, over-fitting due to the second Kronecker product reduces Y ˆY 2 F under the wrong configuration, which is quantified by λ2 2φ2 2+2λ1λ2φ1φ2ξ in the numerator of (28). Similar to Example 1, we consider the following example of the two term model under normality. Example 2. Consider the two term model (24). Suppose that λk Ak Bk = λk0 Ak Bk/2(M+N)/2, k = 1, 2, where all of the five matrices Ak and Bk and E are independent, and each consisting of IID standard normal entries. To translate it back into the form of (24), we let Ak = Ak/ Ak F , Bk = Bk/ Bk F , and λk = λk0 Ak F Bk F /2(M+N)/2. For Example 2, it turns out that with probabilities tending to one, ξ is close to 0 and the representation gaps ψ2 1 and ψ2 2 are close to 1/2 (due to Corollary 2). As an immediate consequence, Theorem 5 yields a information criterion gap of the size α = ln 1 + λ2 10 λ2 20 2(σ2 + λ2 20) However, by a refined analysis of Assumption 5 under the normality of Example 2, we are able to prove the following improved result. Corollary 4. Consider Example 2. Under Assumptions 1 and 2, Theorem 5 holds with the information criterion gap α = ln 1 + λ2 10 2(σ2 + λ2 20) 6. Examples We illustrate the performance of the estimation and configuration selection procedure through simulation studies in Section 6.1, and image examples in Section 6.2. 6.1 Simulations We design two simulation studies: the first one on the performance of the estimation procedure introduced in Section 2.3, and the second one on the configuration selection proposed in Section 3. Many implications of the theoretical results in Section 4 surface from the outcomes of the numerical studies. Cai, Chen and Xiao 6.1.1 Estimation with known configuration We first consider the performance of the estimators of λ, A and B given in (7), when the true configuration (m0, n0) is known. Throughout this subsection the simulations are based on model (4) with m0 = 5, n0 = 5, M = 10, N = 10 and σ = 1. The model (4) after the rearrangement under the true configuration becomes Rm0,n0[Y ] = λvec(A)vec(B) + σ2 (M+N)/2Rm0,n0[E], where vec(A) R2m0+n0, vec(B) R2m 0+n 0 are unit vectors. Without loss of generality, set vec(A) = (1, 0, . . . , 0) , vec(B) = (1, 0, . . . , 0) . In this experiment, the noise level is fixed at σ = 1, so the signal-to-noise ratio is controlled by λ, which takes values from the set {e1, e2, . . . , e16}. For each value of λ, we calculate the errors of the corresponding estimators ˆλ, ˆ A and ˆ B by !2 and ln ˆ A A 2 F + ln ˆ B B 2 F . The errors based on 20 repetitions are reported in Figure 2. Figure 2: Boxplots for errors in ˆλ, ˆ A and ˆ B against the signal-to-noise ratio. Figure 2 displays an interesting linear pattern, that is, as the signal-to-noise ratio in- creases, ln ˆλ λ 1 2 is approximately linear against ln λ with a slope around 2, and so is the error ln( ˆ A A 2 F ˆ B B 2 F ) for the matrix estimators. We note that this pattern is consistent with Theorem 1, which asserts that ˆλ λ 1 = Op and ˆ A A F ˆ B B F = Op since r0 defined in Theorem 1 remains a constant here as we vary the signal strength λ in the simulation. 6.1.2 Configuration Selection We now demonstrate the performance of the information criterion based procedure for selecting the configuration. Two criteria will be considered: MSE (when κ = 0) and AIC Ko PA: Automated Kronecker Product Approximation (when κ = 2). Corresponding to the oneand multi-term models considered in Sections 4 and 5, we carry out two experiments respectively. Experiment 1: One-term Ko PA model The simulation is based on model (4). Two configurations are considered: (i) M = N = 9, m0 = 4, n0 = 4, and (ii) M = N = 10, m0 = 5, n0 = 4. Similar to Section 6.1.1, the noise level is fixed at σ = 1, so the signal-to-noise ratio is controlled by λ. To control the representation gap ψ2, we construct the matrices A and B as follows: where vec(Di), i = 1, 2, 3, 4 are independent random unit vectors such that vec(D1) and vec(D2) are orthogonal, and so are vec(D3) and vec(D4). In the experiment, five values of ϕ2 are considered: ϕ2 {0.1, 0.2, 0.3, 0.4, 0.5}. We remark that the construction above controls the representation gaps for configurations (1, 0) and (m0 +1, n0) at ϕ2 exactly, and the representation gaps for configurations with m + n {1, M + N 1} (close to trivial configurations) or |m m0| + |n n0| = 1 (close to the true configuration) at roughly 0.5. Consequently, when ϕ2 = 0.1, 0.2, 0.3, 0.4, the overall representation gap ψ2 is at the desired level ϕ2 with high probabilities. But when ϕ2 = 0.5, the representation gap ψ2 can be slightly smaller than 0.5. In Figure 3, we plot the empirical frequencies of the correct configuration selection, out of 100 repetitions, against the signal-to-noise ratio λ/σ. Note that the x-axis scale in Subfigures 3a and 3b is different from that in 3c and 3d. The performances of both MSE (κ = 0) and AIC (κ = 2) are illustrated. BIC (κ = (M + N) ln 2) has a very similar performance to AIC, and is not reported here. For extremely weak signal-to-noise ratio λ 0.03, neither of MSE and AIC is able to select the true configuration with a high probability, for both configurations. This does not contradict with Theorem 3. When the signal is very weak, larger dimensions of the observed matrix Y are required for the consistency. As the signal-to-noise ratio increases from 0.01 to 0.13, the probability that the true configuration is selected increases gradually and eventually gets very close to one for AIC as shown in Figures 3a and 3b. We also note that the performance gets better as the representation gap ψ2 increases. These observations are echoing Theorem 2, which shows that AIC (with κ = 2 > 2 ln 2) only requires a minimal condition (λ/σ)2ψ2 > 0 to achieve the consistency, and the separation gap of AIC is a monotone function of (λ/σ)2ψ2. On the other hand, the performance of MSE exhibits a phase transition: it only starts to select the true configuration with a decent probability when the signal-to-noise ratio λ/σ exceed a certain threshold. The theoretical asymptotic threshold for MSE is λ/σ p 1/(2ψ2) as discussed in Remark 7. For ψ2 {0.5, 0.4, 0.3, 0.2, 0.1} used in this simulation, the corresponding thresholds for λ/σ are {1, 1.12, 1.29, 1.58, 2.24}, which can be clearly visualized in Figures 3c and 3d. Comparing Figures 3a with Figures 3b, we observe that the empirical frequency curve increases from 0 to 100 much faster when the matrices are larger. This is consistent with Cai, Chen and Xiao (a) M = N = 9, AIC (b) M = N = 10, AIC (c) M = N = 9, MSE (d) M = N = 10, MSE Figure 3: The empirical frequencies of the correct configuration selection out of 100 repetitions. Theorem 2, which shows that the probability of correct configuration selection approaches 1 exponentially fast. Experiment 2: Two-term Ko PA model In the second experiment, we consider the two-term Ko PA model in (24) where Ak and Bk are generated under the random scheme in Example 2 such that ψ2 1 1/2 and ψ2 2 1/2. According to Theorem 5, besides the signal-to-noise ratio λ1/σ, the relative strength of the second term λ2/λ1 (for the random scheme adopted in this experiment, see Corollary 4) affects the configuration selection as well. In this simulation, we fix the configurations to M = N = 9, (m0, n0) = (4, 4) and consider four different relative strengths of the second term λ2 2/λ2 1 {0.3, 0.4, 0.5, 0.6}. Similar to Experiment 1, we report the empirical frequencies of correct configurations selection of MSE and AIC, out of 100 repetitions, as a function of the signal-to-noise ratio λ1/σ in Figure 4. Figure 4a shows that the performance of AIC is in-sensitive to the ratio λ2 2/λ2 1 over the experimented range. To the contrary, it is seen from Figure 4b that MSE performs better when the ratio λ2 2/λ2 1 gets smaller, which is consistent with Corollary 4. Ko PA: Automated Kronecker Product Approximation (a) M = N = 9, AIC (b) M = N = 9, MSE Figure 4: The empirical frequencies of the correct configuration selection out of 100 repetitions, for the two-term model. 6.2 Analysis of Image Examples 6.2.1 The cameraman s image In this section we revisit and analyze the cameraman image introduced in Section 1. The original image, denoted by Y0, has 512 512 pixels. Each entry of Y0 is a real number between 0 and 1, where 0 codes black and 1 indicates white. The grayscale cameraman image Y0 is displayed in Figure 1. Our analysis will be based on the de-meaned version Y of the original image Y0. We demonstrate how well the image Y can be approximated by a Kronecker product or the sum of a few Kronecker products, and make comparisons with the low rank approximations given by SVD. We first consider the configuration selection by MSE, AIC and BIC on the original image Y . Figure 5 plots the heat maps for the information criterion ICκ(m, n) for all candidate configurations in the set C = {(m, n) : 0 m, n 9} \ {(0, 0), (9, 9)}, where the top-left and bottom-right corners are always excluded from the consideration. Since darker cells correspond to smaller values of the information criteria, we see that MSE and AIC select the configuration (8, 9), and BIC selects (6, 7). We also observe an overall pattern in Figure 5: configurations with larger (m, n) values are more preferable than those with smaller (m, n). Note that the Kronecker product does not commute, and with configuration (m, n) the product is a 2m 2n block matrix, each block of the size 29 m 29 n. Real images usually show the locality of pixels in the sense that nearby pixels tend to have similar colors. Therefore, it can be understood that larger values of m and n are preferred, since they are better suited to capture the locality. Actually, for the cameraman s image, the configuration (8, 9) accounts for 99.50% of the total variation of Y . The penalty on the number of parameters in AIC is not strong enough to offset the closer approximation given by the configuration (8, 9). With a stronger penalty term, BIC selects a configuration that is closer to the center of the configuration space, involving a much smaller number of parameters. Cai, Chen and Xiao Figure 5: Information Criteria for the cameraman s image. (Left) MSE (Mid) AIC (Right) BIC. Darker color corresponds to lower IC value. From the perspective of image compressing, Ko PA is more flexible than the low rank approximation, by allowing a choice of the configuration, and hence a choice of the compression rate. To compare their performances, we use the ratio ˆY 2 F / Y 2 F to measure how close the approximation ˆY is to the original image Y . In Figure 6, these ratios are plotted against the numbers of parameters for the KPD, marked by + on the solid line. Since the number of parameters involved in the Kronecker product with configuration (m, n) is η = 2m+n + 2M+N m n, the configurations {(m, n) : m + n = c} for any given 0 < c < M + N have the same number of parameters. Among these configurations, we only plot the one with the largest ˆY 2 F / Y 2 F . On the other hand, each cross stands for a rank-k approximation of Y , where its value on the horizontal axis is the number of parameters j=1 (2M + 2N 2j + 1) for k = 1, . . . , 2M N. According to Figure 6, there always exists a one-term Kronecker product which provides a better approximation of the original cameraman s image than the best low rank approximation involving the same number of parameters. We also consider denoising the images corrupted by additive Gaussian white noise Yσ = Y + σE, where E is a matrix with IID standard normal entries. We experiment with three levels of corruption: σ = 0.1, 0.2, 0.3. Examples of the corrupted images with different σ are shown in Figure 7 with the values rescaled to [0, 1] for plotting. For the corrupted images, the information criteria ICκ(m, n) are calculated, and the corresponding heat maps are plotted in Figure 8. With added noise, AIC and BIC tend to select configurations in the middle of the configuration space. Now we consider multi-term Kronecker approximation. Following the discussion in Section 5, for each of three corrupted images Yσ, we use the configuration selected by BIC in Figure 8. Specifically, configurations (6, 6), (5, 6) and (5, 5) are selected when σ = 0.1, 0.2 and 0.3, respectively. A two-term Kronecker product model (24) is then fitted under the selected configuration. The fitted images are plotted in the upper panel of Figure 9. Each Ko PA: Automated Kronecker Product Approximation Figure 6: Percentage of variance explained against the number of parameters, for Ko PA of all configurations, and for low rank approximations of all ranks. Figure 7: Noisy cameraman s images. (Left) σ = 0.1, (Mid) σ = 0.2, (Right) σ = 0.3. of them is compared with the image obtained by the low rank approximation involving a similar number of parameters as the two-term Ko PA. From Figure 9, it is quite evident that the details can easily be recognized from the images reconstructed by the two-term Ko PA, but can hardly be perceived in those given by the low rank approximation. Finally, we examine the reconstruction error defined by Y ˆY 2 F Y 2 F , where Y is the original image and ˆY is the one reconstructed from Yσ. For each of the three noisy images, we continue to use the configuration selected by BIC. With fixed configurations, we keep increasing the number of terms in the Ko PA until Yσ is fully fitted, and plot the corresponding reconstruction error against the number of parameters in Figure 10. It has the familiar U shape, showing the trade-offbetween estimation bias and variation. A similar curve is given for the low rank approximations exhausting all possible ranks. From Figure 10, it is seen that the multi-term Ko PA constantly outperforms the low rank approximation at any given number of parameters. Furthermore, the minimum Cai, Chen and Xiao MSE AIC BIC Figure 8: Heat maps for three different information criteria for the camera s images with different noise levels. Darker color means lower IC value. Ko PA: Automated Kronecker Product Approximation σ = 0.1 σ = 0.2 σ = 0.3 Figure 9: The fitted image given by multi-term Ko PA, and the SVD approximation with similar number of parameters. Figure 10: Reconstruction error against the number of parameters for Ko PA and low rank approximations. The three panels from left to right correspond to σ = 0.1, σ = 0.2 and σ = 0.3 respectively. reconstruction error that Ko PA can reach is always smaller than that given by the low rank approximation. 6.2.2 More images To assess the performance of Ko PA model in image denoising, we repeat the experiment in Section 6.2.1 to a larger set of test images. The 10 test images printed in Figure 11 are collected from Image Processing Place1 and The Waterloo image Repository2. Each 1. http://www.imageprocessingplace.com/root files V3/image databases.htm 2. http://links.uwaterloo.ca/Repository.html Cai, Chen and Xiao Figure 11: List of test images. image SVD Ko PA m SVD m Ko PA TVR boat 0.4709 0.1757 0.0853 0.0613 0.0356 cameraman 0.5446 0.1337 0.0644 0.0399 0.0294 goldhill 0.4632 0.1391 0.0759 0.0568 0.0363 jetplane 0.7347 0.1853 0.0866 0.0596 0.0302 lake 0.5425 0.1287 0.0825 0.0539 0.0308 livingroom 0.6747 0.2055 0.0995 0.0811 0.0589 mandril 0.6949 0.3557 0.1471 0.0889 0.0739 peppers 0.7394 0.1075 0.0734 0.0445 0.0224 pirate 0.7746 0.1533 0.1018 0.0686 0.0413 walkbridge 0.6617 0.2085 0.1263 0.0925 0.0593 Table 1: Reconstruction errors of one-term SVD, one-term Ko PA, multi-term SVD(m SVD), multi-term Ko PA(m Ko PA) and total variation regularization (TVR) on the ten test images. of the 10 test images is a 512 512 gray-scaled matrix, same as the cameraman s image. We corrupt the test image with additive Gaussian noise, whose amplitude is 0.5 times the standard deviation of all its pixel values: Yσ = Y + 0.5 std(Y ) E. We compare five methods of denoising these images: one-term SVD and Ko PA models, multi-term SVD and Ko PA models, image denoising algorithm through total variation regularization (Chambolle, 2004). Since determining the number of terms in multi-term models is beyond the scope of this article, the numbers of terms in the multi-term models are chosen to minimize the reconstruction error. The performance of the five approaches on the ten images are reported in Table 1. For each image, the configuration of the Ko PA is selected by BIC (κ = 18 ln 2). From Table 1, the Ko PA-based methods outperform SVD-based approaches, which is not surprising as SVD corresponds to a special configuration in Ko PA models. On the other hand, Ko PA: Automated Kronecker Product Approximation the image denoising based on Ko PA (and multi-term Ko PA) is close to the TVR (total variation regularization) method but the latter does have a superior performance. We note that Ko PA and TVR are not directly comparable. Image is a special type of matrix data, whose entries usually possess certain continuity in values. TVR fully utilizes this continuity by imposing regularization on the total variation while SVD and Ko PA do not. The difference can be seen from Figure 12 as well. The TVR can recover the smooth region (the mandrill s nose) well, while the multi-term Ko PA model has more details in nonsmooth regions (the mandrill s fur and beard). Finally we remark that the performance of Ko PA approach on image analysis can possibly be improved by adding a similar penalty term on the smoothness of B. Figure 12: (left) The mandrill image, (mid) recovered images from multi-term Ko PA model and (right) total variation regularization. 7. Conclusion and Discussions In this article, we propose to use the Kronecker product approximation as an alternative of the low rank approximation of large matrices. Comparing with the low rank approximation, Ko PA is more flexible because any configuration of the Kronecker product can potentially be used, leading to different levels of approximation and compression. To select the configuration, we propose to use the extended information criterion, which includes MSE, AIC and BIC as special cases. We establish the asymptotic consistency of the configuration selection procedure, and use an example with a random Kronecker product to illustrate how the technical assumptions are fulfilled. Extension to the multi-term Kronecker product model is also investigated. Both simulations and analysis of image examples demonstrate that Ko PA can be superior over the low rank approximations in the sense that it can give a closer approximation of the original matrix/image with a higher compression rate. We conclude with a discussion of future directions. First of all, the Kronecker product model (4) is not permutation-invariant. In other words, after a permutation of columns and rows, the signal from the matrix Y may or may not be a Kronecker product. When the columns and rows have an order in nature as in image data and in spatial-temporal data, it is not an issue. But in general, especially when the data is allowed to be shuffled, a pre-processing step for ordering rows and columns should be investigated before conducting Ko PA analysis. Another extension is to consider a multi-term model, where each term can have its own configuration. This approach certainly allows a greater flexibility, but also poses new challenges not only on the configuration and order selections, but also on the Cai, Chen and Xiao estimation and algorithms as well. It would be ideal if a natural and interpretable procedure for the estimation, order determination, and configuration selection and be developed, with theoretical guarantees. Lastly, cross validation may also be used for configuration selection. Note that we are working with one single observation (the matrix Y ). It is possible to randomly remove one or a set of elements of Y , and evaluate the performance of a configuration based on the prediction accuracy of these elements. This approach requires a matrix completion procedure based on Kronecker Product approximation, which can be done based on low rank matrix completion procedure on the re-arranged matrix. We are currently studying such a procedure, though its theoretical properties requires further investigation. Acknowledgments Chen s research is supported in part by National Science Foundation grants DMS-1737857, IIS-1741390, CCF-1934924, DMS-2027855 and DMS-2052949. Xiao s research is supported in part by National Science Foundation grant DMS-1454817, DMS-2027855, DMS-2052949 and a research grant from NEC Labs America. The authors thank an Associate Editor and two referees for their insightful comments that have led to significant improvement of the manuscript. Ko PA: Automated Kronecker Product Approximation Appendix Appendix A. Proof of Theorem 1 and Corollary 1 Without loss of generality, we assume σ = 1. Noticing that ˆλ = Rm0,n0[Y ] S = λvec(A)vec(B) + σ2 (M+N)/2Rm0,n0[E] S, by triangular inequality, we have ˆλ λvec(A)vec(B) S σ2 (M+N)/2 Rm0,n0[E] S, where λvec(A)vec(B) S = λ. The following bound for Rm0,n0[E] S can be obtained using the concentration inequality from Vershynin (2010), P( Rm0,n0[E] S 2(m0+n0)/2 + 2(M+N m0 n0)/2 + t) e t2/2. Therefore, Rm0,n0[E] S = s0 + Op(1) and |ˆλ λ| 2 (M+N)/2(s0 + Op(1)) = r0 + Op(2 (M+N)/2), which yields ˆλ λ = Op(r0). The bounds for ˆ A and ˆ B corresponds to the error bounds in estimating the left and right singular vectors of Rm0,n0[Y ], which is a direct consequence of the analysis in Wedin (1972) by observing that min s= 1 ˆ A s A 2 F = min s= 1 vec( ˆ A) svec(A) 2 2 = 2 sin2 Θ(vec( ˆ A), vec(A)). A sharper bound is provided in Cai and Zhang (2018). Since above analysis holds for any fixed value of λ, Corollary 1 follows immediately. Appendix Appendix B. Proof of Theorem 2 We first show and prove several technical lemmas. Lemma 3. Suppose an > 0, an 0 and xn = Op(1) is a sequence of continuous random variables with density functions pn satisfying (i) E|xn| C for some constant C for every n, (ii) 1 + anxn > 0 almost surely, (iii) a 2 n supx 1/(2an) pn(x) 0, then we have E ln (1 + anxn) = O (an) . Proof. Let pn(xn) be the density function of xn. For the positive part, we have 0 ln(1 + ant)pn(t)dt Z + 0 antpn(t)dt an E|xn| Can. Cai, Chen and Xiao For the negative part, we have 1/an ln(1 + ant)pn(t)dt = Z 1/(2an) 1/an ln(1 + ant)pn(t)dt + Z 0 1/(2an) ln(1 + ant)pn(t)dt sup t 1/(2an) pn(t) # Z 1/(2an) 1/an ln(1 + ant)dt + Z 0 1/(2an) 2antpn(t)dt 2an sup t< 1/(2an) pn(t) + 2an o(an) 2Can. Hence, E ln(1 + anxn) = E+ + E = O(an). The conditions in Lemma 3 are easy to verify in the subsequent proofs. Condition (ii) ensures the logarithm is well-defined on the whole support. Condition (i) is satisfied when xn converges in mean to a random variable x with finite expectation. Condition (iii) is controlling the left tails of the densities, and is easily fulfilled if they are exponential. Lemma 4. Let X be an arbitrary P Q real matrix with P Q and E be a P Q matrix with IID standard Gaussian entries. Then we have E X + E 2 S X 2 S + ( Q)2 + 4 X S Q) + 2 =: U2. Furthermore, the departure from the expectation is sub-Gaussian such that for any positive t, we have P[ X + E S U + t] e t2/2. Proof. Without loss of generality, we assume X = [X1, X2], where X1 RP P is a diagonal matrix and X2 RP (Q P) is zero. Such a form of X can always be achieved by multiplying X and E from left and right by orthogonal matrices, without changing the distribution of E. Similarly, we partition E into [E1, E2] with E1 RP P and E2 RP (Q P). Then X + E 2 S = sup u RP , u =1 u (X + E)(X + E) u = sup u RP , u =1 u XX u + u EEu + 2u XE u X 2 S + E 2 S + 2 X S E1 S According to Vershynin (2010), we have E E1 S 2 Q + t] e t2/2. Ko PA: Automated Kronecker Product Approximation E E 2 S = Z t=0 P[ E S > t]2tdt ( Hence, we have E X + E 2 S X 2 S + ( Q)2 + 4 X S Q) + 2 =: U2. Since for any fixed X, X +E S is a function of E with Lipschitz norm 1, by concentration inequality, for any positive t, we have P[ X + E S U + t] e t2/2. We rewrite the information criterion as ICκ(m, n) = D h ln Y ˆY (m,n) 2 F + κr2 m,n 2κD 1/2i , where D = 2M+N and rm,n = 2 (m+n)/2 + 2 (m +n )/2. The constant term 2κD 1/2 is irrelevant to the configuration (m, n) and is therefore ignored in subsequent proofs. Without loss of generality, we define the following expected information criterion EICκ(m, n) = D h E ln Y ˆY (m,n) 2 F + κr2 m,n i for simplicity. The difference in expected information criterion between wrong configurations and the true configuration is of central interest, so we define EICκ(m, n) = EICκ(m, n) EICκ(m0, n0) Under the true configuration (m0, n0), we have E Y ˆY (m,n) 2 F E Y λA B 2 F = σ2D 1E E 2 F = σ2. Therefore, we have EICκ(m0, n0) D h ln E Y ˆY (m,n) 2 F + κr2 0 i D ln σ2 + κr2 0 , (29) where r0 = rm0,n0. Define ˆλ(m,n) := Rm,n[Y ] S = λRm,n[A B] + σD 1/2Rm,n[E] S. (30) To calculate the information criterion for wrong configurations, we use the following equality Y ˆY (m,n) 2 F = Y 2 F h ˆλ(m,n)i2 . Cai, Chen and Xiao Notice that Y 2 F = λA B 2 F + σ2D 1 E 2 F + 2λσD 1/2tr[(A B)E ], where λA B 2 F = λ2, σ2D 1 E 2 F = σ2(1 + Op(D 1/2)) and tr[(A B)E ] follows a standard normal distribution. We have Y 2 F = λ2 + σ2 + R1, (31) where R1 = Op (σ2 + λσ)D 1/2 . For wrong configurations (m, n) W, without loss of generality, we assume m + n (M + N)/2. According to Lemma 4, we have the upper bound for (30): [ˆλ(m,n)]2 λ2φ2 + σ2r2 m,n + 4λφσ2(m+n)/2D 1/2 + Op((λσ + σ2)D 1/2) λ2φ2 + σ2r2 m,n + 4λσD 1/4 + Op((λσ + σ2)D 1/2). (32) Y ˆY (m,n) 2 S λ2(1 φ2) + σ2(1 r2 m,n) 4λσD 1/4 + Op((λσ + σ2)D 1/2). The last two terms are minor terms by Assumption 3. Therefore, EICκ(m, n) D ln(λ2ψ2 + σ2(1 r2 m,n)) O Here Lemma 3 is applied since the stochastic term in (32) has an exponential tail bound. Notice that EICκ(m, n) in (33) is either a monotone increasing function or a uni-modal function of r2 m,n on [1/2, 4D1/2]. Therefore, the minimum of the right hand side of (33) is obtained on the boundary. When r2 m,n = 1/2, (33) becomes EICκ(m, n) D ln(λ2ψ2 + σ2/2) O When r2 m,n = 4D 1/2, (33) becomes EICκ(m, n) D ln(λ2ψ2 + σ2) O In conclusion, for any wrong configuration (m, n) W, we have EICκ(m, n) D α = ln 1 + λ2ψ2 Ko PA: Automated Kronecker Product Approximation When κ 2 ln 2, α takes the first value in the preceding equation. The assumptions imposed in Theorem 2 ensure the leading term α in (36) dominates other terms so that the minimum of EIC over the wrong configurations is strictly positive. We now address Remark 7. It turns out possible to use only the MSE to select the configuration, which corresponds to κ = 0. It requires a stronger signal-to-noise ratio λ2ψ2/σ2 > 1/2 so that the leading term α in (36) is positive, and hence Theorem 2 continues to hold. Remark 11. Note that the upper bound used in (32) is quite conservative, because the maximums of φ and 2(m+n)/2 over W are taken separately. It leads to a simple form of Assumption 3, which is actually not as optimal as possible. If we define φ(m,n) = Rm,n[A B] S, then the condition (16) in Assumption 3 can be relaxed to lim M+N inf (m,n) W(2(m+n)/2 + 2(m +n )/2) λ σ 1 [φ(m,n)]2 However, in the main text we choose to introduce the concept of representation gap and present a simple version of Assumption 3. Appendix Appendix C. Proof of Theorem 3 We begin with the tail bounds for E 2 F . According to the tail bounds for χ2 random variable given in Laurent and Massart (2000), it holds that for any t > 0, P h D 1 E 2 F > 1 + 2D 1/2t + D 1t2i e t2/2, (37) P h D 1 E 2 F < 1 2D 1/2t i e t2/2, (38) where D = 2M+N. Therefore, at the true configuration (m0, n0), we have P h Y ˆY (m0,n0) 2 F > σ2 + 2σ2D 1/2t + σ2D 1t2i P h σD 1/2E 2 F > σ2 + 2σ2D 1/2t + σ2D 1t2i e t2/2. (39) Noticing that Y 2 F = λ2 + σ2D 1 E 2 F + 2λσD 1/2Z, where Z = tr[(A B)E ] is a standard Gaussian random variable, by (38) we have P h Y 2 F < λ2 + σ2 ( 2σ2 + 2λσ)D 1/2t i 2e t2/2. (40) Now we consider the tail bound for ˆλ(m,n) of wrong configurations. According to Lemma 4, we have the tail bound for ˆλ(m,n) as P[ˆλ(m,n) U + σD 1/2t] e t2/2, (41) Cai, Chen and Xiao U2 = λ2φ2 + σ2r2 m,n + 4λφσ2(m+n)/2D 1/2 + 2πσ2rm,n D 1/2 + 2σ2D 1 < (λ + σ)2. Let α = ln(1 + (λ/σ)2ψ2) be the positive gap constant. We have P [ICκ(m0, n0) > EICκ(m0, n0) + Dα/3] =P h Y ˆY (m0,n0) 2 F > σ2eα/3i exp c2 1D/2 , (42) where c2 1 = eα/3 1. For any (m, n) W, it holds that P [ICκ(m, n) < EICκ(m0, n0) + Dα/3] P [ICκ(m, n) < EICκ(m, n) Dα/3] P h Y 2 F ˆλ2 < λ2 + σ2 λ2φ2 2h i P Y 2 F < λ2 + σ2 h + P h ˆλ2 > U2 + h i 2 exp c2 2D/2 + exp c2 3D/2 (43) where we use (40) and (41) to obtain (43), 1 e α/3 (λ2(1 φ2) + σ2), c2 = h and c3 is the solution of σ2c2 3 + 2(λ + σ)σc3 = h. We conclude that P ICκ(m0, n0) min (m,n) W ICκ(m, n) (m,n) W P[ICκ(m0, n0) ICκ(m, n)] P[ICκ(m0, n0) EICκ(m0, n0) + Dα/3] + P[ICκ(m, n) EICκ(m0, n0) + Dα/3] 4(M + 1)(N + 1) exp c2D/2 0, (44) where c = min{c1, c2, c3}. By calculating the orders of c1, c2, c3, it holds that Ko PA: Automated Kronecker Product Approximation Specifically, if α 0 (or equivalently, (λ/σ)2ψ2 0), we have σ2 ψ2 (λ2/σ2)2 (1 + λ/σ)2 ψ4 The right hand side is much greater than ln(MN), under Assumptions 1 and 3. Appendix Appendix D. Proof of Theorem 4 The proof is very similar to the proofs of Theorem 2 and Theorem 3, so we only point out the major steps, but omit the details. Condition (19) implies that λ2 = λ2 0(1 + op(1)) and ψ2 = ψ2 0(1 + op(1)). The proof of Theorem 2 follows immediately by replacing λ2 and ψ2 with the deterministic values λ2 0 and ψ2 0, except that an op(λ2 0 + ψ2 0) term is added to (30). Since the additional stochastic term is negligible and has finite expectation, Theorem 2 continues to hold. The consistency follows same lines as those of Theorem 3 except that the deviations λ2 λ2 0 and ψ2 ψ2 0 should be incorporated into (44). Specifically, Assumption 4 implies that for any small constant δ, with probability larger than 1 o(1/(MN)), we have λ2 λ2 0(1 δ) and ψ2 ψ2 0(1 δ). Proof of Theorem 3 follows immediately by replacing λ2 and ψ2 with λ2 0(1 δ) and ψ2 0(1 δ). The following probability of exceptions should be added to (44). (M + 1)(N + 1) P[λ2 < λ2 0(1 δ)] + P[ψ2 < ψ2 0(1 δ)] = o(1), which does not affect consistency but may reduce the convergence rate. Appendix Appendix E. Proof of Lemma 1 and Corollary 2 Consider the complete Kronecker product decomposition of A with respect to the configuration (m m , n n , (m m )+, (n n )+): i=1 µi Ci Di, (45) where I = 2m m +n n 2(m m )++(n n )+, µ1 µ2 µI are the coefficients in decreasing order. Ci and Di satisfy Ci, Cj = Di, Dj = δi,j, (46) where δi,j is the Kronecker delta function such that δi,j = 1 if and only if i = j and δi,j = 0 otherwise, and A, B := tr[A B] is the trace inner product. Notice that the decomposition in (45) corresponds to the singular value decomposition for Rm m ,n n [A]. Therefore, the singular values µ1, . . . , µI are uniquely identifiable and the components Ci, Di are identifiable if the singular values are distinct. In particular, µ1 = Rm m ,n n [A] S. Similarly, the KPD of B with the configuration ((m m)+, (n n)+, M m m , N n n ) is given by j=1 νj Fj Gj, Cai, Chen and Xiao where J = 2(m m)++(n n)+ 2M+N m m n n and ν1 = R(m m)+,(n n)+[B] S. With the two KPD of A and B, we can rewrite A B as i=1 µi Ci Di j=1 νj Fj Gj j=1 µiνj Ci Di Fj Gj. Notice that the Kronecker product satisfies distributive law and associative law. The matrix Di is 2(m m )+ 2(n n )+ and the matrix Fj is 2(m m)+ 2(n n)+. For all possible values of m, m , n, n , either one of Di and Fj is a scalar, or they are both vectors; and for both cases Di Fj = Fj Di. Therefore, j=1 µiνj Ci Fj Di Gj = j=1 µiνj Pij Qij, (47) where Pij := Ci Fj, Qij := Di Gj. Notice that Pij is a 2m 2n matrix and Qij is a 2M m 2N n matrix. Therefore, (47) is a KPD of A B indexed by (i, j) with respect to the Kronecker configuration (m , n , M m , N n ) as long as Pij and Qij satisfy the orthonormal condition in (46). In fact, Pij, Pkl = tr[P ij Pkl] = tr[(Ci Fj) (Dk Gl)] = tr[(C i Dk) (F j Gl)] = tr[C i Dk]tr[F j Gl] = δi,jδk,l, and similar results hold for Qij. It follows that Rm ,n [A B] S = max i,j µiνj = µ1ν1 = Rm m ,n n [A] S R(m m)+,(n n)+[B] S, and the proof of Lemma 1 is complete. Now we consider Corollary 2. When A and B are generated as in Example 1, we have Rm m ,n n [ A] S 2(m m +n n )/2 + 2((m m )++(n n )+)/2 + Op(1), R(m m)+,(n n)+[ B] S 2((m m)++(n n)+)/2 + 2(M+N m m n n )/2 + Op(1), A F B F = 2(M+N)/2(1 + Op(r0)). Rm ,n [A B] S = Rm ,n [ A B] S A F B F 2 (m +n )/2 + 2 (M+N m n )/2 + 2 (|m m |+|n n |)/2 + 2 (M+N |m m | |n n |)/2 + op(1). Ko PA: Automated Kronecker Product Approximation The maximum of the right hand side is obtained when |m m | + |n n | = 1, or m + n {1, M + N 1}, for which Rm ,n [A B] S 1/ Furthermore, it is straightforward to verify that the upper bound is attained when m +n {1, M + N 1}, which leads to Corollary 2. Appendix Appendix F. Proof of Lemma 2 We first prove the following technical lemma. Lemma 5. Let U, V be two vector subspaces of Rn with Θ(U, V ) = θ [0, π/2], where Θ(U, V ) denotes the smallest principal angle between U and V . Suppose w Rn is a unit vector and PUw = cos α, for some α [0, π/2], where PU denotes the orthogonal projection to the space U. Then it holds that ( cos(θ α) if α θ, 1 if α > θ. u = PUw PUw , then u = 1 and u U. Let {u1, u2, . . . , un} be an orthogonal basis of Rn such that u1 = u. For any vector v V , we have v w = v n X = v u1u 1w + i=2 v uiu iw = cos η cos α + sin η sin α = cos(η α), where v u1 = cos η. The proof is complete by noting that cos η = v u1 cos θ. We now prove Lemma 2. Proof of Lemma 2. Recall that M1 and M2 are of the same dimension. We consider the maximization of (M1 + M2)u 2 over all unit vectors u. First write (M1 + M2)u 2 = M1u + M2u 2 = M1PM 1u + M2PM 2u 2 = M1PM 1u 2 + M2PM 2u 2 + 2(M1PM 1u) M2PM 2u, Cai, Chen and Xiao where PM denotes the projection matrix to the column space of M. Since M1 S = µ and M2 S = ν, we have M1PM 1u 2 µ2 PM 1u 2 and M2PM 2u 2 ν2 PM 2u 2. Since M1PM 1u span(M1) and M2PM 2u span(M2), it holds that (M1PM 1u) M2PM 2u cos θµν PM 1u PM 2u . It follows that (M1 + M2)u 2 µ2 PM 1u 2 + ν2 PM 2u 2 + 2µν PM 1u PM 2u cos θ. Suppose PM 1u = cos α for some α [0, π/2]. If α > η, then PM 2u 1. The right hand side of the preceding inequality attains its maximum when PM 1u = cos η and PM 2u = 1. Hence, we only consider the case α η, which implies that PM 2u cos(η α), and (M1 + M2)u 2 µ2 cos2 α + ν2 cos2(η α) + 2µν cos θ cos α cos(η α). µ2 cos2 α + ν2 cos2(η α) + 2µν cos θ cos α cos(η α) 2µ2(1 + cos 2α) + 1 2ν2(1 + cos(2η 2α)) + µν cos θ[cos η + cos(η 2α)] 2(µ2 + ν2 + 2µν cos θ cos η) 2ν2 cos(2η) + µν cos θ cos η cos(2α) + 1 2ν2 sin(2η) + µν cos θ sin η sin(2α) 2(µ2 + ν2 + 2µν cos θ cos η) 2ν2 cos(2η) + µν cos θ cos η 2 + 1 2ν2 sin(2η) + µν cos θ sin η 2 µ2 + ν2 + 2µν cos θ cos η + q (µ2 + ν2 + 2µν cos θ cos η)2 4µ2ν2 sin2 θ sin2 η . The proof is complete. Appendix Appendix G. Proofs of Theorem 5 and Corollary 4 The proof of Theorem 5 is similar to the proofs of Theorem 2 and Theorem 3, so we only point out the main steps here and omit the details. Following the same argument as in the proof of Theorem 2, the expected information criteria of the true configuration is EICκ(m0, n0) = D ln λ2 2 + σ2 + κr2 0 . For a wrong configuration (m, n) W, ˆλ(m,n) is obtained by ˆλ(m,n) = λ1R[A1 B1] + λ2R[A2 B2] + σD 1/2R[E] S. Ko PA: Automated Kronecker Product Approximation According to Lemma 2 and Assumption 5, we have λ1R[A1 B1] + λ2R[A2 B2] 2 S λ2 1φ2 1 + λ2 2φ2 2 + 2λ1λ2φ1φ2ξ < (λ1 + λ2)2. (48) By Lemma 4, we have [ˆλ(m,n)]2 λ2 1φ2 1 + λ2 2φ2 2 + 2λ1λ2φ1φ2ξ + σ2r2 m,n + O((λ1 + λ2)σD 1/4) + Op (λ1 + λ2 + σ)σD 1/2 . (49) With (49) replacing (32), the rest of the proof follows the same line of the proof of Theorem 2. The proof of consistency is same as in the proof of Theorem 3 except that the formula of ˆλ(m,n) in (49) is used in (43). We now prove Corollary 4. When model (24) is generated under the random scheme in Example 2, we only consider the wrong configuration close to the true configuration. It can be verified that the separation EIC(m, n) is larger at other configurations. Consider (m, n) such that |m0 m| + |n0 n| = 1. Then from Corollary 2, we have 2 + Op(r0), φ2 = 1 2 + Op(r0). Now consider the principle angles between R[A1 B1] and R[A2 B2 as in Lemma 2, We have cos θ = Op(2 (m+n)), cos η = Op(2 (m +n )). By Lemma 2, (48) can be revised to λ1R[A1 B1] + λ2R[A2 B2] 2 S λ2 1 2 + Op(λ2 1r0). Corollary 4 follows immediately. Appendix Appendix H. Additional Simulation with Different Noise Distributions In this section, we examine the performance of configuration selection under different distributions of the noise matrix E. We replicate the simulation in Experiment 1 in Section 6.1.2 with M = N = 9, replacing the the normal distribution of the noise by (1) Unif[ 1, 1] and (2) Student s t4 distribution with degrees of freedom 4, both normalized to have unit variance. The uniform distribution is an example of the sub-Gaussian case, whose concentration inequality of the spectral norm is provided by, for example, Proposition 2.4 of Rudelson and Vershynin (2010). The t4 is an example of tails heavier than the Gaussian distribution. We plot the number of correct configuration selections out of 100 replications for different noise distributions in Figure 13. There is no substantial difference between Gaussian and uniform cases. The t4 noise appears to require higher signal-to-noise ratios than Gaussian noise due to its heavier tail. But the phase transition of correctly selecting the configuration continues to exist. Cai, Chen and Xiao (a) AIC, Gaussian (b) AIC, Uniform (d) MSE, Gaussian (e) MSE, Uniform Figure 13: The empirical frequencies of the correct configuration selection out of 100 repetitions under AIC and MSE for different noise distributions with dimensions M = N = 9. Seung C. Ahn and Alex R. Horenstein. Eigenvalue ratio test for the number of factors. Econometrica, 81(3):1203 1227, 2013. Hirotogu Akaike. Information Theory and an Extension of the Maximum Likelihood Principle, pages 199 213. Springer New York, New York, NY, 1998. Jushan Bai. Inferential theory for factor models of large dimensions. Econometrica, 71(1): 135 171, 2003. J.W. Belliveau, D.N. Kennedy, R.C. Mc Kinstry, B.R. Buchbinder, R.M. Weisskoff, M.S. Cohen, J.M. Vevea, T.J. Brady, and B.R. Rosen. Functional mapping of the human visual cortex by magnetic resonance imaging. Science, 254(5032):716 719, 1991. Vicki Bruce and Andy Young. Understanding face recognition. British journal of psychology, 77(3):305 327, 1986. Jian-Feng Cai, Emmanuel J Cand es, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956 1982, 2010. T. Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46(1):60 89, 2018. Emmanuel J. Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936, 2010. Emmanuel J. Cand es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009. Ko PA: Automated Kronecker Product Approximation Antonin Chambolle. An algorithm for total variation minimization and applications. Journal of Mathematical imaging and vision, 20(1-2):89 97, 2004. Priyam Chatterjee and Peyman Milanfar. Patch-based near-optimal image denoising. IEEE Transactions on Image Processing, 21(4):1635 1649, 2011. Jiahua Chen and Zehua Chen. Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759 771, 2008. Rong Chen, Dan Yang, and Cun-Hui Zhang. Factor models for high-dimensional tensor time series. Journal of the American Statistical Association, 117(537):94 116, 2022. Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing, 16(8):2080 2095, 2007. Tugrul Dayar. Analyzing Markov chains using Kronecker products: theory and applications. Springer Science & Business Media, 2012. Marco F. Duarte and Richard G. Baraniuk. Kronecker compressive sensing. IEEE Transactions on Image Processing, 21(2):494 504, Feb 2012. Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211 218, 9 1936. Jianqing Fan, Yuan Liao, and Martina Mincheva. High dimensional covariance matrix estimation in approximate factor models. Annals of statistics, 39(6):3320, 2011. Rina Foygel and Mathias Drton. Extended bayesian information criteria for gaussian graphical models. In Advances in neural information processing systems, pages 604 612, 2010. Anna Goldenberg, Alice X. Zheng, Stephen E. Fienberg, and Edoardo M. Airoldi. A survey of statistical network models. Foundations and Trends in Machine Learning, 2(2):129 233, 2010. Qiang Guo, Caiming Zhang, Yunfeng Zhang, and Hui Liu. An efficient SVD-based method for image denoising. IEEE transactions on Circuits and Systems for Video Technology, 26(5):868 880, 2015. Roger A. Horn and Charles R. Johnson. Topics in matrix analysis. Cambridge University Press, Cambridge, 37:39, 1991. Julie Kamm and James G. Nagy. Kronecker product and SVD approximations in image restoration. Linear Algebra and its Applications, 284(1):177 192, 1998. International Linear Algebra Society (ILAS) Symposium on Fast Algorithms for Control, Signals and Image Processing. Phillip Kaye, Raymond Laflamme, and Michele Mosca. An introduction to quantum computing. Oxford University Press, 2007. Cai, Chen and Xiao B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000. Can M. Le, Elizaveta Levina, and Roman Vershynin. Optimization via low-rank approximation for community detection in networks. The Annals of Statistics, 44(1):373 400, 2016. Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research, 11(Feb):985 1042, 2010. Joseph A. Maldjian, Paul J. Laurienti, Robert A. Kraft, and Jonathan H. Burdette. An automated method for neuroanatomic and cytoarchitectonic atlas-based interrogation of f MRI data sets. Neuroimage, 19(3):1233 1239, 2003. Victor Ng, Robert F. Engle, and Michael Rothschild. A multi-dynamic-factor model for stock returns. Journal of Econometrics, 52(1-2):245 266, 1992. Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In BMVC 2015 - Proceedings of the British Machine Vision Conference 2015, pages 1 12. British Machine Vision Association, 2015. Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II IV: Invited Lectures, pages 1576 1602. World Scientific, 2010. Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2): 461 464, 03 1978. Amit Singer. Angular synchronization by eigenvectors and semidefinite programming. Applied and computational harmonic analysis, 30(1):20 36, 2011. Matthew A. Turk and Alex P. Pentland. Face recognition using eigenfaces. In Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 586 591. IEEE, 1991. Charles F. Van Loan and Nikos Pitsianis. Approximation with kronecker products. In Linear algebra for large scale and real-time applications, pages 293 314. Springer, 1993. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing, 2010. Per Ake Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99 111, 1972. Karl Werner, Magnus Jansson, and Petre Stoica. On estimation of covariance matrices with kronecker product structure. IEEE Transactions on Signal Processing, 56(2):478 491, Feb 2008.