# convolutional_imputation_of_matrix_networks__9edfb699.pdf Convolutional Imputation of Matrix Networks Qingyun Sun * 1 Mengyuan Yan * 2 David Donoho 3 Stephen Boyd 2 A matrix network is a family of matrices, with relatedness modeled by a weighted graph. We consider the task of completing a partially observed matrix network. We assume a novel sampling scheme where a fraction of matrices might be completely unobserved. How can we recover the entire matrix network from incomplete observations? This mathematical problem arises in many applications including medical imaging and social networks. To recover the matrix network, we propose a structural assumption that the matrices have a graph Fourier transform which is low-rank. We formulate a convex optimization problem and prove an exact recovery guarantee for the optimization problem. Furthermore, we numerically characterize the exact recovery regime for varying rank and sampling rate and discover a new phase transition phenomenon. Then we give an iterative imputation algorithm to efficiently solve the optimization problem and complete large scale matrix networks. We demonstrate the algorithm with a variety of applications such as MRI and Facebook user network. 1. Introduction In machine learning and social network problems, information is often encoded in matrix form. User profiles in social networks can be embedded into feature matrices; item profiles in recommendation systems can also be modeled as matrices. Many medical imaging modalities, such as MRI and CT, also represent data as a stack of images. These matrices have underlying connections that can come from spatial or temporal proximity, or observed similarities between the items being described, etc. A weighted graph *Equal contribution 1Department of Mathematics, Stanford University, California, USA 2Department of Electrical Engineering, Stanford University, California, USA 3Department of Statistics, Stanford University, California, USA. Correspondence to: Qingyun Sun . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). Figure 1: (a) Original MRI image frames (oracle). (b) Our sampled and corrupted observation. (c) Recovered image frames using nuclear norm minimization on individual frames. (d) Recovered image frames using our convolutional imputation algorithm. can be built to represent the connections between matrices. Therefore we propose matrix networks as a general model for data representation. A matrix network is defined by a weighted graph whose nodes are matrices. Due to the limitations of data acquisition processes, sometimes we can only observe a subset of entries from each data matrix. The fraction of entries we observe may vary from matrix to matrix. In many real problems, a subset of matrices can be completely unobserved, leaving no information for ordinary matrix completion methods to recover the missing matrices. To our knowledge, we are the first to examine this novel sampling scheme. As an example, in the following MRI image sequence (figure 1(a)), we sample each frame of the MRI images with i.i.d. Bernoulli distribution p = 0.2, and 2 out of 88 frames are completely unobserved, shown in figure 1(b). If we perform matrix completion by nuclear norm minimization on Convolutional Imputation of Matrix Networks Figure 2: An example of matrix network on Facebook social graph. Each node on the graph represents a matrix. individual frames, we are not able to recover the completely unobserved matrices (figure 1(c)). When we build a network on the image frames, in this case, an one-dimensional chain representing the sequence, and assume that the matrices following graph Fourier transform are low-rank, we are able to recover the missing frames, as shown in figure 1(d). The ability to recover all matrices from partial observations, especially inferring matrices that are totally unobserved, is crucial to many applications such as the cold start problem in networks. Illustrated in figure 2, new items or users in a network, which does not have much information available, need to aggregate information from the network to have an initial estimate of their feature matrices, in order to support inference and decisions. Since we model the matrices as nodes on a graph, information from other matrices makes it possible to recover the missing ones. To use such information, we make the structural assumption that the matrix network is low-rank in spectral space, i.e., the matrix network is the graph convolution of two low-rank matrix networks. In the MRI example, we verify the spectral low-rank assumption in figure 3. For all the matrices after the graph Fourier transform, singular values quickly decrease to almost zero, demonstrating that they are in fact low-rank. We make the following major contributions in this paper: We define a novel modeling framework for collections of matrices using matrix network. We propose a new method to complete a stack of related matrices. We provide a mathematically solid exact recovery guarantee and numerically characterize the precise success regime. We give a convolutional imputation algorithm to efficiently complete large scale matrix networks. 2. Related work Low-rank matrix recovery is an important field of research. (25; 26) proposed and (36; 48) improved the soft-impute Figure 3: Spectral space singular value distribution of the MRI scan to support the spectral low-rank assumption. algorithm as an iterative method to solve large-scale matrix completion problems. The soft-impute algorithm inspired our imputation algorithm. There was also a long line of works building theoretical tools to analyze the recovery guarantee for matrix completion (9; 7; 10; 8; 15; 22; 32; 31). Besides matrix completion, Gross analyzed the problem of efficiently recovering a low-rank matrix from a fraction of observations in any basis (24). These works enlightened our exact recovery analysis. Low-rank matrix recovery could be viewed as a noncommutative analog of compressed sensing by replacing the sparse vector with a low-rank matrix. In compressed sensing, recovery of the sparse vector with a block diagonal sensing matrix was studied by the recent work (37), which demonstrated a phase transition that was different from the well-known phase transition for classical Gaussian/Fourier sensing matrices given by a series of works including (16; 3; 39). In our low-rank matrices recovery problem, our novel sampling scheme also corresponded to a block diagonal operator. We likewise demonstrated a new phase transition phenomenon. Tensors could be considered as matrix networks when we ignored the network structure. Tensor completion coincides with the matrix network completion when the adjacent matrix is diagonal and the graph is just isolated points with no edge and the graph eigenbasis is the coordinate basis. Several works on tensor completion defined the nuclear norm for tensors as linear combinations of the nuclear norm of its unfoldings (21; 34; 19; 50; 49). Besides the common CP and Tucker decompositions of tensors, the recent work (51; 30; 35) defined the t-product using convolution operators between tensor fibers, which was close to the convolution of matrix network using the discrete Fourier transform matrix, and they applied the method in indoor localization. Departing from previous work, we considered a new sampling scheme in which some matrices were com- Convolutional Imputation of Matrix Networks pletely unobserved, and the undersampling ratio could be highly unbalanced for the other observed matrices. This sampling scheme was not considered before, yet it is very natural under the matrix network model. Under an incoherent eigenbasis, we can recover one completely unobserved measurement matrix because its information is well spread across all the spectral matrices with low-rank structure. Networks is an important modelling framework for relations and interactions (29; 20; 52; 54; 53). Graph Laplacian based regularization has been used in semi-supervised learning in (18; 1; 2; 38) and in PCA (44) and low-rank matrix recovery (41; 23). In (41; 23; 44) a regularization term is used for a single matrix, where both the row vectors and column vectors of the matrix are assumed to be connected with graphs. The notion of graph Fourier transform is rooted in spectral graph theory (11), it is the cornerstone of graph harmonic analysis(13; 12). The coherence of graph Fourier transform is studied in (40; 46), and the examples of large graphs with low-coherence (non-local) eigenvectors include different classes of random graphs(14; 17; 47), and non-random regular graph(5). Graph Fourier transform and graph convolution are widely used in data analysis and machine learning, for example, in (4; 55; 28; 43; 45). Recent advances in the field of convolutional neural networks by (6; 27) used this idea to extend neural networks from working on Euclidean grids to working on graphs. 3. Mathematical definitions Matrix network. First, consider a weighted graph G with N nodes and an adjacent matrix W RN N, where Wij is the weight on the edge between node i and j. In the following, we use J to represent the set of nodes on the graph. We define a matrix network by augmenting this weighted graph G with a matrix-valued function A on the node set. The function A maps each node i in the graph to a matrix A(i) of size m n. We define a L2 norm 2 on the matrix network by the squared sum of all entries in all matrices of the network. And we define the sum of nuclear norm as ,1, A ,1 = PN i=1 A(i) . Graph Fourier transform. The graph Fourier transform is an analog of the Discrete Fourier Transform. For a weighted undirected graph G and its adjacent matrix W, the normalized graph Laplacian is defined as L = I D 1/2WD 1/2, where D is a diagonal matrix with entries Dii = P j Wij. The graph Fourier transform matrix U is defined using UL = EU, where E is the diagonal matrix of the eigenvalues of L. Here, U is a unitary N N matrix, and the eigenvectors of L are the row vectors of U. We rank the eigenvalues in descending order and identify the k-th eigenvalue with its index k for simplicity. For a matrix network A, we define its graph Fourier transform ˆA = UA, as a stack of N matrices in the spectral space of the graph. Each matrix is a linear combination of matrices on the graph, weighted by the graph Fourier basis. ˆA(k) = P i J U(k, i)A(i). Intuitively, if we view the matrix network A as a set of m n scalar functions on the graph, the graph Fourier transform on matrix network is applying the graph Fourier transform on each function individually. Using tensor notation, the element of A is A(i, a, b), and the graph Fourier transform U can be represented by a big block diagonal matrix U I where each block is U of size N 2, and there are (mn)2 such blocks. We remark that the discrete Fourier transform is one special example of the graph Fourier transform. When the graph is a periodic grid, L is the discrete Laplacian matrix, and the eigenvectors are just the basis vectors for the discrete Fourier transform, which are sine and cosine functions with different frequencies. We define the graph Fourier coherence as ν(U) = maxk,s |Uk,s|, following (40; 46). We know that ν(U) [ 1 N , 1]. When ν(U) is close to 1 N , the eigenvectors are non-local, for example, the discrete Fourier transform case, different classes of random graphs(14; 17; 47), and non-random regular graph(5). When µ(U) is close to 1, certain eigenvectors may be highly localized, especially when the graph has vertices whose degrees are significantly higher or lower than the average degree, say, in a star-like tree graph, or when the graph has many triangles, as discussed in (42). We will show in the following section that graphs with low coherence (close to 1 N ) is preferred for the imputation problem. Convolution of matrix networks. We can extend the definition of convolution to matrix networks. For two matrix networks X, Y on the same graph, we define their convolution as \ (X Y )(k) = ˆX(k) ˆY (k). Then \ X Y is a stack of matrices where each matrix is the matrix multiplication of ˆX(k) and ˆY (k). Convolution on a graph is defined as multiplication in the spectral space by generalizing the convolution theorem since it is not clear how to define convolution in the original space. 4. Completion problem with missing matrices Imagine that we observe a few entries Ω(i) of each matrix A(i). We define the sampling rates as pi = |Ω(i)|/(mn). The projection operator PΩis defined to project the full matrix network to our partial observation by only retaining entries in the set Ω= S Ω(i). Convolutional Imputation of Matrix Networks The sampling rate can vary from matrix to matrix. The main novel sampling scheme we include here is that a subset of matrices may be completely unobserved, namely pi = 0. This sampling scheme almost has not been discussed in depth in the literature. The difficulty lies in the fact that if a matrix is fully unobserved, there is no information at all from itself for the recovery, therefore we must leverage the information from other observed matrices. To focus on understanding the essence of this difficulty, it is worth considering the extreme sampling scheme where each matrix is either fully observed or fully missing, which we call node undersampling. To recover missing entries, we need structural assumptions about the matrix network A. We propose the assumption that A can be well-approximated by the convolution X Y of two matrix networks X, Y of size m r and r n, for some r much smaller than m and n. We will show that under this assumption, accurate completion is possible even if a significant fraction of the matrices are completely unobserved. We formulate the completion problem as follows. Let A0 = X0 Y 0 be a matrix network of size m n, as the ground truth, where X0 and Y 0 are matrices of size m r and r n on the same network. After the graph Fourier transform, ˆA0(k) are rank r matrices. Our observations are AΩ= PΩ(A) = PΩ(A0 + W), each entry of W is sampled i.i.d from N(0, σ2/n). We first consider the noiseless setting where σ = 0. We can consider the following nuclear norm minimization problem, as a convex relaxation of rank minimization problem, minimize ˆ M ˆ M ,1, subject to AΩ= PΩ(U ˆ M) As an extension to include noise, we can consider the convex optimization problem in Lagrange form with regularization parameters λk, Lλ( ˆ M) = 1 2 AΩ PΩU ˆ M 2 2 + k=1 λk ˆ M(k) . We can also consider the bi-convex formulation, which is to minimize the following objective function, Lλ(X, Y ) = AΩ PΩ(X Y ) 2 2 + PN k=1 λk( ˆX(k) 2 2 + ˆY (k) 2 2). This formulation is non-convex but it is computationally efficient in large-scale applications. One remark is that when we choose the regularization parameter λk to be Ek, the eigenvalues of the graph Laplacian L, and view X as a (nr) N dimensional matrix, k=1 Ek ˆX(k) 2 2 = Tr(X U EUX) = Tr(X LX), then our regularizer is related to the graph Laplacian regularizer from (41; 23; 44). 5. Exact recovery guarantee Let us now analyze the theoretical problem: what condition is needed for the non-uniform sampling Ωand for the rank of ˆA such that our algorithm is guaranteed to perform accurate recovery with high probability? We focus on the noiseless case, σ = 0, and for simplicity of results, we assume m = n. We first prove that one sufficient condition is that average sampling rate p = 1 N PN i=1 pi = |Ω|/(Nn2) is greater than O( r n log2(n N)). It is worth pointing out that the condition is only about the average sampling rate, therefore it includes the interesting case that a subset of matrices is completely unobserved. Analysis on exact recovery guarantee The matrix incoherence condition is a standard assumption for low-rank matrix recovery problems. Let the SVD of ˆA(k) be ˆA(k) = V1(k)E(k)V 2 (k). We define P1 and P2 as the direct sum of the projection matrix, P1(k) = V1(k)V1(k) , P2(k) = V2(k)V2(k) . We define the subspace T as the direct sum of the subspaces T(k), where T(k) is spanned by the column vectors of V1(k) and V2(k). Then we define the projection onto T as PT ( ˆ M) = (V1V 1 ˆ M + ˆ MV2V 2 V1V 1 ˆ MV2V 2 ). We define its complement as PT = I PT . We define sign( ˆA(k)) = V1(k)V 2 (k) as the sign matrix of the singular values of ˆA. In matrix completion, for a r dimensional subspace of dimension n, spanned by V with an orthogonal projection PV , the coherence µ(V ) is defined as r max i PV ei 2. Here we introduce an averaged coherence Definition 5.1 For the graph Fourier transform U , let the column vector of U be uk. We now define the incoherence condition with coherence µ for the stack of spectral subspaces V1(k), V2(k) as max{PN k=1 uk 2 µ(V1(k)), PN k=1 uk 2 µ(V2(k))} = µ. The coherence of graph Fourier transform U is defined as Convolutional Imputation of Matrix Networks ν(U ) = maxk uk . We remark that µ ν(U )2 max{PN k=1 µ(V1(k)), PN k=1 µ(V2(k))} ν(U )2N max N k=1 max{µ(V1(k)), µ(V2(k))} In the following, we show that the sampling rate threshold is proportional to µ, which is upper bounded by ν(U )2N max N k=1 max{µ(V1(k)), µ(V2(k))}. This upper bound suggests that for the imputation would prefer low coherence graph such that ν(U ) is close to 1 Theorem 1 We assume that A is a matrix network on a graph G, and its graph Fourier transform ˆA(k) are a sequence of matrices, each of them is at most rank r, and ˆA satisfy the incoherence condition with coherence µ. And we observe a matrix network AΩon the graph G, for a subset of node in Ωrandom sampled from the network, node i on the network is sampled with probability pi, we define the average sampling rate p = 1 N PN i=1 pi = |Ω|/(Nn2), and define R = 1 Then we prove that for any sampling probability distribution {pi}, as long as the average sampling rate p > Cµ r n log2(Nn) for some constants C, the solution to the optimization problem minimize ˆ M ˆ M ,1, subject to AΩ= R ˆ M is unique and is exactly ˆA with probability 1 (Nn) γ,where γ = log(Nn) Proof sketch The proof of this theorem is given in the supplementary material. We sketch the steps of the proof here. We will prove that for any nonzero ˆ M = ˆA, we define = ˆ M ˆA, then we want to show either R = 0, or ˆA + ,1 > ˆA ,1. We define the inner product: ˆ M1, ˆ M2 = P k ˆ M1(k), ˆ M2(k) , then ˆA ,1 = sign( ˆA), ˆA . We define a decomposition = PT + PT = T + T . For R = 0, we show that ˆA + ,1 ˆA ,1 + sign( ˆA) + sign( T ), . Now we want to estimate s = sign( ˆA) + sign( T ), . Since R = 0, range(R) . We want to construct a dual certificate K range(R), such that for k = 3 + 1 2 log2(r) + log2(n) + log2(N), with probability 1 (Nn) γ, PT (K) sign( ˆA) 2 ( 1 PT (K) 1 2. Given the existence of the dual certificate, we have s = sign( ˆA) + sign( T ) K, We can break down s as sign( T ) PT (K), T + sign( ˆA) PT (K), T then with probability 1 (Nn) γ, we get We can show that for all range(R) , with probability 1 (Nn) γ, T 2 < 2n N T 2. Using this fact, 2)k r2n N T 1 Therfore, when ˆ M is a minimizer, we must have T = 0, otherwise ˆA + ,1 < ˆA ,1. Since T 2 is bounded by T 2, we also have T = 0, then = 0. Therefore, ˆ M is the unique mininizer, and ˆ M = ˆA. This ends the proof. Now we add remarks for some of the important techinical steps. The propositions with high probability guarantee rely on a concentration result. Since E(PT RPT ) = PT , we control the probability of deviation P[ PT PT RPT > t] via operator Bernstein inequality( see theorem 6 of (24)), use the condition p = Cµ r n log2(Nn), let t = 1/2, then with probability 1 (n N) γ, where γ = log(Nn) 16 , PT PT RPT < 1/2. We construct a dual certificate via a method called "golfing", this technique was invented in (24). We construct the dual certificate K by the following construction: We decompose Ωas the union of k subset Ωt, where each entry is sampled independently so that E(|Ωt| = pt = 1 (1 p)1/k, and define Rt = 1 pt PΩt U . Define H0 = sign( ˆA), Kt = Pt j=1 Rj Hj 1, Ht = sign( ˆA) PT Kt. Then the dual certificate is defined as K = Kk. Using the operator-Bernstein concentration inequality, we can verify the two conditions: The first condition: PT (K) sign( ˆA) 2 = Hk PT PT RPT Ht 1 2 1 2 Ht 1 2 ( 1 2)k sign( ˆA) ( 1 The second condition, we can apply operator Bernstein inequality again for a sequence of tj = 1/(4 r), so that PT Rj Hj 1 ti Hj 1 2, and since Hj 2 r2 j, then PT (K) Pk j=1 ti Hj 1 2 1 4 Pk j=1 2 (j 1) < 1/2. Convolutional Imputation of Matrix Networks 6. Convolutional imputation algorithm Now we propose a convolutional imputation algorithm that effectively finds the minimizer of the optimization problem for a sequence of regularization parameters. Iterative imputation algorithm. The vanilla version of our imputation algorithm iteratively performs imputation of Aimpute = PΩ(A) + P Ω(Aest) and singular value softthreshold of ˆAimpute to solve the nuclear norm regularization problem. In the following, we denote singular value soft-threshold as Sλ( ˆA) = V1(Σ λI)+V 2 ,where ( )+ is the projection operator on the semi-definite cone, and ˆA = V1ΣV 2 is the singular value decomposition. Iterative Imputation: input PΩ(A). Initialization Aest 0 = 0, t = 0. for λ1 > λ2 > . . . > λC, where λj = (λj k), k = 1, . . . , N do Aimpute = PΩ(A) + P Ω(Aest t ). ˆAimpute = UAimpute. ˆAest t+1(k) = Sλj k( ˆAimpute(k)). Aest t+1 = U 1 ˆAest t+1. t=t+1. until Aest t Aest t 1 2/ Aest t 1 2 < ϵ. Assign Aλj = Aest t . end for output The sequence of solutions Aλ1, . . . , AλC. In the vanilla imputation algorithm, computing the full SVD on each iteration is very expensive for large matrices. For efficiency, we can use alternating ridge regression to compute reduced-rank SVD instead. Due to the limited space, we omit the detailed algorithm description here. Regularization path. The sequence of regularization parameters is chosen such that λ1 k > λ2 k > . . . > λC k for each k. The solution for each iteration with λs is a warm start for the next iteration with λs+1. Our recommended choice is to choose λ1 k as the largest singular value for ˆAimpute(k), and decay λs at a constant speed λs+1 = cλs. Convergence. Our algorithm is a natural extension of softimpute (25), which is a special case of the proximal gradient algorithm for nuclear norm minimization, as demonstrated by (48), and the convergence of the algorithm is guaranteed. Here we show that the solution of our imputation algorithm converges asymptotically to a minimizer of the objective Lλ( ˆ M) in an elegant argument. We show that each step of our imputation algorithm is minimizing a surro- gate Qλ( ˆ M| ˆ M old) = AΩ+ P ΩU 1 ˆ M old U 1 ˆ M 2 + PN k=1 λk ˆ M(k) . Theorem 2 The imputation algorithm produces a sequence of iterates ˆ M t λ as the minimizer of the successive optimization objective ˆ M t+1 λ = argmin Qλ( ˆ M| ˆ M t λ). The sequence of iterates that converges to the minimizer ˆ M of Lλ( ˆ M). We put the proof of the convergence theorem in the appendix. The main idea of the proof is to show that Qλ decreases after every iteration. ˆ M t λ is a Cauchy sequence. The limit point is a stationary point of Lλ Computational complexity. Now we analyze the computational complexity of the imputation algorithm. The cost of the graph Fourier transform on matrix network is O(mn N 2). When the graph is a periodic lattice, using fast Fourier transform(FFT), it is reduced to O(mn N log N). The cost of SVD is O(min(mn2, m2n)N) for computing singular value soft-threshold. Replacing SVD with alternating ridge regression reduces the complexity to O(r2n N). Therefore, the cost of each iteration is the sum of the cost of both parts, and the total cost would be that times total iteration steps. 7. Experimental results Numerical verification of the exact recovery To focus on the essential difficulty of the problem, we study the noiseless, node sampling setting: In each imputation experiment, we first generate a stack of low-rank matrices in the spectral space, ˆA(k) = X0(k)T Y 0(k) for i.i.d Gaussian random matrix X0(k), Y 0(k) Rr n. We also generate a random graph G. Then we compute the matrix network A by the inverse graph Fourier transform and obtain our observation by node undersampling of A. Then we send the observed matrices and the graph G to the imputation algorithm to get the solution ˆ M. We measure the relative mean square error (r MSE) ˆ M ˆA / ˆA . We set (n, N) = (50, 100) for all our experiments, and vary the undersampling ratio p and rank r. For each set of parameters (p, r), we repeat the experiment multiple times and compute the success rate of exact recovery. In figure 4 on the upper panel we show the r MSE when the graphs are one-dimensional chains of length N. When r/n is large and p is small, the r MSE is approximately equal to the undersampling ratio p, which means the optimization Convolutional Imputation of Matrix Networks Figure 4: Upper: the r MSE for different combination of undersampling ratio p and rank r/n. We observe that the transition between successful recovery (r MSE 10 5) and failure (r MSE p) is very sharp. Lower: the phase transition graph with varying undersampling ratio and rank. We repeat the experiment multiple times for each parameter combination, and plot the success recovery rate (r MSE < 0.001) failed to recover the ground truth matrices. On the opposite side, when r/n is small and p is large, the r MSE is very small, indicating we have successfully recovered the missing matrices. The transition between the two regions is very sharp. We also show the success rate on the lower panel of figure 4, which demonstrates a phase transition. Feature matrices on Facebook network We take the ego networks from the SNAP Facebook dataset (33). The combined network forms a connected graph with 4039 nodes and 88234 edges. All the edges have equal weights. The feature matrices on each of the nodes were generated by randomly generating X(k), Y (k) C1 50 in the spectral domain, and doing the inverse graph Fourier transform to get A = U 1(X(k)Y (k)). The observation is generated by sampling Nobs matrices at sampling rate p, and adding i.i.d. Gaussian noise with mean 0 and variance σ2/50 to all observed entries. Here Nobs < N = 4039 and the other matrices are completely unobserved. Nobs = 0.2N Nobs = 0.4N Nobs = 0.9N p = 1 p = 0.5 p = 2/9 σ = 0 0.116/0.000 0.100/0.038 0.088/0.061 σ = 0.1 0.363/0.495 0.348/0.411 0.339/0.365 Table 1: Average MSE of missing/(partially) observed matrices. We run our iterative imputation algorithm to recover A from this observation with varying parameters Nobs, p, and σ, and calculate the MSE between our estimation and the ground truth. The results are summarized in Table 1. When there is no additive noise, we can recover all the matrices very well even with only 20% of entries observed across the matrix network. It works well both when doing node undersampling and more uniform undersampling. When there is additive noise, the MSE between reconstruction and the ground truth will grow proportionally. MRI completion We use a cardiac MRI scan dataset for the completion task. The stack of MRI images scans through a human torso. The frames are corrupted, several frames are missing, and the other frames are sampled i.i.d. from a Bernoulli distribution with p = 0.2. Our completion result is demonstrated in figure 1 at first page as the motivating example. In the 88 frames there are 2 frames missing, and we only sampled 20% of the rest of frames i.i.d. from a Bernoulli distribution with p = 0.2. We compare with the baseline method where we solve a tensor completion problem using nuclear norm minimization. Relative MSE for all frames are plotted in figure 5. The baseline method failed at missed frames and significantly under-performed the convolutional imputation method. SPECT completion We imputed a cardiac SPECT scan dataset. The SPECT scan captures the periodic movement of a heart, and we have a temporal sequence at a fixed spatial slice. The sequence has 36 frames, capturing 4 periods of heart beats. 4 consecutive frames out of the 36 frames are missing and the other frames are sampled i.i.d. from a Bernoulli distribution with p = 0.2. We try to recover the whole image stack from the observations and compare our method with two baseline methods. The first baseline method assumes each individual frame is low-rank and minimizes the sum of nuclear norms. The second baseline method adds the graph regularizer from (41; 23; 44) , in addition to the low-rank assumption on each frame. Minimizing the sum of nuclear norm fails to recover completely missing frames. Our algorithm performs better than tensor completion with graph regularizer on the SPECT scan, since in spectral domain we can use the periodicity to help aggregate information, while using graph regularizer only propagates information between neighbors. This is demon- Convolutional Imputation of Matrix Networks Figure 5: Comparison of relative MSE for all frames of MRI, the baseline joint matrix completion method failed at missed frames and significantly underperformed the convolutional imputation method. Figure 6: Visualization of the first 9 frames of the SPECT sequence. The frames in pink shadow are missing in the observation. Figure 7: Comparison of relative MSE for all frames of SPECT, missed frames are indexed 3 to 5. strated in figure 6. The first row shows the ground truth, and the second row overlays the ground truth (in red channel) with the completion result using our convolutional imputation algorithm (in green channel). The third row overlays the ground truth with the completion result using tensor completion with graph regularizer. The completion result with our algorithm matches the ground truth very well, while the completion result with tensor completion using graph regularizer is biased towards the average of neighboring frames, showing red and green rings on the edges.A quantitative comparison on the SPECT scan completion is given in figure 7. Our imputation algorithm s relative MSE between reconstruction and the ground truth is significantly smaller than the baselines . It is worth pointing out that our method s recovery performance at the missing frames are comparable to that at the partially observed frames, while the first baseline completely fails at the missing frames and the second baseline performs significantly worse. 8. Discussion In practice, when you are given a tensor or a stack of matrices, there are two ways to formulate it into a matrix network. One is to use the knowledge of physical or geometrical relation to naturally determine the graph. The graph of the matrix network is given in the facebook network and the graph is naturally constructed as a 1-d equal-weighted chain in the MRI and SPECT datasets, based on the nature of the datasets. The other is to construct the graph using an explicit constructive methods. Finding a graph with good graph Fourier transform relies on problem structure and domain knowledge. One suggested universal way is to construct a lattice or a d regular graph, then assign the weight on each edge as some distance metric of two matrices, for example, the distance metric could be computed using Gaussian kernels. We suggest that the coherence µ we defined before could be used as a criterion to measure how good the graph Fourier transform is. From the bound on µ by the coherence of the graph Fourier transform and the maximum coherence over all spectral matrices, we know that we want to search for graph with low coherence. This leads to interesting dictionary learning problem where we want to learn a unitary dictionary as the graph Fourier transform. To conclude, treating a series of matrices with relations as a matrix network is a useful modeling framework since a matrix network has operations like the graph Fourier transform and convolution. This framework allows us to complete the matrices when some of them are completely unobserved, using the spectral low-rank structural assumption. We provided an exact recovery guarantee and discovered a new phase transition phenomenon for the completion algorithm. [1] Rie Kubota Ando and Tong Zhang. Learning on graph with laplacian regularization. Advances in neural information processing systems, 19:25, 2007. [2] Konstantin Avrachenkov, Pavel Chebotarev, and Alexey Mishenin. Semi-supervised learning with regularized lapla- Convolutional Imputation of Matrix Networks cian. Optimization Methods and Software, 32(2):222 236, 2017. [3] Mohsen Bayati, Marc Lelarge, and Andrea Montanari. Universality in polytope phase transitions and iterative algorithms. Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pages 1643 1647, 2012. [4] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pages 585 591, 2002. [5] Shimon Brooks and Elon Lindenstrauss. Non-localization of eigenfunctions on large regular graphs. Israel Journal of Mathematics, 193(1):1 14, 2013. [6] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. Spectral networks and locally connected networks on graphs. ar Xiv preprint ar Xiv:1312.6203, 2013. [7] Jian-Feng Cai, Emmanuel Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982, 2010. [8] Emmanuel Candès and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936, 2010. [9] Emmanuel Candès and Benjamin Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111 119, 2012. [10] Emmanuel Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080, 2010. [11] Fan RK Chung. Spectral graph theory. Number 92. American Mathematical Soc., 1997. [12] Ronald R Coifman, Stephane Lafon, Ann B Lee, Mauro Maggioni, Boaz Nadler, Frederick Warner, and Steven W Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences of the United States of America, 102(21):7426 7431, 2005. [13] Ronald R Coifman and Mauro Maggioni. Diffusion wavelets. Applied and Computational Harmonic Analysis, 21(1):53 94, 2006. [14] Yael Dekel, James R Lee, and Nathan Linial. Eigenvectors of random graphs: Nodal domains. Random Structures & Algorithms, 39(1):39 58, 2011. [15] David Donoho, Matan Gavish, et al. Minimax risk of matrix denoising by singular value thresholding. The Annals of Statistics, 42(6):2413 2440, 2014. [16] David Donoho and Jared Tanner. Counting faces of randomly projected polytopes when the projection radically lowers dimension. Journal of the American Mathematical Society, 22(1):1 53, 2009. [17] Ioana Dumitriu, Soumik Pal, et al. Sparse regular random graphs: spectral density and eigenvectors. The Annals of Probability, 40(5):2197 2235, 2012. [18] Ahmed El Alaoui, Xiang Cheng, Aaditya Ramdas, Martin J Wainwright, and Michael I Jordan. Asymptotic behavior of lp-based laplacian regularization in semi-supervised learning. 29th Annual Conference on Learning Theory, pages 879 906, 2016. [19] Marko Filipovi c and Ante Juki c. Tucker factorization with missing data with application to low-rank tensor completion. Multidimensional systems and signal processing, 26(3):677 692, 2015. [20] Jacob Fox, Tim Roughgarden, C Seshadhri, Fan Wei, and Nicole Wein. Finding cliques in social networks: A new distribution-free model. ar Xiv preprint ar Xiv:1804.07431, 2018. [21] Silvia Gandy, Benjamin Recht, and Isao Yamada. Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Problems, 27(2):025010, 2011. [22] Matan Gavish and David L Donoho. The optimal hard threshold for singular values is 4/ p (3). IEEE Transactions on Information Theory, 60(8):5040 5053, 2014. [23] Pere Giménez-Febrer and Alba Pages-Zamora. Matrix completion of noisy graph signals via proximal gradient minimization. [24] David Gross. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):1548 1566, 2011. [25] T Hastie and R Mazumder. softimpute: Matrix completion via iterative soft-thresholded svd. R package version, 1, 2015. [26] Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrix completion and low-rank svd via fast alternating least squares. J. Mach. Learn. Res, 16(1):3367 3402, 2015. [27] Mikael Henaff, Joan Bruna, and Yann Le Cun. Deep convolutional networks on graph-structured data. ar Xiv preprint ar Xiv:1506.05163, 2015. [28] Wei Hu, Gene Cheung, Antonio Ortega, and Oscar C Au. Multiresolution graph fourier transform for compression of piecewise smooth images. IEEE Transactions on Image Processing, 24(1):419 433, 2015. [29] M.O. Jackson. Social and Economic Networks. Princeton University Press. Princeton University Press, 2008. [30] Eric Kernfeld, Misha Kilmer, and Shuchin Aeron. Tensor tensor products with invertible linear transforms. Linear Algebra and its Applications, 485:545 570, 2015. [31] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980 2998, 2010. [32] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. Journal of Machine Learning Research, 11(Jul):2057 2078, 2010. [33] Jure Leskovec and Julian J Mcauley. Learning to discover social circles in ego networks. Advances in neural information processing systems, pages 539 547, 2012. Convolutional Imputation of Matrix Networks [34] Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208 220, 2013. [35] Xiao-Yang Liu, Shuchin Aeron, Vaneet Aggarwal, Xiaodong Wang, and Min-You Wu. Adaptive sampling of rf fingerprints for fine-grained indoor localization. IEEE Transactions on Mobile Computing, 15(10):2411 2423, 2016. [36] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research, 11(Aug):2287 2322, 2010. [37] Hatef Monajemi and David Donoho. Sparsity/undersampling tradeoffs in anisotropic undersampling, with applications in mr imaging/spectroscopy. ar Xiv preprint ar Xiv:1702.03062, 2017. [38] Boaz Nadler, Nathan Srebro, and Xueyuan Zhou. Semisupervised learning with the graph laplacian: The limit of infinite unlabelled data. Advances in neural information processing systems, 21, 2009. [39] Samet Oymak and Babak Hassibi. The proportional mean decomposition: a bridge between the gaussian and bernoulli ensembles. Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 3322 3326, 2015. [40] Nathanael Perraudin, Benjamin Ricaud, David Shuman, and Pierre Vandergheynst. Global and local uncertainty principles for signals on graphs. ar Xiv preprint ar Xiv:1603.03030, 2016. [41] Nikhil Rao, Hsiang-Fu Yu, Pradeep K Ravikumar, and Inderjit S Dhillon. Collaborative filtering with graph information: Consistency and scalable methods. Advances in neural information processing systems, pages 2107 2115, 2015. [42] Naoki Saito and Ernest Woei. On the phase transition phenomenon of graph laplacian eigenfunctions on trees (recent development and scientific applications in wavelet analysis). 2011. [43] Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs: Graph fourier transform. ICASSP, pages 6167 6170, 2013. [44] Nauman Shahid, Nathanael Perraudin, Gilles Puy, and Pierre Vandergheynst. Compressive pca for low-rank matrices on graphs. IEEE transactions on Signal and Information Processing over Networks, 2016. [45] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83 98, 2013. [46] David I Shuman, Benjamin Ricaud, and Pierre Vandergheynst. Vertex-frequency analysis on graphs. Applied and Computational Harmonic Analysis, 40(2):260 291, 2016. [47] Linh V Tran, Van H Vu, and Ke Wang. Sparse random graphs: Eigenvalues and eigenvectors. Random Structures & Algorithms, 42(1):110 134, 2013. [48] Quanming Yao and James T Kwok. Accelerated inexact soft-impute for fast large-scale matrix completion. Twenty Fourth International Joint Conference on Artificial Intelligence, 2015. [49] Xiaoqin Zhang, Di Wang, Zhengyuan Zhou, and Yi Ma. Simultaneous rectification and alignment via robust recovery of low-rank tensors. In Advances in Neural Information Processing Systems, pages 1637 1645, 2013. [50] Xiaoqin Zhang, Zhengyuan Zhou, Di Wang, and Yi Ma. Hybrid singular value thresholding for tensor completion. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014. [51] Zemin Zhang and Shuchin Aeron. Exact tensor completion using t-svd. IEEE Transactions on Signal Processing, 65(6):1511 1526, 2016. [52] Zhengyuan Zhou, Nicholas Bambos, and Peter Glynn. Dynamics on linear influence network games under stochastic environments. In International Conference on Decision and Game Theory for Security, pages 114 126. Springer, 2016. [53] Zhengyuan Zhou, Nicholas Bambos, and Peter Glynn. Deterministic and stochastic wireless networks games: Equilibrium, dynamics and price of anarchy. Operations Research, 2018. [54] Zhengyuan Zhou, Benjamin Yolken, Reiko Ann Miura-Ko, and Nicholas Bambos. A game-theoretical formulation of influence networks. In American Control Conference (ACC), 2016, pages 3802 3807. IEEE, 2016. [55] Xiaofan Zhu and Michael Rabbat. Approximating signals supported on graphs. Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 3921 3924, 2012.