# partial_trace_regression_and_lowrank_kraus_decomposition__51c3687a.pdf Partial Trace Regression and Low-Rank Kraus Decomposition Hachem Kadri 1 Stéphane Ayache 1 Riikka Huusari 2 Alain Rakotomamonjy 3 4 Liva Ralaivola 4 The trace regression model, a direct extension of the well-studied linear regression model, allows one to map matrices to real-valued outputs. We here introduce an even more general model, namely the partial-trace regression model, a family of linear mappings from matrix-valued inputs to matrix-valued outputs; this model subsumes the trace regression model and thus the linear regression model. Borrowing tools from quantum information theory, where partial trace operators have been extensively studied, we propose a framework for learning partial trace regression models from data by taking advantage of the so-called low-rank Kraus representation of completely positive maps. We show the relevance of our framework with synthetic and real-world experiments conducted for both i) matrix-to-matrix regression and ii) positive semidefinite matrix completion, two tasks which can be formulated as partial trace regression problems. 1. Introduction Trace regression model. The trace regression model or, in short, trace regression, is a generalization of the wellknown linear regression model to the case where input data are matrices instead of vectors (Rohde & Tsybakov, 2011; Koltchinskii et al., 2011; Slawski et al., 2015), with the output still being real-valued. This model assumes, for the pair of covariates (X, y), the following relation between the matrix-valued random variable X and the real-valued random variable y: y = tr B X + ϵ, (1) 1Aix-Marseille University, CNRS, LIS, Marseille, France 2Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland 3Université Rouen Normandie, LITIS, Rouen, France 4Criteo AI Lab, Paris, France. Correspondence to: Hachem Kadri . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). Figure 1. Illustration of the partial trace operation. The partial trace operation applied to m m-blocks of a qm qm matrix gives a q q matrix as an output. where tr( ) denotes the trace, B is some unknown matrix of regression coefficients, and ϵ is random noise. This model has been used beyond mere matrix regression for problems such as phase retrieval (Candes et al., 2013), quantum state tomography (Gross et al., 2010), and matrix completion (Klopp, 2011). Given a sample S = {(Xi, yi)}ℓ i=1, where Xi is a p1 p2 matrix and yi R, and each (Xi, yi) is assumed to be distributed as (X, y), the training task associated with statistical model (1) is to find a matrix b B that is a proxy to B . To this end, Koltchinskii et al. (2011); Fan et al. (2019) proposed to compute an estimation b B of B as the solution of the regularized least squares problem b B = arg min B yi tr B Xi 2 + λ B 1, (2) where 1 is the trace norm (or nuclear norm), which promotes a low-rank b B, a key feature for the authors to establish bounds on the deviation of b B from B . Slawski et al. (2015); Koltchinskii & Xia (2015) have considered the particular case where p .= p1 = p2 and B is assumed to be from S+ p , the cone of positive semidefinite matrices of order p, and they showed that guarantees on the deviation of b B from B continue to hold when b B is computed as b B = arg min B S+ p i=1 (yi tr (BXi))2 . (3) Here, the norm regularization of (2) is no longer present and it is replaced by an explicit restriction for b B to be in S+ p (as B ). This setting is tied to the learning of completely positive maps developed hereafter. Partial Trace Regression and Low-Rank Kraus Decomposition Partial trace regression model. Here, we propose the partial trace regression model, that generalizes the trace regression model to the case when both inputs and outputs are matrices, and we go a step farther from works that are assuming either matrix-variate inputs (Zhou & Li, 2014; Ding & Cook, 2014; Slawski et al., 2015; Luo et al., 2015) or matrix/tensor-variate outputs (Kong et al., 2019; Li & Zhang, 2017; Rabusseau & Kadri, 2016). Key to our work is the so-called partial trace, explained in the following section and depicted in Figure 1. This novel regression model that maps matrix-to-matrix is of interest for several application tasks. For instance, in Brain-Computer Interfaces, covariance matrices are frequently used as a feature for representing mental state of a subject (Barachant et al., 2011; Congedo et al., 2013) and those covariance matrices need to be transformed (Zanini et al., 2018) into other covariance matrices to be discriminative for some BCI tasks. Similarly in Computer Vision and especially in the subfield of 3D Shape retrieval, covariance matrices are of interest as descriptors (Guo et al., 2018; Hariri et al., 2017), while there is a surging interest in deep learning methods for defining trainable layers with covariance matrices as input and output (Huang & Van Gool, 2017; Brooks et al., 2019). Contributions. We make the following contributions. i) We introduce the partial trace regression model, a family of linear predictors from matrix-valued inputs to matrixvalued outputs; this model encompasses previously known regression models, including the trace regression model and thus the linear regression model; ii) borrowing concepts from quantum information theory, where partial trace operators have been extensively studied, we propose a framework for learning a partial trace regression model from data; we take advantage of the low-rank Kraus representation of completely positive maps to cast learning as an optimization problem which we are able to handle efficiently; iii) we provide statistical guarantees for the model learned under our framework, thanks to a provable estimation of pseudodimension of the class of functions that we can envision; iv) finally, we show the relevance of the proposed framework for the tasks of matrix-to-matrix regression and positive semidefinite matrix completion, both of them are amenable to a partial trace regression formulation; our empirical results show that partial trace regression model yields good performance, demonstrating wide applicability and effectiveness. 2. Partial Trace Regression Here, we introduce the partial trace regression model to encode linear mappings from matrix-valued spaces to matrixvalued spaces. We specialize this setting to completely positive maps, and show the optimization problem to which Symbol Meaning i, j, m, n, p, q integers α, β, γ, . . . real numbers X, Y, H, . . . vector spaces1 x, y, k, . . . vectors (or functions) X, Y , K, . . . matrices (or operators) X, Y, K, . . . block matrices Φ, Λ, Γ, . . . linear maps on matrices transpose Table 1. Notational conventions used in the paper. learning with the partial trace regression model translates, together with a generalization error bound for the associated learned model. In addition, we present how the problem of (block) positive semidefinite matrix completion can be cast as a partial trace regression problem. 2.1. Notations, Block Matrices and Partial Trace Notations. Our notational conventions are summarized in Table 1. For n, m N, Mn m = Mn m(R) denotes the space of all n m real matrices. If n = m, then we write Mn instead of Mn n. If M is a matrix, Mij denotes its (i, j)-th entry. For M Mn, M 0 will be used to indicate that M is positive semidefinite (PSD); we may equivalently write M S+ n . Throughout, {(Xi, Yi)}l i=1 denotes a training sample of l examples, with each (Xi, Yi) assumed to be drawn IID from a fixed but unknown distribution on X Y where, from now on, X .= Mp and Y .= Mq. Block matrices. We will make extensive use of the notion of block matrices, i.e., matrices that can be partitioned into submatrices of the same dimension. If M Mnm, the number of block partitions of M directly depends on the number of factors of n and m; to uniquely identify the partition we are working with, we will always consider a n n block partition, where n will be clear from context the number of rows and columns of the matrix at hand will thus be multiples of n. The set Mn(Mm) will then denote the space of n n block matrices M = [[Mij]] whose i, j entry is an element of Mm.2 Partial trace operators. Partial trace, extensively studied and used in quantum computing (see, e.g,. Rieffel & Polak 2011, Chapter 10), generalizes the trace operation to block matrices. The definition we work with is the following. Definition 1 (Partial trace, see, e.g., Bhatia, 2003.) The partial trace operator, denoted by trm( ), is the linear map 1We also use the standard notations such as Rn and Mn. 2The space Mn(Mm) is isomorphic to Mn Mm. Partial Trace Regression and Low-Rank Kraus Decomposition from Mq(Mm) to Mq defined by trm(M) = tr(Mij) , i, j = 1, . . . , q. In other words, given a block matrix M of size qm qm, the partial trace is obtained by computing the trace of each block of size m m in the input matrix M, as depicted in Figure 1. We note that in the particular case q = 1, the partial trace is the usual trace. Remark 1 (Alternative partial trace) The other way of generalizing the trace operator to block matrices is the so-called block trace (Filipiak et al., 2018), which sums the diagonal blocks of block matrices. We do not use it here. We now are set to define the partial trace regression model. Definition 2 (Partial trace regression model) The partial trace regression model assumes for a matrix-valued covariate pair (X, Y ), with X taking value in Mp and Y taking value in Mq: Y = trm A XB + ϵ, (4) where A , B Mqm p are the unknown parameters of the model and ϵ is some matrix-valued random noise. This assumes a stochastic linear relation between the input X and the corresponding output Y that, given an IID training sample {(Xi, Yi)}l i=1 drawn according to (4), points to the learning of a linear mapping bΦ : Mp Mq of the form bΦ(X) = trm b AX b B , (5) where b A, b B Mqm p are the parameters to be estimated. When q = 1, we observe that trm(A XB ) = tr(A XB ) = tr(B A X), which is exactly the trace regression model (1), with a parametrization of the regression matrix as B A . We now turn our attention to the question as how to estimate the matrix parameters of the partial trace regression model while, as was done in (2) and (3), imposing some structure on the estimated parameters, so for the learning to come with statistical guarantees. As we will see, our solution to this problem takes inspiration from the fields of quantum information and quantum computing, and amounts to the use of the Kraus representation of completely positive maps. 2.3. Completely Positive Maps, Kraus Decomposition The space L(Mp, Mq) of linear maps from Mp to Mq is a real vector space that has been thoroughly studied in the fields of mathematics, physics, and more specifically quantum computation and information (Bhatia, 2009; Nielsen & Chuang, 2000; Størmer, 2012; Watrous, 2018). Operators from L(Mp, Mq) that have special properties are the so-called completely positive maps, a family that builds upon the notion of positive operators. Definition 3 (Positive maps, Bhatia, 2009) A linear map Φ L(Mp, Mq) is positive if for all M S+ p , Φ(M) S+ q . To define completely positive maps, we are going to deviate a bit from the block matrix structure advocated before and consider block matrices from Mm(Mp) and Mm(Mq). Definition 4 (Completely positive maps, Bhatia, 2009) Φ L(Mp, Mq) is m-positive if the associated map Φm L(Mm(Mp), Mm(Mq)) which computes the (i, j)-th block of Φm(M) as Φ(Mij) is positive. Φ is completely positive if it is m-positive for any m 1. The following theorem lays out the connection between partial trace regression models and positive maps. Theorem 1 (Stinespring representation, Watrous, 2018) Let Φ L(Mp, Mq). Φ writes as Φ(X) = trm AXA for some m N and A Mqm p if and only if Φ is completely positive. This invites us to solve the partial trace learning problem by looking for a map bΦ L(Mp, Mq) that writes as: bΦ(X) = trm( b AX b A ), (6) where now, in comparison to the more general model of (5), the operator bΦ to be estimated is a completely positive map that depends on a sole matrix parameter b A. Restricting ourselves to such maps might seem restrictive but i) this is nothing but the partial trace version of the PSD contrained trace regression model of (3), which allows us to establish statistical guarantees, ii) the entailed optimization problem can take advantage of the Kraus decomposition of completely positive maps (see below) and iii) empirical performance is not impaired by this modelling choice. Now that we have decided to focus on learning completely positive maps, we may introduce the last ingredient of our model, the Kraus representation. Theorem 2 (Kraus representation, Bhatia, 2009) Let Φ L(Mp, Mq) be a completely positive linear map. Then there exist Aj Mq p, 1 j r, with r pq such that X Mp, Φ(X) = j=1 Aj XA j . (7) Partial Trace Regression and Low-Rank Kraus Decomposition The matrices Aj are called Kraus operators and r the Kraus rank of Φ. With such a possible decomposition, learning a completely positive map bΦ can reduce to finding Kraus operators Aj, for some fixed hyperparameter r, where small values of r correspond to low-rank Kraus decomposition (Aubrun, 2009; Lancien & Winter, 2017), and favor computational efficiency and statistical guarantees. Given a sample {(Xi, Yi)}l i=1, the training problem can now be written as: arg min Aj Mq p j=1 Aj Xi A j , (8) where ℓis a loss function. The loss function we use in our experiments is the square loss ℓ(Y, b Y ) = Y b Y 2 F , where F is the Frobenius norm. When ℓis the square loss and q = 1, problem (8) reduces to the PSD contrained trace regression problem (3) , with a parametrization of the regression matrix as Pr j=1 A j Aj. Remark 2 Kraus and Stinespring representations can fully characterize completely positive maps. It has been shown that for a Kraus representation of rank r, there exists a Stinespring representation with dimension m equal to r (Watrous, 2018, Theorem 2.22). Note that the Kraus representation is rather computationally friendly compared to Stinespring representation. It has a simpler form and is easier to store and manipulate, as no need to create the matrix A of size qm p. It also allows us to derive a generalization bound for partial trace regression, as we will see later. Optimization. Assuming that the loss function ℓ( , ) is convex in its second argument, the resulting objective function is non-convex with respect to Aj. If we further assume that the loss is coercive and differentiable, then the learning problem admits a solution that is potentially a local minimizer. In practice, several classical approaches can be applied to solve this problem. We have for instance tested a block-coordinate descent algorithm (Luo & Tseng, 1992) that optimizes one Aj at a time. However, given current algorithmic tools, we have opted to solve the learning problem (8) using autodifferentiation (Baydin et al., 2017) and stochastic gradient descent, since the model can be easily implemented as a sum of product of matrices. This has the advantage of being efficient and allows one to leverage on efficient computational hardware, like GPUs. Note that at this point, although not backed by theory, we can consider multiple layers of mappings by composing several mappings {Φk}. This way of composing would extend the Bi Map layer introduced by Huang & Van Gool (2017) which limits their models to rank 1 Kraus decomposition. Interestingly, they also proposed a Re LUlike layer for PSD matrices that can be applicable to our work as well. For two layers, this would boil down to Φ2 Φ1(X) = Pr2 j2=1 A(2) j2 Γ[Pr1 j1=1 A(1) j XA(1) j2 , where Γ is a nonlinear activation that preserves positive semidefiniteness. We investigate also this direction in our experiments. Generalization. We now examine the generalization properties of partial trace regression via low-rank Kraus decomposition. Specifically, using the notion of pseudo-dimension, we provide an upper bound on the excess-risk for the function class F of completely postive maps with low Kraus rank, i.e., F = {Φ : Mp Mq : Φ is completely positive and its Kraus rank is equal to r}. Recall that the expected loss of any hypothesis h F is defined by R(h) = E(X,Y ) ℓ Y, h(X) and its empirical loss by ˆR(h) = 1 l Pl i=1 ℓ Yi, h(Xi) . The analysis presented here follows the lines of Srebro (2004) in which generalization bounds were derived for low-rank matrix factorization (see also Rabusseau & Kadri (2016) where similar results were obtained for low-rank tensor regression). In order to apply known results on pseudodimension, we consider the class of real-valued functions F with domain Mp [q] [q], where [q] .= {1, . . . , q}, defined by F = {(X, s, t) 7 Φ(X) st : Φ(X) = j=1 Aj XA j }. Lemma 3 The pseudo-dimension of the real-valued function class F is upper bounded by pqr log( 8epq We can now invoke standard generalization error bounds in terms of the pseudodimension (Mohri et al., 2018, Theorem 10.6) to obtain: Theorem 4 Let ℓ: Mq R be a loss function satisfying ℓ(Y, Y ) = 1 s,t ℓ (Yst, Y st) for some loss function ℓ : R R+ bounded by γ. Then for any δ > 0, with probability at least 1 δ over the choice of a sample of size l, the following inequality holds for all h F: R(h) ˆR(h)+γ pqr log( 8epq r ) log( l pqr) The proofs of Lemma 3 and Theorem 4 are provided in the Supplementary Material. Theorem 4 shows that the Kraus Partial Trace Regression and Low-Rank Kraus Decomposition rank r plays the role of a regularization parameter. Our generalization bound suggests a trade-off between reducing the empirical error which may require a more complex hypothesis set (large r), and controlling the complexity of the hypothesis set which may increase the empirical error (small r). 2.4. Application to PSD Matrix Completion Our partial trace model is designed to address the problem of matrix-to-matrix regression. We now show how it can also be applied to the problem of matrix completion. We start by recalling how the matrix completion problem fits into standard trace regression model. Let B Mm be a matrix whose entries B ij are given only for some (i, j) Ω. Low-rank matrix completion can be addressed by solving the following optimization problem: arg min B PΩ(B) PΩ(B ) 2, s.t. rank(B) = r, (9) where PΩ(B)ij = Bij if (i, j) Ω, 0 otherwise. This problem can be cast as a trace regression problem by considering yij = PΩ(B )ij and Xij = Eij, where Eij, 1 i, j m, are the matrix units of Mm. Indeed, it is easy to see that in this case problem (9) is equivalent to yij tr(BXij) 2 s.t. rank(B) = r. (10) Since the partial trace regression is a generalization of the trace regression model, one can ask what type of matrix completion problems can be viewed as partial trace regression models. The answer to this question is given by the following theorem. Theorem 5 (Hiai & Petz, 2014, Theorem 2.49) Let Φ : Mp Mq be a linear mapping. Then the following conditions are equivalent: 1. Φ is completely positive. 2. The block matrix M Mp(Mq) defined by Mij = Φ(Eij), 1 i, j p, (11) is positive, where Eij are the matrix units of Mp. Theorem 5 makes the connection between PSD block decomposable matrices and completely positive maps. Our partial trace regression formulation is based on learning completely positive maps via low-rank Kraus decomposition, and thus can be applied to the problem of PSD matrix completion. The most straightforward application of Theorem 5 to PSD matrix completion is to consider the case where the matrix is block-structured with missing blocks. This boils down to solving the following optimization problem arg min Ak Mq p Yij Φ(Xij) 2 F s.t. Φ( ) = k=1 Ak A k , where F is the Frobenius norm, Yij are the observed blocks of the matrix and Xij are the corresponding matrix units. So, a completely positive map Φ can be learned by our approach from the available blocks and then can be used to predict the missing blocks. This would be a natural approach to take into account local structures in the matrix and then improve the completion performance. This approach can be also applicable in situations where no information about the block-decomposability of the data matrix to be completed is available. In this case, the size of the blocks q and the number of the blocks p can be viewed as hyperparameters of the model, and can be tuned to fit the data. Note that when q = 1, our method reduces to the standard trace regression-based matrix completion. 3. Experiments In this section we turn our attention to evaluating our proposed partial trace regression (PTR) method. We conduct experiments on two tasks where partial trace regression can help, both in a simulated setting and exploiting real data. In all the experiments, the PTR model is implemented in a keras/Tensorflow framework and learned with Adam with default learning rate (0.001) and for 100 epochs. Batch size is typically 16. Our code is available at https://github.com/Stef-hub/ partial_trace_kraus. 3.1. PSD to PSD Matrix Regression We will now describe experiments where the learning problem can be easily described as learning a mapping between two PSD matrices; first with simulated data and then applied to learning covariance matrices for Brain-Computer Interfaces. 3.1.1. EXPERIMENTS ON SIMULATED DATA Our first goal is to show the ability of our model to accurately recover mappings conforming to its assumptions. We randomly draw a set of matrices Xi and Ar, and build the matrices Yi using Equation 7, for various Kraus ranks r, and p, q the size of input and output spaces respectively. We train the model with 100, 1000 and 10000 samples on two simulated datasets with Kraus ranks 5 and 20, and show the results for maps 20 20 10 10 and 100 100 40 40 in Figure 2. While 100 samples is not enough to get a good estimation, with more data the PTR is able to accurately represent the model with correct rank. Partial Trace Regression and Low-Rank Kraus Decomposition Figure 2. PSD to PSD predictive performance (mean squared error) of PTR as a function of its Kraus rank, on simulated data from map 20 20 10 10 (top) and map 100 100 40 40 (bottom). We also aim to show, in this setting, a comparison of our PTR to trace regression and two baselines that are commonly used in multiple output regression tasks: the multivariate regression where no rank assumptions are made (y = x B + ϵ) and the reduced-rank multivariate regression (y = x B +ϵ with a low-rank constraint on B). For these two methods, B is a matrix mapping the input x to the output y, which are the vectorization of the matrix-valued input X and the matrix-valued output Y , respectively. We also compare PTR to three tensor to tensor regression methods: the higher order partial least squares (HOPLS) (Zhao et al., 2012) and the tensor train and the Tucker tensor neural networks (NN) (Novikov et al., 2015; Cao & Rabusseau, 2017). All the models are trained using 10000 simulated examples from map 20 20 10 10 generated from a model with true Kraus rank 5. The results are shown in Table 2 on 1000 test samples where we display the best performance over ranks from 1 to 100 (when applicable) in terms of mean squared error (MSE). For the (reduced-rank) multivariate regression experiments we needed to consider vectorisations of the matrices, thus removing some of the relevant structure of the output data. We note that reduced-rank multivariate regression performs worse than multivariate regression since the rank of multivariate regression is not related to the Kraus rank. In this experiment, PTR performs similar to tensor train NN and significantly better than all the other methods. Table 2. Comparison of various regression models on PSD to PSD task on simulated data with map 20 20 10 10 and true Kraus rank of the model 5. We report the best performance among various tested model ranks from 1 to 100. Model MSE Partial Trace Regression 0.001 0.0008 Trace Regression 0.028 0.0144 Multivariate Regression 0.058 0.0134 Reduced-rank Multivariate Regression 0.245 0.1023 HOPLS 1.602 0.0011 Tensor Train NN 0.001 0.0009 Tucker NN 0.595 0.0252 Note that, in contrast to PTR, tensor train NN does not preserve the PSD property. We ran these experiments also with fewer examples (100 instead of 10000). In this case PTR performs better than tensor train NN (0.007 0.013 for PTR and 0.662 0.364 for tensor train NN) and again better than the other baselines. 3.1.2. MAPPING COVARIANCE MATRICES FOR DOMAIN ADAPTATION In some applications like 3D shape recognition or Brain Computer Interfaces (Barachant et al., 2011; Tabia et al., 2014), features take the form of covariance matrices and algorithms taking into account the specific geometry of these objects are needed. For instance, in BCI, "minimum distance to the mean (in the Riemannian sense)" classifier has been shown to be a highly efficient classifier for categorizing motor imagery tasks. For these tasks, distribution shifts usually occur in-between sessions of the same subject using the BCI. In such situations, one solution is to consider an optimal transport mapping of the covariance matrices from the source to the target domain (the different sessions) (Yair et al., 2019; Courty et al., 2016). Here, our goal is to learn such a covariance matrix mapping and to perform classification using covariance matrix from one session (the source domain) as training data and those of the other session (the target domain) as test data. For learning the mapping, we will consider only matrices from the training session. We adopt the experimental setting described by Yair et al. (2019) for generating the covariance matrix. From the optimal transport-based mapping obtained in a unsupervised way from the training session, we have couples of inputoutput matrices of size 22 22 from which we want to learn our partial trace regressor. While introducing noise into the classification process, the benefit of such regression function is to allow out-of-sample domain adaptation as in Partial Trace Regression and Low-Rank Kraus Decomposition Table 3. Accuracy of a "minimum distance to the mean" classifier on domain adaptation BCI task. We report the results for the same subjects as in (Yair et al., 2019). Subject No Adapt Full Adapt Oo SAdapt 1 73.66 73.66 72.24 3 65.93 72.89 68.86 7 53.42 64.62 59.20 8 73.80 75.27 73.06 9 73.10 75.00 76.89 Perrot et al. (2016). This means that we do not need to solve an optimal transport problem every time a new sample (or a batch of new samples) is available. In practice, we separate all the matrices (about 270) from the training session in a training/test group, and use the training examples (230 samples) for learning the covariance mapping. For evaluating the quality of the learned mapping, we compare the classification performance of a minimum distance to the mean classifier in three situations: No adaptation between domains is performed. The training session data is used as is. The method is denoted as No Adapt. All the matrices from the training session are mapped using the OT mapping. This is a full adaptation approach, denotes as Full Adapt Our approach denoted as Oo SAdapt (from Out of Sample Adaptation) uses the 230 covariance matrices mapped using OT and the other matrices mapped using our partial trace regression model. For our approach, we report classification accuracy for a model of rank 20 and depth 1 trained using an Adam optimize of learning rate 0.001 during 500 iterations. We have also tested several other hypeparameters (rank and depth) without much variations in the average performance. Classification accuracy over the test set (the matrice from the second session) is reported in Table 3. We first note that for all subjects, domain adaptation helps in improving performance. Using the estimated mapping instead of the true one for projecting source covariance matrix into the target domain, we expect a loss of performance with respect to the Full Adapt method. Indeed, we observe that for Subject 1 and 8, we occur small losses. For the other subjects, our method allows to improve performance compared to no domain adaptation while allowing for out-of-sample classification. Interestingly, for Subject 9, using estimated covariance matrices performs slightly better than using the exact ones. Figure 3. Completion performance on simulated 28 28 matrix, with p = 7 and q = 4. Top left, middle and right: original matrix, original matrix with missing values, our result with rank 1 Kraus decomposition. Bottom left, middle and right: Our results with Kraus rank 5, 30 and 100. Table 4. Comparison of partial trace regression and trace regression and tensor train neural networks on (block) PSD matrix completion on simulated 28 28 matrix with missing blocks. Model MSE Partial Trace Regression 0.572 0.019 Trace Regression 2.996 1.078 Tensor Train NN 3.942 1.463 3.2. PSD Matrix Completion We now describe our experiments in the matrix completion setting, first by illustrative examples with simulated data, then more comprehensively in the setting of multi-view kernel matrix completion. 3.2.1. EXPERIMENTS ON SIMULATED DATA We consider the problem of matrix completion in two settings: filling in fully missing blocks, and filling in individually missing values in a matrix. We perform our experiments on full rank PSD matrices of size 28 28. We show the results on block completion in Figure 3, where we have trained our partial trace regression model (without stacking) with Kraus ranks 1, 5, 30 and 100. While using Kraus rank 1 is not enough to retrieve missing blocs, rank 5 gives already reasonable results, and ranks 30 and 100 are able to infer diagonal values that were totally missing from the training blocks. Completion performance in terms of mean squared error are reported in Table 4, showing that our PTR method performs favorably against trace regression and tensor train neural networks. Figures 4 and 5 illustrate the more traditional completion Partial Trace Regression and Low-Rank Kraus Decomposition Figure 4. Completion performance on simulated 28 28 matrix, with p = 7 and q = 4. Top row: target matrix and target matrix with missing data. Bottom row: Our completion results with depth=1 (left) and depth=2 (right). Figure 5. Completion performance on simulated 28 28 matrix, with p = 7 and q = 4. Top row: target matrix and target matrix with missing data. Bottom row: Our completion results with depth=1 (left) and depth=2 (right). task where 35% and 85% (respectively) of values in the matrix are missing independently of the block structure. We fix p and q to 7 and 4, respectively, and investigate the effect that stacking the models has on the completion performance. Note that p and q may be chosen via crossvalidation. With only a little missing data (Figure 4) there is very little difference between the results obtained with one-layer and two-layer models. However we observe that for the more difficult case (Figure 5), stacking partial trace regression models significantly improves the reconstruction performance. 3.2.2. SIMILARITY MATRIX COMPLETION For our last set of experiments, we evaluate our partial trace regression model in the task of matrix completion on a multiview dataset, where each view contains missing samples. This scenario occurs in many fields where the data comes Figure 6. Sum of the matrix completion errors for trace regression (TR) and our partial trace regression (PTR), over the views of the multiple features digits dataset as a function of missing samples with model depths 1 and 2. from various sensors of different types producing heterogeneous view data (Rivero et al., 2017). Following Huusari et al. (2019), we consider the multiple features digits dataset, available online.3 The task is to classify digits (0 to 9) from six descriptions extracted from the images, such as Fourier coefficients or Karhunen-Loéve coefficients. We selected 20 samples from all the 10 classes, and computed RBF kernels for views with data samples in Rd, and Chi2 kernels for views with data samples in Zd, resulting in six 200 200 kernel matrices. We then randomly introduced missing samples within views, leading to missing values (rows/columns) into kernel matrices. We vary the level of total missing samples in the whole dataset from 10% to 90%, by taking care that all the samples are observed at least in one view, and that all views have observed samples. We first measure the matrix completion performance on reconstruction quality by computing P v Kv ˆKv F with Kv the original kernel matrix for view v and ˆKv the completed one. We then analyse the success of our method in the classification task by learning an SVM classifier from a simple average combination of input kernels. Here we compare our partial trace regression method to two very simple baselines of matrix completion, namely zero and mean inputation, as well as the more relevant CVKT method presented in Huusari et al. (2019). We perform the matrix completion with our algorithm with three block-structure configurations; p, q = 20, 10, p, q = 10, 20, and finally with p, q = 200, 1, which corresponds to trace regression. We consider both depths 1 and 2 for our partial trace regression model, as well as Kraus ranks 50, 100 and 200. Figure 6 shows the sum of matrix completion errors P v Kv ˆKv F for our method and the trace regression method in various configurations. For depth 2, the partial trace regression clearly outperforms the more simple trace regression (p, q = 200, 1). With model depth 1 all the methods perform similarly. It might be that the real data 3https://archive.ics.uci.edu/ml/datasets/Multiple+Features Partial Trace Regression and Low-Rank Kraus Decomposition Figure 7. SVM accuracy results on the digits dataset as functions of the amount of missing data samples. Left: the classification accuracies with kernel matrices completed with the compared methods and with "full" kernel matrices for reference; Middle: The results of our method separately w.r.t. the depth of the model; Right: The results of our method separately w.r.t the assumed Kraus rank. considered in this experiment is more complex and nonlinear than our model assumes, thus stacking our models is useful for the performance gain. However the traditional trace regression does not seem to be able to capture the important aspects of this data even in the stacked setting. Figure 7 shows in the left panel the SVM classification accuracies obtained by using the kernel matrices from various completion methods, and detailed results focusing on our method in the middle and right side panels. For these experiments, we selected to use in SVMs the kernel matrices giving the lowest completion errors. We observe that our model provides excellent kernel completion since the classification accuracy is close to the performance of the setting with no missing data. The stacked model seems to be able to accurately capture the data distribution, giving rise to very good classification performance. The middle panel confirms the observations from Figure 6: the depth of our model plays a crucial part on its performance, with depth 2 outperforming the depth 1 in almost every case, except the choice of p, q = 200, 1 corresponding to trace regression. The Kraus rank does not have a strong effect on classification performance (right panel). Indeed, this justifies the usage of our method in a low-rank setting. 4. Conclusion In this paper, we introduced partial trace regression model, a family of linear predictors from matrix-valued inputs to matrix-valued outputs that generalizes previously proposed models such as trace regression and linear regression. We proposed a novel framework for estimating the partial trace regression model from data by learning low-rank Kraus decompositions of completely positive maps, and derived an upper bound on the generalization error. Our empirical study with synthetic and real-world datasets shows the promising performance of our proposed approach on the tasks of matrix-to-marix regression and positive semidefinite matrix completion. Acknowledgements This work has been funded by the French National Research Agency (ANR) project Quant ML (grant number ANR-19CE23-0011). Part of this work was performed using computing resources of CRIANN (Normandy, France). The work by RH has in part been funded by Academy of Finland grants 334790 (MAGITICS) and 310107 (MACOME). Aubrun, G. On almost randomizing channels with a short kraus decomposition. Communications in mathematical physics, 288(3):1103 1116, 2009. Barachant, A., Bonnet, S., Congedo, M., and Jutten, C. Multiclass brain computer interface classification by riemannian geometry. IEEE Transactions on Biomedical Engineering, 59(4):920 928, 2011. Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. Automatic differentiation in machine learning: a survey. The Journal of Machine Learning Research, 18 (1):5595 5637, 2017. Bhatia, R. Partial traces and entropy inequalities. Linear Algebra and its Applications, 370:125 132, 2003. Bhatia, R. Positive definite matrices, volume 24. Princeton university press, 2009. Brooks, D. A., Schwander, O., Barbaresco, F., Schneider, J., and Cord, M. Riemannian batch normalization for SPD neural networks. In Advances in neural information processing systems (Neur IPS), pp. 15463 15474, 2019. Candes, E. J., Strohmer, T., and Voroninski, V. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1241 1274, 2013. Partial Trace Regression and Low-Rank Kraus Decomposition Cao, X. and Rabusseau, G. Tensor regression networks with various low-rank tensor approximations. ar Xiv preprint ar Xiv:1712.09520, 2017. Congedo, M., Barachant, A., and Andreev, A. A new generation of brain-computer interface based on riemannian geometry. ar Xiv preprint ar Xiv:1310.8115, 2013. Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. Optimal transport for domain adaptation. IEEE Transactions on pattern analysis and machine intelligence, 39(9): 1853 1865, 2016. Ding, S. and Cook, R. D. Dimension folding PCA and PFC for matrix-valued predictors. Statistica Sinica, 24 (1):463 492, 2014. Fan, J., Gong, W., and Zhu, Z. Generalized highdimensional trace regression via nuclear norm regularization. Journal of econometrics, 212(1):177 202, 2019. Filipiak, K., Klein, D., and Vojtková, E. The properties of partial trace and block trace operators of partitioned matrices. Electronic Journal of Linear Algebra, 33(1): 3 15, 2018. Gross, D., Liu, Y.-K., Flammia, S. T., Becker, S., and Eisert, J. Quantum state tomography via compressed sensing. Physical review letters, 105(15):150401, 2010. Guo, Y., Wang, F., and Xin, J. Point-wise saliency detection on 3D point clouds via covariance descriptors. The Visual Computer, 34(10):1325 1338, 2018. Hariri, W., Tabia, H., Farah, N., Benouareth, A., and Declercq, D. 3D facial expression recognition using kernel methods on riemannian manifold. Eng. Appl. of AI, 64: 25 32, 2017. Hiai, F. and Petz, D. Introduction to matrix analysis and applications. Springer Science & Business Media, 2014. Huang, Z. and Van Gool, L. A riemannian network for spd matrix learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. Huusari, R., Capponi, C., Villoutreix, P., and Kadri, H. Kernel transfer over multiple views for missing data completion, 2019. Klopp, O. Rank penalized estimators for high-dimensional matrices. Electronic Journal of Statistics, 5:1161 1183, 2011. Koltchinskii, V. and Xia, D. Optimal estimation of low rank density matrices. Journal of Machine Learning Research, 16(53):1757 1792, 2015. Koltchinskii, V., Lounici, K., and Tsybakov, A. B. Nuclearnorm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302 2329, 2011. Kong, D., An, B., Zhang, J., and Zhu, H. L2rm: Low-rank linear regression models for high-dimensional matrix responses. Journal of the American Statistical Association, pp. 1 47, 2019. Lancien, C. and Winter, A. Approximating quantum channels by completely positive maps with small kraus rank. ar Xiv preprint ar Xiv:1711.00697, 2017. Li, L. and Zhang, X. Parsimonious tensor response regression. Journal of the American Statistical Association, 112 (519):1131 1146, 2017. Luo, L., Xie, Y., Zhang, Z., and Li, W.-J. Support matrix machines. In ICML, pp. 938 947, 2015. Luo, Z.-Q. and Tseng, P. On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7 35, 1992. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018. Nielsen, M. A. and Chuang, I. L. Quantum Computation and Quantum Information. Cambridge University Press, 2000. Novikov, A., Podoprikhin, D., Osokin, A., and Vetrov, D. P. Tensorizing neural networks. In Advances in neural information processing systems (Neur IPS), pp. 442 450, 2015. Perrot, M., Courty, N., Flamary, R., and Habrard, A. Mapping estimation for discrete optimal transport. In Advances in neural information processing systems (Neur IPS), 2016. Rabusseau, G. and Kadri, H. Low-rank regression with tensor responses. In Advances in neural information processing systems (Neur IPS), pp. 1867 1875, 2016. Rieffel, E. G. and Polak, W. H. Quantum computing: A gentle introduction. MIT Press, 2011. Rivero, R., Lemence, R., and Kato, T. Mutual kernel matrix completion. IEICE Transactions on Information and Systems, 100(8):1844 1851, 2017. Rohde, A. and Tsybakov, A. B. Estimation of highdimensional low-rank matrices. The Annals of Statistics, 39(2):887 930, 2011. Partial Trace Regression and Low-Rank Kraus Decomposition Slawski, M., Li, P., and Hein, M. Regularization-free estimation in trace regression with symmetric positive semidefinite matrices. In Advances in neural information processing systems (Neur IPS), pp. 2782 2790, 2015. Srebro, N. Learning with matrix factorizations. Ph D thesis, MIT, 2004. Størmer, E. Positive linear maps of operator algebras. Springer Science & Business Media, 2012. Tabia, H., Laga, H., Picard, D., and Gosselin, P.-H. Covariance descriptors for 3d shape matching and retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4185 4192, 2014. Watrous, J. The theory of quantum information. Cambridge University Press, 2018. Yair, O., Dietrich, F., Talmon, R., and Kevrekidis, I. G. Optimal transport on the manifold of spd matrices for domain adaptation. ar Xiv preprint ar Xiv:1906.00616, 2019. Zanini, P., Congedo, M., Jutten, C., Said, S., and Berthoumieu, Y. Transfer learning: A riemannian geometry framework with applications to brain-computer interfaces. IEEE Trans. Biomed. Engineering, 65(5): 1107 1116, 2018. Zhao, Q., Caiafa, C. F., Mandic, D. P., Chao, Z. C., Nagasaka, Y., Fujii, N., Zhang, L., and Cichocki, A. Higher order partial least squares (hopls): a generalized multilinear regression method. IEEE transactions on pattern analysis and machine intelligence, 35(7):1660 1673, 2012. Zhou, H. and Li, L. Regularized matrix regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2):463 483, 2014.