# collective_matrix_completion__8fd42950.pdf

Journal of Machine Learning Research 20 (2019) 1-43 Submitted 7/18; Revised 10/19; Published 10/19

Collective Matrix Completion

Mokhtar Z. Alaya mokhtarzahdi.alaya@gmail.com Modal X, UPL, Univ Paris Nanterre, F92000 Nanterre France

Olga Klopp kloppolga@math.cnrs.fr ESSEC Business School and CREST, F95021 Cergy France

Editor: Xiaotong Shen

Matrix completion aims to reconstruct a data matrix based on observations of a small number of its entries. Usually in matrix completion a single matrix is considered, which can be, for example, a rating matrix in recommendation system. However, in practical situations, data is often obtained from multiple sources which results in a collection of matrices rather than a single one. In this work, we consider the problem of collective matrix completion with multiple and heterogeneous matrices, which can be count, binary, continuous, etc. We ﬁrst investigate the setting where, for each source, the matrix entries are sampled from an exponential family distribution. Then, we relax the assumption of exponential family distribution for the noise. In this setting, we do not assume any speciﬁc model for the observations. The estimation procedures are based on minimizing the sum of a goodness-of-ﬁt term and the nuclear norm penalization of the whole collective matrix. We prove that the proposed estimators achieve fast rates of convergence under the two considered settings and we corroborate our results with numerical experiments. Keywords: High-dimensional prediction; Exponential families; Low-rank matrix estimation; Nuclear norm minimization; Low-rank optimization;

1. Introduction

Completing large-scale matrices has recently attracted great interest in machine learning and data mining since it appears in a wide spectrum of applications such as recommender systems (Koren et al., 2009; Bobadilla et al., 2013), collaborative ﬁltering (Netﬂix challenge) (Goldberg et al., 1992; Rennie and Srebro, 2005), sensor network localization (So and Ye, 2005; Drineas et al., 2006; Oh et al., 2010), system identiﬁcation (Liu and Vandenberghe, 2009), image processing (Hu et al., 2013), among many others. The basic principle of matrix completion consists in recovering all the entries of an unknown data matrix from incomplete and noisy observations of its entries. To address the high-dimensionality in matrix completion problem, statistical inference based on low-rank constraint is now an ubiquitous technique for recovering the underlying data matrix. Thus, matrix completion can be formulated as minimizing the rank of the matrix given a random sample of its entries. However, this rank minimization problem is in general NP-hard due to the combinatorial nature of the rank function (Fazel et al., 2001; Fazel, 2002). To alleviate this problem and make it tractable, convex relaxation strategies

c 2019 Mokhtar Z. Alaya and Olga Klopp.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-483.html.

Alaya and Klopp

were proposed, e.g., the nuclear norm relaxation (Srebro et al., 2005; Candes and Tao, 2010; Recht et al., 2010; Negahban and Wainwright, 2011; Klopp, 2014) or the max-norm relaxation (Cai and Zhou, 2016). Among those surrogate approximations, nuclear norm, which is deﬁned as the sum of the singular values of the matrix or the ℓ1-norm of its spectrum, is probably the most widely used penalty for low-rank matrix estimation, since it is the tightest convex lower bound of the rank (Fazel et al., 2001).

Motivations. Classical matrix completion focus on a single matrix, whereas in practical situations data is often obtained from a collection of matrices that may cover multiple and heterogeneous sources. For example, in e-commerce users express their feedback for diﬀerent items such as books, movies, music, etc. In social networks like Facebook and Twitter users often share their opinions and interests on a variety of topics (politics, social events, health). In this examples, informations from multiple sources can be viewed as a collection of matrices coupled through a common set of users.

Rather than exploiting user preference data from each source independently, it may be beneﬁcial to leverage all the available user data provided by various sources in order to generate more encompassing user models (Cantador et al., 2015). For instance, some recommender system runs into the so-called cold-start problem (Lam et al., 2008). A user is new or cold in a source when he has few to none rated items. Such user may have a rating history in auxiliary sources and we can use his proﬁle in the auxiliary sources to recommend relevant items in the target source. For example, a user s favorite movie genres may be derived from his favorite book genres. Therefore, this shared structure among the sources can be useful to get better predictions (Singh and Gordon, 2008; Bouchard et al., 2013; Gunasekar et al., 2016).

More generally speaking, collective matrix completion ﬁnds a natural application in the problem of recommender system with side information. In this problem, in addition to the conventional user-item matrix, it is assumed that we have side information about each user (Chiang et al., 2015; Jain and Dhillon, 2013; Fithian and Mazumder, 2018; Agarwal et al., 2011). For example, in blog recommendation task, we may have access to user generated content (images, tags and text) or user activity (e.g., likes and reblogs). Such side information may be used to improve the quality of recommendation of blogs of interest (Shin and Lee, 2015).

Based on the type of available side information, various methods for recommender systems with side information have been proposed. It can be user generated content (Armentano et al., 2013; Hannon et al., 2010), user/item proﬁle or attribute (Agarwal et al., 2011), social network (Jamali and Ester, 2010; Ma et al., 2011) and context information (Natarajan et al., 2013). A very interesting surveys of the state-of-the-art methods can be found in (Fithian and Mazumder, 2018; Natarajan et al., 2013).

On the other hand, our framework includes the model of Mixed Data Frames with missing observations (Pag es, 2014; Udell et al., 2016). Here matrices collect categorical, numerical and count observations. They appear in numerous applications including in ecology, patient records in health care (Gunasekar et al., 2016), quantitative gene expression values (Natarajan and Dhillon, 2014; Zitnik and Zupan, 2014, 2015), and also in recommender systems and survey data.

Collective Matrix Completion

Main contributions and related literature. In this paper, we extend the theory of low-rank matrix completion to a collection of multiple and heterogeneous matrices. We ﬁrst consider general matrix completion setting where we assume that for each matrix its entries are sampled from natural exponential distributions (Lehmann and Casella, 1998). In this setting, we may have Gaussian distribution for continuous data; Bernoulli for binary data; Poisson for count-data, etc. In a second part, we relax the assumption of exponential family distribution for the noise and we do not assume any speciﬁc model for the observations. This approach is more popular and widely used in machine learning. The proposed estimation procedure is based on minimizing the sum of a goodness-of-ﬁt term and the nuclear norm penalization of the whole collective matrix. The key challenge in our analysis is to use joint low-rank structure and our algorithm is far from the trivial one which consists in estimating each source matrix separately. We provide theoretical guarantees on our estimation method and show that the collective approach provides faster rate of convergences. We further corroborate our theoretical ﬁndings through simulated experiments.

Previous works on collective matrix completion are mainly based on matrix factorization (Srebro et al., 2005). In a nutshell, this approach ﬁts the target matrix as the product of two low-rank matrices. Matrix factorization gives rise to non-convex optimization problems and its theoretical understanding is quite limited. For example, Singh and Gordon (2008) proposed the collective matrix factorization that jointly factorizes multiple matrices sharing latent factors. As in our setting, each matrix can have a diﬀerent value type and error distribution. In Singh and Gordon (2008), the authors use Bregman divergences to measure the error and extend standard alternating projection algorithms to this setting. They consider a quite general setting which includes as a particular case the nuclear norm penalization approach that we study in the present paper. They do not provide any theoretical guarantee. A Bayesian model for collective matrix factorization was proposed in Singh and Gordon (2010). Horii et al. (2014) and Xu et al. (2016) also consider collective matrix factorization and investigate the strength of the relation among the source matrices. Their estimation procedure is based on penalization by the sum of the nuclear norms of the sources. The convex formulation for collective matrix factorization was proposed in Bouchard et al. (2013) where the authors consider a general situation when the set of matrices do not necessarily have a common set of rows/columns. When this is the case, the estimator proposed in Bouchard et al. (2013) is quite similar to ours. Their algorithm is based on the iterative Singular Value Thresholding and the authors conduct empirical evaluations of this approach on two real data sets.

Most of the previous papers focus on the algorithmic side without providing theoretical guarantees for the collective approach. One exception is the paper by Gunasekar et al. (2015) where the authors prove consistency of the estimate under two observation models: noise-free and additive noise models. Their estimation procedure is based on minimizing the least squares loss penalized by the nuclear norm. To prove the consistency of their estimator, Gunasekar et al. (2015) assume that all the source matrices share the same low-rank factor. They consider the uniform sampling scheme for the observations (see Assumptions 1 and 4 in Gunasekar et al. (2015)). Uniform sampling is an usual assumption in matrix completion literature (see, e.g., (Candes and Tao, 2010; Cand es and Recht, 2009; Davenport et al., 2014)). This assumption is restrictive in many applications such as recommendations

Alaya and Klopp

systems. The theoretical analysis in the present paper is carried out for general sampling distributions. Similar to our setting, matrix completion with side information explores the available user data provided by various sources. For instance Jain and Dhillon (2013) and Xu et al. (2013) introduce the so-called Inductive Matrix Completion (IMC). It models side information as knowledge of feature spaces. They show that if the features are perfect (e.,g., see Deﬁnition 1 in Chiang et al. (2018) for perfect side information), the sample complexity can be reduced. More precisely, in works on matrix completion with side information, it is usually assumed that one has partially observed low-rank matrix of interest M Rd1 d2

and, additionally, one has access to two matrices of features A Rd1 r1 and B Rd2 r2

where each row of A (or B) denotes the feature of the i-th row (or column) entity of M, ri < di for i = 1, 2 and M = AZBT . The main diﬀerence with our setting is that, here, A and B are assumed to be fully observed while our model allows also missing observations for the set of features. The perfect side information assumption is strong and hard to meet in practice. Chiang et al. (2015) relaxed it by assuming that the side information may be noisy (not perfect). In this approach, referred as Dirty IMC, they assume that the unknown matrix is modeled as M = AZBT + N where the residual matrix N models imperfections and noise in the features. Several works consider matrix completion with side information. For example, Chiang et al. (2015) proposes a method based on penalization by the sum of the nuclear norms of M and of each feature. Our method is based on the penalization by the nuclear norm of the whole matrix built of the matrix M and the features A and B. In Jain and Dhillon (2013), the authors study the problem of low-rank matrix estimation using rank one measurements. In the noise-free setting, they assume that all the features are known and that the matrices of features are incoherent. The method proposed in Jain and Dhillon (2013) is based on non-convex matrix factorization. In Fithian and Mazumder (2018), the authors consider a general framework for reduced-rank modeling of matrix-valued data. They use a generalized weighted nuclear norm penalty where the matrix is multiplied by positive semideﬁnite matrices P and Q which depend on the matrix of features. In Agarwal et al. (2011), the authors introduce a per-item user covariate logistic regression model augmenting with user-speciﬁc random eﬀects. Their approach is based on a multilevel hierarchical model. In the case of the heterogeneous data coming from diﬀerent sources, these approaches can be applied for recovering each source separately. In contrast, our approach aims at collecting all the available information in a single matrix which results in faster rates of convergence. On the other hand, popular algorithms for matrix completion with side information, such as Maxide in Xu et al. (2013) and Alt Min in Jain and Dhillon (2013), are based on the least square loss which could be not suitable for data coming from non-Gaussian distributions. If we consider a single matrix, our model includes as particular case 1-bit matrix completion and, more generally, matrix completion with exponential family noise. 1-bit matrix completion was ﬁrst studied in Davenport et al. (2014), where the observed entries are assumed to be sampled uniformly at random. This problem was also studied among others by (Cai and Zhou, 2013; Klopp et al., 2015; Alquier et al., 2017). Matrix completion with exponential family noise (for a single matrix) was previously considered in Lafond (2015) and Gunasekar et al. (2014). In these papers authors assume sampling with replacement

Collective Matrix Completion

where there can be multiple observations for the same entry. In the present paper, we consider more natural setting for matrix completion where each entry may be observed at most once. Our result improves the known results on 1-bit matrix completion and on matrix completion with exponential family noise. In particular, we obtain exact minimax optimal rate of convergence for 1-bit matrix completion and matrix completion with exponential noise which was known up to a logarithmic factor (for more details see Remark 9 in Section 3).

Organization of the paper. The remainder of the paper is organized as follow. In Section 1.1, we introduce basic notation and deﬁnitions. Section 2 sets up the formalism for the collective matrix completion. In Section 3, we investigate the exponential family noise model. In Section 4, we study distribution-free setup and we provide the upper bound on the excess risk. To verify the theoretical ﬁndings, we corroborate our results with numerical experiments in Section 5, where we present an eﬃcient iterative algorithm that solves the maximum likelihood approximately. The proofs of the main results and key technical lemmas are postponed to the appendices.

1.1. Preliminaries

For the reader s convenience, we provide a brief summary of the standard notation and the deﬁnitions that will be frequently used throughout the paper.

Notation. For any positive integer m, we use [m] to denote {1, . . . , m}. We use capital bold symbols such as X, Y , A, to denote matrices. For a matrix A, we denote its (i, j)-th

entry by Aij. As usual, let A F = q P i,j A2 ij be the Frobenius norm and let A =

maxi,j |Aij| denote the elementwise ℓ -norm. Additionally, A stands for the nuclear norm (trace norm), that is A = P i σi(A) where σ1(A) σ2(A) are singular values of A, and A = σ1(A) to denote the operator norm. The inner product between two matrices is denoted by A, B = tr(A B) = P ij Aij Bij, where tr( ) denotes the trace of a matrix. We write Ψ the subdiﬀerential mapping of a convex functional Ψ. Given two real numbers a and b, we write a b = max(a, b) and a b = min(a, b). The symbols P and E denote generic probability and expectation operators whose distribution is determined from the context. The notation c will be used to denote positive constant, that might change from one instance to the other.

Deﬁnition 1 A distribution of a random variable X is said to belong to the natural exponential family, if its probability density function characterized by the parameter η is given by: X|η fh,G(x|η) = h(x) exp ηx G(η) ,

where h is a nonnegative function, called the base measure function, which is independent of the parameter η. The function G(η) is strictly convex, and is called the log-partition function, or the cumulant function. This function uniquely deﬁnes a particular member distribution of the exponential family, and can be computed as: G(η) = log R h(x) exp(ηx)dx .

If G is smooth enough, we have that E[X] = G (η) and Var[X] = G (η), where G

stands for the derivative of G. The exponential family encompasses a wide large of standard distributions such as:

Alaya and Klopp

Normal, N(µ, σ2) (known σ), is typically used to model continuous data, with natural parameter η = µ

σ2 and G(η) = σ2

Gamma, Γ(λ, α) (known α), is often used to model positive valued continuous data, with natural parameter η = λ and G(η) = α log( η).

Negative binomial, NB(p, r) (known r), is a popular distribution to model overdispersed count data, whose variance is larger than their mean, with natural parameter η = log(1 p) and G(η) = r log(1 exp(η)).

Binomial, B(p, N) (known N), is used to model number of successes in N trials, with natural parameter η = log( p 1 p) (logit function) and G(η) = N log(1 + exp(η)).

Poisson, P(λ), is used to model count data, with natural parameter η = log(λ) and G(η) = exp(η).

Exponential, chi-squared, Rayleigh, Bernoulli and geometric distributions are special cases of the above ﬁve distributions.

Deﬁnition 2 Let S be a closed convex subset of Rm and Φ : S dom(Φ) R a continuously-diﬀerentiable and strictly convex function. The Bregman divergence associated with Φ (Bregman, 1967; Censor and Zenios, 1997) dΦ : S S [0, ) is deﬁned as dΦ(x, y) = Φ(x) Φ(y) x y, Φ(y) ,

where Φ(y) represents the gradient vector of Φ evaluated at y.

The value of the Bregman divergence dΦ(x, y) can be viewed as the diﬀerence between the value of Φ at x and the ﬁrst Taylor expansion of Φ around y evaluated at point x. For exponential family distributions, the Bregman divergence corresponds to the Kullback Leibler divergence (Banerjee et al., 2005) with Φ = G.

2. Collective matrix completion

Assume that we observe a collection of matrices X = (X1, . . . , XV ). In this collection components Xv Rdu dv have a common set of rows. This common set of rows corresponds, for example, to a common set of users in a recommendation system. The set of columns of each matrix Xv corresponds to a diﬀerent type of entity. In the case of recommender system it can be books, ﬁlms, video game, etc. Then, the entries of each matrix Xv corresponds to the user s rankings for this particular type of products. We assume that the distribution of each matrix Xv depends on the matrix of parameters M v. This distribution can be diﬀerent for diﬀerent v. For instance, we can have binary observations for one matrix Xv1 with entries which correspond, for example, to like/dislike labels for a certain type of products, multinomial for another matrix Xv2 with ranking going from 1 to 5 and Gaussian for a third matrix Xv3. As it happens in many applications, we assume that for each matrix Xv we observe only a small subset of its entries. We consider the following model: for v [V ] and (i, j) [du] [dv], let Bv ij be independent Bernoulli random variables with parameter πv ij.

Collective Matrix Completion

We suppose that Bv ij are independent from Xv ij. Then, we observe Y v ij = Bv ij Xv ij. We can think of the Bv ij as masked variables. If Bv ij = 1, we observe the corresponding entry of Xv, and when Bv ij = 0, we have a missing observation. In the simplest situation each coeﬃcient is observed with the same probability, i.e. for every v [V ] and (i, j) [du] [dv], πv ij = π. In many practical applications, this assumption is not realistic. For example, for a recommendation system, some users are more active than others and some items are more popular than others and thus rated more frequently. Hence, the sampling distribution is in fact non-uniform. In the present paper, we consider general sampling model where we only assume that each entry is observed with a positive probability:

Assumption 1 Assume that there exists a positive constant 0 < p < 1 such that

min v [V ] min (i,j) [du] [dv] πv ij p.

Let Π denotes the joint distribution of the Bernoulli variables Bv ij : (i, j) [du] [dv], v [V ] . For any matrix A Rdu D where D = P v [V ] dv, we deﬁne the weighted Frobenius norm

A 2 Π,F = X

(i,j) [du] [dv] πv ij(Av ij)2.

Assumption 1 implies A 2 Π,F p A 2 F . For each v [V ] let us denote πv i = Pdv j=1 πv ij and

πv j = Pdu i=1 πv ij. Note we can easily get an estimations of πv i and πv j using the empirical frequencies: c πv i = X

j [dv] Bv ij and d πv j = X

i [du] Bv ij.

Let πi = P v [V ] πv i , π j = maxv [V ] πv j, and µ be an upper bound of its maximum, that is max (i,j) [du] [dv](πi , π j) µ. (1)

3. Exponential family noise

In this section we assume that for each v distribution of Xv belongs to the exponential family, that is

Xv ij|Mv ij fhv,Gv(Xv ij|Mv ij) = hv(Xv ij) exp Xv ij Mv ij Gv(Mv ij) .

We denote M = (M 1, . . . , M V ) and let γ be an upper bound on the sup-norm of M, that is γ = |γ1| |γ2|, where γ1 Mv ij γ2 for every v [V ] and (i, j) [du] [dv]. Hereafter, we denote by C (γ) = W Rdu D : W γ , the ℓ -norm ball with radius γ in the space Rdu D. We need the following assumptions on densities fhv,Gv:

Alaya and Klopp

Assumption 2 For each v [V ], we assume that the function Gv( ) is twice diﬀerentiable and there exits two constants L2 γ, U2 γ satisfying:

sup η [ γ 1

K ] (Gv) (η) U2 γ, (2)

and inf η [ γ 1

K ] (Gv) (η) L2 γ, (3)

for some K > 0.

The ﬁrst statement, (2), in Assumption 2 ensures that the distributions of Xv ij have uniformly bounded variances and sub-exponential tails (see Lemma 28 in Appendix C). The second one, (3), is the strong convexity condition satisﬁed by the log-partition function Gv. This assumption is satisﬁed for most standard distributions presented in the previous section. In Table 1, we list the corresponding constants in Assumption 2.

Model (Gv) (η) (Gv) (η) L2 γ U2 γ

Normal σ2η σ2 σ2 σ2

Binomial Neη 1+eη Neη (1+eη)2 Ne (γ+ 1

4 Gamma (if γ1γ2 > 0) α

η α η2 α (γ+ 1

K )2 α (|γ1| |γ2|)2

Negative binomial reη 1 eη reη (1 eη)2 re (γ+ 1

K ))2 re(γ+ 1

K )2 Poisson eη eη e (γ+ 1

Table 1: Examples of the corresponding constants L2 γ and U2 γ from Assumption 2.

3.1. Estimation procedure

To estimate the collection of matrices of parameters M = (M 1, . . . , M V ), we use penalized negative log-likelihood. Let W Rdu D, we divide it in V blocks W v Rdu dv: W = (W 1, . . . , W V ). Given observations Y = (Y 1, . . . , Y V ), we write the negative log-likelihood as LY(W) = 1

(i,j) [du] [dv] Bv ij Y v ij W v ij Gv(W v ij) .

The nuclear norm penalized estimator c M of M is deﬁned as follows:

c M = ( c M 1, . . . , c M V ) = argmin W C (γ) LY(W) + λ W , (4)

where λ is a positive regularization parameter that balances the trade-oﬀbetween model ﬁt and privileging a low-rank solution. Namely, for large value of λ the rank of the estimator c M is expected to be small. Let the collection of matrices (Ev 11, . . . , Ev dudv) form the canonical basis in the space of matrices of size du dv. The entry of (Ev ij) is 0 everywhere except for the (i, j)-th entry

Collective Matrix Completion

where it equals to 1. For (εv ij), an i.i.d Rademacher sequence, we deﬁne ΣR = (Σ1 R, . . . , ΣV R) where for all v [V ]

Σv R = 1 du D

(i,j) [du] [dv] εv ij Bv ij Ev ij.

We now state the main result concerning the recovery of M. Theorem 3 gives a general upper bound on the estimation error of c M deﬁned by (4). Its proof is postponed in Appendix A.1.

Theorem 3 Assume that Assumptions 1 and 2 hold, and λ 2 LY(M) . Then, with probability exceeding 1 4/(du + D) we have

1 du D c M M 2 Π,F c

p max n du D rank(M) λ2

L4γ + γ2(E[ ΣR ])2 , γ2 log(du + D)

where c is a numerical constant.

Using Assumption 1, Theorem 3 implies the following bound on the estimation error measured in normalized Frobenius norm.

Corollary 4 Under assumptions of Theorem 3 and with probability exceeding 1 4/(du+D), we have

1 du D c M M 2 F c

p2 max n du D rank(M) λ2

L4γ + γ2(E[ ΣR ])2 , γ2 log(du + D)

In order to get a bound in a closed form we need to obtain a suitable upper bounds on E[ ΣR ] and on LY(M) with high probability. Therefore we use the following two lemmas.

Lemma 5 There exists an absolute constant c such that

log(du D) du D

Lemma 6 Let Assumption 2 holds. Then, there exists an absolute constant c such that, with probability at least 1 4/(du + D), we have

(Uγ K) µ + (log(du D))3/2

The proofs of Lemmas 5 and 6 are postponed to Appendices A.2 and A.3. Recall that the condition on λ in Theorem 3 is that λ 2 LY(M) . Using Lemma 6, we can choose

(Uγ K) µ + (log(du D))3/2

With this choice of λ, we obtain the following theorem:

Alaya and Klopp

Theorem 7 Let Assumptions 1 and 2 be satisﬁed. Then, with probability exceeding 1 4/(du + D) we have

1 du D c M M 2 Π,F c rank(M)

γ2 + (Uγ K)2

µ + log3(du D) ,

1 du D c M M 2 F c rank(M)

γ2 + (Uγ K)2

µ + log3(du D) ,

where c is an absolute constant.

Remark 8 Note that the rate of convergence in Theorem 7 has the following dominant term: 1 du D c M M 2 F rank(M)µ

where the symbol means that the inequality holds up to a multiplicative constant. If we assume that the sampling distribution is close to the uniform one, that is that there exists positive constants c1 and c2 such that for every v [V ] and (i, j) [du] [dv] we have c1p πv ij c2p, then Theorem 7 yields

1 du D c M M 2 F rank(M)

If we complete each matrix separately, the error will be of the order PV v=1 rank(M v)/p(du D). As rank(M) PV v=1 rank(M v), the rate of convergence achieved by our estimator is faster compared to the penalization by the sum-nuclear-norm. In order to get a small estimation error, p should be larger than rank(M)/(du D). We denote n = P v [V ] P (i,j) [du] [dv] πv ij, the expected number of observations. Then, we get the following condition on n:

n c rank(M)(du D).

Remark 9 In 1-bit matrix completion (Davenport et al., 2014; Klopp et al., 2015; Alquier et al., 2017), instead of observing the actual entries of the unknown matrix M Rd D, for a random subset of its entries Ωwe observe {Yij {+1, 1} : (i, j) Ω}, where Yij = 1 with probability f(Mij) for some link-function f. In Davenport et al. (2014) the parameter M is estimated by minimizing the negative log-likelihood under the constraints M γ and M γ

rd D for some r > 0. Under the assumption that rank(M) r, the authors prove that 1 d D c M M 2 F cγ

where cγ is a constant depending on γ (see Theorem 1 in Davenport et al. (2014)). A similar result using max-norm minimization was obtained in Cai and Zhou (2013). In (Klopp et al.,

Collective Matrix Completion

2015) the authors prove a faster rate. Their upper bound (see Corollary 2 in Klopp et al. (2015)) is given by

1 d D c M M 2 F cγ rank(M)(d D) log(d D)

In the particular case of 1-bit matrix completion for a single matrix under uniform sampling scheme, Theorem 7 implies the following bound:

1 d D c M M 2 F cγ rank(M)(d D)

which improves (6) by a logarithmic factor. Furthermore, Klopp et al. (2015) provide rank(M)(d D)/n as the lower bound for 1-bit matrix completion (see Theorem 3 in Klopp et al. (2015)). So our result answers the important theoretical question what is the exact minimax rate of convergence for 1-bit matrix completion which was previously known up to a logarithmic factor. In a more general setting of matrix completion with exponential family noise, the minimax optimal rate of convergence was also known only up to logarithmic factor (see Lafond (2015)). Our result provides the exact minimax optimal rate in this more general setting too. It is easy to see, by inspection of the proof of the lower bound in Lafond (2015), that the upper bound provided by Theorem 7 is optimal for the collective matrix completion.

Remark 10 Note that our estimation method is based on the minimization of the nuclearnorm of the whole collective matrix M. Another possibility is to penalize by the sum of the nuclear norms P v [V ] M v (see, e.g., Klopp et al. (2015)). This approach consists in estimating each component matrix independently.

4. General losses

In the previous section we assume that the link functions Gv are known. This assumption is not realistic in many applications. In this section we relax this assumption in the sense that we do not assume any speciﬁc model for the observations. Recall that our observations are a collection of partially observed matrices Y v = (Bv ij Xv i,j) Rdu dv for v = 1, . . . , V and Xv = (Xv ij) Rdu dv. We are interested in the problem of prediction of the entries of the collective matrix X = (X1, . . . , XV ). We consider the risk of estimating Xv with a loss function ℓv, which measures the discrepancy between the predicted and actual value with respect to the given observations. We focus on non-negative convex loss functions that are Lipschitz:

Assumption 3 (Lipschitz loss function) For every v [V ], we assume that the loss function ℓv(y, ) is ρv-Lipschitz in its second argument: |ℓv(y, x) ℓv(y, x )| ρv|x x |.

Some examples of the loss functions that are 1-Lipschitz are: hinge loss ℓ(y, y ) = max(0, 1 yy ), logistic loss ℓ(y, y ) = log(1 + exp( yy )), and quantile regression loss ℓ(y, y ) = ℓτ(y y) where τ (0, 1) and ℓτ(z) = z(τ 1(z 0)).

Alaya and Klopp

For a matrix M = (M 1, . . . , M V ) Rdu D, we deﬁne the empirical risk as

RY(M) = 1 du D

(i,j) [du] [dv] Bv ijℓv(Y v ij, Mv ij).

We deﬁne the oracle as: M = M 1, . . . , M V = argmin Q C (γ) R(Q) (7)

where R(Q) = E[RY(Q)]. Here the expectation is taken over the joint distribution of {(Y v ij, Bv ij) : (i, j) [du] [dv] and v [V ]}. We use machine learning approach and will

provide an estimator c M that predicts almost as well as M. Thus we will consider excess

risk R( c M) R( M). By construction, the excess risk is always positive. For a tuning parameter Λ > 0, the nuclear norm penalized estimator c M is deﬁned as

c M argmin Q C (γ)

RY(Q) + Λ Q . (8)

We next turn to the assumption needed to establish an upper bound on the performance of the estimator c M deﬁned in (8).

Assumption 4 Assume that there exists a constant ς > 0 such that for every Q C (γ), we have R(Q) R( M) ς

du D Q M 2 Π,F .

This assumption has been extensively studied in the learning theory literature (Mendelson, 2008; Zhang, 2004; Bartlett et al., 2004; Alquier et al., 2017; Elsener and van de Geer, 2018), and it is called Bernstein condition. It is satisﬁed in various cases of loss function (Alquier et al., 2017) and it ensures a suﬃcient convexity of the risk around the oracle deﬁned in (7). Note that when the loss function ℓv is strongly convex, the risk function inherits this property and automatically satisﬁes the margin condition. In other cases, this condition requires strong assumptions on the distribution of the observations, for instance for hinge loss or quantile loss (see Section 6 in Alquier et al. (2017)). The following result gives an upper bound on the excess risk of the estimator c M.

Theorem 11 Let Assumptions 1, 3 and 4 hold and set ρ = maxv [V ] ρv. Suppose that

Λ 2 sup{ G : G RY( M)}. Then, with probability at least 1 4/(du + D), we have

R( c M) R( M) c

p max n rank( M)du D ρ3/2p

γ/ς(E[ ΣR ])2 + Λ2

, ργ + ρ3/2p

log(du + D) du D

Theorem 11 gives a general upper bound on the prediction error of the estimator c M. Its proof is presented in Appendix A.4. In order to get a bound in a closed form we need to

obtain a suitable upper bounds on sup{ G : G (RY( M))} with high probability.

Collective Matrix Completion

Lemma 12 Let Assumption 3 holds. Then, there exists an absolute constant c such that, with probability at least 1 4/(du + D), we have

for all G RY( M).

The proof of Lemma 12 is given in Appendix A.5. Using Lemma 12 , we can choose

and with this choice of Λ and Lemma 5, we obtain the following theorem:

Theorem 13 Let Assumptions 1, 3 and 4 hold. Then, we have

R( c M) R( M) c

p rank( M)(ρ2 + ρ3/2p

γ/ς)(µ + log(du D))

with probability at least 1 4/(du + D).

Using Assumption 4, we get the following corollary:

Corollary 14 With probability at least 1 4/(du + D), we have

1 du D c M M 2 F c

p2ς rank( M)(ρ2 + ρ3/2p

γ/ς)(µ + log(du D))

1-bit matrix completion. In 1-bit matrix completion with logistic (resp. hinge) loss, the Bernstein assumption is satisﬁed with ς = 1/(4e2γ) (resp. ς = 2τ, for some τ that

veriﬁes | Mv ij 1/2| τ, v [V ], (i, j) [du] [dv]). More details for these constants can be found in Propositions 6.1 and 6.3 in Alquier et al. (2017). Then, the excess risk with respect to these two losses under the uniform sampling is given by:

Corollary 15 With probability at least 1 4/(du + D), we have

R( c M) R( M) c rank( M) p(du D).

These results are obtained without a logarithmic factor, and it improves the ones given in Theorems 4.2 and 4.4 in Alquier et al. (2017). The natural loss in this context is the 0/1 loss which is often replaced by the hinge or the logistic loss. We assume without loss of generality that γ = 1, since the Bayes classiﬁer has its entries in [ 1, 1], and we deﬁne the classiﬁcation excess risk by:

R0/1(M) = 1 du D

(i,j) [du] [dv] πv ij P[Xv ij = sign(Mv ij)],

for all M Rdu D. Using Theorem 2.1 in Zhang (2004), we have

R0/1( c M) R0/1( M) c

v u u t rank( M) p(du D).

Alaya and Klopp

5. Numerical experiments

In this section, we ﬁrst provide algorithmic details of the numerical procedure for solving the problem (4), then we conduct experiments on synthetic data to further illustrate the theoretical results of the collective matrix completion.

5.1. Algorithm

The collective matrix completion problem (4) is a semideﬁnite program (SDP), since it is a nuclear norm minimization problem with a convex feasible domain (Fazel et al., 2001; Srebro et al., 2005). We may solve it, for example, via the interior-point method (Liu and Vandenberghe, 2010). However, SDP solvers can handle a moderate dimensions, thus such formulation is not scalable due to the storage and computation complexity in lowrank matrix completion tasks. In the following, we present an algorithm that solves the problem (4) approximately and in a more eﬃcient way than solving it as SDP.

Proximal Gradient. Problem (4) can be solved by ﬁrst-order optimization methods such as proximal gradient (PG) which has been popularly used for optimizations problems of the form of (4) (Beck and Teboulle, 2009; Nesterov, 2013; Parikh and Boyd, 2014; Ji and Ye, 2009b; Mazumder et al., 2010; Yao and Kwok, 2015). When LY has L-Lipschitz continuous gradient, that is LY(W) LY(Q) F L W Q F , the PG generates a sequence of estimates {Wt} as

Wt+1 = argmin W LY(W) + (W Wt) LY(Wt) + L

2 W Wt 2 F + λ W

L (Zt), where Zt = Wt 1

L LY(Wt) (9)

and for any convex function Ψ : Rdu D 7 R, the associated proximal operator at W Rdu D is deﬁned as

proxΨ(W) = argmin 1

2 W Q 2 F + Ψ(Q) : Q Rdu D .

The proximal operator of the nuclear norm at W Rdu D corresponds to the singular value thresholding (SVT) operator of W (Cai. et al., 2010). That is, assuming a singular value decomposition W = UΣV , where U Rdu r, V RD r have orthonormal columns, Σ = (σ1, . . . , σr), with σ1 σr > 0 and r = rank(W), we have

SVTλ/L(W) = Udiag((σ1 λ/L)+, . . . , (σr λ/L)+)V , (10)

where (a)+ = max(a, 0). Although PG can be implemented easily, it converges slowly when the Lipschitz constant L is large. In such scenarios, the rate is O(1/T), where T is the number of iterations (Parikh and Boyd, 2014). Nevertheless, it can be accelerated by replacing Zt in (9) with

Qt = (1 + θt)Wt θt Wt 1, Zt = Qt η LY(Qt). (11)

Several choices for θt can be used. The resultant accelerated proximal gradient (APG) (see Algorithm 1) converges with the optimal O(1/T 2) rate (Nesterov, 2013; Ji and Ye, 2009a).

Collective Matrix Completion

Algorithm 1: APG for Collective Matrix Completion

1. initialize: W0 = W1 = Y, and α0 = α1 = 1.

2. for t = 1, . . . , T do

3. Qt = Wt + αt 1 1

αt (Wt Wt 1);

4. Wt+1 = SVT λ

5. αt+1 = 1

4α2 t + 1 + 1);

6. return WT+1.

Approximate SVT (Yao and Kwok, 2015). To compute Wt+1 in the proximal step (SVT) in Algorithm 1, we need ﬁrst perform SVD of Zt given in (11). In general, obtaining the SVD of du D matrix Zt requires O((du D)du D) operations, because its most expensive steps are computing matrix-vector multiplications. Since the computation of the proximal operator of the nuclear norm given in (10) does not require to do the full SVD, only a few singular values of Zt which are larger than λ/L are needed. Assume that there are ˆk such singular values. As Wt converges to a low-rank solution W , ˆk will be small during iterating. The power method (Halko et al., 2011) at Algorithm 2 is a simple and eﬃcient to capture subspace spanned by top-k singular vectors for ˆk k. Additionally, the power method also allows warm-start, which is particularly useful because the iterative nature of APG algorithm. Once an approximation Q is found, we have SVTλ/L(Zt) = QSVTλ/L(Q Zt) (see Proposition 3.1 in Yao and Kwok (2015)). We therefore reduce the time complexity on SVT from O((du D)du D) to O(ˆkdu D) which is much cheaper.

Algorithm 2: Power Method: Power Method(Z, R, ϵ)

1. input: Z Rdu D, initial R RD k for warm-start, tolerance δ;

2. initialize W1 = ZR;

3. for t = 1, 2, . . . , do

4. Qt+1 = QR(Wt);// QR denotes the QR factorization

5. Wt+1 = Z(Z Qt+1);

6. if Qt+1Q t+1 Qt Q t F δ then break;

7. return Qt+1.

Algorithm 3 shows how to approximate SVTλ/L(Zt). Let the target (exact) rank-k SVD of Zt be UkΣk V k . Step 1 ﬁrst approximates Uk by the power method. In steps 2 to 5, a less expensive SVTλ/L(Q Zt) is obtained from (10). Finally, SVTλ/L(Zt) is recovered. Hereafter, we denote the objective function in (4) by Fλ(W), that is Fλ(W) = LY(W) + λ W , for any W C (γ). Recall that the gradient of the likelihood LY is written as

(i,j) [du] [dv] Bv ij(Y v ij (Gv) (W v ij))Ev ij.

Alaya and Klopp

Algorithm 3: Approximate SVT: Approx-SVT(Z, R, λ, δ)

1. input: Z Rdu D, R RD k, thresholds λ and δ;

2. Q = Power Method(Z, R, δ);

3. [U, Σ, V] = SVD(Q Z);

4. U = {ui|σi > λ};

5. V = {vi|σi > λ};

6. Σ = max(Σ λI, 0); // (I denotes the identity matrix)

7. return QU, Σ, V.

By Assumption 2, we have for any W, Q Rdu D

LY(W) LY(Q) 2 F = 1 (du D)2 X

(i,j) [du] [dv] {Bv ij((Gv) (W v ij) (Gv) (Qv ij))}2

U2 γ (du D)2 W Q 2 F .

This yields that LY has L-Lipschitz continuous gradient with L = Uγ/(du D) 1. In the following algorithm and the experimental setup, we choose to work with L = 1.

Penalized Likelihood Accelerated Inexact Soft Impute (PLAIS-Impute). We present here the main algorithm in this paper, referred to as PLAIS-Impute, which is tailored to solving our collective matrix completion problem. The PLAIS-Impute is an adaption of the AIS-Impute algorithm in Yao and Kwok (2015) to the penalized likelihood completion problems. Note that AIS-Impute is an accelerated proximal gradient algorithm with further speed up based on approximate SVD. However, it is dedicated only to squareloss goodness-of-ﬁtting. The PLAIS-Impute is summarized in Algorithm 4. The core steps are 10-12, where an approximate SVT is performed. Steps 10 and 11 use the column space of the last iterations (Vt and Vt 1) to warm-start the power method. For further speed up, a continuation strategy is employed in which λt is initialized to a large value and then decreases gradually. The algorithm is restarted (at the step 14) if the objective function Fλ starts to increase. As AIS-Impute, PLAIS-Impute shares both low-iteration complexity and fast O(1/T 2) convergence rate (see Theorem 3.4 in Yao and Kwok (2015)).

5.2. Synthetic datasets

Software. The implementation of Algorithm 4 for the nuclear norm penalized estimator (4) was done in MATLAB R2017b on a desktop computer with mac OS system, Intel i7 Core 3.5 GHz CPU and 16GB of RAM. For fast computation of SVD and sparse matrix computations, the experiments call an external package called PROPACK (Larsen, 1998) implemented in C and Fortran. The code that generates all ﬁgures given below is available from https://github.com/mzalaya/collectivemc.

Experimental setup. In our experiments we focus on square matrices. We set the number of the source matrices V = 3, then, for each v {1, 2, 3}, the low-rank ground truth parameter matrices M v Rd dv are created with sizes d {3000, 6000, 9000} and

Collective Matrix Completion

Algorithm 4: PLAIS-Impute for Collective Matrix Completion

1. input: observed collective matrix Y, parameter λ, decay parameter ν (0, 1), tolerance ε;

2. [U0, λ0, V0] = rank-1 SVD(Y);

3. initialize c = 1, δ0 = Y F , W0 = W1 = λ0U0V 0 ;

4. for t = 1, . . . , T do

5. δt = νtδ0;

6. λt = νt(λ0 λ) + λ;

7. θt = (c 1)/(c + 2);

8. Qt = (1 + θt)Wt θt Wt 1;

9. Zt = LY(Qt));

10. Vt 1 = Vt 1 Vt(V t Vt 1);

11. Rt = QR([Vt, Vt 1]);

12. [Ut+1, Σt+1, Vt+1] = Approx-SVT(Zt, Rt, λt, δt);

13. if Fλ(Ut+1Σt+1V t+1) > Fλ(UtΣt V t ) then c = 1;

15. if |Fλ(Ut+1Σt+1V t+1) Fλ(UtΣt V t )| ε then break;

16. return WT+1.

dv {1000, 2000, 3000} (hence d = D = P3 v=1 dv). Each source matrix M v is constructed as M v = Lv Rv where Lv Rd rv and Rv Rdv rv. This gives a random matrix of rank at most rv. The parameter rv is set to {5, 10, 15}. A fraction of the entries of M v is removed uniformly at random with probability p [0, 1]. Then, the matrices M v are scaled so that M v = γ = 1.

For M 1, the elements of L1 and R1 are sampled i.i.d. from the normal distribution N(0.5, 1). For M 2, the entries of L2 and R2 are i.i.d. according to Poisson distribution with parameter 0.5. Finally, for M 3, the entries of L3 and R3 are i.i.d. sampled from Bernoulli distribution with parameter 0.5. The collective matrix M is constructed by concatenation of the three sources M 1, M 2 and M 3, namely M = (M 1, M 2, M 3). All the details of these experiments are given in Table 2.

The details of our experiments are summarized in Figures 1 and 2. In Figure 1, we plot the convergence of the objective function Fλ versus time in the three experiments. Note that PLAIS-Impute inherits the speed of AIS-Impute as it does not require performing SVD and it has both low per-iteration cost and fast convergence rate. In Figure 1, we plot also the convergence of the objective function Fλ versus log(λ) in the three experiments. The regularization parameter in the PLAIS-Impute is initialized to a large value and decreased gradually. In Figure 2, we illustrate a learning rank curve obtained by PLAIS-Impute, where the green color corresponds to the input rank and the cyan color to the recovered rank of the collective matrix M.

Alaya and Klopp

M 1 M 2 M 3 M (Gaussian) (Poisson) (Bernoulli) (Collective)

exp.1 dimension 3000 1000 3000 1000 3000 1000 3000 3000

rank 5 5 5 unknown

exp.2 dimension 6000 2000 6000 2000 6000 2000 6000 6000

rank 10 10 10 unknown

exp.3 dimension 9000 3000 9000 3000 9000 3000 9000 9000

rank 15 15 15 unknown

Table 2: Details of the synthetic data in the three experiments.

0 10 20 30 Times (seconds)

Collective Gaussian Poisson Bernoulli

0 50 100 150 200 Times (seconds)

Collective Gaussian Poisson Bernoulli

0 100 200 300 400 500 600 Times (seconds)

Collective Gaussian Poisson Bernoulli

5 0 5 10 15 20 log(λ)

Collective Gaussian Poisson Bernoulli

5 0 5 10 15 20 log(λ)

Collective Gaussian Poisson Bernoulli

5 0 5 10 15 20 log(λ)

Collective Gaussian Poisson Bernoulli

Figure 1: Convergence of the objective function Fλ in problem (4) versus time (top) and versus log(λ) (bottom) in the three experiments with p = 0.6; left for exp.1; middle for exp.2; right for exp.3. Note that the objective functions for Gaussian, Poisson and Bernoulli distributions are calculated separately by the algorithm.

Collective Matrix Completion

0 5 10 15 Iterations

Learned ranks

Ranks In Ranks Out

0 5 10 15 Iterations

Learned ranks

Ranks In Ranks Out

0 5 10 15 20 Iterations

Learned ranks

Ranks In Ranks Out

Figure 2: Learning ranks curve versus iterations in the three experiments with p = 0.6; left for exp.1; middle for exp.2; right for exp.3. We initialize the algorithm by setting a rank r0 = 5r where r {5, 10, 15}. The green color corresponds to the input rank while the cyan to the recovered rank of the collective matrix at each iteration. As can be seen, the two ranks gradually converge to the ﬁnal recovered rank.

Evaluation. In our experiments, the PLAIS-Impute algorithm terminates when the absolute diﬀerence in the cost function values between two consecutive iterations is less than ϵ = 10 6. We set the regularization parameter λ LY(M) as given by Theorem 3. Note that in step 12 of PLAIS-Impute, the threshold in SVT is given by λt (deﬁned in step 6), which is decreasing from one iteration to another. This allows to tune the ﬁrst regularization parameter λ in the program (4). We randomly sample 80% of the observed entries for training, and the rest for testing. In order to measure the the accuracy of our estimator, we employ the relative error (as, e.g., in Cai. et al. (2010); Davenport et al. (2014); Cai and Zhou (2013)) which is widely used metric in matrix completion and is deﬁned by

RE(c W , W ) = c W W o F

where c W is the recovered matrix and W o is the original full data matrix. We run the PLAIS-Impute algorithm in each experiment by varying the percentage of known entries p from 0 to 1. In Figure 3, we plot the relative errors as a functions of p. We observe in Figure 3 that the relative errors are decaying with p. Note that for each v {1, 2, 3}, the estimator d M v is calculated separately using the same program (4). The results shown in Figure 3 conﬁrm that collective matrix completion approach outperforms the approach that consists in estimating each component source independently.

Cold-Start problem. To simulate cold-start scenarios, we choose one of the source matrices M v to be cold by increasing its sparsity. More precisely, we proceed in the following way: we extract vector of known entries of the chosen matrix and we set the ﬁrst 1/5 fraction of its entries to be equal to 0. We denote the obtained matrix by M v cold and the collective matrix by Mv cold. In exp.1, exp.2 and exp.3, we increase the sparsity of M 1, M 2, and M 3, respectively. Hence, we get the cold collective matrices M1 cold = (M 1 cold, M 2, M 3), M2 cold = (M 1, M 2 cold, M 3), and M3 cold = (M 1, M 2, M 3 cold).

Alaya and Klopp

0.2 0.4 0.6 0.8 1.0 Percentage of known entries

Relative errors

Collective Gaussian Poisson Bernoulli

0.2 0.4 0.6 0.8 1.0 Percentage of known entries

Relative errors

Collective Gaussian Poisson Bernoulli

0.2 0.4 0.6 0.8 1.0 Percentage of known entries

Relative errors

Collective Gaussian Poisson Bernoulli

Figure 3: Performance on the synthetic data in terms of relative errors between the target and the estimator matrices as a function of the percentage of known entries p from 0 to 1.

We run 10 times the PLAIS-Impute algorithm for recovering the source M v cold and the collective Mv cold for each v = 1, 2, 3. We denote by \ M vcomp the estimator of M v cold obtained by running the PLAIS-Impute algorithm only for this component. Analogously, we denote \ M v collect the estimator of M v cold obtained by extracting the v-th source of the collective estimator \ Mv cold.

In Figure 4, we report the relative errors RE( \ M vcomp, M v cold) and RE( \ M v collect, M v cold) in the three experiments. We see that, the collective matrix completion approach compensates the lack of informations in the cold source matrix. Therefore, this shared structure among the sources is useful to get better predictions.

d = 3000 d = 6000 d = 9000 Dimensions

Relative errors

RE(c M1 collect, M 1 cold)

RE(c M1 comp, M 1 cold)

RE(c M2 collect, M 2 cold)

RE(c M2 comp, M 2 cold)

RE(c M3 collect, M 3 cold)

RE(c M3 comp, M 3 cold)

Figure 4: Relative errors over a set of 10 randomly generated datasets according to the cold-start scenarios (with the black lines representing the standard deviation) between the target and the estimator matrices.

Collective Matrix Completion

6. Conclusion

This paper studies the problem of recovering a low-rank matrix when the data are collected from multiple and heterogeneous source matrices. We ﬁrst consider the setting where, for each source, the matrix entries are sampled from an exponential family distribution. We then relax this assumption. The proposed estimators are based on minimizing the sum of a goodness-of-ﬁt term and the nuclear norm penalization of the whole collective matrix. Allowing for non-uniform sampling, we establish upper bounds on the prediction risk of our estimator. As a by-product of our results, we provide exact minimax optimal rate of convergence for 1-bit matrix completion which previously was known upto a logarithmic factor. We present the proximal algorithm PLAIS-Impute to solve the corresponding convex programs. The empirical study provides evidence of the eﬃciency of the collective matrix completion approach in the case of joint low-rank structure compared to estimate each source matrices separately.

Acknowledgment

We would like to thank the Associated Editor and the two anonymous Referees for extremely valuable comments and remarks that helped us greatly to improve the paper. This work was supported by grants from DIM Math Innov Rgion Ile-de-France https: //www.dim-mathinnov.fr

Alaya and Klopp

Appendix A. Proofs

We provide proofs of the main results, Theorems 3 and 11, in this section. The proofs of a few technical lemmas including Lemmas 5, 6 and 12 are also given. Before that, we recall some basic facts about matrices.

Basic facts about matrices. The singular value decomposition (SVD) of A has the form A = Prank(A) l=1 σl(A)ul(A)v l (A) with orthonormal vectors u1(A), . . . , urank(A)(A), orthonormal vectors v1(A), . . . , vrank(A)(A), and real numbers σ1(A) σrank(A)(A) > 0 (the singular values of A). Let (S1(A), S2(A)) be the pair of linear vectors spaces, where S1(A) is the linear span space of {u1(A), . . . , urank(A)(A)}, and S2(A) is the linear span space of {v1(A), . . . , vrank(A)(A)}. We denote by S j (A) the orthogonal complements of Sj(A), for j = 1, 2 and by PS the projector on the linear subspace S of Rn or Rm. For two matrices A and B, we set P A(B) = PS 1 (A)BPS 2 (A) and PA(B) = B P A(B). Since PA(B) = PS1(A)B + PS 1 (A)BPS2(A), and rank(PSj(A)B) rank(A), we have that rank(PA(B)) 2 rank(A). (12)

It is easy to see that for two matrices A and B (Klopp, 2014)

A B PA(A B) P A(A B) . (13)

Finally, we recall the well-known trace duality property: for all A, B Rn m, we have

| A, B | B A .

A.1. Proof of Theorem 3

First, noting that c M is optimal and M is feasible for the convex optimization problem (4), we thus have the basic inequality that

(i,j) [du] [dv] Bv ij Gv( ˆ Mv ij) Y v ij ˆ Mv ij + λ c M

(i,j) [du] [dv] Bv ij Gv(Mv ij) Y v ij Mv ij + λ M .

(i,j) [du] [dv] Bv ij Gv( ˆ Mv ij) Gv(Mv ij) Y v ij ˆ Mv ij Mv ij λ( M c M ).

Using the Bregman divergence associated to each Gv, we get

(i,j) [du] [dv] Bv ijd Gv( ˆ Mv ij, Mv ij)

λ( M c M ) 1 du D

(i,j) [du] [dv] Bv ij (Gv) (Mv ij) Y v ij ˆ Mv ij Mv ij .

Collective Matrix Completion

Therefore, using the duality between and , we arrive at

(i,j) [du] [dv] Bv ijd Gv( ˆ Mv ij, Mv ij) λ( M c M ) LY(M), c M M

λ( M c M ) + LY(M) c M M .

Besides, using the assumption λ 2 LY(M) and inequality (13) lead to

(i,j) [du] [dv] Bv ijd Gv( ˆ Mv ij, Mv ij) 3λ

2 PM c M M .

Since PA(B) p

2 rank(A) B F for any two matrices A and B, we obtain

(i,j) [du] [dv] Bv ijd Gv( ˆ Mv ij, Mv ij) 3λ

2 rank(M) c M M F . (14)

Now, Assumption 2 implies that the Bregman divergence satisﬁes L2 γ(x y)2 2dv G(x, y) U2 γ(x y)2, then we get

2 Y( c M, M) 2

(i,j) [du] [dv] Bv ijd Gv( ˆ Mv ij, Mv ij), (15)

where 2 Y( c M, M) = 1 du D

(i,j) [du] [dv] Bv ij( ˆ Mv ij Mv ij)2.

Combining (14) and (15), we arrive at

2 Y( c M, M) 3λ

2 rank(M) c M M F . (16)

Let us now deﬁne the threshold β = 946γ2 log(du+D)

pdu D and distinguish the two following cases that allows us to obtain an upper bound for the estimation error: Case 1: if (du D) 1 c M M 2 Π,F < β, then the statement of Theorem 3 is true.

Case 2: it remains to consider the case (du D) 1 c M M 2 Π,F β. Lemma 18 in Ap-

pendix B.1 implies c M M F 1 4

2 rank(M) c M M , then we obtain

32 rank(M) c M M F .

This leads to c M C β, 32 rank(M) , where the set

C (β, r) = W C (γ) : M W r W M F and (du D) 1 W M 2 Π,F β .

Alaya and Klopp

Using Lemma 19 in Appendix B.1, we have

2 Y( c M, M) W M 2 Π,F 2du D 44536 rank(M)γ2(E[ ΣR ])2 5567γ2

du Dp . (18)

Together (18) and (16) imply

1 2du D c M M 2 Π,F 3λ

2 rank(M) c M M F

+ 44536 rank(M)γ2(E[ ΣR ])2 + 5567γ2

p L4γ rank(M) + 1 4du D c M M 2 Π,F

+ 44536 rank(M)γ2(E[ ΣR ])2 + 5567γ2

1 4du D c M M 2 Π,F 18λ2du D

p L4γ rank(M)

+ 44536p 1du D rank(M)γ2(E[ ΣR ])2 + 5567γ2

1 du D c M M 2 Π,F p 1 max du D rank(M) c1λ2

L4γ + c2γ2(E[ ΣR ])2 , c3γ2

where c1, c2 and c3 are numerical constants. This concludes the proof of Theorem 3.

A.2. Proof of Lemma 5

We use the following result:

Proposition 16 (Corollary 3.3 in Bandeira and van Handel (2016)) Let W be the n m rectangular matrix whose entries Wij are independent centered bounded random variables. Then there exists a universal constant c such that

κ1 κ2 + κ p

where we have deﬁned

κ1 = max i [n]

j [m] E[W 2 i,j], κ2 = max j [m]

i [n] E[W 2 i,j], and κ = max (i,j) [n] [m] |Wij|.

We apply Proposition 16 to ΣR = 1 du D P v [V ] P (i,j) [du] [dv] εv ij Bv ij Ev ij. We compute

κ1 = 1 du D max i [du]

j [dv] E[(εv ij)2(Bv ij)2] = 1 du D max i [du]

j [dv] πv ij

= 1 du D max i [du] pπi ,

Collective Matrix Completion

κ2 = 1 du D max v [V ] max j [dv]

i [du] E[(εv ij)2(Bv ij)2] = 1 du D max v [V ] max j [dv]

i [du] πv ij

1 du D max j [dv]

i [du] πv ij

1 du D max j [dv] pπ j,

and κ = 1 du D maxv [V ] max(i,j) [du] [dv] |εv ij Bij| 1 du D. Using inequality (1), we have κ1 µ du D and κ2 µ du D. Then, κ1 κ2 µ du D, which establishes Lemma 5.

A.3. Proof of Lemma 6

We write LY(M) = 1 du D P v [V ] P (i,j) [du] [dv] Hv ij Ev ij, with Hv ij = Bv ij Xv ij (Gv) (Mv ij) . For a truncation level T > 0 to be chosen, we decompose LY(M) = Σ1 + Σ2, where

(i,j) [du] [dv]

Hv ij1((Xv ij E[Xv ij]) T) E Hv ij1((Xv ij E[Xv ij]) T) Ev ij,

(i,j) [du] [dv]

Hv ij1((Xv ij E[Xv ij])>T) E Hv ij1((Xv ij E[Xv ij])>T) Ev ij,

then, the triangular inequality implies LY(M) Σ1 + Σ2 . Then, the proof is divided on two steps: Step 1: control of Σ1 . In order to control Σ1 , we use the following bound on the spectral norms of random matrices. It is obtained by extension to rectangular matrices via self-adjoint dilation of Corollary 3.12 and Remark 3.13 in Bandeira and van Handel (2016).

Proposition 17 (Bandeira and van Handel, 2016) Let W be the n m rectangular matrix whose entries Wij are independent centered bounded random variables. Then, for any 0 ϵ 1/2 there exists a universal constant cϵ such that for every x 0,

2(1 + ϵ)(κ1 κ2) + x (n m) exp x2

where κ1, κ2, and κ are deﬁned as in Proposition 16.

We apply Proposition 17 to Σ1. We compute

κ1 = 1 du D max i [du]

j [dv] E h Hv ij1((Xv ij E[Xv ij]) T) E Hv ij1((Xv ij E[Xv ij]) T) 2i .

Besides, we have

E h Hv ij1((Xv ij E[Xv ij]) T) E Hv ij1((Xv ij E[Xv ij]) T) 2i E (Hv ij)21((Xv ij E[Xv ij]) T) ,

Alaya and Klopp

E (Hv ij)21((Xv ij E[Xv ij]) T) = E (Bv ij)2 Xv ij E[Xv ij])21((Xv ij E[Xv ij]) T)

πv ij Var[Xv ij]

= πv ij(Gv) (Mv ij).

By Assumption 2, we obtain E (Hv ij)21((Xv ij E[Xv ij]) T) πv ij U2 γ for all v [V ], (i, j) [du] [dv]. Then,

du D max i [du]

j [dv] πv ij Uγ

du D max i [du]

du D max j [dv]

i [du] πv ij Uγ

du D max j [dv] pπ j Uγ µ

It yields, κ1 κ2 Uγ µ

du D . Moreover, we have E Hv ij1((Xv ij E[Xv ij]) T) T, which entails

κ 2T du D. By choosing ϵ = 1/2 in Proposition 17, we obtain, with probability at least

1 4(du D)e x2,

Σ1 3Uγ 2µ + 2 c1/2x T

Therefore, by setting x = p

2 log(du + D), we get with probability at least 1 4/(du + D),

Σ1 3Uγ 2µ + 2 c1/2 p

2 log(du + D)T du D . (19)

Step 2: control of Σ2 . To control Σ2 , we use Chebyshev s inequality, that is

P Σ2 E[ Σ2 ] + x Var[ Σ2 ]

x2 , for all x > 0.

We start by estimating E[ Σ2 ]. We use the fact that E[ Σ2 ] E[ Σ2 F ]:

E Σ2 2 F = 1 (du D)2 X

(i,j) [du] [dv] E Hv ij1((Xv ij E[Xv ij])>T) E Hv ij1((Xv ij E[Xv ij])>T) 2

1 (du D)2 X

(i,j) [du] [dv] E (Hv ij)21((Xv ij E[Xv ij])>T)

1 (du D)2 X

(i,j) [du] [dv] πv ij E (Xv ij E[Xv ij])21((Xv ij E[Xv ij])>T)

1 (du D)2 X

(i,j) [du] [dv] πv ij q

E (Xv ij E[Xv ij])4 q

P ((Xv ij E[Xv ij]) > T) .

Collective Matrix Completion

By Lemma 28, we have that Xv ij E[Xv ij] is an (Uγ, K)-sub-exponential random variable for every v [V ] and (i, j) [du] [dv]. It yields, using (2) in Theorem 26, that

E (Xv ij E[Xv ij])p cpp Xv ij p ψ1, for every p 1,

and by (1) in Theorem 26

P |Xv ij E[Xv ij]| > T exp 1 T cse Xv ij ψ1

where c and cse are absolute constants. Consequently,

E Σ2 2 F c (du D)2 X

(i,j) [du] [dv] πv ij q

exp 1 T cse Xv ij ψ1

c (du D)2 X

(i,j) [du] [dv] (Uγ K)2πv ij

exp T cse K

We choose T = T := 4cse(Uγ K) log(du D). It yields,

E Σ2 2 F c (du D)2 1 (du D)2 X

(i,j) [du] [dv] (Uγ K)2πv ij

(du D)2 1 (du D)2 X

j [dv] πv ij

(du D)2 1 (du D)2 (du D)µ

(du D)2du D.

Using the fact that x 7 x is concave, we obtain

E[ Σ2 ] E[ Σ2 F ] q

c(Uγ K)2µ (du D)2du D c(Uγ K) µ

du D du D . (20)

Let us now control the variance of Σ2 . We have immediately, using (20),

Var[ Σ2 ] E[ Σ2 2] E Σ2 2 F c(Uγ K)2µ

(du D)2du D.

By Chebyshev s inequality and using (20), we have, with probability at least 1 4/(du+D),

Σ2 c(Uγ K) µ

du D du D + c(Uγ K) µ

du D c(Uγ K) µ

du D . (21)

Finally, combining (19) and (21), we obtain, with probability at least 1 4/(du + D),

LY(M) 3Uγ 2µ + 8(Uγ K)cse q

2c1/2 log(du + D) log(du D) + c(Uγ K) µ

Alaya and Klopp

(Uγ K) µ + (log(du D))3/2

where c is an absolute constant. This ﬁnishes the proof of Lemma 6.

A.4. Proof of Theorem 11

We start the proof with the following inequality using the fact that c M is the minimizer of the objective function in problem (8)

0 (RY( c M) + Λ c M ) + (RY( M) + Λ M ).

Then, by adding R( c M) R( M) 0, we obtain

R( c M) R( M) RY( c M) RY( M) R( c M) R( M) + Λ M c M .

(13) implies A B PA(A B) and we get

R( c M) R( M) RY( c M) RY( M) R( c M) R( M) + Λ P

RY( c M) RY( M) + R( c M) R( M) (22)

2 rank( M) c M M F .

Let us now deﬁne the threshold ν = 32 1+e

3ρ/ςγ ργ log(du+D) 3pdu D and distinguish the two following cases that allows us to obtain an upper bound for the prediction error:

Case 1: if R( c M) R( M) < ν, then the statement of Theorem 11 is true.

Case 2: it remains to consider the case R( c M) R( M) ν. Lemma 21 implies

32 rank( M) c M M F ,

then c M Q(ν, 32 rank( M)) where

Q(ν, r) = Q C (γ) : Q M r Q M F and R(Q) R( M) ν .

Using Lemma 22, we have

R( c M) R( M) RY( c M) RY( M)

R( c M) R( M) 2 + c rank( M)ρ2ς 1(E[ ΣR ])2

(1/4e) + (1 1/

3ρ/4ςγ . (23)

Collective Matrix Completion

Now, plugging (23) in (22), we get

R( c M) R( M) c rank( M)ρ2ς 1(E[ ΣR ])2

(1/4e) + (1 1/

3ρ/4ςγ + 2Λ q

2 rank( M) c M M F ,

where c = 1024. Then using the fact that for any a, b R, and ϵ > 0, we have 2ab a2/(2ϵ) + 2ϵb2, we get for ϵ = pς/4

R( c M) R( M) cdu Dp 1 rank( M)ρ2ς 1(E[ ΣR ])2

(1/4e) + (1 1/

+ Λ2du D(pς/4) 1 rank( M) + pς 2du D c M M 2 F

cdu Dp 1 rank( M)ρ2ς 1(E[ ΣR ])2

(1/4e) + (1 1/

+ Λ2du D(pς/4) 1 rank( M) + ς 2du D c M M 2 Π,F .

Using Assumption 4, we obtain

R( c M) R( M) 2cdu Dp 1 rank( M)ρ2ς 1(E[ ΣR ])2

(1/4e) + (1 1/

3ρ/4ςγ + 8Λ2du D(pς) 1 rank( M)

(pς) 1 rank( M)du D ρ2(E[ ΣR ])2

(1/4e) + (1 1/

3ρ/4ςγ + 8Λ2 .

This ﬁnishes the proof of Theorem 11.

A.5. Proof of Lemma 12

By the nonnegative factor and the sum properties of subdiﬀerential calculus (Boyd and Vandenberghe, 2004), we write

RY( M) = G = 1 du D

(i,j) [du] [dv] Bv ij Gv ij Ev ij : Gv ij ℓv(Y v ij, Mv ij)

Recall that the sudiﬀerential of ℓv(Y v ij, Mv ij) at the point Mv ij is deﬁned as

ℓv(Y v ij, Mv ij) = {Gv ij : ℓv(Y v ij, Qv ij) ℓv(Y v ij, Mv ij) + Gv ij(Qv ij Mv ij)}.

Thanks to Assumption 3, we have, for all Gv ij ℓv(Y v ij, Mv ij)

|Gv ij(Qv ij Mv ij)| |ℓv(Y v ij, Qv ij) ℓv(Y v ij, Mv ij)| ρv|Qv ij Mv ij|,

In particular, with Qv ij = Mv ij for all v [V ] and (i, j) [du] [dv], we get |Gv ij| ρv. Then, any subgradient G of RY has entries bounded by ρ/(du D) (recall ρ = maxv [V ] ρv). By a

Alaya and Klopp

triangular inequality and the convexity of , we have

G G E[G] + E[G]

G E[G] + E[ G ],

for any subgradient G of RY. On the one hand, we use the fact that E[ G ] E[ G F ] q

E[ G 2 F ]. Using (1), we have

E[ G 2 F ] 1 (du D)2 X

(i,j) [du] [dv] ρ2 v E[Bv ij]

(i,j) [du] [dv] πv ij

ρ2µ (du D)2 .

Now we apply Proposition 17 to G E[G]. Taking into account (1), we upper bound the constants κ1, κ2 and κ as follows:

κ1 = 1 du D max i [du]

j [dv] E[(Bv ij Gv ij E[Bv ij Gv ij])2]

2ρ du D max i [du]

j [dv] πv ij

κ2 = 1 du D max v [V ] max j [dv]

i [du] E[(Bv ij Gv ij E[Bv ij Gv ij])2]

2ρ du D max v [V ] max j [dv]

i [du] πv ij

and κ = 1 du D maxv [V ] max(i,j) [du] [dv] |Bv ij Gv ij E[Bv ij Gv ij]| 2ρ du D. Now, choose ϵ = 1/2

in Proposition 17, then we obtain, with probability at least 1 4(du D)e x2,

G E[G] 6ρ 2µ + 2ρ c1/2x

du D . (24)

Setting x = p

2 log(du + D) in (24), we get with probability at least 1 4/(du + D),

2)ρ µ + 2ρ c1/2 p

2 log(du + D) du D , (25)

for any subgradient G of RY( M).

Collective Matrix Completion

Appendix B. Technical Lemmas

In this section, we provide several technical lemmas, which are used for proving our main results.

B.1. Useful lemmas for the proof of Theorem 3

Lemma 18 Let A, B C (γ). Assume that λ 2 LY(B) , and LY(A) + λ A LY(B) + λ B . Then,

(i) P B (A B) 3 PB(A B) ,

(ii) A B 4 p

2 rank(B) A B F .

Proof We have LY(B) LY(A) λ( A B ). (13) implies

LY(B) LY(A) λ P B (A B) PB(A B) .

Moreover, by convexity of LY( ) and the duality between and we obtain

LY(B) LY(A) LY(B), B A LY(B) B A λ

P B (A B) PB(A B) + 1

Using the triangle inequality, we get

P B (A B) 3 PB(A B) ,

which proves (i). To prove (ii), note that PB(A) p

2 rank(B) A F , and (i) imply

2 rank(B) A B) F .

Lemma 19 Let β = 946γ2 log(du+D)

pdu D . Then, for all W C (β, r),

2 Y(W, M) (du D) 1 W M 2 Π,F ] (du D) 1 W M 2 Π,F 2 +1392rγ2(E[ ΣR ])2+5567γ2

with probability at least 1 4/(du + D).

Proof We use a standard peeling argument. For any α > 1 and 0 < η < 1/2α, we deﬁne

κ = 1 1/(2α) η

128γ2r(E[ ΣR ])2 + 512γ2

Alaya and Klopp

and we consider the event

W = W C (β, r) : 2 Y(W, M) (du D) 1 W M 2 Π,F > (du D) 1 W M 2 Π,F 2 +κ .

For s N , set

Rs = n W C (β, r) : αs 1β (du D) 1 W M 2 Π,F αsβ o .

If the event W holds for some matrix W C (β, r), then W belongs to some Rs and 2 Y(W, M) (du D) 1 W M 2 Π,F (du D) 1 W M 2 Π,F 2 + κ

For θ β consider the following set of matrices

C (β, r, θ) = n W C (β, r) : (du D) 1 W M 2 Π,F θ o ,

and the following event

Ws = W C (β, r, θ) : 2 Y(W, M) (du D) 1 W M 2 Π,F 1

2ααsβ + κ .

Note that W Ws implies that W C (β, r, αsβ). Then, we get W s Ws. Thus, it is enough to estimate the probability of the simpler event Ws and then apply a the union bound. Such an estimation is given by the following lemma:

Lemma 20 Let

Zθ = sup W C (β,r,θ)

2 Y(W, M) (du D) 1 W M 2 Π,F .

Then, we have

2α + κ 4 exp pdu Dη2θ

The proof of Lemma 20 follows along the same lines of Lemma 10 in Klopp (2015). We now apply an union bound argument combined to Lemma 20, we get

P[W ] P[ s=1Ws] 4

s=1 exp pdu Dη2αsβ

s=1 exp pdu Dη2β log α

4 exp pdu Dη2β log α

1 exp pdu Dη2β log α

By choosing α = e, η = 1/4e and β as stated we get the desired result.

Collective Matrix Completion

B.2. Useful lemmas for the proof of Theorem 11

Lemma 21 Suppose Λ 2 sup{ G : G RY( M)}. Then

2 rank( M) c M M F .

Proof For any subgradient G of RY( M), we have RY( c M) RY( M) + G, c M M .

Then, the deﬁnition of the estimator c M, entails RY( M) RY( c M) Λ( c M M ),

hence G, M c M Λ( c M M ). The duality between and yields

Λ( c M M ) G M c M Λ

then c M M 1

2 M c M . Now, (13) implies

M ( M c M) P

M( M c M) + 1

2 M c M 3 P

M( M c M) .

Therefore M c M 4 P

M( M c M) . Since P

M( M c M) q

2 rank( M) M

c M F , we establish the proof of Lemma 21.

Lemma 22 Let

ν = 32 1 + e p

3ρ/ςγ ργ log(du + D) 3pdu D ,

then, with probability at least 1 4/(du+D), the following holds uniformly over Q Q(ν, r) RY(Q) RY( M) R(Q) R( M)

R(Q) R( M) 2 + 16

(1/4e) + (1 1/

3ρ/4ςγ rρ2(pς) 1(E[ ΣR ])2.

Proof The proof is based on the peeling argument. For any δ > 1 and 0 < ϑ < 1/2δ, deﬁne

ζ = 16r(pς) 1ρ2(E[ ΣR ])2

3ρ/4ςγ ϑ + p

3ρ/4ςγϑ , (27)

and we consider the event

A = Q Q(ν, r) : RY(Q) RY( M) R(Q) R( M) > R(Q) R( M) 2 + ζ .

For l N , we deﬁne the sequence of subsets

Jl = n Q Q(ν, r) : δl 1ν R(Q) R( M) δlν o .

Alaya and Klopp

If the event A holds for some matrix Q Q(ν, r), then Q belongs to some Jl and

RY(Q) RY( M) R(Q) R( M) > R(Q) R( M) 2 + ζ

For θ ν, consider the following set of matrices

Q(ν, r, θ) = n Q Q(ν, r) : R(Q) R( M) θ o ,

and the following event

Al = Q Q(ν, r, θ) : RY(Q) RY( M) R(Q) R( M) 1

2δδlν + ζ .

Note that Q Jl implies that Q Q(ν, r, δlν). Then, we get A l Al. Thus, it is enough to estimate the probability of the simpler event Al and then apply a the union bound. Such an estimation is given in Lemma 23, where we derive a concentration inequality for the following supremum of process:

Ξθ = sup Q Q(ν,r,θ)

RY(Q) RY( M) R(Q) R( M)

We now apply an union bound argument combined to Lemma 23, we get

P[A ] P[ l=1Al]

l=1 exp 3du Dϑδlν

l=1 exp 3du Dϑ log(δ)ν

exp 3du Dϑ log(δ)ν

1 exp 3du Dϑ log(δ)ν

where se used the elementary inequality that us = es log(u) s log(u). By choosing δ = e, ϑ = 1/4e and ν as stated we get the desired result.

Lemma 23 One has

P h Ξθ 1 + δ r 3ρ

2δ + ζ i exp 3du Dϑθ

Proof The proof of this lemma is based on Bousquet s concentration theorem:

Collective Matrix Completion

Theorem 24 (Bousquet, 2002) (see also Corollary 16.1 in van de Geer (2016)) Let F be a class of real-valued functions. Let T1, . . . , TN be independent random variables such that E[f(Ti)] = 0 and |f(Ti)| ξ for all i = 1, . . . , N and for all f F. Introduce Z = supf F 1

N PN i=1 f(Ti) E[f(Ti)] . Assume further that

i=1 sup f F E f2(Ti) M2.

Then we have for all t > 0

P Z 2E[Z] + M

We start by bounding the expectation

E[Ξθ] = E h sup Q Q(ν,r,θ)

RY(Q) RY( M) R(Q) R( M) i

= E h sup Q Q(ν,r,θ)

RY(Q) RY( M) E RY(Q) RY( M) i

= E h sup Q Q(ν,r,θ)

(i,j) [du] [dv] Bv ij ℓv(Y v ij, Qv ij) ℓv(Y v ij, Mv ij)

E Bv ij ℓv(Y v ij, Qv ij) ℓv(Y v ij, Mv ij) i

2E h sup Q Q(ν,r,θ)

(i,j) [du] [dv] εv ij Bv ij ℓv(Y v ij, Qv ij) ℓv(Y v ij, Mv ij) i

4ρE h sup Q Q(ν,r,θ)

(i,j) [du] [dv] εv ij Bv ij(Qv ij Mv ij) i

4ρE h sup Q Q(ν,r,θ)

4ρE h ΣR sup Q Q(ν,r,θ) Q M i ,

where the ﬁrst inequality follows from symmetrization of expectations theorem of van der Vaart and Wellener, the second from contraction principle of Ledoux and Talagrand (see Theorems 14.3 and 14.4 in B uhlmann and van de Geer (2011)), and the third from du-

ality between nuclear and operator norms. We have Q Q(ν, r, θ) then Q M r Q M F and using Assumption 4, we have Q M q

r(pς) 1 R(Q) R( M) p

r(pς) 1θ. Then, E[Ξθ] 4 p

r(pς) 1θρE[ ΣR ].

For the upper bound ξ in Theorem 24, we have that

ℓv(Y v ij, Qv ij) ℓv(Y v ij, Mv ij) ρv|Qv ij Mv ij 2ρvγ 2ργ.

Alaya and Klopp

Now we compute M in Theorem 24. Thanks to Assumption 4, we have

(i,j) [du] [dv] E Bv ij ℓv(Y v ij, Qv ij) ℓv(Y v ij, Mv ij) 2

(i,j) [du] [dv] (ρv)2E Bv ij(Qv ij Mv ij)2]

du D Q M 2 Π,F

ς (R(Q) R( M))

Then, Bousquet s theorem implies that for all t > 0,

P h Ξθ 2E[Ξθ] +

2ρ2θt ςdu D + 8ργt

Taking t = 3du Dϑθ

8ργ , we obtain

P h Ξθ 8γ p

r(pς) 1θρE[ ΣR ] + r 3ρ

4ςγ ϑ + ϑ θ i exp 3du Dϑθ

Using the fact that for any a, b R, and ϵ > 0, 2ab a2/ϵ + ϵb2, we get (for ϵ = 1/2δ + p

3ρ/4ςγ ϑ + p

3ρϑ/4ςγ ), we get

r(pς) 1θρE[ ΣR ] + s

ςγ + ϑ θ 16r(pς) 1ρ2(E[ ΣR ])2

3ρ 4ςγ ϑ + 1

16r(pς) 1ρ2(E[ ΣR ])2

3ρ 4ςγ ϑ + 1 + δ r 3ρ

Using (28), we get P h Ξθ 1 + δ q

3ρςγ θ 2δ + ζ i exp 3du Dϑθ

8ργ . This ﬁnishes the proof of Lemma 23.

Appendix C. Sub-exponential random variables

The material here is taken from R.Vershynin (2010).

Deﬁnition 25 A random variable X is sub-exponential with parameters (ω, b) if for all t such that |t| 1/b,

E exp t(X E[X]) exp t2ω2

Collective Matrix Completion

When b = 0, we interpret 1/0 as being the same as , it follows immediately from this deﬁnition that any sub-Gaussian random variable is also sub-exponential. There are also a variety of other conditions equivalent to sub-exponentiality, which we relate by deﬁning the sub-exponential norm of random variable. In particular, we deﬁne the sub-exponential norm (sometimes known as the ψ1-Orlicz in the literature) as

X ψ1 := sup q 1

1 q (E[|Xq|])1/q.

Then we have the following lemma which provides several equivalent characterizations of sub-exponential random variables.

Theorem 26 (Equivalence of sub-exponential properties (R.Vershynin, 2010)) Let X be a random variable and ω > 0 be a constant. Then, the following properties are all equivalent with suitable numerical constants Ki > 0, i = 1, . . . , 4, that are diﬀerent from each other by at most an absolute constant c, meaning that if one statement (i) holds with parameter Ki, then the statement (j) holds with parameter Kj c Ki.

(1) sub-exponential tails: P[|X| > t] exp 1 t ωK1 , for all t 0.

(2) sub-exponential moments: (E[|Xq|])1/q K2ωq, for all q 1.

(3) existence of moment generating function (Mgf): E exp X

Note that in each of the statements of Theorem 26, we may replace ω by X ψ1 and, up to absolute constant factors, X ψ1 is the smallest possible number in these inequalities.

Lemma 27 (Mgf of sub-exponential random variables (R.Vershynin, 2010)) Let X be a centered subexponential random variable. Then, for t such that |t| c/ X ψ1, one has

E[exp(t X)] exp(Ct2 X 2 ψ1)

where C, c > 0 are absolute constants.

Lemma 28 For all v [V ] and (i, j) [du] [dv], the random variable Xv i,j is a subexponential with parameters (Uγ, K), where K is deﬁned in Assumption 2. Moreover, we have that Xv i,j ψ1 = c(Uγ K) for some absolute constant c.

Proof Let t such that |t| 1/K, then

E[exp t(Xv ij E[Xv ij]) ]

= e t(Gv) (Mv ij) Z

R hv(x) exp (t + Mv ij)x Gv(Mv ij) dx

= e Gv(t+Mv ij) Gv(Mv ij) t(Gv) (Mv ij) Z

R hv(x) exp (t + Mv ij)x Gv(t + Mv ij) dx

= e Gv(t+Mv ij) Gv(Mv ij) t(Gv) (Mv ij),

where we used in the last inequality the fact that that R

R hv(x) exp (t + Mv ij)x Gv(t + Mv ij) dx = R

R fhv,Gv(Xv i,j|t + Mv ij)dx = 1. Therefore, an ordinary Taylor series expansion

Alaya and Klopp

of Gv implies that there exists tγ,K [ γ 1

K ] such that Gv(t + Mv ij) Gv(Mv ij) t(Gv) (Mv ij) = (t2/2)(Gv) (t2 γ,K). By Assumption 2, we obtain

E[exp t(Xv ij E[Xv ij]) ] exp t2U2 γ 2

Using Lemma 27, we get Xv i,j ψ1 = c(Uγ K) for some absolute constant c. This proves Lemma 28.

D. Agarwal, L. Zhang, and R. Mazumder. Modeling itemitem similarities for personalized recommendations on yahoo! front page. Ann. Appl. Stat., 5(3):1839 1875, 2011.

P. Alquier, V. Cottet, and G. Lecu e. Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions. ar Xiv:1702.01402, 2017.

M.G. Armentano, D. Godoy, and A. A. Amandi. Followee recommendation based on text analysis of micro-blogging activity. Information Systems, 38(8):1116 1127, 2013. ISSN 0306-4379.

A. S. Bandeira and R. van Handel. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. Ann. Probab., 44(4):2479 2506, 2016.

A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. J. Mach. Learn. Res., 6:1705 1749, 2005.

P. L. Bartlett, M. I. Jordan, and J. D. Mcauliﬀe. Large margin classiﬁers: Convex loss, low noise, and convergence rates. In S. Thrun, L. K. Saul, and B. Sch olkopf, editors, Advances in Neural Information Processing Systems 16, pages 1173 1180. MIT Press, 2004.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183 202, 2009.

J. Bobadilla, F. Ortega, A. Hernando, and A. Guti errez. Recommender systems survey. Know.-Based Syst., 46:109 132, 2013.

G. Bouchard, D. Yin, and S. Guo. Convex collective matrix factorization. In AISTATS, volume 31 of JMLR Workshop and Conference Proceedings, pages 144 152. JMLR.org, 2013.

O. Bousquet. A bennett concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique, 334(6):495 500, 2002.

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.

Collective Matrix Completion

L.M. Bregman. The relaxation method of ﬁnding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200 217, 1967.

P. B uhlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer Berlin Heidelberg, 2011.

J.-F. Cai., E. J. Cands, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982, 2010.

T. Cai and W. X. Zhou. A max-norm constrained minimization approach to 1-bit matrix completion. J. Mach. Learn. Res., 14(1):3619 3647, 2013.

T. Tony Cai and Wen-Xin Zhou. Matrix completion via max-norm constrained optimization. Electron. J. Statist., 10(1):1493 1525, 2016.

E. J. Cand es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717, 2009.

E. J. Candes and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080, 2010.

I. Cantador, I. Fern andez-Tob ıas, Shlomo Berkovsky, and P. Cremonesi. Cross-Domain Recommender Systems, pages 919 959. Springer US, Boston, MA, 2015.

Y. Censor and S.A. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, USA, 1997.

K.-Y. Chiang, C.-J. Hsieh, and I. S. Dhillon. Matrix completion with noisy side information. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS 15, pages 3447 3455, Cambridge, MA, USA, 2015. MIT Press.

K.-Y. Chiang, I. S. Dhillon, and C.-J. Hsieh. Using side information to reliably learn lowrank matrices from missing and corrupted observations. Journal of Machine Learning Research, 19(76):1 35, 2018.

M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3(3):189, 2014.

P. Drineas, A. Javed, M. Magdon-Ismail, G. Pandurangant, R. Virrankoski, and A. Savvides. Distance matrix reconstruction from incomplete distance information for sensor network localization. In 2006 3rd Annual IEEE Communications Society on Sensor and Ad Hoc Communications and Networks, volume 2, pages 536 544, 2006.

A. Elsener and S. van de Geer. Robust low-rank matrix estimation. Ann. Statist., 46(6B): 3481 3509, 12 2018.

M. Fazel. Matrix Rank Minimization with Applications. Ph D thesis, Stanford University, 2002.

Alaya and Klopp

M. Fazel, H. Hindi, and S. P. Boyd. A rank minimization heuristic with application to minimum order system approximation. In Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148), volume 6, pages 4734 4739 vol.6, 2001.

W. Fithian and R. Mazumder. Flexible low-rank statistical modeling with missing data and side information. Statist. Sci., 33(2):238 260, 2018.

D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative ﬁltering to weave an information tapestry. Commun. ACM, 35(12):61 70, 1992.

S. Gunasekar, P. Ravikumar, and J. Ghosh. Exponential family matrix completion under structural constraints. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, pages II 1917 II 1925. JMLR.org, 2014.

S. Gunasekar, M. Yamada, D. Yin, and Y. Chang. Consistent collective matrix completion under joint low rank structure. In AISTATS, 2015.

S. Gunasekar, J. C. Ho, J. Ghosh, S. Kreml, A. N. Kho, J. C Denny, B. A. Malin, and J. Sun. Phenotyping using structured collective matrix factorization of multi source ehr data. preprint ar Xiv:1609.04466, 2016.

N. Halko, P. Martinsson, and J. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2): 217 288, 2011.

J. Hannon, M. Bennett, and B. Smyth. Recommending twitter users to follow using content and collaborative ﬁltering approaches. In Proceedings of the Fourth ACM Conference on Recommender Systems, Rec Sys 10, pages 199 206, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-906-0.

S. Horii, T. Matsushima, and S. Hirasawa. A note on the correlated multiple matrix completion based on the convex optimization method. In 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 1618 1623, 2014.

Y. Hu, D. Zhang, J. Ye, X. Li, and X. He. Fast and accurate matrix completion via truncated nuclear norm regularization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2117 2130, 2013.

P. Jain and I. S. Dhillon. Provable inductive matrix completion. Co RR, abs/1306.0626, 2013.

M. Jamali and M. Ester. A matrix factorization technique with trust propagation for recommendation in social networks. In Proceedings of the Fourth ACM Conference on Recommender Systems, Rec Sys 10, pages 135 142, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-906-0.

S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09, pages 457 464, New York, NY, USA, 2009a. ACM.

Collective Matrix Completion

S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09, pages 457 464, New York, NY, USA, 2009b. ACM.

O. Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1):282 303, 2014.

O. Klopp. Matrix completion by singular value thresholding: Sharp bounds. Electron. J. Statist., 9(2):2348 2369, 2015.

O. Klopp, J. Lafond, E. Moulines, and J. Salmon. Adaptive multinomial matrix completion. Electron. J. Statist., 9(2):2950 2975, 2015.

Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30 37, 2009.

J. Lafond. Low rank matrix completion with exponential family noise. In Peter Grnwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 1224 1243, Paris, France, 2015. PMLR.

X. N. Lam, T. Vu, T. D. Le, and A. D. Duong. Addressing cold-start problem in recommendation systems. In Proceedings of the 2Nd International Conference on Ubiquitous Information Management and Communication, ICUIMC 08, pages 208 211, New York, NY, USA, 2008. ACM.

R. M. Larsen. Lanczos bidiagonalization with partial reorthogonalization, 1998.

E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer-Verlag, New York, NY, USA, second edition, 1998.

Z. Liu and L. Vandenberghe. Interior-point method for nuclear norm approximation with application to system identiﬁcation. SIAM J. Matrix Anal. Appl., 31(3):1235 1256, 2009.

Z. Liu and L. Vandenberghe. Interior-point method for nuclear norm approximation with application to system identiﬁcation. SIAM Journal on Matrix Analysis and Applications, 31(3):1235 1256, 2010.

H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King. Recommender systems with social regularization. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM 11, pages 287 296, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0493-1.

R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res., 11:2287 2322, 2010.

S. Mendelson. Obtaining fast error rates in nonconvex situations. Journal of Complexity, 24(3):380 397, 2008.

N. Natarajan and I. S. Dhillon. Inductive matrix completion for predicting genedisease associations. Bioinformatics, 30(12):i60 i68, 06 2014. ISSN 1367-4803.

Alaya and Klopp

N. Natarajan, D. Shin, and I. S. Dhillon. Which app will you use next?: Collaborative ﬁltering with interactional context. In Proceedings of the 7th ACM Conference on Recommender Systems, Rec Sys 13, pages 201 208, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2409-0.

S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist., 39(2):1069 1097, 2011.

Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161, 2013.

S. Oh, A. Montanari, and A. Karbasi. Sensor network localization from local connectivity: Performance analysis for the mds-map algorithm. In 2010 IEEE Information Theory Workshop on Information Theory (ITW 2010, Cairo), pages 1 5, 2010.

J. Pag es. Multiple Factor Analysis by Example Using R. Chapman & Hall/CRC The R Series. Taylor & Francis, 2014. ISBN 9781482205473.

N. Parikh and S. Boyd. Proximal algorithms. Found. Trends Optim., 1(3):127 239, 2014.

B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev., 52(3):471 501, 2010.

J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22Nd International Conference on Machine Learning, ICML 05, pages 713 719, New York, NY, USA, 2005. ACM.

R.Vershynin. Introduction to the non-asymptotic analysis of random matrices. Co RR, abs/1011.3027, 2010.

S. Shin, D.and Cetintas and I. S. Lee, K.-C.and Dhillon. Tumblr blog recommendation with boosted inductive matrix completion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM 15, pages 203 212, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3794-6.

A. P. Singh and G. J. Gordon. Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 08, pages 650 658, New York, NY, USA, 2008. ACM.

A. P. Singh and G. J. Gordon. A bayesian matrix factorization model for relational data. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 10, pages 556 563, Arlington, Virginia, United States, 2010. AUAI Press.

A. M.-C. So and Y. Ye. Theory of semideﬁnite programming for sensor network localization. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 05, pages 405 414, Philadelphia, PA, USA, 2005. Society for Industrial and Applied Mathematics.

N. Srebro, J. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization, 2005.

Collective Matrix Completion

M. Udell, C. Horn, R. Zadeh, and S. Boyd. Generalized low rank models. Found. Trends Mach. Learn., 9(1):1 118, June 2016. ISSN 1935-8237.

S. van de Geer. Estimation and Testing Under Sparsity: Ecole d Et e de Probabilit es de Saint Flour XLV 2015. Lecture Notes in Mathematics. Springer International Publishing, 2016.

L. Xu, Z. Chen, Q. Zhou, E. Chen, N. J. Yuan, and X. Xie. Aligned matrix completion: Integrating consistency and independency in multiple domains. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 529 538, 2016.

M. Xu, R. Jin, and Z.-H. Zhou. Speedup matrix completion with side information: Application to multi-label learning. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS 13, pages 2301 2309, USA, 2013. Curran Associates Inc.

Q. Yao and J. T. Kwok. Accelerated inexact soft-impute for fast large-scale matrix completion. In Proceedings of the 24th International Conference on Artiﬁcial Intelligence, IJCAI 15, pages 4002 4008. AAAI Press, 2015.

T. Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. Ann. Statist., 32(1):56 85, 2004.

M. Zitnik and B. Zupan. Matrix factorization-based data fusion for gene function prediction in baker s yeast and slime mold. Paciﬁc Symposium on Biocomputing. Paciﬁc Symposium on Biocomputing, pages 400 11, 2014.

M. Zitnik and B. Zupan. Data fusion by matrix factorization. IEEE Transactions on Pattern Analysis & Machine Intelligence, 37(1):41 53, 2015. ISSN 0162-8828.