# finegrained_generalization_analysis_of_inductive_matrix_completion__e5567bf4.pdf

Fine-grained Generalisation Analysis of Inductive Matrix Completion

Antoine Ledent TU Kaiserslautern ledent@cs.uni-kl.de

Rodrigo Alves TU Kaiserslautern alves@cs.uni-kl.de

Yunwen Lei University of Birmingham y.lei@bham.ac.uk

Marius Kloft TU Kaiserslautern kloft@cs.uni-kl.de

In this paper, we bridge the gap between the state-of-the-art theoretical results for matrix completion with the nuclear norm and their equivalent in inductive matrix completion: (1) In the distribution-free setting, we prove sample complexity bounds improving the previously best rate of rd2 to d3{2?r logpdq, where d is the dimension of the side information and r is the rank. (2) We introduce the (smoothed) adjusted trace-norm minimization strategy, an inductive analogue of the weighted trace norm, for which we show guarantees of the order Opdr logpdqq under arbitrary sampling. In the inductive case, a similar rate was previously achieved only under uniform sampling and for exact recovery. Both our results align with the state of the art in the particular case of standard (non-inductive) matrix completion, where they are known to be tight up to log terms. Experiments further confirm that our strategy outperforms standard inductive matrix completion on various synthetic datasets and real problems, justifying its place as an important tool in the arsenal of methods for matrix completion using side information.

1 Introduction

Matrix completion (MC) is the machine learning problem of recovering the missing entries of a partially observed matrix. It is the go-to approach in various application domains such as recommender systems [1, 2] and social network analysis [3, 4, 5]. The Soft Impute algorithm [6, 7] is among the most popular MC methods. It solves the following convex problem encouraging low-rank solutions:

min ZPRmˆn 1 2}PΩp Z Gq}2 Fr λ}Z} , (1)

where PΩdenotes the projection on the set Ωof observed entries, G is the ground truth matrix, and }.} denotes the nuclear norm (i.e., the sum of the matrix s singular values).

Besides the incomplete matrix, additional information may be available in applications such as movie recommendation or drug interaction prediction [8, 9, 10, 11]. For instance in movie recommendation, one may have access to the movies genres, their synopsis, the gender and occupation of the users, or a friendship network between the users. Inductive matrix completion (IMC) [11, 12, 13, 14] exploits such side information. It assumes that the side information is summarized in matrices X P Rmˆd1 and Y P Rnˆd2, with the row vectors representing the users and items, respectively. IMC then optimizes the following objective function

min MPRd1ˆd2 1 2}PΩp XMY J Gq}2 Fr λ}M} . (2)

This model has been used in many domains also beyond movie recommendation [8, 10, 15].

In this paper, we contribute to a better theoretical understanding of IMC and related methods in the approximate recovery case. In this setting we obtain guarantees in terms of a bound on the expected

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

loss which decreases with the number of samples. Our best results concern the distribution-free case, meaning that our bounds are valid for any sampling distribution. This is in sharp contrast to the vast areas of literature where one assumes the distribution is uniform [16, 17, 18]. Our analysis leads to substantial gains compared to the state of the art results [19, 20, 21], as we obtain near optimal bounds in situations where the state of the art bounds are vacuous, as is explained below.

Although for uniform sampling, near-tight exact recovery bounds of Oprd logpdq logpnqq exist1 for IMC [16, 17], the approximate recovery case (especially in a distribution-free setting) is far less understood. The state-of-the-art distribution-free results for IMC were proved in [19, 20] (and in [21] for a kernel formulation of IMC) and, expressed in terms of generalisation error bounds, scale as

where x : }XJ}2,8 maxu }X.,u}2 is the maximum norm of a left side information vector (row of X), N is the number of available samples, and y : }Y J}2,8 maxv }Y.,v}2 is the maximum norm of a right side information vector (row of Y ). This implies that reaching a given loss threshold ϵ requires Opx2y2M2{ϵ2q entries, where M is a bound on the nuclear norm of M. In this case, we say that the sample complexity is Opx2y2M2q. To understand how those bounds scale with the matrix dimensions, consider the simple case where X and Y are made up of blocks of identity matrices. In that case, we have x y 1, yielding a sample complexity of Op M2q. Since }M}2 d2r, this yields a bound of order rd2.

Such bounds have a remarkable property: they do not depend on the size n of the matrix and instead depend only on the size d of the side information. This means that they capture the fact that valuable information can be extracted even for users and items for which no ratings are observed. On the other hand, these bounds have a strong dependence on the size d of the side information. As an illustration, consider that they are vacuous when X I and Y I, since the required number of entries Oprd2q Oprn2q then grows faster than the total number of entries n2. This is despite the fact that in that situation, distribution-free bounds for standard matrix completion yield a sample complexity of Opn3{2?rq for the standard regulariser [22] and Opnr logpnqq for a modified regulariser (the smoothed weighted trace norm from [23]). Thus, these existing distribution-free IMC bounds are very far from tight. In fact, they are only meaningful when the size of the side information is negligible compared to the general scale of the problem, which is a significant limitation in terms of the elegance of the theory (mismatch with MC bounds, separate proof techniques for separate regimes) and in practice (real-life side information could be very high-dimensional, especially if it is extracted from a neural network [24] or from a wide variety of different sources). To reinforce that point, note that any side information with a strong cluster structure2 would exhibit similar failings to the above mentioned identity side information case.

In this work, we bridge the gap between the state-of-the art in matrix completion and inductive matrix completion with the trace norm by providing distribution-free bounds for IMC which combine both of the following advantages: (1) a lack of dependence in the size of the original matrices, and (2) a more refined dependence on the size of the side information: the dependence on d in our bounds is almost the same as the dependence on n (the size of the matrix) for the state-of-the-art MC results. More precisely, our first contribution is to provide a bound of order Opd3{2?r logpdqq for the standard regulariser (2). The proof builds on techniques from [22, 25], but is substantially more involved due to the complicated dependence structure generated by the side information. As our second contribution, we construct analogues of the ideas of [23, 26] for the IMC setting: we begin by showing a bound of order Oprd logpdqq for a class of distributions with certain uniformity assumptions (our "uniform inductive marginals"), and then design a new "adjusted trace norm regulariser" for the problem (2) with similar properties to the weighted trace norm [26, 23] in MC. Instead of simply renormalising rows and columns of M as in previous work, our method requires rescaling the core matrix M along data-dependent orientations that capture interplay between the sampling distribution and the side information matrices X, Y .

Our contributions are summarised as follows.

1with some orthogonality assumptions on the side information 2where the users and items are approximately split into communities , see also Appendix A

1. We provide distribution-free generalisation bounds for the inductive matrix completion model (2) (assuming a fixed upper bound on the nuclear norm) which scale like Opd3{2?r logpdqq where r is a soft relaxation of the rank.

2. In the case of uniform or approximately uniform sampling, we provide a bound of order Oprd logpdqq for approximate recovery.

3. We introduce a modified version of the IMC objective (2), which we refer to as adjusted trace norm regularsation (ATR). An empirical version E-ATR is also introduced and both achieve bounds of order Oprd logpdqq in the distribution-free setting.

4. We experimentally demonstrate on synthetic data that our adjusted regulariser outperforms the standard IMC objective (2) in many situations.

5. We incorporate our method into a model involving a non-inductive term and evaluate it on real-life datasets, demonstrating substantially improved performance.

This paper is organized as follows. In Section 2 we review some related work. In Section 3 we introduce our main results. Finally, in Section 4 we present our experimental results.

2 Related work

In both MC and IMC, the existing literature consists of several main branches differing in their main assumptions: exact recovery versus approximate recovery and uniform sampling versus distributionfree bounds. In exact recovery, the matrix is assumed deterministic, and we want to recover its missing entries exactly [17, 16, 27, 28]. In approximate recovery, the matrix is assumed noisy, and we want to recover its missing entries only approximately, within some interval around their expectation [19, 20, 21, 18, 29]. Approximate recovery theory is typically expressed in terms of uniform generalisation bounds over a function class using a matrix-norm constraint. Assuming that the entries are sampled from a specific distribution (e.g., uniform), one typically can achieve much faster rates than distribution-free theory regardless of the distribution. The typical sample complexity of standard MC under uniform sampling is Opnr log2pnqq for exact recovery (proved in the series of breakthrough papers [27, 28, 30]) and Opnr logpnqq for approximate recovery [23]. In [31, 32], an improved rate of nr logpnq logprq (for exact recovery) was shown.

The most closely related papers to ours are [22] and [23], which both work only on standard matrix completion without side information. In [22], a bound of order Opn3{2?rq was obtained in the distribution-free setting for matrix completion with the trace norm, whilst in [23], rates of Oprn logpnqq are shown for sampling with uniform marginals and for a smoothed version of the weighted trace norm regulariser in the distribution-free case. We almost perfectly extend most of the results from both papers to the inductive case, which requires many technical modifications.

Within the IMC framework the closest works are those which also deal with approximate recovery in the non uniform sampling case: [21, 33, 19, 20]. Their bounds, presented in many different contexts, translate to sample complexities of type Oprd2q. Other celebrated works in the theoretical study of IMC include: [16] and [17], which showed rates of order d2r3 logpdq and rd logpdq logpnq respectively for exact recovery with uniform sampling, together with other important contributions (see appendix). In the case of exact recovery, the rate of rd logpdq logpnq was obtained only under the assumption that the side information matrices have orthonormal columns. Some bounds use a completely different regulariser (such as the max norm) to achieve better rates [34, 35] etc. These works also do not involve side information.

In Figures 1 and 2, we summarize state-of-the-art (s.o.t.a.) results in both MC and IMC. Note the problem of exact recovery in the distribution-free case is ill-defined (hence the N/As in our table). In approximate recovery bounds, we omit a factor of 1{ϵ2, where ϵ is the tolerance threshold in terms of expected loss), as this factor is present in all approximate recovery bounds 3. In exact recovery bounds, the rate is the order of magnitude of the threshold past which exact recovery occurs with high probability.

3To our best knowledge, all results show a decline in population expected loss of the order of a

1{N where N is the sample size

Table 1: Matrix completion results (trace norm-based only) MC Unif.Sampling Distr.-free Weighted version

Exact nr log2pnq ([27, 28, 30]) N/A N/A nr logpnq logprq ([31, 32]) Approx. nr logpnq ([23, 22]) n3{2?r ([22]) rn logpnq ([23])

Table 2: Inductive matrix completion results (trace norm-based only) IMC Unif.Sampling Distr.-free Weighted version

Exact rd logpdq logpnq ([17, 18]) N/A N/A d2r3 logpdq ([16]) Approx. (s.o.t.a.) rd2 ([21, 33, 19]) rd2 ([21, 33, 19]) None Approx. (ours) rd logpdq (Ours) d3{2?r logpdq (Ours) rd logpdq (Ours)

Other related works include (IMCNF) [19, 20], which proposed the following model:

pi,jq PΩ |Gi,j p XMY J Zqi,j|2 λ1}M} λ2}Z} , (4)

where λ1, λ2 are regularisation parameters, Gi,j denotes the observed entries and the predictors take the form p XMY J Zq. This model relies on the cross-validated hyperparameters λ1, λ2 to balance the importance of the side information. The authors also showed results based on a combination of a bound for the inductive term XMY J and a bound for the non inductive term Z. The non inductive terms in the bounds are similar to [22], whilst the bounds for the inductive term are proved from scratch and have also later appeared in a different form in [21, 33] together with a kernel formulation of IMC. In Subsection 4.2 we combine our framework with this strategy to reach competitive results.

In [36], the authors introduce a model consisting of a sum of mutually orthogonal IMC terms together with an explicit optimization strategy in the specific case where the available side information consists in partitions of the users and items into communities. In [37], the authors further extend the model to learn the community membership functions together with the ground truth matrix, based only on the sampled entries. The case of a single IMC term where the side information is in the form of a community partition is useful to develop intuition into the equivalent roles of d1, d2 in our bounds versus m, n in MC bounds. Whilst generalization bounds were proved in [36] with a similar scaling as our bound from Thm 3.1 (and in particular are better than the state-of-the-art IMC bounds if applied to this situation), they only apply to the specific case of community side information. In this work (Theorem 3.1) we achieve the first IMC bounds which cover the whole range of possible side information matrices X, Y , whilst providing the correct scaling (up to log terms) in the case of identity or community side information. Community side information has also been studied in other discrete contexts where individual behaviour is assumed to be a noisy realisation of community side information [38, 39].

Another work is [18] which introduces a joint model that imposes a nuclear norm-based constraint on both M and XMY J through a modification of the objective. The authors prove bounds for their method which match the state of the art in IMC [17, 19] and MC [22] when the side information is perfect and useless respectively. The dependence on the side information is better in our case. Further discussion of that paper is included in the appendix. Of course, there are also many other works which propose modified optimization problems for the Recommender Systems task through other rank-sparsity inducing regularisers [35, 34, 40] and even exploiting other ground truth structure besides the low-rank property [41, 42].

3 Main results

Notation: We observe N entries of a ground truth matrix G P Rmˆn which are sampled i.i.d (with replacement) through an arbitrary distribution p: we draw pi, jq P t1, .., mu ˆ t1, ..., nu with probability pi,j where ř

i,j pi,j 1. The sampled entries ξ1, ξ2, . . . , ξN P t1, 2, . . . , mu ˆ

t1, 2, . . . , nu form a multiset Ω: our setting allows for the observations to be noisy with a different noise distribution for each entry, but purely for notation convenience we often treat the issue as if there is no noise when no ambiguity is possible. When written explicitly, the noise is denoted by ζ. For a function f : R Ñ R we will write ř

pi,jq PΩfp Gi,jq for the sum of the images of the observations, counted as many times as necessary 4. We assume we are given side information matrices X P Rmˆd1 and Y P Rnˆd2. The maximum L2 norm of a row of X (resp. Y ) is denoted by x (resp. y). The minimums are denoted by x and y respectively. The row vectors of X (resp. Y ) are also written xi for i ď m (resp. yj for j ď n ). For any matrices A, B, A ď B means that B A is positive semi-definite, }A} denotes the spectral norm of A and }A} denotes the nuclear norm of A. We have one fixed loss function l used throughout the paper which is both Lipschitz with constant ℓand bounded by b. For convenience we also frequently write d instead of maxpd1, d2q. In the appendix, we provide a complete table of notations (Table K.1) that includes all notations introduced throughout the paper.

We now present our results, starting with the distribution-free bound for the standard regulariser, then moving on to the improved bounds under uniform sampling, and finally to our adjusted trace norm regulariser and the theoretical improvements it provides.

3.1 Distribution-free guarantees for the standard IMC objective

For a constant M P R, we define the function class: FM XMY J : }M} ď M ( , which contains all predictors XMY J where M has its spectral norm bounded by S. Our first main result is a uniform generalisation bound for the loss minimiser within this function class. Below we use the shorthand lp Aq (resp. ˆl Sp Aq or even l Sp Aq) for Epi,jq pplp Ai,j, Gi,j ζi,jqq (resp. ř

pi,jq PΩlp Ai,j, Gi,j ζi,jq{N), the overall expected (resp. empirical) loss associated to matrix A P Rmˆn. In particular, in the noiseless setting, inf ZPFM lp Zq 0 as long as }G} ď M.

Theorem 3.1. Fix any target matrix G and distribution p. Define ˆZS arg minpˆl Sp Zq : Z P FMq. For any δ P p0, 1q, with probability (w.p.) ě 1 δ over the draw of the training set Ωwe have

lp ˆZq ď inf ZPFM lp Zq C

where C is a universal constant, b is a bound on the loss, ℓis the Lipschitz constant of the loss l, and

logp Np20M2ℓ ?

drx2y2s{b 1q ȷ is a logarithmic quantity. Furthermore, in

expectation over the training set we have:

lp ˆZq ď inf ZPFM lp Zq C

The proof is provided in Appendix A. Assuming that ℓ, b are treated as constants, the above bound on the generalisation gap lp ˆZq inf ZPFM lp Zq scales like

logpxy Mq ı

If we further think of the maximum entry of the core matrix M as bounded by a constant, M scales like ?d1d2r where r is the rank of M. Assuming the rescaling is also set so that x, y are constants, the above yields a sample complexity of

dr logpdq ϵ2

4More rigorously the observations are i.i.d of the form pξo, ξoq with ξo P t1, 2, . . . , mu ˆ t1, 2, . . . , nu and ξo P R and write řN o 1 fp ξoq instead of ř

pi,jq PΩfp Gi,jq, and it should be assumed that the "ground truth" values G (are defined so as to) minimize Eplp Gξ, ξqq for our loss function l over the joint distribution of ξ, ξ

where ϵ is the tolerance threshold. Indeed, the a

logp Nq term can be treated via the following simple observation: If N ě Θ logpΘq and Θ is sufficiently large then

N logp Nq ě Θ logpΘq logpΘq logplogpΘqq ě Θ logpΘq

2 logpΘq ě Θ{2.

Remark on the proof technique: The proof of the result in [22] relies on a lemma of Latala (lemma A.1) from [43] for random matrices with i.i.d. entries and an elegant decomposition of the entries into two groups: (1) entries that have been sampled many times, and (2) entries that have not been sampled too often. On group 1, the partial sums of the Rademacher variables concentrate trivially, whilst on group 2, the entries are well spread out and Lemma A.1 limits the spectral norm similarly to the uniform case. The proof is about carefully balancing those two contributions.

In our inductive situation, using the same split can only yield bounds of the type (3) which are well known and vacuous when the side information is of comparable size to the matrix. Our key idea to fix this issue is that instead of distinguishing frequently and less frequently sampled entries, we split between high and low energy orientations corresponding to pairs p X.,u, Y.,vq of columns of the side information matrices. To achieve this aim, we use the rotational invariance of the trace operator and equivalently express the Rademacher averages in inductive space (Rd1ˆd2). However, the entries of the resulting matrix are certainly not independent, which makes it impossible to apply the concentration results from [43]. Instead, we must rely again on the matrix Bernstein inequality F.4. Obtaining a covariance structure that is amenable to application of this result requires performing an iterative procedure involving series of distribution dependent rotational transformations of the side information and other estimates at each step.

3.2 Generalisation bounds for the trace norm regulariser under a uniformity assumption

We now move to our second main contribution, which is a broad generalisation of most of the results of [23] to the inductive case. In this direction, we begin with a result for approximate recovery in inductive matrix completion with the standard nuclear norm regulariser. Although this first result (proved in Appendix B) is original to the best of our knowledge, it is not surprising since a similar result is known in the exact recovery case. However, it is an excellent way to introduce notation which will be necessary in the rest of the paper. Proposition 3.1. Let us write FM for the function class corresponding to matrices of the form XMY J with }M} ď M. Let MS arg min}M} ďM ř

ξPΩlpp XMY Jqξ, Gξ ζξq be the trained matrix M and M arg min}M} ďM Elpp XMY Jqξ, Gξ ζξq be the optimal M when M is restricted by }M} ď M. Write also ZS XMSY J and Z XM Y J.

Write K : max b

m }Y }2 Fr n , b

n }X}2 Fr m

ȷ . Under uniform sampling, w.p. ě 1 δ:

lp ZSq lp Z q ď 8ℓK ?

N Mxyp1 logp2dqq b

where ?r M{?d1d2 and b is a bound on the loss. Furthermore, the above result holds under the following more general "uniform inductive marginals" condition (analogous to the "uniform marginals"):

i,j pi,j}yj}2 }Y }2 Fr mn and @j, ÿ

i,j pi,j}xi}2 }X}2 Fr mn . (9)

Remarks: If }xi} and }yj} are constant over i and j, then the above conditions (9) reduce to a requirement of uniform marginal probabilities. Note that ?r p M{?d1d2q acts as a soft relaxation of the rank of M since if M P FM and the entries of M are bounded by 1 then rankp Mq ď r. If X I and Y I, then conditions (9) reduce to the uniform marginals condition from [23].

In particular, we see that in the case of identity side information, we require Opdr logprqq samples to reach a given accuracy. However, the result above is deeper when the side information is non trivial. Indeed, the quantity maxp a

}XJX}}Y }2 Fr, a

}Y JY }}X}2 Frq, which equals d maxpd1, d2q in the

case of identity (or equal-size community) side information, is sensitive to the relative orientation of the columns of X and Y : if the side information X and Y are properly scaled and approximately of rank ρ, then this quantity will approach ρ. We discuss this in more details in the appendix.

To prove the above result, we will show a slightly more general result below (Prop 3.2). In order to capture the interaction between the side information and the data-distribution, we must define a distribution-dependent inner product x., .yl (resp. x., .yr) on the column space of X (resp. Y ):

For two vectors u1, u2 P Rm (resp. v1, v2 P Rn) we define xu1, u2yl řm i 1 u1 i u2 i qi (resp. xv1, v2yr řn j 1 v1 j v2 j κj) where the qis and κjs are defined by

j 1 pi,j}yj}2 @i ď m κj

i 1 pi,j}xi}2 @j ď n. (10)

We now define the vector σ1 P Rd1 (resp. σ2 P Rd2) as the vector of singular values of the matrix X (resp. Y ) with respect to (w.r.t) the inner product x., .yl (resp. x., .yr). In other words, the entries of σ1 P Rd1 (resp. σ2 P Rd2) are the square roots of the eigenvalues of the symmetric matrix L : XJ diagpqq X P Rd1ˆd1 ř

i,j pi,jxix J i }yj}2 (resp. R : Y J diagpκq Y ř

i,j pi,jyjy J j }xi}2 P Rd2ˆd2). We also write σ1 maxpσ1q and σ2 maxpσ2q. Proposition 3.2. With the same notation as in Proposition 3.1, w.p. ě 1 δ over the draw of the training set Ω:

lp ZSq lp Z q ď 8ℓ ?

N M maxpσ1 , σ2 qp1 a

logp2dqq 12ℓ

N Mxyp1 logp2dqq b

Remarks: Note that both σ1 and σ2 scale as the product of the scaling of X and Y . The above result shows that if the distribution is only approximately uniform (sampling probabilities within a given ratio), then the bound is only penalised proportionately to this ratio: for identity side information, rσ1 s2 is the maximum user (marginal) probability which scales like 1{d1 for approximately uniform marginals. Similarly σ2 1{d2, yielding a sample complexity bound of order dr logpdq as expected.

3.3 Proposed adjusted regularisers and notation

In this section, we introduce our adjusted trace norm regulariser and its variants. We first recall that in standard (non-inductive) matrix completion, the weighted trace norm [26, 23] of a matrix Z is defined as } ?

E} where D P Rmˆm (resp. E P Rnˆn) are diagonal matrices whose diagonal entries contain the marginal row (resp. column) sampling probabilities. Regularising the weighted trace norm instead of the standard trace norm increases performance [26] and leads to better theoretical guarantees. In this work we extend those advantages to the setting where side information is available.

Notation: Recall Γ ř

i,j pi,j}xi}2}yj}2. Our method is based on a careful distribution-dependent rescaling of the matrix M. The idea is that we must look at the principal directions (singular vectors) of the side information matrices, but computed with respect to a distribution-sensitive inner product: when computing inner products of vectors in the column space of x, components corresponding to highly users which are more likely to be sampled must be weighted more. Accordingly, we diagonalise the matrix L XJ diagpqq X (resp. L Y J diagpκq Y ) from above to write it P 1DP (resp. Q 1EQ). We also define empirical versions of those quantities: pΓ 1

i,j hi,j}xi}2}yj}2

where hi,j is the number of times that entry pi, jq was sampled: hi,j řN o 1 1ξo pi,jq #pΩX tpi, jquq; ˆqi ř

N }yj}2, ˆκj ř

N }xi}2, p L XJ diagpˆqq X, p R Y J diagpˆκq Y , and their

diagonalisations p P 1 p D p P and p Q 1 p E p Q. We can now write our predictors

XMY J XP 1D 1 2 r D 1

2 s E 1 2 QY J X p P 1 p D 1 2 r p D 1

2 s p E 1 2 p QY J. (11)

The simplest version of our proposed algorithm is to regularise r D 1

2 s instead of M.

However, some extra technical modifications may be necessary: If some users or items have extremely small sampling probability, the corresponding entries of D 1

2 will be very large. To obtain good bounds, we tackle this issue by forcing the entries of D, p D, E, p E to be bounded below, which

we achieve via smoothing: fixing a parameter α P r0, 1s, we define r D αD p1 αqΓI{d1 and r E αE p1 αqΓI{d2 where I is the identity matrix. Similarly, q D α p D p1 αqpΓI{d1 and q E α p E p1 αqˆΓI{d2.

We also define accordingly M 1 D 1 2 PMQ 1E 1 2 ; x M p D 1 2 p PM p Q 1 p E 1 2 ; Ă M r D 1 2 PMQ 1 r E 1 2 ; and | M q D 1 2 p PM p Q 1 q E 1 2 ; as well as similarly r X XP 1 r D 1

2 , X1 XP 1D 1

2 , p X X p P 1 p D 1

2 , q X X p P 1 q D 1

2 , r Y Y P 1 r E 1

2 , Y 1 Y P 1D 1

2 , p Y Y p Q 1 p D 1

2 , q Y Y p Q 1 q D 1

2 . Thus XMY J X1M 1r Y 1s J r X Ă M r Y J p X x M p Y J q X | M q Y J.

Proposed models: We then propose a variety of adjusted regularisation strategies as follows by replacing the regularisation of M by that of M 1, Ă M, x M or | M depending on whether the ground truth distribution is known and whether smoothing is desired. For instance, in the smoothed, empirical case, we will solve the following optimization problem:

ξPΩ lpp XMY Jqξ, Gξ ζξq λ} q D 1 2 p PM p Q 1 q E 1 2 } . (12)

Remark: Similarly to the matrix case the smoothing parameter α is set to 1

2 in all theorem statements 5. In the experiments, we vary α as indicated.

We will prove results for the empirical risk minimiser belonging to the following function classes: r Fr : ! XMY J : }Ă M} ď ?rΓ ) q Fr : ! XMY J : }| M} ď ?rpΓ ) , (13)

corresponding to the smoothed and smoothed empirical versions of our algorithm. Note that the factors of Γ are added purely for convenience in the final formula, so that we can understand the final formulae in terms of a soft concept of "rank". Indeed we have

} r D 1 2 }2 Fr ď d1 Γ 2d1 1

diagpqq X}2 Fr p1{2qΓ p1{2q ÿ

i,u X2 i,u ÿ

j pi,j}yj}2 Γ, (14)

and similarly } r E 1 2 }2 ď Γ. Thus if }PMQ 1}8 ď 1 and rankp Mq ď ρ, we have }Ă M} ď ?ρ}Ă M}Fr ď ?ρ bř u,vr r D

1 2u s2r r E

1 2v s2r PMQ 1s2 i,j ď ?ρ}PMQ 1}8 bř u,vr r D

1 2u s2r r E

1 2v s2 ď ?ρΓ. Similarly, }| M} À ?ρpΓ under the condition } p PM p Q 1}8 À 1.

3.4 Generalisation bounds for the smoothed adjusted trace norm

Although knowing the distribution is not realistic, it is instructive to see that one can obtain guarantees of order Opdr logpdqq for the function class r Fr as a reasonably straightforward extension of the ideas developed for Proposition 3.2. The proof is provided in Appendix C.

Proposition 3.3. Let Ă MS arg min} Ă M}ď?rΓ ř

ξPΩlpp r X Ă M r Y Jqξ, Gξ ζξq be the trained matrix Ă M and r Z arg min ZP r Fr Elp Zξ, Gξ ζξq be the optimal r Z when the predictors are restricted to

the class r Fr. Let also r ZS r X Ă MS r Y J. We have w.p. ě 1 δ:

lp r ZSq lp r Z q ď 16ℓ ?

N 24ℓxy?d1d2rp1 logp2dqq

3.5 Generalisation bounds for the smoothed empirically adjusted trace norm

Below is a more challenging result (proof in Appendix D) which concerns the function class q Fr corresponding to the empirically smoothed regulariser.

Theorem 3.2. Fix any target matrix G and distribution p. Define q ZS arg minpˆl Sp Zq : Z P q Frq where ˆl Sp Zq 1

ξPΩlp Zξ, Gξ ζξq. For any δ P p0, 1q, w.p. ě 1 δ

lp q Zq ď inf ZP r Fr lp Zq C ℓ?rγpx yq2 b d

δ q N , (15)

5It is trivial to extend the proofs to arbitrary α at the cost of a factor of 1{ minpα, 1 αq.

where γ x2 y2

x2y2 and C is a universal constant. In particular, in expectation over the draw of the training set we have

lp q Zq ď inf ZP r Fr lp Zq 2C ℓ?rγpx yq2 b c

The significance of this result is that even in the case of an arbitrary distribution, minimizing the smoothed empirical adjusted nuclear norm }| M} results in sample complexity bounds of order dr logpdq, meaning that our distribution-dependent transformations have completely removed the negative effects of non-uniformity on the sample complexity. Note the proof requires careful technical variations compared to the proof of the comparable results in [23]. As an example, Lemma E.1 is the equivalent of Lemma 2 in page 8 of the supplementary in [23] (whose proof is far shorter).

3.6 Variations on the optimization problems

As in the related literature ([19, 22] etc.), we worked with a bounded loss, and expressed our results for the loss minimizer within a function class defined by explicit norm constraints. However, it is also possible to modify the results (under some boundedness assumptions) to make them apply to lagrangian formulations such as (18) (1), (2). In typical contexts where the entries are known to be bounded, this can even be done with the square loss. As an example, we consider the following immediate corollary of Proposition 3.3 and its global version C.1 (appendix): Corollary 3.4. Assume that all of the entries of the ground truth are bounded by a constant C, and that they are observed without noise. Let Z# XM#Y J be the solution to the following optimization problem:

min M } r D 1 2 PMQ 1 r E 1 2 } subject to r XMY Jsξ Gξ @ξ P Ω. (17)

Let ΦCpxq signpxq minp|x|, Cq. For any ℓ-Lipschitz loss l, we have (with probability ě 1 δ)

lpΦCp Z#qq ď 8ℓ ?

N 12ℓxy?d1d2r Gp1 logp2dqq

where r G is the smallest r such that the ground truth G satisfies G P r Fr.

A further result which applies in the presence of noise is provided in Appendix H.

4 Experimental verification

In this section, we experimentally validate the advantages of our adjusted regularisation strategies described in Subsection 3.3. In all experiments, we work with the square loss.

4.1 Experiments on synthetic data

We construct square data matrices in Rnˆn with a given rank r ď d for several combinations of n, d, r. We provide each model with d-dimensional side information spanning the row and column spaces. The sampling distribution is a power-type law depending on Λ such that Λ 0 yields uniform sampling (details in appendix). We compare three approaches: (1) Standard inductive matrix completion with the side information matrices X, Y (IMC) (2) Our smoothed adjusted regulariser λ}Ă M} (for several values of α) (ATR)6; and finally (3) our smoothed empirically adjusted regulariser λ}| M} (for several values of α) (E-ATR). For each n P t100, 200u we evaluate the following d, r combinations: p30, 4q, p50, 6q and p80, 10q. In order to study a meaningful data-sparsity regime, in each case we sampled drω entries where ω P t1, 2, 3, 4, 5u. We show the most representative results here. More comprehensive results are provided in the supplementary material.

We observe that our methods outperform standard inductive matrix completion by significant margins in many regimes, even in the case of uniform sampling. Furthermore, the empirical version of our model actually often performs better than the exact one, which matches the observations made in [23] in the case of standard matrix completion. More detailed results are reported in the appendix.

6Note that in this synthetic context, it is actually possible to compute Ă M since the distribution is known.

Figure 1: Left: performance as a function of the data sparsity parameter ω for n, d, r 200, 80, 10. Right: Performance on different n, d, r combinations for ω 4. Legend: parameter to the right is α.

Table 3: Results of real-world datasets (RMSE) Soft Impute [6] IMCNF [19] E-ATR-0.5 E-ATR-0.75 E-ATR-1.0 Douban 0.9582 0.8197 0.7691 0.7614 0.8779 Last FM 2.4109 1.7612 1.6159 1.6943 2.3371 Movie Lens 0.9280 0.9252 0.9056 0.9139 0.9262

4.2 Real data experiments

We evaluate the performance of our model on three real life datasets: Douban, Last FM and Movie Lens (further described in the supplementary). In real data we work with the following adjusted version of the model in [19]:

min M,Z 1 N

pi,jq PΩ lp XMY J Z, Gi,j ζi,jq λ1} q D 1 2 p PM p Q 1 q E 1 2 } λ2} q D

1 2 I Z q E

1 2 I } (18)

where q D, q E are defined as above based on the side information matrices X, Y , and q DI, q EI are defined as q D, q E except based on the side information matrices p I, Iq. In particular, } q D1{2 I Z q E1{2 I } } q Z} is the smoothed weighted trace norm of Z in the sense of [23]. We report results in Table 3 and note our method outperforms both Soft Impute and IMCNF, especially with appropriate smoothing.

5 Conclusion

In this paper, we have provided the first distribution-free bounds for approximate recovery in inductive matrix completion with the trace norm with the following two desirable properties: (1) being non vacuous for identity or community side information and (2) being completely independent of the size of the matrix. We further presented an adjusted regularisation strategy which relies on a careful rescaling along distribution-dependent directions that captures the interaction between the side information matrices and the sampling distribution. Our bounds, which concern both the standard regulariser (rate Opd3{2?r logpdqq) and our adjusted version (rate Opdr logpdqq) are almost exactly what one would obtain by replacing the size of the matrix with the size of the side information in the standard matrix completion bound. Thus, we have bridged the large gap between the theoretical guarantees for matrix completion and inductive matrix completion.

Broader impact

The work in this paper is theoretical and without any foreseeable significant societal impact.

Acknowledgements

The authors acknowledge support by the German Research Foundation (DFG) awards KL 2698/2-1 and KL 2698/5-1, as well as by the Federal Ministry of Science and Education (BMBF) awards 031L0023A, 01IS18051A, and 031B0770E.

[1] Zhao Kang, Chong Peng, and Qiang Cheng. Top-n recommender system via matrix completion. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.

[2] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30 37, 2009.

[3] Cho-Jui Hsieh, Kai-Yang Chiang, and Inderjit S. Dhillon. Low rank modeling of signed networks. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 12, page 507 515, New York, NY, USA, 2012.

[4] Vassilis Kalofolias, Xavier Bresson, Michael Bronstein, and Pierre Vandergheynst. Matrix Completion on Graphs. ar Xiv e-prints, page ar Xiv:1408.1717, August 2014.

[5] Hao Ma, Dengyong Zhou, Chao Liu, Michael R Lyu, and Irwin King. Recommender systems with social regularization. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 287 296, 2011.

[6] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11:2287 2322, August 2010.

[7] Trevor Hastie, Rahul Mazumder, Jason D. Lee, and Reza Zadeh. Matrix completion and low-rank svd via fast alternating least squares. Journal of Machine Learning Research, 16(104):3367 3402, 2015.

[8] Rong Li, Yongcheng Dong, Qifan Kuang, Yiming Wu, Yizhou Li, Min Zhu, and Menglong Li. Inductive matrix completion for predicting adverse drug reactions (adrs) integrating drug target interactions. Chemometrics and Intelligent Laboratory Systems, 144:71 79, 2015.

[9] H. Wang, Y. Wei, M. Cao, M. Xu, W. Wu, and E. P. Xing. Deep inductive matrix completion for biomedical interaction prediction. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 520 527, 2019.

[10] Nagarajan Natarajan and Inderjit S. Dhillon. Inductive matrix completion for predicting gene and disease associations. Bioinformatics, 30(12):i60 i68, 06 2014.

[11] Mark Herbster, Stephen Pasteris, and Lisa Tse. Online matrix completion with side information. Advances in Neural Information Processing Systems, 33, 2020.

[12] Xiao Zhang, Simon Du, and Quanquan Gu. Fast and sample efficient inductive matrix completion via multi-phase procrustes flow. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5756 5765, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

[13] Aditya Krishna Menon and Charles Elkan. Link prediction via matrix factorization. In Machine Learning and Knowledge Discovery in Databases, pages 437 452. Springer Berlin Heidelberg, 2011.

[14] Tianqi Chen, Weinan Zhang, Qiuxia lu, Kailong Chen, Zhao Zheng, and Yong Yu. Svdfeature: A toolkit for feature-based collaborative filtering. The Journal of Machine Learning Research, 2012.

[15] Fabian Jirasek, Rodrigo A. S. Alves, Julie Damay, Robert A. Vandermeulen, Robert Bamler, Michael Bortz, Stephan Mandt, Marius Kloft, and Hans Hasse. Machine learning in thermodynamics: Prediction of activity coefficients by matrix completion. The Journal of Physical Chemistry Letters, 11(3):981 985, 2020.

[16] Prateek Jain and Inderjit S. Dhillon. Provable inductive matrix completion. Co RR, abs/1306.0626, 2013.

[17] Miao Xu, Rong Jin, and Zhi-Hua Zhou. Speedup matrix completion with side information: Application to multi-label learning. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS 13, page 2301 2309, Red Hook, NY, USA, 2013. Curran Associates Inc.

[18] Jin Lu, Guannan Liang, Jiangwen Sun, and Jinbo Bi. A sparse interactive model for matrix completion with side information. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

[19] Kai-Yang Chiang, Inderjit S. Dhillon, and Cho-Jui Hsieh. Using side information to reliably learn low-rank matrices from missing and corrupted observations. Journal of Machine Learning Ressearch, 2018.

[20] Kai-Yang Chiang, Cho-Jui Hsieh, and Inderjit S Dhillon. Matrix completion with noisy side information. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

[21] P. Giménez-Febrer, A. Pagès-Zamora, and G. B. Giannakis. Generalization error bounds for kernel matrix completion and extrapolation. IEEE Signal Processing Letters, 27:326 330, 2020.

[22] Ohad Shamir and Shai Shalev-Shwartz. Collaborative filtering with the trace norm: Learning, bounding, and transducing. In Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 661 678. PMLR, 2011.

[23] Rina Foygel, Ohad Shamir, Nati Srebro, and Russ R Salakhutdinov. Learning with the weighted trace-norm under arbitrary sampling distributions. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2133 2141. Curran Associates, Inc., 2011.

[24] Kai Zhong, Zhao Song, Prateek Jain, and Inderjit S Dhillon. Provable non-linear inductive matrix completion. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[25] Ohad Shamir and Shai Shalev-Shwartz. Matrix completion with the trace norm: Learning, bounding, and transducing. Journal of Machine Learning Research, 15:3401 3423, 2014.

[26] Nathan Srebro and Russ R Salakhutdinov. Collaborative filtering in a non-uniform world: Learning with the weighted trace norm. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2056 2064. Curran Associates, Inc., 2010.

[27] Emmanuel J. Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080, May 2010.

[28] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(null):3413 3430, December 2011.

[29] Raghunandan Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 952 960. Curran Associates, Inc., 2009.

[30] Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717, 2009.

[31] Yudong Chen. Incoherence-optimal matrix completion. Information Theory, IEEE Transactions on, 61, 10 2013.

[32] Lijun Ding and Yudong Chen. Leave-one-out approach for matrix completion: Primal and dual analysis. IEEE Transactions on Information Theory, 66:7274 7301, 2020.

[33] P. Giménez-Febrer, A. Pagès-Zamora, and G. B. Giannakis. Matrix completion and extrapolation via kernel regression. IEEE Transactions on Signal Processing, 67(19):5004 5017, 2019.

[34] T. Tony Cai and Wen-Xin Zhou. Matrix completion via max-norm constrained optimization. Electronic Journal of Statistics, 10(1):1493 1525, 2016.

[35] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In Peter Auer and Ron Meir, editors, Learning Theory, pages 545 560, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg.

[36] Antoine Ledent, Rodrigo Alves, and Marius Kloft. Orthogonal inductive matrix completion. IEEE Transactions on Neural Networks and Learning Systems, pages 1 12, 2021.

[37] Rodrigo Alves, Antoine Ledent, Renato Assunção, and Marius Kloft. An empirical study of the discreteness prior in low-rank matrix completion. In Neur IPS 2020 Workshop on Preregistration in Machine Learning, volume 148 of Proceedings of Machine Learning Research, pages 111 125. PMLR, 11 Dec 2021.

[38] Qiaosheng Zhang, Vincent Tan, and Changho Suh. Community detection and matrix completion with social and item similarity graphs. IEEE Transactions on Signal Processing, 69:917 931, 01 2021.

[39] Qiaosheng Eric Zhang, Geewon Suh, Changho Suh, and Vincent Y. F. Tan. MC2G: an efficient algorithm for matrix completion with social and item similarity graphs. Co RR, abs/2006.04373, 2020.

[40] Rina Foygel, Nathan Srebro, and Russ R Salakhutdinov. Matrix reconstruction with the local max norm. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.

[41] Muhan Zhang and Yixin Chen. Inductive matrix completion based on graph neural networks. In International Conference on Learning Representations, 2020.

[42] Qitian Wu, Hengrui Zhang, Xiaofeng Gao, Junchi Yan, and Hongyuan Zha. Towards open-world recommendation: An inductive model-based collaborative filtering approach. In International Conference on Machine Learning, pages 11329 11339. PMLR, 2021.

[43] Rafał Latała. Some estimates of norms of random matrices. Proceedings of the American Mathematical Society, 133(5):1273 1282, 2005.