# generalization_analysis_of_deep_nonlinear_matrix_completion__8e4dac17.pdf Generalization Analysis of Deep Non-linear Matrix Completion Antoine Ledent 1 Rodrigo Alves 2 We provide generalization bounds for matrix completion with Schatten p quasi-norm constraints, which is equivalent to deep matrix factorization with Frobenius constraints. In the uniform sampling regime, the sample complexity scales like r O prnq where n is the size of the matrix and r is a constraint of the same order as the ground truth rank in the isotropic case. In the distributionfree setting, the bounds scale as r O r1 p 2 , which reduces to the familiar ?rn 3 2 for p 1. Furthermore, we provide an analogue of the weighted trace norm for this setting which brings the sample complexity down to r Opnrq in all cases. We then present a non-linear model, Functionally Rescaled Matrix Completion (FRMC) which applies a single trainable function from R Ñ R to each entry of a latent matrix, and prove that this adds only negligible terms of the overall sample complexity, whilst experiments demonstrate that this simple model improvement already leads to significant gains on real data. We also provide extensions of our results to various neural architectures, thereby providing the first comprehensive uniform convergence PAC analysis of neural network matrix completion. 1. Introduction Matrix Completion (MC), the problem which consists in estimating a ground truth matrix G P Rmˆn from a small number N ! mn of observations, is an important machine learning problem with applications in various fields such as recommender systems (Mazumder et al., 2010; Hastie et al., 2015; Zhang et al., 2018; Koren et al., 2009), community 1School of Computing and Information Sciences, Singapore Management University, Singapore 2Department of Applied Mathematics, Czech Technical University in Prague, Prague, Czech Republic. Correspondence to: Antoine Ledent . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). discovery (Qiaosheng et al., 2019) and drug interaction prediction (Li et al., 2015). To recover a ground truth matrix based on a small number of observations, it is necessary to assume that it has some structure. Accordingly, there are a wide set of constraints and regularizers which aim to indirectly induce rank sparsity. One of the most well-known examples is the nuclear norm }.} , which is defined as the sum of the singular values (Mazumder et al., 2010). The Schatten p quasi-norm (for p ă 1) provides an alternative form of rank sparsity inducing constraint. The Schatten p quasi-norm }Z}sc,p of a matrix Z is defined as rř v ρp vs 1 p , where the ρvs are the singular values of Z. In particular, when p approaches 0, }Z}p sc,p ř v ρp v approaches the rank of Z. When p 2 d for some integer d, this is known to be equivalent to the popular deep matrix factorization (DMF) framework (De Handschutter et al., 2021; Arora et al., 2019; Fan & Cheng, 2018), whose predictors take the form of a product of matrices AD1 . . . Dd 2BJ, with a regularizer of the form Lp A, D, Bq : ř }Dv}2 Fr }A}2 Fr }B}2 Fr. Indeed, the minimum min Lp A, D, Bq s.t. AD1 . . . Dd 2BJ Z (1) is d}Z}p sc,p (see (Dai et al., 2021), or Theorem F.22). This equivalence with Schatten p quasi-norm constrained MC is gathering substantial interest in recent years (Arora et al., 2019; Giampouras et al., 2020), and implications for sample complexity are not fully explored. Indeed, the early literature on deep matrix factorization is mostly concerned with algorithmic and optimization issues. It is also worth noting the equivalence has intriguing implications beyond matrix completion, to the study of the implicit regularization of depth in neural networks, which seen explosion of recent interest in the community (Jacot, 2022; Wang & Jacot, 2023). The last few years have also witnessed a surge in the popularity of non-linear matrix completion models. For instance, a branch of the deep matrix factorization literature incorporates non-linear functions inside the product (Xue et al., 2017; Fan & Cheng, 2018; Fan, 2021), leading to predictors of the form g0p Ag1p D1 . . . gd 2p Dd 2gd 1p BJqq . . .q, where the gs are activation functions. Moreover, many models are simply neural network architectures which take a (row, column) combination as input. Such models typically incorporate learnable row and column embeddings. This Generalization Analysis of Deep Non-linear Matrix Completion idea appears to date back to the Neural Network Matrix Factorisation model (NNMF) (Dziugaite & Roy, 2015). A relevant variant is (He et al., 2017), which involves a concatenation of Hadamard products of user and item embeddings and neural encodings followed by a linear layer. Whilst existing research provides many new algorithms and insights into the optimization landscape of various DMF and NNMF methods, very few provide a sample complexity analysis of the associated function classes: to the best of our knowledge, most of the existing work in this direction is limited to MC without non-linearities and with p 1 (Shamir & Shalev-Shwartz, 2011), with the exception of (Fan et al., 2020). In this paper, we study sample complexity of DMF with and without non-linear components. Our programme is to study a broad class of predictors: gpi, jq ϕp Z1 i,j, . . . , Zm i,j, Ψpi, jqq, where ϕ and Ψ are neural networks and the matrices Z1, . . . , Zm may be subject to various Schatten type constraints. We include a large variety of results for many such models in the supplementary material, but for the sake of simplicity, here, we focus our exposition on the following two much simpler cases: (1)MC with Schatten p Constraints: gi,j Zi,j, subject to }Z}sc,p ď M for some constant M; and (2) Functionally Rescaled Matrix Completion (FRMC): gi,j fθp Zi,jq subject to }Z}sc,p ď M, }Z}8 ď B0 and }fθ}lip ď Lf for some constants M and Lf. In addition, inspired by earlier work on the weighted trace norm (Foygel et al., 2011; Srebro & Salakhutdinov, 2010), we study alternative constraints based on the following weighted version of the Schatten quasi-norm: } diagpˇpq 1 2 Z diagpˇqq 1 2 }sc,p, where the vectors ˇp, ˇq are estimates of the marginal row and column probabilities. Throughout the paper, we use abbreviations such as FSd ( functionally rescaled Schatten-d ) for the model with Schatten 2 d constraint and a rescaling function f, and other similar acronyms, which we summarize in the table of notation in Section A. Our contributions are as follows: For MC with a Schatten quasi-norm constraint in the uniform sampling regime, we show sample complexity bounds of r O ppm nqrq where r M ?mn ı 2p 2 p scales like the rank of the ground truth. In the distribution-free setting, we show a sample complexity bound of r O r1 1 p . This reduces to the classic rate of r O ?rpm nq 3 2 (cf. (Shamir & Shalev-Shwartz, 2011; 2014)) for p 1. By considering the weighted version of the Schatten quasi-norm, we are able to bring the rate back to r O prpm nqq, analogously to the case p 1 in (Foygel et al., 2011). As can be seen in Table 1 the Functionally Rescaled model in all of the cases above, we show that learning the function only brings a negligible cost to the sample complexity (it merely adds a constant which depends on the Lipschitz and boundedness parameters). We provide extensions of our results to the case of multiple latent matrices and neural encodings in the appendix. Cf. Subsection C.3 and Section G. In particular, some of our results apply to (Dziugaite & Roy, 2015; He et al., 2017), which we show (cf. Sec F.6) involve implicit Schatten 2{3 regularization. Our proofs rely on low-level modifications of chaining arguments which may be of independent interest. In particular, we prove multi-class chaining Lemmas E.4 and E.3, which allow one to bound the Rademacher complexity of combinations of function classes without access to covering numbers for each individual class. In extensive synthetic and real life experiments, we evaluate the effects of the depth parameter d, the presence or absence of weights in the norm constraints, and the presence or absence of additional neural embeddings. We find that p 2 3 generally performs significantly better than p 1, and our proposed weighted Schatten norm is slightly superior to their non-weighted counterparts. Our results and the comparison to the related works can also be seen in Tables 1, 4 and 5. In all our results, we assume a bounded and Lipschitz loss function. 2. Related Works Approximate Recovery in Matrix Completion: There is a substantial body of literature on the sample complexity of matrix completion with bounded Lipschitz losses and norm constraints. In particular, our work takes much inspiration from the pioneering works of (Foygel et al., 2011) and (Shamir & Shalev-Shwartz, 2011; 2014), which proved some particular cases of some of our results for MC, without a learnable function, in the case p 1. The explicitly rank-restricted case was studied in classification settings in (Srebro et al., 2004; Srebro & Shraibman, 2005; Srebro & Jaakkola, 2005). Table 4 positions our work within the approximate recovery literature. Beyond the examples above, we are not aware of any work on the approximate recovery for Schatten norm constrained matrix completion. However, similar problems have been studied with different losses or sampling regimes. In particular, (Fan et al., 2020; Fan, 2021) study approximate tensor recovery with Schatten regularization over the space SK d,n of order d tensors with orthogonal Generalization Analysis of Deep Non-linear Matrix Completion Table 1: Summary of our results for Functionally Rescaled Matrix completion (FRMC). The r O notation hides logarithmic factors, including N, m, n, r, ℓ, B the failure probability δ and the constraint on B0 on the maximum entry. fp Zq with Sampling Generalization bound Result } Z ?mn}p sc,p ď r1 p 2 Uniform r O ˆ B B2 B0 Lf ℓB Thm 3.4 }f}Lip ď Lf, }Z}8 ď B0 } Z ?mn}p sc,p ď r1 p 2 Arbitrary r O ˆ B1 p B2 B0 Lf ℓB Thm 3.5 }f}Lip ď Lf, }Z}8 ď B0 } r Z}sc,p ď r1 p 2 Arbitrary r O ˆ B B2 B0 Lf ℓB Thm 3.4 }f}Lip ď Lf, }Z}8 ď B0 CP factors1. Since tensors are more general and generally more complex to study than matrices, the results go well beyond the more restricted setting of matrix completion which we study here. However, in the case of a 2-way tensor (i.e. a matrix), the results can be interpreted as a Lagrangian formulation of the empirical risk minimization problems we study. The loss function is the square loss and sampling is uniformly at random without replacement, which means the results are not quite directly comparable. Nonetheless, the achieved L2 excess risk bounds scale like 2p 2 p q{p Npq(cf. (Fan et al., 2020), Theorem 4), where M is an upper bound on the }.}sc,p norm of the recovered matrix. Expressed in terms of our rank-like quantity r, this turns into prn 2 2 p q{p Npq. In contrast, our result 2 p q{p Npq , which trans- lates to r O a . Firstly, note both results scale like r Oprnq when p Ñ 0 (though the constant blows up like 1{p in both cases). Secondly, our rate is uniformly tighter since 2{p2 pq ą 1. And lastly, the bound in (Fan et al., 2020) is vacuous for p 1, scaling like r Oprn2q in that case, compared to r Oprnq in our result. Exact and perturbed recovery for matrix completion (and inductive matrix completion) is a very well-studied problem (Recht, 2011; Candes & Plan, 2010; Cand es & Tao, 2010). In general, using nuclear norm constraints or regularization (which is equivalent to the case p 1 from our study) results in a sample complexity of r Oprnq. We refer the reader to (Recht, 2011; Xu et al., 2013) for more details. There is also a substantial amount of work on other soft relaxations of the rank, such as the max norm. In particular, the early work of (Srebro & Shraibman, 2005) shows a sample complexity of r Opn M 2q, where M is a constraint on the max norm. A low-noise recovery result was achieved for the max norm in the classic work of (Cai & Zhou, 2016), which was further extended in (Wang et al., 2021) to provide 1This is a strict subset of the set of tensors of order d when d ą 2, but it coincides with the set of all matrices when d 2. bounds on the uniformly weighted Frobenius error of the recovered matrix in the non-uniform sampling regime (under some approximate uniformity assumption on the sampling probabilities). For Schatten constraints with p ă 1, there appears to be little to no existing work in the case of randomly sampled entries. However, there are several works on the sample complexity of compressed sensing for Schatten quasi-norm MC (Zhang et al., 2013; Arora et al., 2019; Liu et al., 2014; Recht et al., 2010). Nonetheless, compressed sensing is not directly comparable to matrix completion, especially in the arbitrary sampling regime we study. Cf Section I for more details. Earlier works on deep matrix factorization often focus on the optimization and algorithmic aspects (Trigeorgis et al., 2016; Zhao et al., 2017) without providing sample complexity bounds, though some include non-linear components (Xue et al., 2017; Fan & Cheng, 2018; Wang et al., 2017; De Handschutter et al., 2021; Wei et al., 2020; Lara Cabrera et al., 2020). Note that the non-linear components in those works are interspersed between each matrix in the product which implies the models are different from both our proposed FRMC and the analogous models we study. The observation that deep matrix factorization is equivalent to Schatten norm regularization was made in other works, including (Arora et al., 2019), which studies the optimization landscape of the problem in a compressed sensing setting where the measurement matrices commute (which does not apply to indicator measurements). The implications this has on the implicit rank-restriction which occurs when training deep neural networks is currently the subject of a large amount of interest in the community (Dai et al., 2021; Jacot, 2022; Wang & Jacot, 2023). However, those works typically do not study sample complexity, perhaps it is only non trivial when the matrix is not flat, which implies a multi-output scenario in the neural network context. Nevertheless, the potential to generalize our results to that situation is a tantalizing direction for future work which may shed a different light on implicit rank-restriction in DNN training. Generalization Analysis of Deep Non-linear Matrix Completion 3. Main Results Notation and Setting: In line with much of the literature on approximate recovery in matrix completion (Shamir & Shalev-Shwartz, 2011; 2014; Foygel et al., 2011), we assume an i.i.d. sampling regime in a supervised learning setting where the input space is the set of entries rms ˆ rns: each sample/observation consists in a pair pξ, r Gq sampled i.i.d. from a joint distribution, where ξ P rmsˆrns, and r G is a real number. We try to learn a model g : rms ˆ rns Ñ R, whose performance is to be evaluated by a loss function l, which can depend on ξ, r G and the prediction of the model gξ. The loss function l is assumed to be ℓ-Lipschitz w.r.t. the prediction gpξq (for fixed ξ, r G), and uniformly bounded by a constant B. For each fixed ξ pi, jq, we choose Gξ P arg minplpgξ, r Gξ, ξqq. The resulting matrix G P Rmˆn is referred to as the ground truth matrix. For instance, if lpg, r G, ξq Fp|g r G|q where F : R Ñ R is a strictly increasing function and for each pi, jq P rms ˆ rns we have r Gpi,jq Ri,j ζ where ζ is generated via i.i.d. noise from a symmetric distribution with Epζq 0, then G R. The marginal distribution over ξ is a doubly stochastic matrix whose pi, jqth entry we denote by pi,j. We also write pi and qj for the marginal probabilities of the ith row and jth column, respectively. Our training set S consists of i.i.d. N samples: S ! pξ1, r G1q, . . . , pξN, r GNq ) Ă prms ˆ rnsq ˆ R. By abuse of notation, we sometimes omit the dependence of l on r G and ξ by writing the empirical expectation of a function F as p Ep Fpξqq 1 N řN o 1 lop Fq instead of 1 N řN o 1 lp Fo, r Go, ξoq and sometimes use notations such as gi,j and gpi, jq interchangeably to denote the prediction made by predictor g P Rmˆn for entry pi, jq. In addition, by further abuse of notation, we will often write Epℓp Zqq and p Epℓp Zqq instead of the previous quantities. A table of notations 3 is available in Appendix A. For a predictor g P Rmˆn, the empirical loss is ˆlpgq : p Eplpgpξq, r G, ξqq 1 o 1 lpgξ0, r Go, ξoq. In particular, if the ground truth matrix G P Rmˆn is observed without noise, the N observations are distinct and the loss function l is the square loss,ˆlpgq lpgi,j, Gi,j, pi, jqqq 1 |gi,j Gi,j|2 where ΩĂ rms ˆ rns is the set of observed entries. The population expected loss is lpgq : Eplpgpξq, r G, ξqq, where the expectation runs over a random joint draw of the entry ξ P rms ˆ rns, and the observation r G. In particular, if the entries of the ground truth matrix G are observed without noise and the loss function l is the square loss, we have lpgq ÿ pi,jq P rmsˆrns pi,j lpgi,j, Gi,jq ÿ pi,jq PΩ pi,jpgi,j Gi,jq2, where pi,j denotes the marginal probability of sampling entry pi, jq. For uniform sampling, we further have lpgq 1 mn}g G }2 Fr, where }.}Fr denotes the Frobenius norm. For a general predictor g P Rmˆn, the generalization error is lpgq plpgq. Given the class of matrices F Ă Rmˆn, the empirical risk minimizer ˆg P F is defined by ˆg P arg ming PFplpgqq. The excess risk is then lpˆgq min g PF lpgq. (2) O and r O Notations: For simplicity, some of our results (e.g. Thms 3.4 and 3.5 and the results in the summary tables) are expressed in terms of a r O notation which hides polylogarithmic factors in all variables, including B0, Lf, the failure probability δ, the constants ℓ, B relative to the loss function, etc. Both the O and r O notations also assume that B0, B ě 1. The formal results including all polylogarithmic terms are in the appendix. 3.1. Excess Risk Bounds for Matrix Completion with the Schatten Quasi-norm In this subsection, we present our results for matrix completion with Schatten quasi-norm constraints. A summary of our results is available in Appendix B. Notation for the Weighted Setting: In the weighted set- ting, we require empirical estimates ˆpi řN o 1 1pξoq1 i řN o 1 1pξoq2 j N of the quantities pi and qj respectively. Furthermore, similarly to the literature on the weighted trace norm (Foygel et al., 2011; Srebro & Salakhutdinov, 2010), we also work with the smoothed versions pi 1 2pi 1 2m qj 1 2qj 1 2n of the ground truth distribution, as well as the empirically evaluated analogues ˇpi 1 2pi 1 2m and ˇqj 1 2qj 1 2n. By abuse of notation, we write diagppq and diagpqq for the diagonal matrices with diagonal elements p1, . . . , pm and q1, . . . , qn respectively (and use similar notations for p and q). For a matrix Z, we denote by r Z the matrix diagp pq 1 2 Z diagp qq 1 2 , so that } r Z} is the (smoothed) weighted trace norm (Foygel et al., 2011). Similarly, q Z diagpˇpq 1 2 Z diagpˇqq 1 2 . Remark on the Definition of the Rank-like Quantity r: To better illustrate the implicit dimensional dependence of Generalization Analysis of Deep Non-linear Matrix Completion the bounds which arise from our norm-based constraints, we typically express our constraints on the Schatten norms of matrices in terms of the rank-like quantity r r M ?mns 2p 2 p , where M is an upper bound constraint on }Z}sc,p. In the case p 1 this is a well-established convention in the literature (Foygel et al., 2011; Srebro & Salakhutdinov, 2010; Ledent et al., 2021b; Foygel et al., 2012). Let us briefly explain the rationale behind this notation in the case of an arbitrary p. Suppose the entries of some matrix Z are bounded above by some constant C: |Zi,j| ď C (for all i, j). Then we have }Z}2 Fr ř |Zi,j|2 ď C2mn. Writing ρ1, . . . , ρr for the singular values of Z in decreasing order, we then have, by Holder s inequality: o 1 rρp os 2 p ď rmn C2s p 2 r1 p 2 Op?mn pr1 p Similarly, if we have |Zi,j| ě C0 for all i, j, then }Z}Fr ě C0 ?mn. If the spectrum is homogeneous, i.e., ρ1{ρr : κ Op1q, then we also have }Z}p SC,p o 1 ρp o ě rρp r rrρ2 rs p 2 r1 p 2 r C2 0mns p 2 Ωp?mn pr1 p Thus, if a matrix Z has Ωp1q entries and an approximately uniform spectrum, then its Schatten quasi-norm is Ωp?mnr 2p q, which justifies that notation: enforcing the constraint }Z}SC,p ?mn ı 2p 2 p ď r can be understood as a soft analogue of restricting the rank to r or less with a tolerance for additional singular values of very small magnitudes. The tolerance is greater for larger values of p. A similar argument can be easily derived for the weighted case by substituting the estimates of the Frobenius norms by the following: C2 0 ď } Z}2 Fr ř pi qj|Zi,j|2 ď C2. This leads to the conclusion that } r Z}p sc,p ď Cpr1 p 2 (and in the case of a uniform spectrum, Cp 0r1 p 2 ď } r Z}p sc,p. This justifies the use of the notation r for constraints imposed on } r Z} 2p 2 p sc,p . We provide the following results in this subsection: A sample complexity result of r Oppm nqrp 1q for matrix completion with the Schatten norm p ď 1 weighted with the smoothed ground truth marginals (Theorem 3.1). In particular, this result applies to the unweighted Schatten norm regularized matrix completion problem in the uniform sampling regime. A sample complexity result of r Oppm nq1 p 2 p 1q for the unweighted Schatten quasi-norm regularized matrix completion problem in the distribution-free setting. An excess risk bound corresponding to a sample complexity of r Oppm nqrq for the empirically weighted Schatten quasi-norm regularized problem in the distribution-free setting under the assumption that p 2 d for some integer d. Furthermore, the factors of p can be removed at the cost of an additional factor of logp B0q, where B0 is an upper bound imposed on the entries. See also Table 5. Theorem 3.1 (cf. Theorems C.1 and C.2). As in the rest of this paper, assume the loss function l is ℓ-Lipschitz and bounded by B. Let p ą 0 be a fixed Schatten index and let r ą 0 be a fixed real number. Consider the class Fp t of matrices with Schatten Quasi-norm bounded by ?mnr Z P Rmˆn : }Z}p sc,p ď ?mn pr1 p If the sampling distribution over entries is uniform, then for any δ ą 0, w.p. ě 1 δ over the draw of the training set, every matrix Z P Fp t satisfies the generalization error bound E lp Zξ, r G, ξq p E lp Zξ, r G, ξq ď Np ln r mn Nℓ where ℓ ℓ 1 and r r 1. More generally, if the sampling distribution is arbitrary with smoothed marginals p and q, the result holds for the class r Fp r Z P Rmˆn : } r Z}p sc,p ď r1 p 2 ( where r Z diagp pq 1 2 Z diagp qq 1 2 . Furthermore, if one incorporates an enforced upper bound on all the absolute values of the entries: r Fp r,B0 Z P Rmˆn : } r Z}p sc,p ď r1 p 2 ; }Z}8 ď B0 ( , then we have instead (w.p. ě 1 δ), E lp Zq p E lp Zq ď N ln mn Nr ℓ B0 See Theorems C.1 and C.2 in the appendix for a full proof. To illustrate the implications of Theorem 3.1, let us consider the idealized situation where the ground truth matrix G is Generalization Analysis of Deep Non-linear Matrix Completion of low-rank r, the loss function l is the truncated square loss lpa, b, pi, jqq minppa bq2, 1q, there is no noise in the observations and the sampling distribution is uniform. By Equation (3), we have } G }sc,p ď ?mn r 2p . Thus, if we solve the following empirical risk minimization problem: Minimize Z 1 N pi,jq PΩ minp|Zi,j Gi,j |2, 1q subject to }Z}sc,p ď r for some r ď r ď Op rq where the set of observed entries Ωis counted with multiplicity, Theorem 3.1 implies a high probability error bound for the (global) minimizer p Z: pi,jq Prmsˆrns min ˇˇˇ p Zi,j Gi,j ˇˇˇ 2 , 1 ȷ Np log ˆr mn Nℓ This matches the same sample complexity rate of r Opn rq achieved by nuclear norm regularization. However, the result is more general since it is not necessary impose r ď r, only } G }sc,p ď ?mnr 2p : thus, the sample complexity can adapt to the approximate low-rank ness of the ground truth as expressed through its Schatten quasi-norm. Sketch of proof of Theorem 3.1. The proof uses a novel technique we refer to as parametric interpolation , which consists in interpolating between the regimes where p 0 and p 1. Since we need to use the boundedness of the loss function to get a tight bound on the parametric component, the combination also requires the refined multi-class chaining arguments from Lemma E.3, but we leave the details to the Appendix and focus on the intuition in this proof sketch. For simplicity, we treat B and ℓas constants and absorb all logarithmic factors of N, m, n, r into r O notation. See Theorems C.1 and C.2 (and the results they rely on, such as Theorem D.2) for details. At the left extreme (p Ñ 0), it is known that the class of matrices whose rank is explicitly restricted to some value r1 exhibits a sample complexity of r Opr1pm nqq (see (Srebro & Shraibman, 2005; Srebro et al., 2004) for an early discussion of a nearly identical problem where the target matrix is assumed to be in t 1, 1umˆn and the distribution is uniform, see (Vandermeulen & Ledent, 2021) for a covering number of the class of low rank matrices, see Lemma D.1). This is in line with the fact that explicitly rank-restricted matrix completion is a parametric model, leading to a sample complexity of the same order as the number of parameters, omitting logarithmic factors of the magnitude of them. More precisely, by Lemma D.1, the sample complexity of a bounded loss class associated to the set of matrices of rank r1 is r Opr1pm nq logp B0qq, where the r O notation hides logarithmic factors of n, m, N and r. Similarly, the sample complexity of matrices Z satisfying } r Z} ď ?r2 is r Oppm nqr2q by the more recent results of (Foygel et al., 2011). Next, for any matrix Z with } r Z}p sc,p ď r1 p 2 (and }Z}8 ď B0), we can write r Z r Z1 r Z2 where r Z1 is the sum of the terms in the singular value decomposition of r Z associated with a singular value greater than τ for some threshold τ. Writing ρ1, ρ2, . . . for the singular values of r Z, since } r Z}p sc,p ř ρp v ď r1 p 2 , by Markov s inequality, we have r1 rankp Z1q ď r1 p 2 τ p . (6) Furthermore, since all the singular values of r Z2 are bounded above by τ, the nuclear norm } r Z2} řn v r1 1 ρv can be controlled as r2 : } r Z2} v r1 1 pρvqppρvq1 p ď } r Z2}p sc,p τ 1 p ď r1 p Thus, the function class r Fp r,B0 is included in the function class Rτ Tτ, where Rτ " Z1 : rankp Z1q ď r1 p 2 τ p , }Z1} ď ?mn B0 Tτ ! Z2 : } r Z2} ď r1 p Thus by Lemma D.1 (parameter counting bound of Rτ) and Proposition F.5 (norm-based bound on the set of low nuclear norm matrices), together with the sample complexity of r Fp r,B0 can be upper bounded by r O ˆ pm nq r1 p 2 τ p logp B0q r2 p τ 2 2p ȷ . Setting the threshold as r 1 2 yields a sample complexity of r O ppm nqr logp B0qq, as expected. When no separate upper bound is enforced on the entries, we can still upper bound }Z}8 by 2?mn} r Z}sc,p ď 2?mnr 2p , which implies the additional factor logp B0q becomes r Oplogpr 2p qq r Op 1 Next, we also control the sample complexity of learning with the non-weighted trace norm under arbitrary sampling. Theorem 3.2 (Cf. Theorem C.3). Consider the following function class for 0 ă p ď 1: Fp t Z P Rmˆn : }Z}p sc,p ď Mp ?mn pr1 p Generalization Analysis of Deep Non-linear Matrix Completion W.p. ě 1 δ, every Z in Fp t satisfies E lp Zq p E lp Zq ď where c1 lnpmn Nr ℓ q and c2 lnpmn Nr Mp 1s ℓ q with ℓ ℓ 1 and r r 1. Thus (fixing B, ℓ) the sample complexity is r Opr1 p 2 {pq. For the class Fp r,B0 : ! Z P Rmˆn : }Z}sc,p ď M ?mnrrs 2p , }Z}8 ď B0 ) , then we have instead (w.h.p.) E lp Zq p E lp Zq ď where c3 lnpmn Nr B0 1srℓ 1sq. Remark: The above result reverts to the classic r Oppm nq 3 2 r 1 2 q when p 1. As p Ñ 0, the first bound in (7) blows up due to the factor of p 1 2 , whilst the second yields a complexity of r O ppm nq Mpq, in line with the parameter counting argument. Finally, an excess risk bound can be shown in the more realistic case where the function class restriction relies on the empirical marginals instead of the true marginals. Theorem 3.3 (Cf. Theorem C.4). Assume p 2 d for some integer d and that the ground truth is realizable: } G }sc,p ď 2 . Let p Z P arg min Z p Ep Zq : } q Z}p sc,p ď r2rs1 p 2 . We have the following excess risk bound w.h.p. (where ℓ ℓ 1 and r r 1): Eplp p Zqq Eplp Gqq ď Np ln r mn Nℓ The proof relies mostly on Lemma E.5, which is a generalization of Lemma 4 in (Foygel et al., 2011) to the case p 1. This lemma shows that for large enough N, the Schatten quasi-norms of q Z and r Z (for any Z) are within a small ratio of each other (w.h.p.). This allows us to show that the class q Fp 2r contains the ground truth with high probability. Note the constraint in our result is } q Z}p sc,p ď r2rs1 p 2 rather than } q Z}p sc,p ď r1 p 2 . This is in contrast to the case p 1 in (Foygel et al., 2011) with the more natural constraint } q Z} ď r. However, for practical purposes, the presence of the factor of 2 in our result is not an issue, since it merely slightly increases the cross-validation cost of the constraint parameter. Also, the result only works for p 2 d, i.e., when the optimization problem can be reformulated as p Z P arg min Z Eplp Zqq : q Z A v 1 Dv BJ : }A}2 Fr }B}2 Fr v 1 }Dv}2 Fr ď dr2rs1 p Remark: Interestingly, reformulating the condition as above makes the factors of p disappear from the bounds: if we reformulate that condition by writing r1 d 2 2 p r, so that r2r1s1 p 2 is an upper bound on }A}2 Fr }B}2 Fr řd 2 v 1 }Dv}2 Fr, the final sample complex- ity is r Oppm nqrp 1q r O pm nqr1d 2 2 p 1 , i.e. r O pm nqr1d 1 d 1 r O ppm nqr1q. Of course, this applies to Theorems 3.1 and 3.2 as well. 3.2. Generalization Bounds for FRMC We now move on to our results on a new class of models we refer to as Functionally Rescaled Matrix Completion (FRMC), where the predictors take the form fθp Zq where fθ is a trainable function and Z is a Schatten-constrained matrix. Thus, these models can be seen as an analogue of generalized linear models in Matrix Completion. In a nutshell, our results show that learning the rescaling function f can be done at negligible cost to function class capacity and generalization performance. Indeed, our generalization error bounds take the form of a sum of two terms, one corresponding to learning the complexity of the matrix class, and another one corresponding to the function class Flip,Lf,Bf of bounded Lipschitz functions from r B0, B0s Ñ R, which has very small function class capacity thanks to the low dimensionality (see Proposition F.12 from (von Luxburg & Bousquet, 2004) and (Tikhomirov, 1993)). Theorem 3.4 (Cf. Theorem C.5). Let Flip,Lf,Bf f : r B0, B0s Ñ R : }f}lip ď Lf; }f}8 ď Bf ( . Consider the following function class for our learning algorithm: Flip,Lf,Bf r Fp r,B0 : g : rms ˆ rns Ñ R : Df P Flip,Lf,Bf , Z P r Fp r,B0 : gpi, jq fp Zi,jq ( . With probability greater than 1 δ over the draw of the training set, the following holds for all g P Flip,Lf,Bf r Fp r,B0: Eplpgqq p Eplpgqq ď r O B2 B0 Lf ℓB Generalization Analysis of Deep Non-linear Matrix Completion Furthermore, similarly to Theorem 3.3 above, an excess risk result holds for the minimization problem over the empirically weighted class Flip,Lf,Bf q Fp 2r,B0 (cf. Thm C.7). In the distribution-free unweighted setting, we have: Theorem 3.5 (cf. Thm C.6). Consider the function class Flip,Lf,Bf Fp r,B0 : g : rms ˆ rns Ñ R : Df P Flip,Lf,Bf , Z P Fp r,B0 : gpi, jq fp Zi,jq W.p. ě 1 δ, for each g P Flip,Lf,Bf Fp r,B0, one has Eplpgqq p Eplpgqq ď r O 2 rℓLfs p 2 B2 B0 Lf ℓB Note that in all cases above, the incorporation of a learnable function f P Flip,Lf,Bf only contributes an additional B2 B0 Lf ℓB N to the generalization error bound: there is no dependence on the architectural or norm-based parameters such as m, n, r and the function f only needs to be learned once for the whole dataset. The interaction between this learning task and the learning of the low rank latent matrix does not introduce any additional challenge: the complexities of both tasks are disentanglable. It is worth noting that the proofs are far from being a trivial combination of the proofs from Subsection 3.1 above and Proposition F.12. Indeed, no sufficiently tight covering number bound is available for any of the function classes discussed in Section 3.12, not even for p 1: of course, it is possible to obtain such a cover respect to the Frobenius or L8 norms via covering numbers for linear function classes applied to the matrices A, B, D1, . . . , Dd 2, but this leads to loose covering number bounds that translate to vacuous results in terms of sample complexity. For instance, the error bound (applicable to a setting analogous to uniform sampling) for higher order tensors of (Fan et al., 2020; Fan, 2021) is based on a Frobenius covering number bound, but for the case of matrices and for p 1, the log covering number scales as r O rpm nq2 , which is vacuous. In fact, since the Rademacher complexities involved in the proofs of the results in Section 3.1 depend subtly on the sampling distribution, it is clear that the metric used in the cover must be carefully chosen. Moreover, the Frobenius norm doesn t seem to work well, not even in the uniform sampling case. 2except the parametric class Er,t of matrices with explicitly restricted rank Our proof of the results of this section relies instead on multi-class generalization of classic chaining arguments. More specifically, in Section E, we establish two generalizations of Dudley s Entropy Theorem, Lemma E.4 and Lemma E.3, which allow one to bound the Rademacher complexity of the function class Fp F1, F2q, where F is a fixed function and the following two conditions are satisfied: (1) A covering number is available for Fp F1, f2q, uniformly over any choice of f2 P F2 and (2) A Rademacher complexity complexity bound is available for Fpf1, F2q, uniformly over any choice of f1 P F1. Results with some similarities can be traced back to (Golowich et al., 2018) (Thm. 4) and (Ledent et al., 2021b) (Prop. A.4.). 3.3. Generalization Bounds with Neural Encodings In this section, we briefly describe some of our extended results for the Sd+NN setting, which includes an additional neural network encoding. Specifically, we consider neural encodings of the form Ψpi, jq fp A0pui, vjq Jq where ui is the embedding for row i, vj is the embedding for column j, f is the neural network given by fpxq σL AL Relu Relu . . . Relu A1x . . . , where the matrices R1ˆwl 1 Q AL, . . . , A1 are the weight matrices. The predictors then take the form gi,j Zi,j Ψi,j where Ψ is the neural encoding and Z is a matrix to which (potentially weighted) Schatten p quasi-norm regularization is applied. In particular, for p 2 3, the model corresponds to the one presented in (He et al., 2017). Our results (cf. Thm C.8, Thm C.9) show that the generalization error is bounded as a sum of terms corresponding to the matrix class and the neural encoding class. Extension to Multiple Latent Matrices: In the Appendix, we extend our results to the case of models of the form ϕ p Z1, Z2, . . . , Zm, Ψq where ϕ and Ψ are trainable networks (Ψ a neural encoding) and the matrices Z1, . . . , Zm are constrained via various Schatten p quasi-norms. 4. Experiments Synthetic Data Experiments: We generated synthetic square data matrices in Rnˆn with a specified rank r. We varied the proportion of observed entries in the generated matrices (%obs = Er N{n2s), with a non-uniform sampling distribution. A summary of the results is provided in Figure 1. The results demonstrate, unsurprisingly, that FRMC achieves better performance than methods which do not incorporate a non-linearity. Going deeper, we observe that with sufficiently many observations, the model is able to recover the ground truth function nearly perfectly, together with the low rank latent matrix. We provide an example of the recovered functions in Figure 2 for %obs P t0.14, 0.20u. Moreover, we observe that the weighted version of the Generalization Analysis of Deep Non-linear Matrix Completion Figure 1: Summary of the results of the synthetic data experiments. Ground-truth generated by considering fpxq σpxq. Figure 2: Learned function f by our model (yellow curve). Red curves represent the ground truth. Blue dots are the predictions, which ideally should lie under the red curve. regularizer works slightly better, especially with d 2. Finally, an exponential performance improvement occurs when d increases, especially from 2 to 3. This is in line with the expectation that lower values of p induce more and more rank-sparsity, and with our theoretical results in the distribution-free case, which show much better sample complexities when p is small (cf. our sample complexity of r O r1 p 2 ). For additional results, such as a comparison with the identity function as ground truth, see Figure 3. For detailed information, including the generation procedure, parameter selection, and validation setup, refer to the appendix in Section H.1. Real Data Experiments: MC can be applied to a wide range of domains, such as recommender systems, human event dynamics, and chemical engineering. We chose three standard datasets to evaluate our method in a real data scenario: DOUBAN and Movie Lens 25M (MVL25) from the recommender systems domain, and Last FM, which stores listening habits of users in a music streaming platform. For descriptions of datasets and implementation details of the real-world strand, refer to Section H.2 in the Appendix. In Figure 2, we plot the functions learned by our model on real data. Interestingly, we see that the chosen functions look somewhat sigmoidal, probably to avoid out-of-range predictions and model the vanishing significance of increments between very high or very low ratings. Furthermore, we observe that in 2 out of 3 datasets, our mildly non-linear Table 2: Test RMSE for the assessed methods. Notation: Weighted (W) models use weighted-norm regularization in the embeddings. Our methods learn a re-scaling function (FS). Thus, S2 refers to the traditional nuclear norm regularization, SW3 refers to weighted Schatten 2{3 norm regularization and FRMC-FSW2 refers to the model fp Zq with nuclear norm constraint on Z and a trainable componentwise rescaling function f. Model d W FS Douban Last FM MVL25 ˆ ˆ 0.8042 2.5885 0.8047 SW2 ˆ 0.7981 2.4980 0.7625 FRMC-FS2 ˆ 0.7627 1.0327 0.7795 FRMC-FSW2 0.7626 1.0091 0.7776 ˆ ˆ 0.8050 2.0512 0.7786 SW3 ˆ 0.8030 2.0417 0.7876 FRMC-FS3 ˆ 0.7674 0.9952 0.7711 FRMC-FSW3 0.7616 0.9904 0.7799 model FRMC substantially outperforms traditional matrix completion. Furthermore, p 2{3 outperforms p 1 and the weighted version outperforms the unweighted version. 5. Conclusion We studied matrix completion with Schatten p quasi-norm constraints for 0 ď p ď 1 in the approximate recovery setting. Ignoring the dependence on Lipschitz and boundedness constants, we provided sample complexity bounds of r Oprpm nqq and r Opr1 p 2 q in the uniform and arbitrary sampling regimes respectively. The results show the stronger rank-sparsity inducing properties of lower order Schatten p quasi-norms, which we also observe in our experiments. Moreover, we showed that the use of the weighted Schatten p quasi-norm can bring both rates back to r Oprpm nqq. We introduced a parsimonious non-linear model, Functionally Rescaled Matrix Completion (FRMC), which consists in applying a trainable function from R Ñ R to the entries of a latent matrix. We show extensions of all of our results to the FRMC setting, which demonstrate that the addition of a learnable function from R to R negligibly increases function class capacity. Generalization Analysis of Deep Non-linear Matrix Completion Acknowledgements Antoine Ledent s research was supported by the Singapore Ministry of Education (MOE) Academic Research Fund (Ac RF) Tier 1 grant. Rodrigo Alves thanks Recombee for supporting this research, specially while using the DGX Server. Impact Statement Our work is mostly theoretical in nature, and is unlikely to have a negative societal impact. Alves, R., Ledent, A., Assunc ao, R., and Kloft, M. An empirical study of the discreteness prior in low-rank matrix completion. In Neur IPS 2020 Workshop on Preregistration in Machine Learning, volume 148 of Proceedings of Machine Learning Research, pp. 111 125. PMLR, 11 Dec 2021. Angluin, D. and Valiant, L. G. Fast probabilistic algorithms for hamiltonian circuits and matchings. Journal of Computer and system Sciences, 18(2):155 193, 1979. Arora, S., Cohen, N., Hu, W., and Luo, Y. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32, 2019. Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. In Helmbold, D. and Williamson, B. (eds.), Computational Learning Theory, pp. 224 240, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg. ISBN 978-3-540-44581-4. Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6240 6249. Curran Associates, Inc., 2017. Boucheron, S., Lugosi, G., and Bousquet, O. Concentration inequalities. Lecture Notes in Computer Science, 3176: 208 240, 2004. Cai, T. T. and Zhou, W.-X. Matrix completion via max-norm constrained optimization. Electronic Journal of Statistics, 10(1):1493 1525, 2016. doi: 10.1214/16-EJS1147. Candes, E. J. and Plan, Y. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936, 2010. Cand es, E. J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717, 2009. Cand es, E. J. and Tao, T. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080, May 2010. Chen, Y. Incoherence-optimal matrix completion. Information Theory, IEEE Transactions on, 61, 10 2013. doi: 10.1109/TIT.2015.2415195. Chen, Y., Chi, Y., Fan, J., Ma, C., and Yan, Y. Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM journal on optimization, 30(4):3098 3121, 2020. Chen, Y., Chi, Y., Fan, J., Ma, C., et al. Spectral methods for data science: A statistical perspective. Foundations and Trends in Machine Learning, 14(5):566 806, 2021. Chiang, K.-Y., Hsieh, C.-J., and Dhillon, I. S. Matrix completion with noisy side information. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. Chiang, K.-Y., Dhillon, I. S., and Hsieh, C.-J. Using side information to reliably learn low-rank matrices from missing and corrupted observations. Journal of Machine Learning Ressearch, 2018. Dai, Z., Karzand, M., and Srebro, N. Representation costs of linear neural networks: Analysis and design. Advances in Neural Information Processing Systems, 34:26884 26896, 2021. De Handschutter, P., Gillis, N., and Siebert, X. A survey on deep matrix factorizations. Computer Science Review, 42:100423, 2021. Dziugaite, G. K. and Roy, D. M. Neural network matrix factorization. ar Xiv preprint ar Xiv:1511.06443, 2015. Fan, J. Multi-mode deep matrix and tensor factorization. In international conference on learning representations, 2021. Fan, J. and Cheng, J. Matrix completion by deep matrix factorization. Neural Networks, 98:34 41, 2018. Fan, J., Ding, L., Yang, C., Zhang, Z., and Udell, M. Euclidean-norm-induced schatten-p quasi-norm regularization for low-rank tensor completion and tensor robust principal component analysis. ar Xiv e-prints, pp. ar Xiv 2012, 2020. Foygel, R., Shamir, O., Srebro, N., and Salakhutdinov, R. R. Learning with the weighted trace-norm under arbitrary sampling distributions. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 2133 2141. Curran Associates, Inc., 2011. Generalization Analysis of Deep Non-linear Matrix Completion Foygel, R., Srebro, N., and Salakhutdinov, R. R. Matrix reconstruction with the local max norm. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. Giampouras, P., Vidal, R., Rontogiannis, A., and Haeffele, B. A novel variational form of the schatten-p quasi-norm. Advances in Neural Information Processing Systems, 33: 21453 21463, 2020. Gin e, E. and Guillou, A. On consistency of kernel density estimators for randomly censored data: Rates holding uniformly over adaptive intervals. Annales de l Institut Henri Poincare (B) Probability and Statistics, 37:503 522, 07 2001. doi: 10.1016/S0246-0203(01)01081-0. Golowich, N., Rakhlin, A., and Shamir, O. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pp. 297 299. PMLR, 2018. Graf, F., Zeng, S., Rieck, B., Niethammer, M., and Kwitt, R. On measuring excess capacity in neural networks. Advances in Neural Information Processing Systems, 35: 10164 10178, 2022. Gross, D. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):1548 1566, March 2011. ISSN 1557-9654. doi: 10.1109/TIT.2011.2104999. Guermeur, Y. Combinatorial and structural results for gamma-psi-dimensions. ar Xiv preprint ar Xiv:1809.07310, 2018. Guermeur, Y. Rademacher complexity of margin multicategory classifiers. Neural Computing and Applications, 32(24):17995 18008, 2020. Gui, Y., Barber, R., and Ma, C. Conformalized matrix completion. Advances in Neural Information Processing Systems, 36:4820 4844, 2023. Hagerup, T. and R ub, C. A guided tour of chernoff bounds. Inf. Process. Lett., 33:305 308, 1990. URL https://api.semanticscholar. org/Corpus ID:40984617. Hastie, T., Mazumder, R., Lee, J. D., and Zadeh, R. Matrix completion and low-rank svd via fast alternating least squares. Journal of Machine Learning Research, 16 (104):3367 3402, 2015. URL http://jmlr.org/ papers/v16/hastie15a.html. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T.-S. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp. 173 182, 2017. He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., and Wang, M. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 639 648, 2020. Jacot, A. Implicit bias of large depth networks: a notion of rank for nonlinear functions. ar Xiv preprint ar Xiv:2209.15055, 2022. Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 42(8): 30 37, 2009. Lara-Cabrera, R., Gonz alez-Prieto, A., and Ortega, F. Deep matrix factorization approach for collaborative filtering recommender systems. Applied Sciences, 10(14):4926, 2020. Latała, R. Some estimates of norms of random matrices. Proceedings of the American Mathematical Society, 133 (5):1273 1282, 2005. ISSN 00029939, 10886826. Ledent, A., Alves, R., and Kloft, M. Orthogonal inductive matrix completion. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 12, 2021a. doi: 10.1109/TNNLS.2021.3106155. Ledent, A., Alves, R., Lei, Y., and Kloft, M. Fine-grained generalization analysis of inductive matrix completion. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 25540 25552. Curran Associates, Inc., 2021b. Ledent, A., Mustafa, W., Lei, Y., and Kloft, M. Norm-based generalisation bounds for deep multi-class convolutional neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):8279 8287, May 2021c. Ledent, A., Alves, R., Lei, Y., Guermeur, Y., and Kloft, M. Generalization bounds for inductive matrix completion in low-noise settings. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7):8447 8455, Jun. 2023. doi: 10.1609/aaai.v37i7.26018. Ledoux, M. and Talagrand, M. Probability in Banach spaces : isoperimetry and processes. Springer, Berlin [u.a.], 1991. ISBN 3540520139. Li, R., Dong, Y., Kuang, Q., Wu, Y., Li, Y., Zhu, M., and Li, M. Inductive matrix completion for predicting adverse drug reactions (adrs) integrating drug target interactions. Chemometrics and Intelligent Laboratory Systems, 144: 71 79, 2015. ISSN 0169-7439. doi: https://doi.org/10. 1016/j.chemolab.2015.03.013. Generalization Analysis of Deep Non-linear Matrix Completion Liu, L., Huang, W., and Chen, D.-R. Exact minimum rank approximation via schatten p-norm minimization. Journal of Computational and Applied Mathematics, 267: 218 227, 2014. Long, P. M. and Sedghi, H. Size-free generalization bounds for convolutional neural networks. In International Conference on Learning Representations, 2020. Ma, W. and Chen, G. H. Missing not at random in matrix completion: The effectiveness of estimating missingness probabilities under a low nuclear norm assumption. Advances in neural information processing systems, 32, 2019. Mao, K., Zhu, J., Xiao, X., Lu, B., Wang, Z., and He, X. Ultragcn: ultra simplification of graph convolutional networks for recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 1253 1262, 2021. Mazumder, R., Hastie, T., and Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11: 2287 2322, August 2010. Meir, R. and Zhang, T. Generalization error bounds for bayesian mixture algorithms. J. Mach. Learn. Res., 4 (null):839 860, dec 2003. ISSN 1532-4435. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2 edition, 2018. ISBN 978-0-262-03940-6. Mustafa, W., Lei, Y., Ledent, A., and Kloft, M. Fine-grained generalization analysis of structured output prediction. In Zhou, Z.-H. (ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI21, pp. 2841 2847. International Joint Conferences on Artificial Intelligence Organization, 8 2021. Main Track. Pal, S. and Jain, P. Online low rank matrix completion. In The Eleventh International Conference on Learning Representations, 2022. Pal, S., Sai Suggala, A., Shanmugam, K., and Jain, P. Optimal algorithms for latent bandits with cluster structure. In Ruiz, F., Dy, J., and van de Meent, J.-W. (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 7540 7577. PMLR, 25 27 Apr 2023. Platen, E. Pollard, d.:convergence of stochastic processes. (springer series in statistics). springer-verlag, new york - berlin - heidelberg - tokyo 1984, 216 pp., 36 illustr., dm 82. Biometrical Journal, 28(5):644 644, 1986. doi: 10.1002/bimj.4710280516. Qiaosheng, Zhang, Tan, V. Y. F., and Suh, C. Community Detection and Matrix Completion with Two-Sided Graph Side-Information. ar Xiv e-prints, art. ar Xiv:1912.04099, December 2019. Recht, B. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(null):3413 3430, December 2011. ISSN 1532-4435. Recht, B., Fazel, M., and Parrilo, P. A. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471 501, 2010. doi: 10.1137/070697835. Scott, C. Rademacher complexity. Lecture Notes, Statistical Learning Theory, 2014. Shamir, O. and Shalev-Shwartz, S. Collaborative filtering with the trace norm: Learning, bounding, and transducing. In Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pp. 661 678. PMLR, 2011. Shamir, O. and Shalev-Shwartz, S. Matrix completion with the trace norm: Learning, bounding, and transducing. Journal of Machine Learning Research, 15:3401 3423, 2014. Sinclair, A. Lecture notes for the course CS271 Randomness and computation . URL https: //web.archive.org/web/20141031035717/ http://www.cs.berkeley.edu/ sinclair/ cs271/n13.pdf. Sportisse, A., Boyer, C., and Josse, J. Imputation and lowrank estimation with missing not at random data. Statistics and Computing, 30(6):1629 1643, 2020. Srebro, N. Learning with matrix factorizations. Ph D Thesis, 2004. Srebro, N. and Jaakkola, T. Generalization error bounds for collaborative prediction with low-rank matrices. In Advances In Neural Information Processing Systems 17, pp. 5 27. MIT Press, 2005. Srebro, N. and Salakhutdinov, R. R. Collaborative filtering in a non-uniform world: Learning with the weighted trace norm. In Lafferty, J. D., Williams, C. K. I., Shawe Taylor, J., Zemel, R. S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, pp. 2056 2064. Curran Associates, Inc., 2010. Srebro, N. and Shraibman, A. Rank, trace-norm and maxnorm. In Auer, P. and Meir, R. (eds.), Learning Theory, pp. 545 560, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. ISBN 978-3-540-31892-7. Generalization Analysis of Deep Non-linear Matrix Completion Srebro, N., Alon, N., and Jaakkola, T. Generalization error bounds for collaborative prediction with low-rank matrices. Advances In Neural Information Processing Systems, 17, 2004. Talagrand, M. Sharper bounds for gaussian and empirical processes. The Annals of Probability, 22(1):28 76, 1994. ISSN 00911798. Talagrand, M. New concentration inequalities in product spaces. Inventiones mathematicae, 126(3):505 563, Nov 1996. ISSN 1432-1297. doi: 10.1007/s002220050108. Tikhomirov, V. M. ϵ-Entropy and ϵ-Capacity of Sets In Functional Spaces, pp. 86 170. Springer Netherlands, Dordrecht, 1993. ISBN 978-94-017-2973-4. doi: 10. 1007/978-94-017-2973-4 7. Trigeorgis, G., Bousmalis, K., Zafeiriou, S., and Schuller, B. W. A deep matrix factorization method for learning attribute representations. IEEE transactions on pattern analysis and machine intelligence, 39(3):417 429, 2016. Tropp, J. A. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12 (4):389 434, 2012. Vandermeulen, R. A. and Ledent, A. Beyond smoothness: Incorporating low-rank analysis into nonparametric density estimation. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 12180 12193. Curran Associates, Inc., 2021. Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. von Luxburg, U. and Bousquet, O. Distance-based classification with lipschitz functions. J. Mach. Learn. Res., 5 (Jun):669 695, 2004. Wang, J., Wong, R. K., Mao, X., and Chan, K. C. G. Matrix completion with model-free weighting. In International Conference on Machine Learning, pp. 10927 10936. PMLR, 2021. Wang, Q., Sun, M., Zhan, L., Thompson, P., Ji, S., and Zhou, J. Multi-modality disease modeling via collective deep matrix factorization. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1155 1164, 2017. Wang, Z. and Jacot, A. Implicit bias of sgd in l {2}- regularized linear dnns: One-way jumps from high to low rank. ar Xiv preprint ar Xiv:2305.16038, 2023. Wei, S., Wang, J., Yu, G., Domeniconi, C., and Zhang, X. Multi-view multiple clusterings using deep matrix factorization. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 6348 6355, 2020. Wu, L., Ledent, A., Lei, Y., and Kloft, M. Fine-grained generalization analysis of vector-valued learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35 (12):10338 10346, May 2021. doi: 10.1609/aaai.v35i12. 17238. Xu, M., Jin, R., and Zhou, Z.-H. Speedup matrix completion with side information: Application to multi-label learning. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS 13, pp. 2301 2309, Red Hook, NY, USA, 2013. Curran Associates Inc. Xue, H.-J., Dai, X., Zhang, J., Huang, S., and Chen, J. Deep matrix factorization models for recommender systems. In IJCAI, volume 17, pp. 3203 3209. Melbourne, Australia, 2017. Zhang, M. and Chen, Y. Inductive matrix completion based on graph neural networks. In International Conference on Learning Representations, 2020. Zhang, M., Huang, Z.-H., and Zhang, Y. Restricted p - isometry properties of nonconvex matrix recovery. IEEE Transactions on Information Theory, 59(7):4316 4323, 2013. doi: 10.1109/TIT.2013.2250577. Zhang, Q., Suh, G., Suh, C., and Tan, V. Y. F. Mc2g: An efficient algorithm for matrix completion with social and item similarity graphs. IEEE Transactions on Signal Processing, 70:2681 2697, 2022. doi: 10.1109/TSP.2022. 3174423. Zhang, X., Du, S., and Gu, Q. Fast and sample efficient inductive matrix completion via multi-phase procrustes flow. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5756 5765, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. Zhao, H., Ding, Z., and Fu, Y. Multi-view clustering via deep matrix factorization. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017. Generalization Analysis of Deep Non-linear Matrix Completion A. Table of Notations Table 3: Table of notations for quick reference Notation Meaning Sampling setting G P Rmˆn Ground truth matrix N Number of samples S tpξ1, r G1q, . . . , pξN, r GNqu Training set ζo Noise in oth observation r Go Gξo ζo oth observation ξ P rms ˆ rns (resp. ξorms ˆ rns) Observed entry (oth resp. observed entry) l Loss function lpgξ0, r Go, ξoq lopgξoq Loss function at oth datapoint pi,j Ppξ pi, jqq, (marginal) probability of observing entry i, j pi ř j pi,j Row i marginal probability qj ř i pi,j Column j marginal probability p Ep Fpξ, r Gqq 1 N řN o 1 Fpξo, r Goq Empirical expectation of F ei P Rm Indicator vector of ith row ej P Rn Indicator vector of jth column (Weighted) norms }.} Spectral norm }.}Fr Frobenius norm }.} Nuclear norm }.}sc,p Schatten p quasi-norm (p ď 1) řn j 1 břm i 1 A2 i,j řN o 1 1pξoq1 i N ith empirical row marginal řN o 1 1pξoq2 j N jth empirical column marginal pi 1 2pi 1 2m Smoothed row marginal qj 1 2qj 1 2n Smoothed column marginal ˇpi 1 2pi 1 2m Smoothed empirical row marginal ˇqj 1 2qj 1 2n Smoothed empirical column marginal r Z diagp pq 1 2 Z diagp qq 1 2 q Z diagpˇpq 1 2 Z diagpˇqq 1 2 M (in Fn Class definitions) Upper bound on }.}sc,p r (in Fn Class definitions) Upper bound on } r Z} 2p 2 p sc,p , } q Z} 2p 2 p sc,p or }Z} 2p 2 p sc,p ?mn 2p d Depth of deep matrix factorization A ś Di BJ d ) (also d) width of 1st layer after embedding in Ψ P N1,W Definitions of r Unweighted Setting r M ?mn ı 2p Weighted Setting } r Z}p sc,p ď r1 p 2p 2 p sc,p Matrix Function Classes Er,t t R P Rmˆn : }R} ď t, rankp Rq ď ru r F1 r Rmˆn Q Z : } r Z} ď ?r ( Rmˆn Q Z : } r Z} ď ?r, }Z}8 ď B0 ( Z P Rmˆn : } r Z}p sc,p ď r1 p Z P Rmˆn : }Z}sc,p ď M ( Generalization Analysis of Deep Non-linear Matrix Completion ! Z P Rmˆn : }Z}sc,p ď M rrs 2p ?mn, }Z}8 ď B0 ) Z P Rmˆn : } r Z}p sc,p ď r1 p 2 ; }Z}8 ď B0 ( Z P Rmˆn : } q Z}p sc,p ď rrs1 p Z P Rmˆn : } q Z}p sc,p ď rrs1 p 2 , }Z}8 ď B0 ( Other function classes Lℓ,B Set of all loss functions bounded by B from R2 ˆ prms ˆ rnsq to R, which are ℓ-Lipschitz in the first argument Flip,Lf,Bf Set of all Bf-bounded, Lf-Lipschitz functions DNN classes ϕ : Rm 1 Ñ R Neural Network with final output Ψ : rms ˆ rns Ñ R Encoder Net taking users and items as input N1,W Networks satisfying conditions (261) N2,W Networks satisfying conditions (265) N0,W,cpa, s, cq g : rms ˆ rns Ñ R1ˇˇDf P N1,Wpa, sq, U P Rmˆ m, V P Rnˆ m : }U}2 Fr }V }2 Fr ď c2 maxpm, nq, }A0} ď s0 : gpi, jq fp A0pui, vjq Jq @i, j Č N0,W,cpa, s, cq g : rms ˆ rns Ñ R1ˇˇDf P N1,Wpa, sq, U P Rmˆ m, V P Rnˆ m : } diagprpq 1 2 U}2 Fr } diagprqq 1 2 V }2 Fr ď c2, }A0} ď s0 : gpi, jq fp A0pui, vjq Jq @i, j N0,W,cpa, s, cq g : rms ˆ rns Ñ R1ˇˇDf P N1,Wpa, sq, U P Rmˆ m, V P Rnˆ m : } diagpqpq 1 2 U}2 Fr } diagpqqq 1 2 V }2 Fr ď c, }A0} ď s0 : gpi, jq fp A0pui, vjq Jq @i, j N1,W,idpa, sqi,j ϕpxξq where ϕ is a network form (259) satisfying Cond. (265) and xi,j : concatpei, ejq Composite function classes (illustrative examples) Flip,Lf,Bf Fp r,B0 g : rms ˆ rns Ñ R : Df P Flip,Lf,Bf , Z P Fp r,B0 : gpi, jq fp Zi,jq ( Z0 Latent matrix in representation of ground truth G as f Z0 Flip,Lf,Bf r Fp r,B0 g : rms ˆ rns Ñ R : Df P Flip,Lf,Bf , Z P r Fp r,B0 : gpi, jq fp Zi,jq ( r Fp r,B0 N0,W,c g : rms ˆ rns Ñ R : DZ P r Fp r,B0 Ψ P N0,W,c : gpi, jq Zi,j Ψpi, jq ( l N2,W pa1, s1qp r Fp r , N2,W pa, sqq Set of functions G written as Gpξ, r Gq lpϕ1p Z ϕ2q, r G, ξq for some Z P r Fp r , N2,W pa1, s1q Q ϕ1 : R2 Ñ R, N2,W pa, sq Q ϕ2 : rms ˆ rns Ñ R Constants ℓ Lipschitz constant of l B Bound on the loss function l Generalization Analysis of Deep Non-linear Matrix Completion C Constant from (Latała, 2005) C1 maxp C, 1q Lf Bound on the Lipschitz constant of f Bf Bound on the values of f m Number of latent matrices Constants in Neural Networks W Number of parameters (of DNN) L Number of layers (of DNN) w1, . . . , w L 1 Layer widths s1, . . . , s L Constraints on }W1}, . . . , }WL} a1, . . . , a L (in N1,W) Constraints on }p W 1 M 1q J}2,1, . . . , }p W 1 M 1q J}2,1 a1, . . . , a L (in N2,W ) Constraints on }W 1 M 1}, . . . , }W 1 M 1} W 1, . . . , W L Weight matrices (of DNN) M 1, . . . , M L Initialised weights (of DNN) s0 Upper bound on }A0} c2 (with weights) Upper bound on } diagprpq 1 2 U}2 Fr } diagprqq 1 2 V }2 Fr c2 (without weights) Upper bound on }U}2 Fr }V }2 Fr maxpm, nq řL ℓ 1 23{2 śL ℓ 1 sℓ řL ℓ 1 Q aℓ sℓ Constants in log terms Γ r Fp r ,ℓ 6pm nq Npℓ 1qpr 1q δ ΓFp t ,ℓ 6Npm nqpr 1qpℓ 1q ΓFp r,B0,ℓ 6Npm nqpr 1qpℓ 1qp B0 1q Γ r Fp r,B0,ℓ 3Nmn3r B0 1srℓ 1s 1 96W s0pm nq?mn śL ℓ 1 sℓ ϵ 1 For multiple latent matrices m Number of latent matrices pv vth latent matrix Schatten index rv Constraint on }Z} sr řm v 1 rv ΓW,m 12N śL ℓ 1 sℓ m B0 ı śL ℓ 1 s1 ℓ ı śL ℓ 1 sℓ ı rř ℓaℓs 1 Γ 3Nmn3r B0 1srℓ 1s 1 H N2,W pa1, s1q pconcatm v 1p r Fpv rv,B0q, N1,W,idpa, sqq Model abbreviations Name and function class Sd Schatten matrix completion (MC) Fp t , Fp r,B0 SWd Schatten weighted MC r Fp r , r Fp r,B0, q Fp r , q Fp r,B0 FRMC-FSd Functionally rescaled Schatten MC Flip,Lf,Bf Fp t , Flip,Lf,Bf Fp r,B0 FRMC-FSWd Functionally rescaled Schatten weighted MC Flip,Lf,Bf q Fp r,B0, Flip,Lf,Bf q Fp r,B0 FRMC-Sd+NN Sum of Sd and NN Fp t N0,W,c, Fp r,B0 N0,W,c FRMC-SWd+NN Sum of SWd and NN r Fp r N0,W,c, q Fp r N0,W,2c, r Fp r,B0 N0,W,c , etc. Generalization Analysis of Deep Non-linear Matrix Completion B. Table summary of results Table 4: Table of our results for the Schatten p constrained matrix completion without linear components as compared to the previous works for p 1, p ă 1 and p 0. For simplicity, we omit polylogarithmic factors in all relevant quantities such as N, m, n, B, B0, Bf, ℓand the failure probability δ.). NC stands for Not comparable , NDC stands for Not directly comparable . The compressed sensing literature (Zhang et al., 2013; Recht et al., 2010; Liu et al., 2014) offers results which can loosely be compared to an exact recovery sample complexity of r Opm nqr where r is the ground truth rank with uniform RIP measurements (e.g. Gaussian measurements, loosely analogous to uniform sampling). (Fan, 2021) includes somewhat different assumptions. The rank-restricted version (p 0, c.f. Lemma D.1) is a simple consequence of parameter counting (see (Long & Sedghi, 2020; Graf et al., 2022; Mohri et al., 2018; Ledent et al., 2021b; Gin e & Guillou, 2001; Platen, 1986; Talagrand, 1994; 1996)). There is also an analogous result for classification with uniform sampling (Srebro et al., 2004; Srebro & Shraibman, 2005; Srebro & Jaakkola, 2005). Constraint Sampling Our Bound Previous Work Comment }Z} ď ?rmn M Uniform 1 N minpm,nq (Thm 3.1) 1 N minpm,nq N (Foygel et al., 2011) }Z} ď ?rmn M Arbitrary B ℓpm nq 3 2 ?r N M b N (Thm 3.2) B ℓpm nq 3 2 ?r N M b N (Shamir & Shalev-Shwartz, 2011) } r Z} ď ?r Arbitrary B b N (Thm 3.1) N (Foygel et al., 2011) }Z}p sc,p ď Mp 2 ?mnp Uniform M 2p 2 p pm nq 2 3p 2 p Np (Thm 3.1) rn 2 2 p Np 2 p M 2p 2 p Np (Fan, 2021) B, ℓconstant; n m no replacement }Z}p sc,p ď Mp 2 ?mnp Arbitrary 2 Np (Thm 3.2) N/A NC to Comp. sensing } r Z}p sc,p ď r1 p 2 Arbitrary B Np (Thm 3.1) }Z}8 ď B0 rankp Zq ď r Arbitrary N logpℓB0q Parameter counting (Lemma D.1) N logpℓB0q Parameter counting cf. also (Srebro, 2004) (Mohri et al., 2018) Generalization Analysis of Deep Non-linear Matrix Completion Table 5: Very short summary of our results for Schatten quasi-norm matrix completion including dependence on p. For simplicity, the r O notation hides logarithmic factors of the relevant quantities, including of the failure probability δ and the constraint quantity B0. Main constraint Sampling }Z}8 unconstrained }Z}8 ď B0 } Z ?mn}p sc,p ď r1 p } Z ?mn}p sc,p ď r1 p 2 Arbitrary } r Z}sc,p ď r1 p 2 Arbitrary Generalization Analysis of Deep Non-linear Matrix Completion Table 6: detailed tabular summary of our results in the supplementary, expressed in terms of generalization error bounds. The r O notation hides polylogarithmic factors in all variables and constraints. Similar excess risk bounds hold for all function classes under the condition p 2 d for some d. Similar excess risk bounds hold for the empirically weighted analogues (with a multiple of the constraint r) under both realisability assumptions and the assumption p 2 d. See Thms C.4, C.7 and C.10. Function class Generalization Bound Relevant Theorem Sd MC l r Fp r O ˆ B Np logp r mn Nℓ l r Fp r,B0 O ˆ B N logp mn Nr ℓ B0 2 logpmn N ℓ r q Np B b l Fp r,B0 O ˆ B1 p 2 logpmn N ℓ r B0q l Flip,Lf,Bf r Fp r,B0 2 p r Lf ℓs N logp Nq logp mn B0rℓLf 1s B0 Lf ℓB B2 N logp Nq B b l Flip,Lf,Bf Fp r,B0 2 r Lf ℓs p 2 b 2 N log 3 2 p Nmn B0rℓLf 1sq B0 Lf ℓB B2 N logp Nq B b l p r Fp r,B0 Č N0,W,cq r O N B s0c RW ? l p Fp r,B0 N0,W,cq N B s0c?m n RW ? Multi-latent extension H : N2,W pa1, s1q pconcatm v 1p r Fpv rv,B0q, N1,W,idpa, sqq N B0 S1 ℓ b where S1 śL ℓ 1 s1 ℓ ı Generalization Analysis of Deep Non-linear Matrix Completion C. Generalization and Excess Risk Results In this section, we prove our main results. Many of the key difficulties involved in the proofs have already been overcome in the proofs of the relevant partial results in Sections D, which itself relies on lower level tools from Section E. C.1. Generalization and Excess Risk Bounds Schatten Norm Matrix Completion (Sd and SWd) In this subsection, we prove generalization and excess risk bounds for ordinary matrix completion (without a non-linear component) with Schatten norm regularization. Theorem C.1. Let l P Lℓ,B be a loss function. We consider the function class r Fp r : ! Z P Rmˆn : } r Z}p sc,p ď r1 p 2 ) . Let ˆZ : min ZP r Fp r p Eplp Zξ, r Gqq. With probability greater than 1 δ, we have the following excess risk bound: Eplp ˆZξ, r Gξqq Eplp Gξ, r Gξqq ď 12 B Np logp2Γ r Fp r ,ℓq where Γ r Fp r ,ℓ: 6pm nq Npℓ 1qpr 1q In particular, if the sampling distribution is uniform, the same result holds for Fp t . Furthermore, the same upper bound holds for sup ZP r Fp r Eplp Z, r Gqq p Eplp Z, r Gqq. (In fact, the generalization bound holds with a factor of 1{2 on the right with probability ě 1 δ.) Proof. This follows immediately from Theorem D.2, and Theorem F.11. Very similarly, we have the following result which applies with an additional constraint on the maximum entry: Theorem C.2. Let l P Lℓ,B be a loss function. Consider the following function class: r Fp r,B0 : ! Z P Rmˆn : } r Z}p sc,p ď r1 p 2 ; }Z}8 ď B0 ) . (11) Let ˆZ : min ZP r Fp r p Eplp Zξ, r Gqq. With probability greater than 1 δ, we have the following excess risk bound: Eplp ˆZξ, r Gξqq Eplp Gξ, r Gξqq ď 12 B N logp2Γ r Fp r,B0,ℓq where Γ r Fp r,B0,ℓ: 3Nmn3r B0 1srℓ 1s 1 δ . In particular, in the case of a uniform distribution, the same result holds for Fp t . The same result also holds for the generalization error sup Z r Fp r,B0 Eplp Zξ, r Gξqq s Eplp Zξ, r Gξqq. Proof. This follows immediately from Theorem D.4, and Theorem F.11. Next, we consider the case of a non-uniform distribution with non-weighted trace norm constraints: Theorem C.3. Consider the following function class: Fp t : ! Z P Rmˆn : }Z}sc,p ď M rr?mns Generalization Analysis of Deep Non-linear Matrix Completion With probability ě 1 δ, excess risk bound for ˆZ P arg min ZPFp t p Eplp Z, r Gqq: Eplp ˆZξ, r Gξqq Eplp Gξ, r Gξqq ď 12 B 2 log 2ΓFp t ,ℓ where ΓFp t ,ℓ: 6Npm nqpr 1qpℓ 1q. Furthermore, if we consider instead the optimization over the function class Fp r,B0 : ! Z P Rmˆn : }Z}sc,p ď M rr?mns 2p , }Z}8 ď B0 ) (15) then we have instead Eplp ˆZξ, r Gξ, ξqq Eplp Gξ, r Gξ, ξqq ď 12 B g f f er1 p 2 log 2ΓFp r,B0,ℓ where ΓFp r,B0,ℓ: 6pm nq Nr B0 1srℓ 1s. Furthermore,the same upper bounds hold for sup ZP r Fp r,B0 Eplp Z, r G, ξqq p Eplp Z, r G, ξqq. Proof. This follows immediately from Theorems D.3, D.5 and F.11. Next, we provide an excess risk result for the empirically weighted version. Theorem C.4. Assume that p 2 d for some integer d. Let p Z P arg min p Eplp Zξ, r Gqq : Z P q Fp 2r . If we assume that the ground truth G belongs to r Fp r , we have the following excess risk bound, which holds with probability ě 1 δ under the condition that N ě 140pm nq log 3pm nq Eplp p Zξ, r G, ξqq Eplp G, r G, ξqq ď 12 B Np logp6Γ r Fp r ,ℓq. (17) Furthermore, the upper bound also holds for the generalisation error (with the same error probability), and an analogous result holds for the class q Fp r,B0, with Γ r Fp r ,ℓreplaced by Γ r Fp r,B0,ℓand with the factor of p removed. Proof. By lemma E.5 we have, with probability ě 1 δ{3, }| Z0}2{d sc,2{d ď g f f e6pm nq log 3pm nq 2 d sc,2{d. (18) In particular, as long as N ě 24pm nq log 3pm nq 2 1 2 (note that since p is at most 1, this is satisfied as long as N ě 140pm nq log pm nq δ ) we certainly have }| Z0}p sc,p ď 21 p 2 }Ă Z0}p sc,p ď r2rs1 p Generalization Analysis of Deep Non-linear Matrix Completion This implies that G P q Fp 2r. (20) Similarly, by Lemma F.10, we also have with probability ě 1 δ{3 (as long as N ě 8pm nq logp 3pm nq δ q, which is already required by the stronger condition for (19)): 2 and ˇqj ě qj Under this condition, for any matrix Z P q Fp 2r, } r Z}p sc,p } a diagp pq diagpˇpq 1 q Z a diagp qq diagpˇqq 1q}p sc,p ď 2p} q Z}p sc,p ď 2pr2rs1 p 2p 2 p 2rs1 p 2 ď r4rs1 p Hence, we certainly have: q Fp 2r Ă r Fp 4r. (22) Now, by Theorem C.1 we have w.p. ě 1 δ{3 simultaneously over all Z P r Fp 4r: Eplp Zξ, r Gξqq p Eplp Gξ, r Gξqq ď 6 B Np logp6Γ r Fp 4r,ℓq Thus after a union bound, equations (19), (22) an (23) hold simultaneously with probability 1 δ and applying this to p Z and the ground truth G, we obtain: Eplp p Zξ, r Gqq Eplp G, r Gqq ď Eplp p Zξ, r Gqq p Eplp p Zξ, r Gqq p Eplp p Zξ, r Gqq p Eplp Gξ, r Gqq p Eplp Gξ, r Gqq Eplp G, r Gqq Np logp6Γ r Fp r ,ℓq as expected. C.2. Generalization and Excess Risk Bounds for Functionally Rescaled Schatten quasi-norm Matrix Completion (FRMC-FSd and FRMC-FSWd) We consider the following function class: Flip,Lf,Bf r Fp r,B0 : ! g : rms ˆ rns Ñ R : Df P Flip,Lf,Bf , Z P r Fp r,B0 : gpi, jq fp Zi,jq ) . (25) Theorem C.5. With probability greater than 1 δ over the draw of the training set we have the following bound on the empirical Rademacher complexity of the class Flip,Lf,Bf r Fp r : p Rp Flip,Lf,Bf r Fp r,B0q ď 150 B0 Lf Bf Bf 2 1 ? N log2p Nq (26) N log2p2NΓ r Fp r,B0,ℓLfq where Γ r Fp r,B0,Lf : 3Nmn3r B0 1sr Lf 1s 1 Generalization Analysis of Deep Non-linear Matrix Completion In particular, for any fixed loss function l P Lℓ,B, with probability greater than 1 δ over the draw of the training set, we have the following generalization bound for any f Z P Flip,Lf,Bf r Fp r : Epℓpfp Zξq, r Gqq 1 o 1 ℓopfp Zξoqq ď 6 B B0 Lf ℓB B2 1 ? N log2p Nq (27) 2 p p Lf ℓq N log2p4NΓ r Fp r,B0,ℓLfq Furthermore, we also have the following excess risk bound, which holds with probability greater than 1 δ: Epℓpˆgpi, jq, r Gq Epℓpg pi, jq, r Gq B0 Lf ℓB B2 1 ? N log2p Nq (28) 2 p p Lf ℓq N log2p4NΓ r Fp r,B0,ℓLfq where g and ˆg denote ming PFlip,Lf ,Bf r Fp r,B0 Eplpgξ, r Gqq and ming PFlip,Lf ,Bf r Fp r,B0 p Eplpgξ, r Gqq respectively. In particular, if the distribution is uniform, the same results hold for the function class Flip,Lf,Bf Fp r,B0. Proof. By Proposition F.13 with d 1, for every ϵ ą 0, there exists a uniform cover Cpϵq of Flip,Lf,Bf with cardinality satisfying log p| Cpϵq |q ď 3 R2 B0 Lf V 1 ȷ . (29) Note that since this is a cover of Flip,Lf,Bf with respect to the uniform norm, it satisfies the properties of Lemma E.4, to wit, for any matrix Z P r Fp r,B0, |pf fqp Zi,jq| ď ϵ (and therefore | lpfp Zi,jq, r G, pi, jqq lp fp Zi,jq, r G, pi, jqq| ď ϵ ℓholds uniformly over any matrix Z and any input pi, jq (this is the condition from Lemma E.3 (cf. Eq. (174))), which is stronger than that in Lemma E.4 (cf. Eq. (183))). Thus, we can apply our Lemma E.4 with Θ1 Flip,Lf,Bf and Θ2 r Fp r . p Rpl Flip,Lf,Bf r Fp r,B0q ď Eσ sup θ1,2PΘ1,2 i 1 σifipθ1, θ2q (30) sup θ1PΘ1 p R p Fθ1q 4α 4 ? logp Np Flip,Lf,Bf , ϵ{ ℓqq For the first term, note that by Theorem D.2, with probability ě 1 δ over the draw of the training set, we actually have sup f PCpϵq p RSpl f r Fp r,B0q ď sup f PFlip,Lf ,Bf p RSpl f r Fp r,B0q (32) 2 p r Lf ℓs N logpΓ r Fp r,B0q Regarding the second term in Equation (31), we have the following simple calculation: logp NpΘ1, ϵqq 2?ϵ B α 4 ? 15 B0 Lf ℓB B0 Lf ℓB B2 Generalization Analysis of Deep Non-linear Matrix Completion Plugging this back into Equations (31) and (32) we get p Rpl Flip,Lf,Bf r Fp r,B0q ď N log2p Nq 11 B 2 p r Lf ℓs N logpΓ r Fp r,B0,ℓLfq log2p Nq B0 Lf ℓB B2 2 p r Lf ℓs N logpΓ r Fp r,B0,ℓLfq log2p Nq B0 Lf ℓB B2 1 ? N log2p Nq, where we have assumed w.l.o.g. that N ě 2 (the Theorem statement is obvious for N 1). Setting l Id establishes the first inequality (since in this case B Bf and ℓ 1). The generalization bound then follows from Theorem F.11 and a union bound over the two failure probabilities. We now move on to prove a distribution-free result for the class Flip,Lf,Bf Fp r,B0. Theorem C.6. Consider the following function class: Flip,Lf,Bf Fp r,B0 : ! g : rms ˆ rns Ñ R : Df P Flip,Lf,Bf , Z P Fp r,B0 : gpi, jq fp Zi,jq ) . We have the following excess risk bound, which holds with probability greater than 1 δ: Epℓpˆgpξq, r Gq Epℓpg pξq, r Gq ď 12 B B0 Lf ℓB B2 1 ? 19C1 Mppm nq1 p log ΓFp r,B0,Lf ℓ fi where ΓFp r,B0,ℓ: 6pm nq Nr B0 1srℓ 1s, C1 maxp C, 1q (C being the constant from (Latała, 2005)) g and ˆg denote ming PFlip,Lf ,Bf Fp r,B0 Eplpgξ, r Gqq and ming PFlip,Lf ,Bf Fp r,B0 p Eplpgξ, r Gqq respectively. Proof. By the same arguments (cf. Equation (31)) as in the proof of Theorem C.5, we have the following bound on the Rademacher complexity of ℓ Flip,Lf,Bf Fp r,B0: p Rpℓ Flip,Lf,Bf Fp r,B0q ď log2p Nq sup f PFlip,Lf ,Bf p Rpl f Fp r,B0q 4 B B0 ℓLf B B2 Next, by Theorem D.5, we can continue p Rpl Flip,Lf,Bf Fp r,B0q ď 2 p Lf ℓq p 2 2 Mppm nq1 p log ΓFp r,B0,Lf ℓ fi B0 Lf ℓB B2 B0 Lf B ℓ B2 1 ? 19C1 Mppm nq1 p log ΓFp r,B0,Lf ℓ fi which holds for any training sample. In particular, we can apply Theorem F.11 to yield the result immediately. Generalization Analysis of Deep Non-linear Matrix Completion We next turn our attention to the slightly more delicate case of results for the minimizer of the empirically weighted trace norm. Theorem C.7. Assume that p 2 d for some integer d. Let pg P arg min p Eplpgξ, r Gqq : g P Flip,Lf,Bf q Fp 2r,B0 q Fp 2r,B0 is the data dependent function class ! Z P Rmˆn : } q Z}p sc,p ď r2rs1 p 2 , }Z}8 ď B0 ) . If we assume that the ground truth G belongs to Flip,Lf,Bf r Fp r,B0, we have the following excess risk bound, which holds with probability ě 1 δ under the condition that N ě 140pm nq log m n Epℓpˆgpi, jq, r Gq Epℓp Gi,j, r Gq ď 6 B B0 Lf ℓB B2 1 ? 2 p p Lf ℓq N log2p12NΓ r Fp r,B0,ℓLfq. (39) Proof. The proof is similar to the proof of Theorem C.4. Let us write the ground truth as G f Z0 (40) with f P Flip,Lf,Bf and Z0 P r Fp r,B0. As in the proof of Theorem C.4, as long as N ě 140pm nq log m n δ we certainly have w.p. ě 1 δ{3 }| Z0}p sc,p ď 21 p 2 }Ă Z0}p sc,p ď r2rs1 p This implies that Z0 P q Fp 2r,B0. (42) Similarly, by Lemma F.10, we also have with probability ě 1 δ{3 (as long as N ě 8pm nq logp 3pm nq δ q, which is already required by the stronger condition for (41)): 2 and ˇqj ě qj Under this condition, we certainly have, by the same argument as in Equation (22) in the proof of Theorem C.4: q Fp 2r,B0 Ă r Fp 4r,B0. (44) This, together with equation (42), implies that ˆg P Flip,Lf,Bf r Fp 4r,B0. (45) Thus, we can apply Theorem C.5 (with r Ð 4r, δ Ð 3δ, ℓÐ ℓLf) to obtain that an additional failure probability of δ, we have the following for every g P r Fp 4r,B0: Epℓpgξ, r Gqq 1 o 1 ℓopgξoq ď 3 B B0 Lf B B2 1 ? 2 p r Lf ℓs N logp12NΓ r Fp r,B0,ℓLfq Generalization Analysis of Deep Non-linear Matrix Completion In particular, by a union bound, equations (45), (44) an (46) hold simultaneously with probability 1 δ and applying this to ˆg and the ground truth G, we obtain: Eplpˆgξ, r Gqq Eplp G, r Gqq ď Eplppgξ, r Gqq p Eplppgξ, r Gqq p Eplppgξ, r Gqq p Eplp Gξ, r Gqq p Eplp Gξ, r Gqq Eplp Gξ, r Gqq B0 Lf ℓB B2 1 ? N log2p Nq (47) 2 p r Lf ℓs N logp12NΓ r Fp r,B0,ℓLfq as expected. C.3. Generalization and Excess Risk Bounds for a Sum of a Latent Matrix and a Neural Encoding (Sd+NN) Theorem C.8. Fix a loss function l P Lℓ,B and consider the following function class: r Fp r,B0 Č N0,W,c (48) where we assume the output dimension in the class N0,W,c is K 0. Assume that N ě 8pm nq logp 3pm nq Let pg P arg min p Eplpgi,j, r Gqq : g P r Fp r,B0 Č N0,W,c and g P arg min Eplpgi,j, r Gqq : g P r Fp r,B0 Č N0,W,c . Define N B s0c RW ? With probability greater than 1 δ over the draw of the training set, we have the following: p Rpl p r Fp r,B0 N0,W,cqq ď r O s B (49) g P r Fp r,B0 Č N0,W,c Eplpgξ, r G, ξqq p Eplpgξ, r G, ξqq ď r O s B O Eplpˆg, r G, ξqq ď Eplpg , r G, ξqq r O s B O where the r O notation hides polylogarithmic factors of all relevant quantities (B, B0, l, N, m, n, c, s0, RW, śL ℓ 1 sℓetc.). In particular, if the distribution is uniform, the same result holds for the class Fp r,B0 N0,W,c instead. Furthermore, the same results hold for the class r Fp r Č N0,W,c with s B replaced by B where N B s0c RW ? Proof. We aim to use Lemma E.4 with Θ1 Č N0,W,c and Θ2 r Fp r,B0. Assume that equations (270) are satisfied (this happens with probability ě 1 δ{3 as long as N ě 8pm nq logp 3pm nq δ q, by Lemma F.10) ). Then we can let C be a cover of granularity ϵ ℓof the class Č N0,W,c, as guaranteed by Proposition E.6. By Proposition E.6, we have logp|C|q ď 2dpm nq 32s2 0c2 1 ϵ2 1 ȷ RW 2 ȷ log ΓW,ϵ{ ℓ . (52) Now, for any Ψ P Č N0,W,c, we write sΨ for the associated cover element. For any g P r Fp r,B0 Č N0,W,c, we can write g as g Z Ψ. We define an associated cover element in r Fp r,B0 C as Z sΨ. For any value of Z P r Fp r,B0, it is certainly the Generalization Analysis of Deep Non-linear Matrix Completion o 1 plpgξo, r Goq lpsgξo, r Goqq2 ď ℓ2 1 o 1 pgξo sgξoq2 o 1 pgξo sgξoq2 ℓ2 1 o 1 p Zξo Ψξo r Zξo sΨξosq2 ℓ2 1 o 1 pΨξo sΨξoq2 ď ϵ2. (53) Thus, the condition (183) is satisfied and we can apply Lemma (E.4) to obtain p Rpl p r Fp r,B0 Č N0,W,cqq ď sup θ1PΘ1 p R p Fθ1q 4α 4 ? ΨP Č N0,W,c p Rpl p r Fp r,B0 Ψqq 4α 4 ? Now, we tackle both main terms in equation (54) separately. For the first term, it is clear that by applying Theorem D.4, (with an additional failure probability of δ{3): ΨP Č N0,W,c p Rpℓ p r Fp r,B0 Ψqq N logp3Γ r Fp r,B0,ℓq N logp3Γ r Fp r,B0,ℓq. (55) That is because Theorem D.4 explicitly holds uniformly over all loss functions in Lℓ,B, which includes all the loss functions l1 : py, r G, ξq ÞÑ l1py, r G, ξq lpy Ψξ, r G, ξq for any ℓP Lℓ,B and any Ψ : rms ˆ rns Ñ R. Note that at the third line, we have used the condition N ě 8pm nq logp 3pm nq For the second main term, we simply calculate the integral relying on Equation (52) (setting α 1 d 2dpm nq 32s2 0c2 1 ϵ2 1 RW 2 log ΓW,ϵ{ ℓ logpΓW,1{p N ℓqq N 4s0c RW ? N rlogp B Nq Bs Plugging Equations (55) and (56) back into Equation (54), we obtain, with overall probability ě 1 2δ{3, that p Rpl p r Fp r,B0 Č N0,W,cqq ď logpΓW,1{p N ℓqq N 4s0c RW ? N rlogp B Nq Bs N logp3Γ r Fp r,B0,ℓq The results then follow immediately from Theorem F.11. The proof for r Fp r Č N0,W,c is the same except we are using Theorem D.2 instead of Theorem D.4. We now turn our attention to the case of the non-weighted regularization in the arbitrary sampling case: Generalization Analysis of Deep Non-linear Matrix Completion Theorem C.9. We now consider the following function class: Fp r,B0 N0,W,c. We also let ˆg P arg ming p Eplpgξ, r G, ξqq : g P Fp r,B0 N0,W,c and g P arg ming Eplpgξ, r G, ξqq : g P Fp r,B0 N0,W,c . Let N B s0c?m n RW ? With probability ě 1 δ, we have p Rpl p Fp r,B0 N0,W,cqq ď r O s C (58) sup g P r Fp r,B0 N0,W,c Eplpgξ, r G, ξqq ď r O s C O Eplpˆg, r G, ξqq ď Eplpg , r G, ξqq r O s C O where the r O notation hides polylogarithmic factors of all relevant quantities (B, B0, l, N, m, n, c, s0, RW, śL ℓ 1 sℓetc.). Furthermore, if we consider instead the class Fp t N0,W,c, then the same result holds with s C replaced by N B s0c?m n RW ? Proof. By the same arguments as in the proof of Theorem C.8, W.p. ě 1 δ{2, as long as N ě 8pm nq logp 3pm nq δ q there exists a cover C of N0,W,c satisfying condition (183) and logp|C|q ď 2dpm nq 32s2 0c2pm nq 1 ϵ2 1 ȷ RW 2 ȷ log ΓW,ϵ{ ℓ . (61) Furthermore, p Rpl p r Fp r,B0 N0,W,cqq ď log2 p Nq sup ΨP Č N0,W,c p Rpl p r Fp r,B0 Ψqq (62) N s0c?m n RW ? N logp B Nq For the first term, we now have by Theorem D.5: log2 p Nq sup ΨP Č N0,W,c p Rpl p r Fp r,B0 N0,W,cqq (63) 19C1 Mppm nq1 p 2 N logpΓFp r,B0,ℓq Plugging this back into Equation (62) yields the result. Finally, we consider excess risk bounds for the empirically weighted version of our algorithm. In this case, the doubling argument must be used for both components of the model. Theorem C.10. Assume that the ground truth G belongs to q Fp r,B0 Č N0,W,c. Let ˆg P arg ming p Eplpgξ, r G, ξqq : g P q Fp 2r,B0 N0,W,2c . Let Generalization Analysis of Deep Non-linear Matrix Completion With probability greater than 1 δ over the draw of the training set, we have g P q Fp 2r,B0 N0,W,2c Eplpgξ, r G, ξqq p Eplpgξ, r G, ξqq ď r O p Dq O Eplpˆg, r G, ξqq Eplpg , r G, ξqq ď r O p Dq O where the r O notation hides logarithmic factors of all relevant quantities (B, B0, l, N, m, n, c, s0, RW, śL ℓ 1 sℓetc.). Proof. Let us write the ground truth as G Z0 Ψ0 (67) Just as in the proof of theorem C.4, by lemma E.5 we have, with probability ě 1 2δ{3, and as long as N ě 140pm nq log m n δ , that the following are all simultaneously satisfied: Z0 P q Fp 2r. (68) 2 and ˇqj ě qj q Fp 2r Ă r Fp 4r. (70) Also, by the proof of Lemma F.10, on the same high probability event as above, we have (c.f. Equation 208) ˇpi ď 2 pi and ˇqj ď 2 qj (for all i, j). Thus since Ψ0 P Č N0,W,c, we also have Ψ0 P N0,W,2c Ă N0,W,4c . (71) Thus, we have G P q Fp 2r,B0 N0,W,2c Ă r Fp 4r,B0 Č N0,W,4c (72) And thus, the theorem follows by applying Theorem C.8 with δ Ð δ{3. Remarks about the Norm Based Bounds on Neural Encodings: The results in this Section comes with some caveats regarding the improvements offered by the weighting } diagp pq 1 2 U}2 Fr } diagp qq 1 2 V }2 Fr . Indeed, the term b N still contains a parametric dependency on the dimension d of A0pui, vjq J, so it cannot be said that the bounds in Theorems C.8 and C.9 capture any rank sparsity inducing properties of the regularizer on U, V . In fact, the weighting merely serves to increase the uniformity of the input norms of the embeddings, which improves the behavior of norm-based bounds. However, one can also control the complexity of the neural encoding with a parameter counting strategy, which would remove any difference between the weighted and unweighted scenarios. This is what we do in the Section G, which deals with the case where multiple hidden matrices are present. D. Our Results on the Complexity of Matrix Classes with the Schatten quasi-Norms This section compiles our first end-product results: Rademacher complexity bounds for classes of matrices with low Schatten quasi-norms. This section is divided into two very similar sections where we treat the two cases where a separate upper bound on the entries is enforced or not. Indeed, a such a separate condition is required to prevent the bounds from blowing up as p Ñ 0. This condition is very mild, but we still cover both cases since it may be interesting to study the dependence on p. The proofs rely on the important tools developed in Section E. Generalization Analysis of Deep Non-linear Matrix Completion D.1. Without Constraints on Entries First, we will need the following simple parameter-counting lemma (the proofs use standard techniques) for matrices of fixed rank. Similar results hark back to (Srebro & Shraibman, 2005; Srebro et al., 2004; Vandermeulen & Ledent, 2021), but the rest of our proofs will require the variation below, which is uniform over any draw of the sample set ξ1, . . . , ξN P rms ˆ rns. Lemma D.1. Consider the following function class over matrices in Rmˆn: Er,t : R P Rmˆn : }R} ď t, rankp Rq ď r ( . (73) For all u ď N let lu : R Ñ R be ℓ-Lipschitz functions which are bounded by B. The covering number of Er,t is bounded as follows: log N8 p Er,t, ϵq ď pm nqr log ˆ3 ? 2t ϵ 1 . (74) Furthermore, we have the following parameter counting bound on the empirical Rademacher complexity of l Er,t. p Rpl Er,tq : sup ZPEr,t o 1 lop Zξoq ď 1 N log 3N ℓ ? 2t 1 B . (75) Proof. First, note that by Lemma F.18 for any R P Er,t, we can find two matrices A P Rmˆo and B P Rnˆo such that }A}2 Fr }B}2 Fr ď 2t. (77) In particular, Equation (77) certainly implies that }A}Fr, }B}Fr ď ? Next, setting ε ϵ 4 ? 2t, by Lemma F.17, there exist covers CA Ă A P Rmˆr : }A}Fr ď ? 2t ( and CB Ă B P Rnˆr : }B}Fr ď ? 2t ( (with respect to the Frobenius norm) such that for all A P Rmˆr (resp. B P Rmˆr), there exists a s A P CA (resp. s B P CB) such that }A s A}Fr, }B s B}Fr ď ε and |CA| ď ˆ3 ? 2t ϵ 1 mr and (78) |CB| ď ˆ3 ? 2t ϵ 1 nr . (79) The cover C : R P Rmˆn : R ABJ : A1 P CA, B1 P CB ( Ă Rmˆn is an (external) ϵ{2-cover of Er,t with respect to the L8 norm. Indeed, for any R P Er,t, we can write R ABJ for some A P CA, B P CB. Then, writing s A (resp. s B) for the corresponding nearest cover elements in CA (resp. CB), and writing s R s A s BJ we have for any i ď m, j ď n: s R R i,j ď s R R Fr s A s BJ ABJ Fr s A s BJ BJ s A A BJ Fr ď s A Fr s BJ BJ Fr s A A Fr s B Fr (80) In particular (by Equations (78)), there must exist an internal ϵ-cover C1 Ă Er,t with ˇˇC1ˇˇ ď |C| ď |CA| |CB| (82) 2t ϵ 1 mr ˆ3 ? 2t ϵ 1 nr ˆ3 ? 2t ϵ 1 pm nqr . (83) Next, we need a simple argument analogous to proofs of Dudley s entropy integrals (however, since the covering number has mild dependence on ϵ, it is not necessary to use a full chaining argument via Lemma E.4, E.3 or E.1). For any R P Er,t Generalization Analysis of Deep Non-linear Matrix Completion let s R be the closest cover element in C1, we have, for any sample set ξ1, . . . , ξN P rms ˆ rns; Eσ sup RPEr,t u 1 σu lup Rξuq (84) ď Eσ sup RPEr,t u 1 σuplup Rξuq lup s Rξuqq Eσ sup RPEr,t u 1 σu lup s Rξuq ď Eσ sup RPEr,t u 1 σuplup Rξuq lup s Rξuqq Eσ sup RPC1 1 N u 1 σu lup s Rξuq 2 logp|C1|q B ? where the fifth line (85) follows from Proposition F.14. Setting ϵ 1 N ℓyields the result. Theorem D.2. Consider the following function class: r Fp r : ! Z P Rmˆn : } r Z}p sc,p ď r1 p With probability ě 1 δ over the draw of the training set, the following bound holds on the Rademacher complexity of l r Fp r holds simultaneously over all choices of l P Lℓ,B : Eσ sup ZP r Fp r o 1 σo lop Zξoq ď Np logpΓ r Fp r q where Γ r Fp r : 6pm nq Npℓ 1qpr 1q Proof. Let us write t for t r1 p 2 . Then we have r Fp r ! Z P Rmˆn, } r Z}p sc,p ď t ) . For any matrix X of rank k, let us write ρ1p Xq, . . . , ρkp Xq for the singular values of X, ordered from largest to smallest. We can also define, for any τ ě 0, the quantity Up Xq : |tκ ď k : ρκ ě τu|. By Markov s inequality, we certainly have for any X: Up Xq ď }X}p sc,p τ p . (87) Note also that for if ρκ ď τ for all κ ě U 1, then we certainly have κ U 1 ρp κρ1 p κ ď τ 1 p kÿ κ U 1 ρp κ. (88) Thus, applying the decomposition to X r Z we have the following super-decomposition of the function class r Fp r : r Fp r Ă Rτ Tτ, (89) where Rτ consists in the contribution from singular values greater than τ and Tτ consists in the contribution from singular values less than τ. Thus, recalling Equation (73), Rτ : ! R P Rmˆn : } r R} ď st, rankp Rq ď s U ) (90) Generalization Analysis of Deep Non-linear Matrix Completion where r R diagp? pq R diagp? qq and and st t1{pn nr where the expression for s U follows from Equation (87) and the expression for st follows from the fact that for any matrix X P Rmˆn, }X}sc,pn ě }X}sc,prankp Xq ě rankp Xq}X} ě }X} . Note that we upper bound the rank by n rather than minpm, nq for cosmetic reasons only: the corresponding terms will only give rise to logarithmic factors in any case. For Tτ, we have Tτ : r F1 rt : ! Z P Rmˆn : } r Z} ď rt ) , (92) where from Equation (88) we obtain the suitable expression for rt as rt : t τ 1 p r 2 p 2 τ 1 p . (93) Since Equation (89) holds with the prescribed values of s U, st and rt we can upper bound the Rademacher complexity via Lemma E.3 for F ! lopp Z1 Z2qξoq oďN : Z1 P Rτ, Z2 P Tτ ) , Θ1 Rτ and Θ2 Tτ: Eσ sup ZP r Fp r o 1 σo lup Zξuq ď Eσ sup Z1PRτ ,Z2PTτ o 1 σo lup Z1 ξu Z2 ξuq (94) ď ϵ sup s Z1PCpϵq p R p F s Z1q B logp Np F1, ϵqq where Cpϵq is an ϵ-uniform cover of Θ1 Rτ in the sense of Lemma E.3, and for the avoidance of doubt, F s Z1 ! lopp s Z1 Z2qξoq oďN : Z2 P Tτ ) . The above equation (95) holds for any ϵ, and since the loss function is uniformly Lipschitz, it is certainly true that if Z1 is a cover element such that ˇˇˇ Z1 Z1 ξo ˇˇˇ ď ϵ for any o, then we also have that ˇˇlopp Z1 Z2qξoq lopp Z1 Z2qξoq ˇˇ ď ϵ ℓ (96) holds uniformly over all Z2 P Tτ, all os and all loss functions ls satisfying the required conditions. Np F1, ϵq ď N8 Rτ, ϵ Note also that since pi ě 1 2m and qj ě 1 2n, for any matrix X with } r X} ď st, we have }X} ď ?n} r X}Fr2?mn ď 2 ? Thus we have Rτ Ă ! R P Rmˆn : }R} ď 2 ? mn2st, rankp Rq ď s U ) . Thus, using Lemma (D.1) and plugging the result Generalization Analysis of Deep Non-linear Matrix Completion back into Equation (95) above, we obtain: Eσ sup ZP r Fp r o 1 σo lup Zξuq (98) ď ϵ sup s Z1PCpϵq Eσ sup Z2PTτ o 1 σo lop Z2 ξo s Z1 ξoq (99) g f f f e s Upm nq log ˆ 3 ? mn2st ℓ ϵ 1 ď ϵ ℓEσ sup Z2PTτ o 1 σo Zξu B g f f f f e s Upm nq log N ℓEσ sup Z2PTτ o 1 σo Zξu (101) 2 pm nq log p6pm nq Npℓ 1qpr 1qq where the second line (99) we used Lemma D.1, at the next line (100) we have used the Talagrand contraction Lemma (once for each value of s Z1), and at the final line (101), we have set ϵ 1 N , used the fact that s U ď tp 2 τ p and the fact that p ď 1. Next, by applying Theorem F.5 (with r Ð rt2) we know that with probability ě 1 δ over the draw of the training set we have Eσ sup Z2PTτ o 1 σo Zξu ď 4 3N log ˆm n 3N log ˆm n ď 4 τ p1 pq d 3N log ˆm n 16 τ 1 p ?mnr 2 p 2 3N log ˆm n where at line (103) we used the definition of rt (cf. Equation (93)). Plugging Equation (103) back into Equation (101) we obtain the following bound for the Rademacher complexity of r Fp r , which holds with probability ě 1 δ over the draw of the training set: Eσ sup ZP r Fp r o 1 σo lup Zξuq (104) N 16 ℓτ 1 p ?mnr 2 p 2 3N log ˆm n 2 pm nq log p6pm nq Npℓ 1qpr 1qq 4 ℓτ p1 pq d 3N log ˆm n Ignoring logarithmic factors, the dominant terms scale as follows: Eσ sup ZP r Fp r o 1 σo lup Zξuq ď r O 2 pm nq Np τ p ℓτ 1 pr1 p Generalization Analysis of Deep Non-linear Matrix Completion Thus, optimizing over the choice of τ, we select τ B 2 2 p ℓ 2 2 p r 1 2 p 1 2 p . (107) Substituting this back into Equation (105) we obtain Rp r Fp r q ď N 16 ℓτ 1 p ?mnr 2 p 2 3N log ˆm n 4 ℓτ p1 pq d 3N log ˆm n 2 pm nq log p6pm nq Npℓ 1qpr 1qq N 16 ℓr B 2 2 p ℓ 2 2 p r 1 2 p 1 2 p s1 p?mnr 2 p 2 3N log ˆm n 2 pm nq log p6pm nq Npℓ 1qpr 1qq Npr B 2 2 p ℓ 2 2 p r 1 2 p 1 2 p sp 4 ℓr B 2 2 p ℓ 2 2 p r 1 2 p 1 2 p sp1 pq d 3N log ˆm n p 2 p r 1 2 ?mn 1 p 2 p log ˆm n 2rpm nq log p6pm nq Npℓ 1qpr 1qq 2 p log ˆm n p 2 p r 1 2 ?mn log m n 1 p 2 p (108) g f f erpm nq log 6pm nq Npℓ 1qpr 1q N logpΓ r Fp r q N logpΓ r Fp r q as expected. Next, we consider the situation where the distribution is arbitrary, but the Schatten quasi norm is not weighted: Theorem D.3. Consider the following function class: Fp t : Z P Rmˆn : }Z}sc,p ď M ( . (111) With probability ě 1 δ, we have the following bound on the Rademacher complexity of l Fp t , where l is any set of N ℓ-Lipschitz functions uniformly bounded by B: p Rp r Fp r q 2 log ΓFp t ,ℓ where ΓFp t ,ℓ: 6Npm nqpr 1qpℓ 1q. Generalization Analysis of Deep Non-linear Matrix Completion Proof. Similarly to the above proofs we can write r Fp r Ă Rτ Tτ, (113) Rτ : R P Rmˆn : }R} ď st, rankp Rq ď s U ( (114) and st ?mnr 2p n. (115) And similarly Tτ : Z P Rmˆn : }Z} ď rt ( (116) 2 ?mn p τ 1 p . (117) Thus, by the same argument as in the proof of Theorem D.2 we have Eσ sup ZP r Fp r o 1 σo lop Zξoq (118) N sup s Z1PCp1{Nq p R p F s Z1q B g f f e s Upm nq log 3N a F s Z1 ! lopp s Z1 Z2qξoq oďN : Z2 P Tτ ) . (120) with the values for Tτ, s U, st, rt defined as per Equations (116), (115) and (117) respectively. Replacing the appropriate values for s U and st, we have the following g f f e s Upm nq log 3N a g f f f er1 p 2 ?mnppm nq log ˆ 3N b N τ p (122) g f f f er1 p 2 pm nqp 1 log ˆ 3N b N τ p (123) 2 pm nqp 1 log p6Npm nqpr 1qpℓ 1qq Np τ p . (124) For any s Z1 P Cpϵq we can apply Proposition F.3 from (Shamir & Shalev-Shwartz, 2011) to obtain the following inequality, which is valid for any training set: Eσ sup Z2PTτ o 1 lop s Z1 Z2q ď 9C B ℓrtp?m ?nq 2 ?mnp τ 1 pp?m ?nq Generalization Analysis of Deep Non-linear Matrix Completion Taking a suprememum over s Z1 we certainly have sup s Z1PCpϵq Eσ sup Z2PTτ o 1 lop s Z1 Z2q ď 2 ?mnp τ 1 pp?m ?nq Plugging Equations (125) and (124) back into Equation (119) we obtain: Eσ sup ZP r Fp r o 1 σo lop Zξoq (126) 18C B ℓr1 p 2 τ 1 pp?m nq1 2p 2 pm nqp 1 log p6Npm nqpr 1qpℓ 1qq this motivates the following choice of threshold τ : B ℓ 1 p 1? which yields: Eσ sup ZP r Fp r o 1 σo lop Zξoq (129) 18C B2 p ℓp r1 p 2 Np1 p (130) 2 log p6Npm nqpr 1qpℓ 1qq 2 p6Npm nqpr 1qpℓ 1qq as expected. D.2. With Constraints on Entries Theorem D.4. Consider the following function class: r Fp r,B0 : ! Z P Rmˆn : } r Z}p sc,p ď r1 p 2 ; }Z}8 ď B0 ) . (131) With probability ě 1 δ, we have the following bound on the Rademacher complexity of l r Fp r , where l a Lipschitz function of ξo and r Go but is ℓ-Lipschitz with respect to the first argument and uniformly bounded by B: ˆRp r Fp r,B0q Eσ sup ZP r Fp r o 1 σo lop Zξoq (132) N logpΓ r Fp r,B0,ℓq Proof. Let us write t for t r1 p 2 . Then we have r Fp r,B0 ! Z P Rmˆn, } r Z}p sc,p ď t, }Z}8 ď B0 ) . Generalization Analysis of Deep Non-linear Matrix Completion As before, we decompose our function class into two parts r Fp r,B0 Ă Rτ Tτ, (133) where Rτ is a class of rank U matrices where the contribution from singular values greater than τ belong and Tτ contains in the contribution from singular values less than τ. More precisely, Rτ : R P Rmˆn : }R} ď st, rankp Rq ď s U ( (134) where r R diagp? pq R diagp? qq and and st ?mnn B0, (135) where the expression for s U comes from Markov s inequality (as before) and (differently from before) the expression for st comes from the fact that for any matrix R, ρiěτ ρiuipviq J ď }Z}rankp Zq ď ?mn B0 rankp Zq ď ?mnn B0 . (136) Note also that exactly as before, for if ρκ ď τ for all κ ě U 1, then we certainly have κ U 1 ρp κρ1 p κ ď τ 1 p kÿ κ U 1 ρp κ. (137) Thus, for Tτ, we have Tτ : r F1 rt : ! Z P Rmˆn : } r Z} ď rt ) , (138) where from equation (137) we obtain the suitable expression for rt as rt : t τ 1 p r 2 p 2 τ 1 p . (139) Since Equation (133) holds with the prescribed values of s U, st and rt we can upper bound the Rademacher complexity via Lemma E.3 for F ! lopp Z1 Z2qξoq oďN : Z1 P Rτ, Z2 P Tτ ) , Θ1 Rτ and Θ2 Tτ: Eσ sup ZP r Fp r o 1 σo lup Zξuq ď Eσ sup Z1PRτ ,Z2PTτ o 1 σo lup Z1 ξu Z2 ξuq (140) ď ϵ sup s Z1PCpϵq p R p F s Z1q B logp Np F1, ϵqq By the same arguments as in the proof of Theorem D.2, Np F1, ϵq ď N8 Rτ, ϵ Generalization Analysis of Deep Non-linear Matrix Completion Thus, using Lemma D.1 and plugging the result back into Equation (141) above, we obtain: Eσ sup ZP r Fp r o 1 σo lup Zξuq (143) ď ϵ sup s Z1PCpϵq Eσ sup Z2PTτ o 1 σo lop Z2 ξo s Z1 ξoq p F s Z1q (144) g f f f e s Upm nq log ˆ 3?? N ℓEσ sup Z2PTτ o 1 σo Zξu B 2 pm nq log p3Nmn3r B0 1srℓ 1s 1q N τ p (145) at the final line (145), we have set ϵ 1 N , used the fact that s U ď tp Next, the application of Theorem F.5 (with r Ð rt2) to Tτ is unchanged from the proof of Theorem D.2, thus we know as before that with probability ě 1 δ over the draw of the training set we have Eσ sup Z2PTτ o 1 σo Zξu ď 4 3N log ˆm n 3N log ˆm n ď 4 τ p1 pq d 3N log ˆm n 16 τ 1 p ?mnr 2 p 2 3N log ˆm n where at line (147) we used the definition of rt (cf. Equation (139)). Plugging Equation (147) back into Equation (145) we obtain the following bound for the Rademacher complexity of r Fp r , which holds with probability ě 1 δ over the draw of the training set: Eσ sup ZP r Fp r o 1 σo lup Zξuq (148) N 16 ℓτ 1 p ?mnr 2 p 2 3N log ˆm n 2 pm nq log p3Nmn3r B0 1srℓ 1s 1q N τ p 4 ℓτ p1 pq d 3N log ˆm n Thus, optimizing over the choice of τ (ignoring logarithmic factors), we select τ B 2 2 p ℓ 2 2 p r 1 Substituting this back into Equation (149) and writing Γ r Fp r,B0,ℓfor 3Nmn3r B0 1srℓ 1s 1 δ ě 3Nmn3r B0 1srℓ 1s 1 we Generalization Analysis of Deep Non-linear Matrix Completion obtain Rp r Fp r q ď N 16 ℓτ 1 p ?mnr 2 p 2 3N log ˆm n g f f er1 p 2 pm nq log Γ r Fp r,B0,ℓ N τ p 4 ℓτ p1 pq d 3N log ˆm n N 16 ℓr B 2 2 p ℓ 2 2 p r 1 2 s1 p?mnr 2 p 2 3N log ˆm n g f f f er1 p 2 pm nq log Γ r Fp r,B0,ℓ Nr B 2 2 p ℓ 2 2 p r 1 2 sp 4 ℓr B 2 2 p ℓ 2 2 p r 1 3N log ˆm n p 2 p r 1 2 ?mn 3N log ˆm n g f f erpm nq log Γ r Fp r,B0,ℓ 3N log ˆm n p 2 p r 1 2 ?mn log m n g f f erpm nq log Γ r Fp r,B0,ℓ N logpΓ r Fp r,B0,ℓq as expected. Theorem D.5. Consider the following function class: Fp r,B0 : ! Z P Rmˆn : }Z}sc,p ď M rr?mns 2p , }Z}8 ď B0 ) (152) With probability ě 1 δ, we have the following bound on the Rademacher complexity of l Fp t , where l is any set of N ℓ-Lipschitz functions uniformly bounded by B: ˆRp Fp r,B0q Eσ sup ZPFp t o 1 σo lop Zξoq (153) 2 Mppm nq1 p log p6pm nq Nr B0 1srℓ 1sq , where C is the absolute constant from (Latała, 2005). Note that similarly to Theorem D.3, the sample complexity is r O p Mppm nqpq r O r1 p where r is the rank-like quantity M ?mn ı 2p Generalization Analysis of Deep Non-linear Matrix Completion Proof. Similarly to the above proofs we can write r Fp r Ă Rτ Tτ, (155) Rτ : R P Rmˆn : }R} ď st, rankp Rq ď s U ( (156) and st B0 ?mnn. (157) And similarly Tτ : Z P Rmˆn : }Z} ď rt ( (158) rt Mp τ 1 p . (159) Thus, by the same argument as in the proof of Theorem D.2 we have Eσ sup ZP r Fp r o 1 σo lop Zξoq (160) N sup s Z1PCp1{Nq p R p F s Z1q B g f f e s Upm nq log 3N ? F s Z1 ! lopp s Z1 Z2qξoq oďN : Z2 P Tτ ) . (162) with the values for Tτ, s U, st, rt defined as per Equations (158), (157) and (159) respectively. For any s Z1 P Cpϵq we can apply Proposition F.3 from (Shamir & Shalev-Shwartz, 2011) to obtain the following inequality, which is valid for any training set: Eσ sup Z2PTτ o 1 lop s Z1 Z2q ď 9C B ℓrtp?m ?nq 9C B ℓMp τ 1 pp?m ?nq Taking a suprememum over s Z1 we certainly have sup s Z1PCpϵq Eσ sup Z2PTτ o 1 lop s Z1 Z2q ď 9C B ℓMp τ 1 pp?m ?nq where C is the absolute constant in (Latała, 2005). Furthermore, replacing the appropriate values for s U and st, we have the following g f f e s Upm nq log 3N ? 2 Mppm nq log p6pm nq Nr B0 1srℓ 1sq N τ p . (165) Plugging Equations (164) and (165) back into Equation (161) we get: Generalization Analysis of Deep Non-linear Matrix Completion Eσ sup ZP r Fp r o 1 σo lop Zξoq (166) 18C B ℓMp τ 1 pp?m nq 2 Mppm nq log p6pm nq Nr B0 1srℓ 1sq This motivates the following choice of threshold: which yields Eσ sup ZP r Fp r o 1 σo lop Zξoq (169) 18C B ℓMp τ 1 pp?m nq 2Mppm nq log p6pm nq Nr B0 1srℓ 1sq 18C B2 p ℓp Mpp?m nq1 p 2Mppm nq1 p 2 B2 p ℓp log p6pm nq Nr B0 1srℓ 1sq 2 log p6pm nq Nr B0 1srℓ 1sq , where as usual at the second line we have used the fact that 1 E. Important Tools In this section, we collect some of the main building blocks of our proofs, which include estimates of perturbations of Schatten quasi-norms in the estimation of the marginals, various tailor-made generalizations of chaining arguments, and generalization bounds for classes of neural embeddings. E.1. Generalizations of Dudley s Entropy Theorem This subsection details some of our results on how to calculate Rademacher complexities of composite function classes when a covering number is only available for one of the classes. Recall the following standard form of Dudley s entropy integral: Lemma E.1. Let F be a real-valued function class taking values in r0, 1s, and assume that 0 P F. Let S be a finite sample of size n. We have the following relationship between the Rademacher complexity Rp F|Sq and the L2 covering number Np F|S, ϵ, }.}2q. Rp F|Sq ď inf αą0 log Np F|S, ϵ, }.}2q dϵ, where the norm }.}2 on RN is defined by }x}2 2 1 N přN i 1 |xi|2q. Generalization Analysis of Deep Non-linear Matrix Completion We will also to extend the above Lemma to various settings where several a function class over two parameters. For this, we will first need to establish the following slight extension of Lemma F.19: Lemma E.2 (Scale sensitive concentration of Rademacher complexity). For any fixed x1, . . . , x N and any function class F Ă Rr Ns. Assume that there are N numbers s1, s2, . . . , s N ą 0 such that for all f P F we have |fi| ď si (170) i 1 s2 i c2 for some c ą 0. We have with probability ě 1 δ over the draw of the Rademacher variables σ1, . . . , σN, ˇˇˇˇˇsup f PF i 1 σifi Eσ sup f PF Proof. This is a direct application of the Mc Diarmid inequality with the variables being the σ1, . . . , σN. Indeed, if σ, sσ P t 1, 1u N differ only in the ith component σi, then we certainly have ˇˇˇˇˇsup f PF i 1 σifi sup f PF ˇˇˇˇˇ ď 2si, (172) which means that Mc Diarmid s inequality implies ˇˇˇˇˇsup f PF i 1 σifi Eσ sup f PF ď 2 exp ˆ 2ϵ2 The lemma follows upon rearranging. The two Lemmas below perform a generalized chaining argument (Guermeur, 2018; Vershynin, 2018; Boucheron et al., 2004) with multiple component function classes. They are extensions of Proposition A.4. (page 3 of the supplementary) of (Ledent et al., 2021b) and allow one to bound the Rademacher complexity of a composition of two function classes in terms of the uniform covering number of the second class and the Rademacher complexity of the first one. Lemma E.3 (Multi-class chaining: simple compositional extension of Dudley s entropy theorem). Let F : tfipθ1, θ2q : θa P Θ1, θ2 P Θ2u be a class of functions on t1, 2, . . . , nu with values in r B, Bs and dependent on two parameters θ1 P Θ1 and θ2 P Θ2. Let ϵ ě 0 and assume that Θ1 admits an ϵ-uniform cover Cpϵq Ă Θ1 (of size Np F1, ϵq, dependent on ϵ) in the following sense: For any θ1 P Θ1 there exists a θ P Cpϵq such that for all θ2 P Θ2 and for all i ď n we have ˇˇfipθ1, θ2q fip θ, θ2q ˇˇ ď ϵ. (174) Then we have the following result on the Rademacher complexity of the function class F: p Rp Fq Eσ sup θ1,2PΘ1,2 i 1 σifipθ1, θ2q ď ϵ sup θPCpϵq p R p F θq B logp Np F1, ϵqq where for all θ P Θ1 we define F θ : f.p θ, θ2q : θ2 P Θ2 ( , and as usual the σis are i.i.d. Rademacher variables. Proof. Without loss of generality, we can assume B 1. Generalization Analysis of Deep Non-linear Matrix Completion For any θ1 P Θ1 let us denote by sθ1 the cover element associated to θ1 as in equation (174), by the assumption on the cover we have p Rp Fq Eσ sup θ1,2PΘ1,2 i 1 σifipθ1, θ2q ď Eσ sup θ1,2PΘ1,2 i 1 σifipsθ1, θ2q rfipθ1, θ2q fipsθ1, θ2qs ď ϵ Eσ sup θ1,2PΘ1,2 i 1 σifipsθ1, θ2q ď ϵ Eσ sup θPCpϵq sup θ2PΘ2 i 1 σifip θ, θ2q. (176) By Lemma F.19 we have, for any choice of θ P Θ1, that with probability ě 1 δ over the draw of the Rademacher variables σ1, . . . , σN, ˇˇˇˇˇ sup θ2PΘ2 i 1 σifip θ, θ2q Eσ sup θ2PΘ2 i 1 σifip θ, θ2q Thus, by a union bound we have that with probability ě 1 δ over the draw of the Rademacher variables: sup θPCpϵq sup θ2PΘ2 i 1 σifip θ, θ2q sup θPCpϵq p R p F θq ˇˇˇˇˇ sup θPCpϵq sup θ2PΘ2 i 1 σifip θ, θ2q sup θPCpϵq Eσ sup θ2PΘ2 i 1 σifip θ, θ2q ď sup θPCpϵq ˇˇˇˇˇ sup θ2PΘ2 i 1 σifip θ, θ2q Eσ sup θ2PΘ2 i 1 σifip θ, θ2q logp Np F1, ϵqq Let X denote the random variable p Rp Fq ϵ sup θPCpϵq p R p F θq logp Np F1, ϵqq (with the randomness arising from the Rademacher variables σis). By equations (178) and (176) we have for all ε ą 0 Pp X ě εq ď 2 exp ˆ ε2N Integrating over ε we obtain Ep Xq ď ż 8 0 2 exp ˆ ε2N 0 expp θ2qdθ which, plugged back into the definition of X (i.e. Equation (179)) gives the result. Generalization Analysis of Deep Non-linear Matrix Completion Whilst Lemma E.3 above works well when the function class Θ1 enjoys a log covering number with very mild dependence on the granularity ϵ (e.g. logp1{ϵq), it is insufficient to handle the typical 1{ϵ2 dependency of norm-based bounds. The generalization below is more suitable in this case. Lemma E.4 (Multi-class chaining: full compositional generalization of Dudley s entropy theorem). Let F : tfipθ1, θ2q : θa P Θ1, θ2 P Θ2u be a class of functions on t1, 2, . . . , nu with values in r B, Bs and dependent on two parameters θ1 P Θ1 and θ2 P Θ2. Assume that there exists a θo P Θ1 such that fipθ0, θ2q 0 for all i and for all θ2 P Θ2. Assume that for all ϵ ą 0, Θ1 admits an ϵ-uniform cover Cpϵq Ă Θ1 (of minimum size NpΘ1, ϵq, dependent on ϵ) in the following sense: For any θ1 P Θ1 there exists a θ P Cpϵq such that for all θ2 P Θ2 we have fipθ1, θ2q fip θ, θ2q 2 ď ϵ2. (183) The Rademacher complexity of the function class F is bounded as follows: p Rp Fq : Eσ sup θ1,2PΘ1,2 i 1 σifipθ1, θ2q (184) sup θ1PΘ1 p R p Fθ1q 4α 4 ? logp NpΘ1, ϵqq where for any fixed θ1 P Θ1, Fθ1 is the function class tfpθ1, θ2q : θ2 P Θ2u. Proof. W.l.o.g. we assume B 1. Let H be arbitrary, and let ϵh 2 ph 1q for h 1, 2, . . . , H. For all h, let Vh Ă Θ1 denote the cover achieving (183), where we can choose v1 tθ0u. For each θ1 P Θ1 let us also write vhrθ1s for the cover element n Vh which achieves (183). Using a similar decomposition to classic proofs of the standard Dudley entropy theorem, we have for any value of the Rademacher variables σ1, . . . , σN: sup θ1,2PΘ1,2 i 1 σifipθ1, θ2q ď sup θ1,2PΘ1,2 i 1 σifipv1rθ1s, θ2q sup θ1,2PΘ1,2 i 1 σirfipθ1, θ2q fipv Hrθ1s, θ2qs sup θ1,2PΘ1,2 h 1 σirfipvhrθ1s, θ2q fipvh 1rθ1s, θ2qs. ď sup θ1,2PΘ1,2 i 1 σifipv1rθ1s, θ2q ϵH sup θ1,2PΘ1,2 h 1 σirfipvhrθ1s, θ2q fipvh 1rθ1s, θ2qs ď ϵH sup θ1,2PΘ1,2 h 1 σirfipvhrθ1s, θ2q fipvh 1rθ1s, θ2qs, (185) where the second inequality follows from the definition of the cover and the Cauchy-Schwartz inequality, and the last inequality follows from v1 tθ0u. Next let us define the function class Wh tw P Rr NsbΘ2 : Dθ1 P Θ1 s.t. wipθ2q fipvhrθ1s, θ2q fipvh 1rθ1s, θ2qu. Then note that we have |Wh| ď |Vh||Vh 1| ď |Vh 1|2 NpΘ1, ϵh 1q2. (186) Note that for all h ď H we have Eσ sup θ1,2PΘ1,2 i 1 σirfipvhrθ1s, θ2q fipvh 1rθ1s, θ2qs ď Eσ sup θ2PΘ2 sup w PWh,θ2 i 1 σiwi. (187) Generalization Analysis of Deep Non-linear Matrix Completion hďH Wh. By definition of the cover (cf. Equation (183)), we have, for any θ2 P Θ2: i 1 rfipvhrθ1s, θ2q fipvh 1rθ1s, θ2qs2 (188) i 1 rfipvhrθ1s, θ2q fipθ1, θ2q fipθ1, θ2q fipvh 1rθ1s, θ2qs2 fipvhrθ1s, θ2q fipθ1, θ2q 2 2 fipθ1, θ2q fipvh 1rθ1s, θ2q 2 ď 2ϵ2 h 2ϵ2 h 1 ď 10ϵ2 h 1. (189) Hence, we can apply Lemma E.2 to conclude that, for any choice of w P W, with probability ě 1 δ over the draw of the Rademacher variables σ1, . . . , σN, we have (simultaneously over all θ2): ˇˇˇˇˇ sup θ2PΘ2 i 1 σiwipθ2q Eσ sup θ2PΘ2 i 1 σiwipθ2q Thus, by a union bound running over w P W together with equation (186) we have that with probability ě 1 δ over the draw of the Rademacher variables, the following holds for all values of w P Wh simultaneously: ˇˇˇˇˇ sup θ2PΘ2 σi N wipθ2q Eσ sup θ2PΘ2 σi N wipθ2q 4 logp NpΘ1, ϵh 1qq Furthermore, for any w P W, we certainly have: Eσ sup θ2PΘ2 σi N wipθ2q ď 2 sup θ1PΘ1 Eσ sup θ2PΘ2 σi N fipθ1, θ2q 2 sup θ1PΘ1 p R p Fθ1q . (192) Plugging equations (191) and (192) back into equations (187) and (185), we have with probability ě 1 δ over the draw of the Rademacher variables σ1, . . . , σN: sup θ1,2PΘ1,2 i 1 σifipθ1, θ2q (193) ď ϵH 2p H 1q sup θ1PΘ1 p R p Fθ1q sup θ1,2PΘ1,2 σi N rfipvhrθ1s, θ2q fipvh 1rθ1s, θ2qs ď ϵH 2p H 1q sup θ1PΘ1 p R p Fθ1q 4 logp NpΘ1, ϵh 1qq ď ϵH 2p H 1q sup θ1PΘ1 p R p Fθ1q 4 h 1 rϵh ϵh 1s logp NpΘ1, ϵhqq Finally, take H to be the largest integer such that ϵH 1 ą α, then ϵH 4ϵH 2 ď 4α and we can continue to show that (w.p. ě 1 δ over the draw of σ): sup θ1,2PΘ1,2 i 1 σifipθ1, θ2q ď ϵH 2p H 1q sup θ1PΘ1 p R p Fθ1q 4 logp NpΘ1, ϵqq sup θ1PΘ1 p R p Fθ1q 4α 4 ? logp NpΘ1, ϵqq Generalization Analysis of Deep Non-linear Matrix Completion Next, let X denote the random variable sup θ1,2PΘ1,2 i 1 σifipθ1, θ2q 4α 4 ? logp NpΘ1, ϵqq N dϵ 2 log2 with the randomness arising from the Rademacher variables σis. By Equation (194) we have Pp X ě εq ď 2 exp ε2N . (196) Integrating over ε we obtain Ep Xq ď ż 8 0 2 exp ε2N dε (197) 0 expp θ2qdθ c π Plugging this back into the definition of X (eq (179)) after taking expectations with respect to σ1, . . . , σN, we get: p Rp Fq Eσ sup θ1,2PΘ1,2 i 1 σifipθ1, θ2q (199) sup θ1PΘ1 p R p Fθ1q 4α 4 ? logp NpΘ1, ϵqq as expected. E.2. On the Impact of the Estimation of the Marginals on the Schatten quasi-norms of r Z In this subsection, we present the following result, which is useful when proving our excess risk bounds. Lemma E.5 (Generalization of Lemma 4 in (Foygel et al., 2011)). Let d ě 2 be an integer. For any N ě 6pm nq logp m n δ qq we have the following inequality with probability greater than 1 δ over the draw of the training set: } q Z}2{d sc,2{d ď 6pm nq log m n 2 d sc,2{d (201) Proof. Let A, B, D1, . . . , Dd 1 be such that i 1 }Di}2 Fr } diagprpq 1 2 A}2 Fr } diagprqq 1 2 B}2 Fr (202) i 1 Di BJ Z. (203) By Corollary (F.23) we have: d} q Z}2{d sc,2{d ď i 1 }Di}2 Fr } diagpˇp{ pq r A}2 Fr } diagpˇq{ qq r B}2 Fr (204) where r A : diagp pq 1 2 A and r B : diagp pq 1 2 B. Generalization Analysis of Deep Non-linear Matrix Completion By a Bernstein bound, for any ϵ ď 1, ď exp ˆ ϵ2N and similarly ď exp ˆ ϵ2N Thus we have for any ϵ ď 1, using a union bound: 2 ě ϵ _ Dj : ď pm nq exp ˆ ϵ2N 6pn mq Thus, we know that with probability greater than 1 δ we have simultaneously over all i, j: 6pm nq log m n 6pm nq log m n Plugging this back into Equation (204) we obtain (w.p. 1 δ): d} q Z}2{d sc,2{d ď i 1 }Di}2 Fr } diagpˇp{ pq r A}2 Fr } diagpˇq{ qq r B}2 Fr (209) 6pm nq log m n } r A}2 Fr } r B}2 Fr ı i 1 }Di}2 Fr (210) 6pm nq log m n } r A}2 Fr } r B}2 Fr i 1 }Di}2 Fr 6pm nq log m n 2 d sc,2{d, (212) as expected. E.3. Covering Number Bounds for Neural Embeddings In this section, we provide covering number bounds for neural embeddings, including both with weighted and un-weighted versions of the constraints on the zeroth layer learnable embeddings. This section relies on results from subsection F.3 such as Proposition F.8. We consider the following function classes. Generalization Analysis of Deep Non-linear Matrix Completion Č N0,W,cpa, s, cq : g : rms ˆ rns Ñ R1ˇˇDf P N1,Wpa, sq, U P Rmˆ m, V P Rnˆ m : (213) } diagprpq 1 2 U}2 Fr } diagprqq 1 2 V }2 Fr ď c, }A0} ď s0 : gpi, jq fp A0pui, vjq Jq @i, j N0,W,cpa, s, cq : g : rms ˆ rns Ñ R1ˇˇDf P N1,Wpa, sq, U P Rmˆ m, V P Rnˆ m : (214) } diagpqpq 1 2 U}2 Fr } diagpqqq 1 2 V }2 Fr ď c, }A0} ď s0 : gpi, jq fp A0pui, vjq Jq @i, j N0,W,cpa, s, cq : g : rms ˆ rns Ñ R1ˇˇDf P N1,Wpa, sq, U P Rmˆ m, V P Rnˆ m : (215) }U}2 Fr }V }2 Fr ď c2 maxpm, nq, }A0} ď s0 : gpi, jq fp A0pui, vjq Jq @i, j where ui and vj denote the ith and jth rows of U and V respectively and A0 P Rdˆp2 mq. Proposition E.6 (L2 covers of N0,W,c, Č N0,W,c). Assume as usual that sl ě 1 for all l and χ2 ě 1. For any sample ξ1, . . . , ξN P rms ˆ rns such that @i, j, ˆpi ď 2pi and ˆqj ď 2qj, any ϵ ą 0, there exists a cover Cpϵq Ă N0,W,cpa, s, cq (resp. Č N0,W,cpa, s, cq) with the following properties. 1. For g P N0,W,cpa, s, cq (resp. Č N0,W,c) there exists a g in Cpϵq such that o 1 pg gqξo ď ϵ2. (216) logp|Cpϵq|q ď 2dpm nq 32s2 0c2 1 ϵ2 1 ȷ RW 2 ȷ log pΓW,ϵq for Č N0,W,c (217) 2dpm nq 32s2 0c2pm nq 1 ϵ2 1 ȷ RW 2 ȷ log pΓW,ϵq for N0,W,c (218) where ΓW,ϵ 96W s0pm nq?mn śL ℓ 1 sℓ ϵ 1. Proof. We write a single proof for both cases as they are very similar. Step 1: Uniform cover C1 of the embeddings A0pui, vjq J For any A0, U, V , we write s Up A0, U, V q for the tensor in Rdˆmˆn such that s Up A0, U, V qu,i,j A0pui, vjq J u for any i ď m, j ď n, u ď d. Let U : U P Rdˆmn : Dp A0, U, V q P Aa,s ( where Aa,s is the set of admissible U, V, A0 defined according to the corresponding equations (213), (214) depending on whether we are proving the Theorem for the class N0,W,c, Č N0,W,c. Note that for any A0, we can always write A0pui, vjq J A0 1u J i A0 2v J j where A0 1 and A0 2 represent the first and last m columns of A0 respectively. Thus, we have U Ă U1 U2 where U1 : A1U JˇˇDV, A2 : pp A1, A2q, U, V q P Aa,s ( and U2 : A2V JˇˇDU, A1 : pp A1, A2q, U, V q P Aa,s ( . For each admissible U, V, A0, it is certainly the case that }A0u J i }, }A0v J j } ď }A0} }U}2 Fr }V }2 Fr 1 2 ď 2s0cpm nq for any i, j, where we have used the fact that pi ě 1 2m and qj ě 1 2n. Generalization Analysis of Deep Non-linear Matrix Completion Thus, we have U1 Ă r U1 : U P Rdˆm : } U}Fr ď 2s0pm nq?m ( and U2 Ă r U2 : s U P Rdˆn : }s U}Fr ď 2s0pm nq?n ( . Thus, by Lemma F.17 applied to r U1 and r U2, we can obtain internal covers r C11 and r C12 of r U1 and r U2 with respect to the }.}Fr norm with granularity ϵ{p8 śL ℓ 1 sℓq such that log min ˇˇˇ r C11 ˇˇˇ , ˇˇˇ r C12 ˇˇˇ ı (219) 6s0pm nq?mn ϵ p8 śL ℓ 1 sℓq 1 48s0pm nq?mn śL ℓ 1 sℓ ϵ 1 which immediately yields an external cover of U with granularity ϵ{p4 śL ℓ 1 sℓq, and finally an internal cover C1 of the same, with granularity ϵ{p2 śL ℓ 1 sℓq and cardinality log p|C1|q ď 2dpm nq log 48s0pm nq?mn śL ℓ 1 sℓ ϵ 1 Step 2: L2 covers of the network class N1,W relative to the cover elements in C1 First, note that for any admissible U, V, A0, we certainly have o 1 }A0puξo 1, vξo 2q J}2 ď s2 0 1 N }uξo 1}2 }vξo 2}2 s2 0 ÿ i ˆpi}ui}2 s2 0 ÿ ď 2s2 0 min i pi}ui}2 ÿ j qj}vj}2, max i }ui}2 max j }vj}2 ď4s2 0c2, for Č N0,W,c, and (221) 4s2 0c2pmaxpm, nqq for N0,W,c . (222) Thus, for each cover element s U p A0, U, V q P C1, there we can apply Proposition F.8 to show that there is a cover C2ps Uq Ă N1,W such that the following properties are satisfied: 1. For any f P N1,W there exists f P C2ps Uq such that rf fsp A0pui, vjq Jq 2 ď ϵ2{4, (223) log p C2q ď 32s2 0c2 1 ϵ2 1 ȷ RW 2 logp2Wq, (224) where as usual RW is defined by Equation (264). Step 3: L2 cover of N0,W,c and Č N0,W,c We now finally define the cover C Ă N0,W,c via C f s U ˇˇs U P C1, f P C2ps Uq ( , (225) where by abuse of notation, f s U denotes the function g : rms ˆ rns Ñ R1 such that gpi, jq fps U.,i,jq fp A0pui, vjq Jq (where A0, U, V realise the element s U of U and as usual ui and vj denote the ith and jth rows of U and V respectively). Generalization Analysis of Deep Non-linear Matrix Completion We now have that C is an ϵ cover with respect to the L2 norm and the sample ξ1, . . . , ξN. Indeed, for g f s U P N0,W,c, let g f ss U be the associated cover element in C, we have o 1 pg gq2 ξo ď 2 1 f f s U 2 ξo 2 1 f s U ss U ı 2 4 śL ℓ 1 sℓ ı2 ϵ2 (227) where at the first line (226) we have used the AM-GM inequality and at the last line (227) we have used the properties of both covers. This established the validity of the cover (Equation (216)). We now only have to estimate the cardinality to establish Equation (217): log p|Cpϵq|q ď ÿ ˇˇC2ps Uq ˇˇ (228) ď 2dpm nq log 48s0pm nq?mn śL ℓ 1 sℓ ϵ 1 ϵ2 1 ȷ RW 2 logp2Wq (229) ď 2dpm nq 32s2 0c2 1 ϵ2 1 ȷ RW 2 ȷ log 96Ws0pm nq?mn śL ℓ 1 sℓ ϵ 1 where c stands for c if we are covering Č N0,W,c and c a maxpm, nq if we are covering N0,W,c, and (at Equation (229)) we have used Equations (224) and (220). The result follows. Generalization Analysis of Deep Non-linear Matrix Completion F. Variations on Known Results This section compiles some minor variations of known results. We include some proofs both for completeness and because the precise results we need sometimes deviate slightly from the known version. For instance, we often require high probability versions of results which were previously presented in expectation over the training set. Remark: The modification is necessary for incorporation with the neural network bounds of Section E.3 and even for incorporation with the Lipschitz constant bounds of Section C.2 via Lemma E.4 and Lemma E.3. Indeed, although the fact that the construction of the L2 cover of the class N1,W is non constructive and dependent on the sample ξ1, . . . , ξN may not be a strong obstacle to applying an expectation version of Lemma E.4 or Lemma E.3, the supremum over θ1 is a bigger issue. Taking the example of Lemma F.1, we need to know that with high probability over the draw of the training set, Equation (231) will be satisfied, allowing us to show that for this particular training set, the inequality in Theorem D.2 holds uniformly over all bounded Lipschitz functions ℓ. F.1. On the Complexity of Classes of Matrices with Nuclear Norm Constraints We will need the following classic Lemma: Lemma F.1 (Cf. (Foygel et al., 2011), proof of Theorem 1, (Tropp, 2012), Remarks 6.4 and 6.5, cf. also (Ledent et al., 2021b)). Consider the matrix RN 1 N řN i 1 σieζi where the σis are Rademacher variables and for all i, j, epi,jq is the matrix with a 1 in the entry pi, jq and zeros in all other entries and the entries are selected i.i.d from a distribution with uniform marginals. With probability greater than 1 δ over the draw of the training set and the draw of the Rademacher variables σ1, . . . , σN, we have 1 N minpm, nq 3N log ˆm n Proof. RN is an average of N i.i.d. matrices Xi σieζi. Thus, we can apply the non commutative Bernstein inequality (Proposition F.20). We have }Xi} ď 1{N with probability 1 for all i, thus M is 1{N. Furthermore, we can compute the ρ2 as follows: E Xi XJ i 1 N 2 } diagpp1, p2, . . . , pmq} 1 m N 2 (232) and E XJ i Xi 1 N 2 } diagpq1, q2, . . . , qnq} 1 n N 2 , (233) where the pis (resp. qjs) denote the marginal probabilities for each row (column), which are uniform by our assumption. Hence ρ2 k is 1 N2 minpm,nq and σ2 1 N minpm.nq. From this, it follows by applying Proposition F.21 that with probability ě 1 δ over the draw of the training set, we have 1 N minpm, nq 3N log ˆm n as expected. Lemma F.2 (Cf. (Foygel et al., 2011), proof of Theorem 1 cf. also (Ledent et al., 2021b)). Consider the matrix RN 1 N řN o 1 σo eξo ? rpξo 1 rqξo 2 where the σos are Rademacher variables, for all i, j, epi,jq is the matrix with a 1 in the entry pi, jq and zeros in all other entries, and as usual. With probability greater than 1 δ over the draw of the training set and the draw of the Rademacher variables σ1, . . . , σN, we have 3N log ˆm n Generalization Analysis of Deep Non-linear Matrix Completion Proof. RN is a sum of N i.i.d. matrices Xo eξo ?pξo 1,ξo 2 . Thus, we can apply the non commutative Bernstein inequality (Proposition F.20). Since rpi ě 1 2m and rqi ě 1 2n for all i, j, we have }Xo} ď 2?mn N with probability 1 for all o, thus M is 2?mn N . Furthermore, we can compute the ρ2 as follows. for any i ď m, we have N 2E Xo XJ o j 1 pi,j 1 rqj ď 2n j 1 pi,j 2npi rpi ď 4n, (237) where as usual the pis (resp. qjs) denote the marginal probabilities for each row (column), which are uniform by our assumption. Similarly, N 2E XJ o Xo i 1 pi,j 1 rpi ď 2m i 1 pi,j 2qj rqj ď 4m (238) Hence ρ2 k can be taken as 4pm nq N2 and σ2 can be taken as 4pm nq N . From this, it follows by applying Proposition F.21 that with probability ě 1 δ over the draw of the training set, we have 3N log ˆm n as expected. For the distribution-free case with the nuclear norm, recall the following theorem: Proposition F.3 (Theorem 2 page 3405 in (Shamir & Shalev-Shwartz, 2011)). Consider the following function class: F1 t : Rmˆn Q Z : }Z} ď t ( . (241) Let l : R ˆ R Ñ R be a function which is uniformly bounded by B and ℓ-Lipschitz w.r.t. the second argument, for any training set ξ1, . . . , ξN P rms ˆ rns we have the following bound on the empirical Rademacher complexity of l F1 t : Eσ sup ZPF1 t o 1 σo lp Zξo, r Goq ď 9C B ℓtp?m ?nq where C is the (absolute) constant in (Latała, 2005). Proposition F.4 (High probability version of Theorem 1 in (Foygel et al., 2011)). Consider the following function class: F1 t : Rmˆn Q Z : }Z} ď t ( . (243) Assume that the sampling distribution has uniform marginals. For any δ ą 0 we have the following bound on the empirical Rademacher complexity of F1 t with probability ě 1 δ over the draw of the dataset: Eσ sup ZPF1 t o 1 σo Zξo ď 8 3N minpm, nq log ˆm n 3N log ˆm n Proof. This follows immediately (as a particular case) from Proposition F.5 below. Generalization Analysis of Deep Non-linear Matrix Completion Proposition F.5 (High probability version of Theorem 3 in (Foygel et al., 2011)). Consider the following function class: r F1 r : ! Rmˆn Q Z : } r Z} ď ?r ) , (245) where as usual r Z diagp pq Z diagpqqq. For any δ ą 0 we have the following bound on the empirical Rademacher complexity of F1 t with probability ě 1 δ over the draw of the dataset: Eσ sup ZP r F1 r o 1 σo Zξo ď 4 3N log ˆm n 3N log ˆm n Proof. Similarly to the original proof, expanding the definition of the Rademacher complexity, we have Eσ sup ZP r F1 r o 1 σo Zξo Eσ sup ZP r F1 r A RN, diagp a pq Z diagp a ď Eσ sup ZP r F1 r }RN}} diagp a pq Z diagp a ď Eσ}RN} sup ZP r F1 r } r Z} ?r Eσ}RN} 3N log ˆm n 3N log ˆm n where at the second line (247) we have used the duality between the nuclear norm and the spectral norm, and at the last line (248), we have used Lemma F.2. F.2. Computational Lemmas This subsection compiles basic calculations which are useful when translating a high-probability bound into a bound in expectation. Lemma F.6. Let F be a random variable that depends only on the draw of the training set. Assume that with probability ě 1 δ, Ep Fq ď fpδq, (249) for some given monotone increasing function f. Then we have, in expectation over the training set: i 1 fp2 iq21 i, (250) In particular, if fpδq C1 b δ q C2, then we have in expectation over the draw of the training set: Ep Fq ď C1 ? 2C1 C2. (251) Further, if fpδq C3 logp 1 δ q, then we have Ep Fq ď 6C3. (252) Proof. By assumption we have for any δ: P p X ě fpδqq ď δ (253) Let us write Ai for the event Ai t F ď fpδiqu where we set δi 2 i for i 1, 2, ... . We also set Ai Aiz Ai 1 for i 1, 2, ... with the convention that A0 H so that A1 A1. Generalization Analysis of Deep Non-linear Matrix Completion We have, for i ě 2, Pp r Aiq ď Pp Ac i 1q ď δi 1, and for i 1, Pp r A1q ď 1 δi 1. Thus we can write i 1 Ep X| r Aiq Pp r Aiq ď i 1 Ep X| r Aiqδi 1 ď i 1 fpδiqδi 1, (254) yielding identity (250) as expected. Next, assuming fpδq C1 b δ q C2, we can continue as follows: i 1 fpδiqδi 1 ď logp2iqs21 i (255) is21 i ď C1 where at the second line we have used the fact that for any natural number i, ? If we assume that fpδq C3 logp 1 δ q we have instead i 1 fpδiqδi 1 ď i 1 r C3 logp2iqs21 i (257) i 1 i21 i ď C3 2 i 121 i ď 3C3 2 ? 2 ď 6C3 (258) Lemma F.7. Let a, b, s ą 0 be three positive real numbers: we have logp1 absq ď logpp1 aqp1 bsqq ď logp1 aq logp1 p1 bqsq ď logp1 aq logp2p1 bqsq ď logp2p1 aqq s logp1 bq. F.3. Covering Numbers for Neural Networks In this subsection, we collect variations of known results on covering numbers of classes of neural networks. These results are useful in Subsection E.3, where apply them to construct covers of the space of neural embeddings over elements of rms ˆ rns. In line with much of the literature, we consider fully-connected neural networks of the following form: fpxq σL ALσL 1 σL 2 . . . σ1 A1x . . . , (259) where the input x P Rd and the output fpxq P R1 and the activations σℓ(for ℓď L) are assumed to be 1-Lipschitz. We write W for the total number of parameters of the network. For a fixed architecture defined by the intermediary widths w1, . . . , w L 1 (so that W řL ℓ 1 w LWL 1), and for a fixed set of constants a1, . . . , a L and s1, s2, . . . , s L and initialization matrices M 1, M 2, . . . , M L we consider the N1,Wps, aq class of networks f that satisfy the following conditions: Aℓ M ℓ J 2,1 ď aℓ @ℓď L and (260) Aℓ ď sℓ @ℓď L. (261) The following is a particular case of (Graf et al., 2022), Theorem C.15 (cf also (Bartlett et al., 2017; Ledent et al., 2021c)) applied to fully-connected neural networks. Generalization Analysis of Deep Non-linear Matrix Completion Proposition F.8 (L2 covering number for N1,W). Assume that χ2 1 N řN o 1 }xo}2 ě 1 and @l, sl ě 1. For any ϵ ą 0, there is an L2 cover Cpϵq of N1,W with the following properties: 1. For any f P N1,W there exists a f in Cpϵq such that ˇˇfpxoq fpxoq ˇˇ2 ď ϵ2. (262) logp|C|q ď R 1 V χ2 RW 2 logp2Wq, (263) o 1 }xo}2. (264) Notes: We omit the improved dependency on the output dimension which can be derived from the techniques of (Ledent et al., 2021c; Wu et al., 2021; Mustafa et al., 2021) since the output of our network is one dimensional. Such techniques can also be easily used to make the above cover uniform over the samples at the cost of additional logarithmic factors. Proposition F.9 (Uniform L8 covering number for N2,W ). Consider the class N2,W ps, aq of fully connected neural networks with the same architecture as those in N1,W, but where the weight matrices only need satisfy the following constraints: }Aℓ} ď sℓ }Aℓ M ℓ} ď al. (265) We further assume that sℓě 1 for all ℓď L. For any 1 ą ϵ ą 0, and any ℓLipschitz loss function l there is an L2 cover Cpϵq of N2,W with the following properties: 1. For any f P N1,W there exists a f in Cpϵq such that for any x P Rd with }x} ď χ, we have: ˇˇl fpxq l fpxq ˇˇ ď ϵ (266) logp|C|q ď W log 6χ ℓ śL ℓ 1 sℓ ı rř ℓaℓs Remark: This result and its proof are very similar to analogous results in (Long & Sedghi, 2020; Graf et al., 2022) (no claim of originality is made here), but the requirement on the cover is stricter than in the Theorem statement in (Graf et al., 2022). The control on the bounds is also looser, but we do not invest too much into the technicalities of obtaining tighter logarithmic factors, since the aim of this section is merely to illustrate how to combine our bounds on Schatten quasi-norm regularized matrices with neural network bounds. Proof. For the sake of completeness, we repeat the main parts of the proof here, which mostly follow (Long & Sedghi, 2020). Note that by a standard argument (see, e.g. pages 4,5 in (Long & Sedghi, 2020)), we have for any sets of matrices A1, . . . , AL and s A1, . . . , s AL satisfying the conditions (261) and corresponding networks f and f: ˇˇfpxq fpxq ˇˇ ď }x} Aℓ s Aℓ (268) Generalization Analysis of Deep Non-linear Matrix Completion with the inequality holding uniformly over any x P Rd. Thus, as long as s Cpϵq is a cover of the space A p A1, . . . , ALq : }Aℓ M ℓ} ď aℓ @ℓ ( with respect to the norm }A} : řL ℓ 1 }Aℓ} with granularity ε : ϵ χ ℓśL ℓ 1 sℓ, the associated cover Cpϵq Ă N2,W gives a uniform ϵ-cover of l N2,W . Furthermore, by Lemma F.17 and a doubling argument (to ensure the condition }Aℓ} ď sℓis also satisfied by each element of the cover) such a cover exists with cardinality ď 6χ ℓ śL ℓ 1 sℓ ı rř and the result follows. F.4. A Result on the Estimation of the Marginals Lemma F.10 (Variation on Lemma 2 in (Foygel et al., 2011) and Lemma E.1 in (Ledent et al., 2021b)). For any δ ą 0, if N ě 8pm nq logp pm nq δ q then with probability ě 1 δ, the following holds for all i ď m and j ď n: 2 and ˇqj ě qj Proof. The proof is almost identical to that of Lemma 2 in (Foygel et al., 2011) but we repeat it for completeness as we need our variant with arbitrarily high probability. Note that this lemma is also a particular case of the inductive case from Lemma E.1 in (Ledent et al., 2021b) with identity side information matrices. m (resp. qj ď 1 n), then pi ď 1 m and ˇpi ě 1 2m (resp. qj ď 1 n and ˇqj ě 1 2n). On the other hand, if pi ą 1 n then by a multiplicative Chernoff bound (Lemma F.15), we have for any i ď m: ď exp ˆ Npi ď δ pm nq, (271) where at the last inequality we have used the fact that N ě 8pm nq logp pm nq Similarly, for all j ď n: ď exp ˆ Nqj ď δ pm nq. (272) The result then follows immediately from a union bound. F.5. Basic Covering Numbers, Concentration Inequalities and Classic Results in Learning Theory In this section, we summarize some existing results which are useful to our study. Theorem F.11 (Generalization bound from Rademacher complexity, cf. e.g., (Bartlett & Mendelson, 2001), (Scott, 2014), (Guermeur, 2020) etc.). Let Z, Z1, . . . , ZN be iid random variables taking values in a set Z. Consider a set of functions F P r0, 1s Z. @δ ą 0, we have with probability ě 1 δ over the draw of the sample S that @f P F, Epfp Zqq ď 1 i 1 fpziq 2 RSp Fq 3 where RSp Fq can be either the empirical or expected Rademacher complexity. In particular, if f P arg minf PF Epfp Zqq and ˆf P arg minf PF 1 N řN i 1 fpziq, then Ep ˆfp Zqq ď Epf p Zqq 4 RSp Fq 6 Generalization Analysis of Deep Non-linear Matrix Completion Recall the following theorem on the covering number of classes of Lipschitz functions. Proposition F.12 (Covering number of Lipschitz function balls, see (von Luxburg & Bousquet, 2004), Theorem 17 page 684, see also (Tikhomirov, 1993)). Let X be a connected and centred metric space (i.e. for all A Ă X with diamp Aq ď 2r there exists an x P X such that dpx, aq ď r for all a P A). Let B denote the set of 1-Lipschitz functions from X to R which are uniformly bounded by diamp Xq. For any ϵ ą 0 we have the following bound on the covering number of the class B as a function of the covering number of X: N p B, ϵ, }.}8q ď ˆR2 diamp Xq V 1 ˆ 2Np X, ϵ 2 ,dq. (273) Applying the above to the d dimensional euclidean space, we immediately obtain: Proposition F.13. Let }.}max denote max norm on Rd, i.e. }x}d maxi |xi|. Let Flip,Lf,Bf denote the set of Lf-Lipschitz functions from r B0, B0sd to R. We have the following bound on the covering number of Flip,Lf,Bf with respect to the L8 (uniform) norm on functions: log Np Flip,Lf,Bf q ď log ˆR4 B0 Lf V 1 R2 B0 Lf V 1 ȷd logp2q (274) ď 3 R2 B0 Lf V 1 ȷd (275) Proof. W.l.o.g, let Lf 1. Then, take the following ϵ{2 cover of r B0, B0sd: rr B0, B0s X ϵZsd, which has cardinality less than P 2 B0 Proposition F.14 (Massart s finite class lemma). Let A Ă RN be a finite class of functions from r Ns to R. Let r maxu PA }u}2. We have the following bound on the Rademacher complexity of A over the sample r Ns: 1 N sup u PA 2 log #p Aq Lemma F.15 (Multiplicative Chernoff bound, well known, Cf e.g. Corollary 13.3 (pp. 13-3 and 13-4) in (Sinclair). i.i.d. case originally from (Angluin & Valiant, 1979), Proposition 2.4 p 158, cf. also (Boucheron et al., 2004) (exercise 2.10 on p. 48) and (Hagerup & R ub, 1990)). Suppose X1, . . . , XN are independent random variables taking values in t0, 1u. Let X řN i 1 Xi denote their sum. For any δ ą 0 we have Pp X ě p1 δq Ep Xqq ď exp ˆ δ2Ep Xq In addition, for all 0 ă δ ă 1 we have Pp X ď p1 δq Ep Xqq ď exp ˆ δ2Ep Xq We will need the following further consequence in our analysis: Corollary F.16. Let m P N and ηi be independent categorical variables on the domain t1, 2, . . . , mu. For all j ď m let us write Xj : řN i 1 1pηi jq for the number of ηis which assume the value j. For any 0 ă δ ă 1 have P Dj, s.t. Xj ď p1 δq Ep Xjq ď m exp ˆ Nδ2µ where µ minj Ep Xjq. Lemma F.17 (Lemma 8 in (Long & Sedghi, 2020)). The (internal) covering number N of the ball of radius κ in dimension d (with respect to any norm }.}) can be bounded by: ϵ 1 d (280) Generalization Analysis of Deep Non-linear Matrix Completion Recall the following result from (Mazumder et al., 2010): Lemma F.18 (Lemma 6 in (Mazumder et al., 2010)). For any matrix X P Rmˆn, the following holds: }X} min A,B,ABJ X 1 2 }A}2 Fr }B}2 Fr . (281) Recall the following useful result, which is an immediate consequence of the Mc Diarmid inequality. A similar result was presented in (Bartlett & Mendelson, 2001) (cf. Theorem 11 page 469) for the expected Rademacher complexity. See also (Ledent et al., 2021b) for the exact result. Lemma F.19. For any fixed x1, . . . , x N and any function class F mapping to r 1, 1s we have with probability ě 1 δ over the draw of the Rademacher variables σ1, . . . , σN, ˇˇˇˇˇsup f PF i 1 σifpxiq p Rpx1,...,xnqp Fq Proposition F.20 (Non commutative Bernstein inequality, Cf. (Recht, 2011)). Let X1, . . . , XS be independent, zero mean random matrices of dimension m ˆ n. For all k, assume }Xk} ď M almost surely, and denote ρ2 k maxp}Ep Xk XJ k q}, }Ep XJ k Xkq}q and ν2 ř k ρ2 k. For any τ ą 0, ď pm nq exp τ 2{2 řS k 1 ρ2 k Mτ{3 Proposition F.21 (High probability version of Bernstein inequality). Let X1, . . . , XS be independent, zero mean random matrices of dimension m ˆ n. For all k, assume }Xk} ď M almost surely, and denote ρ2 k maxp}Ep Xk XJ k q}, }Ep XJ k Xkq}q. Writing σ2 řS k 1 ρ2 k, for any δ ą 0, we have, with probability greater than 1 δ: Proof. From Proposition F.20, we can make the following conclusions splitting into two cases depending on whether Mτ ď σ2 or Mτ ě σ2: If Mτ ď σ2, we have, with probability ě 1 δ: Similarly, if Mτ ě σ2, we have, with probability ě 1 δ: Thus, in all cases, we certainly have, with probability greater than 1 δ: as expected. F.6. Deterministic Results In this subsection, we show how some of the most popular recent neural network models indirectly contain Schatten norm regularized matrix components. Generalization Analysis of Deep Non-linear Matrix Completion One of the first descriptions of non-linear matrix factorization methods appears in (Dziugaite & Roy, 2015), which describes the following very general class of models: gpi, jq fθ u1 i , v1 j , u2 i v2 j , . . . , um where denotes an element wise product, fθ is a trainable neural network and u1, . . . , um (resp. v1, . . . , vm) are low dimensional row (resp. column) embeddings. A particularly famous example of a more specific architectural variation is the model presented in (He et al., 2017), which has achieved state of the art performance in various recommender systems datasets. From an architectural perspective, the model can be described as follows. gpi, jq @ concatpu1 i v1 j, f1pui, vjqq, concatpd, d1q D , (289) where f1 is a neural network with a mutli-dimensional output, and concatpd, d1q is a vector representing the last linear layer. Since then, further models have been proposed which rely on expressing the set of observed entries as a graph through which one can propagate the embeddings (He et al., 2020; Zhang & Chen, 2020; Mao et al., 2021), to cite but a few. In Corollary F.23 below, we show that a global minimum of (1) can always be attained with the additional constraint that D1, . . . Dd 2 be diagonal matrices. In particular, this further cements the validity of the Schatten norm as a regularizer: the models (288) and (1) also hide implicit Schatten norm regularized components. For instance, the model in (He et al., 2017) is in fact equivalent to gpi, jq Zi,j Ψpi, jq, (290) where Ψpi, jq x D1, f2pui, vjqy is a neural network encoding derived from f1 with an additional linear layer, and the matrix Z satisfies Zi,j xpui vjq, dy u J i diagpdqv J j so that Z UDV J for D diagpdq where the rows of U (resp. V ) collect the row (resp. column) embeddings. Thus, imposing L2 regularization (which is arguably implicitly present in popular optimization schemes such as gradient descent) on the parameters d, U, V is equivalent to imposing regularization on the Schatten quasi-norm of Z with p 2 We first recall the following reformulation of Theorem 1 in (Dai et al., 2021), which can be interpreted as a generalization of Lemma F.18 (i.e. Lemma 6 in (Mazumder et al., 2010)): Theorem F.22 (Theorem 1 in (Dai et al., 2021)). Let Z P Rmˆn, for any integers d P N and o P N with o ě minpm, nq we have min Wd PRoˆn,W1PRmˆo Wk PRoˆo@k 1,d k 1 }Wk}2 Fr u 1 σ2{d u d}Z}2{d sc,2{d, (291) where r rankp Zq, the σus are the singular values of Z and }.}sc,p is the p-Schatten quasi-norm. From this, we obtain the following result: Corollary F.23. Let Z P Rmˆn, for any integers d P N and o P N with o ě minpm, nq and d ě 2 we have min APRmˆo,BPRnˆo @kďd 2, Dk diagpdkq, dk PRo }A}2 Fr }B}2 Fr k 1 }Dk}2 Fr k 1 σ2{d k d}Z}SC 2{d, (292) where the minimum runs over all matrices A P Rmˆo, B P Rnˆo and D1, . . . , Dd 2 P Q where Q is either the set of all matrices in Roˆo or the set of all diagonal matrices in Roˆo. (As in Theorem F.22 r rankp Zq, the σus are the singular values of Z and }.}sc,p is the p-Schatten quasi-norm.) In particular, for d 3, we have min APRmˆo,BPRnˆo, Roˆo D diagpdq,d PRo }A}2 Fr }B}2 Fr }D}2 Fr ˇˇˇˇˇADBJ Z k 1 σ2{3 k 3}Z}2{3 sc,2{3. (293) Generalization Analysis of Deep Non-linear Matrix Completion Proof. We prove the theorem in two directions. This follows immediately from Theorem F.22, since our set of admissible factor matrices p A, D1, . . . , dd 2, BJq is a subset of the set of admissible p W1, W2, . . . , Wdq in Theorem F.22 (as a result of the additional constraint that the matrices D1, . . . , Dd 2 must be in Q). This follows by constructing a set of matrices A, B, D1, . . . , Dd 2 which achieves the minimum whilst satisfying the strictest constraint that D1, . . . , Dd 2 are diagonal matrices. For this, let Z UΣV J be the singular value decomposition of Z with additional dimensions chosen such that U P Rmˆo, Σ P Roˆo, V P Rnˆo. We now choose: Dk Σ1{d p@k ď d 2q. (294) It is clear that BJ UΣ1{dΣpd 2q{dΣ1{d V J UΣV J Z. In addition, we by the invariance of the Frobenius norm to rotations, we also have }A}2 Fr }B}2 Fr k 1 }Dk}2 Fr }UΣ1{d}2 Fr }V Σ1{d}2 Fr k 1 }Σ1{d}2 Fr }Σ1{d}2 Fr }Σ1{d}2 Fr k 1 }Σ1{d}2 Fr d}Σ1{d}2 Fr k 1 σ2{d k d}Z}SC 2{d RHS, (295) as expected. The result follows. We have the following immediate variant of Corollary F.23, which shows how we can add the weights to our regularizers in our practical experiments. Corollary F.24. Let ˇp P Rm and ˇq P Rn be arbitrary vectors and let Z P Rmˆn, for any integers d P N and o P N with o ě minpm, nq and d ě 2 we have min APRmˆo,BPRnˆo @kďd 2, Dk PQ } q A}2 Fr } q B}2 Fr k 1 }Dk}2 Fr k 1 σ2{d k d} q Z}SC 2{d, (296) where q A : diagp?ˇpq A, q B : diagp?ˇqq B, q Z diagp?ˇpq Z diagp?ˇqq and the minimum runs over all matrices A P Rmˆo, B P Rnˆo and D1, . . . , Dd 2 P Q where Q is either the set of all matrices in Roˆo or the set of all diagonal matrices in Roˆo. Generalization Analysis of Deep Non-linear Matrix Completion G. Extention: Multiple Latent Matrices In this section, we develop tools to extend some of our results to situations where there are multiple latent matrices, possibly with different Schatten indices. This section is mostly illustrative: the models themselves are quite complicated, making the bounds less meaningful than in other sections. In addition, the dependence on the number of latent factors obtainable with the techniques below is quite strong. However, the tools developed show a general strategy which could be used for a broad class of similar models. G.1. Extension: Multi-latent Lipschitz Decomposition Lemmas In this subsection, we prove some new results, analogous both Talagrand s concentration lemma and Dudley s entropy theorem, aimed at bounding the Rademacher complexity of ℓp F1, . . . , Fmq where Fv (for v ď m) are function classes and ℓis Lipschitz. The aim is to be able to bound the Rademacher complexity of F even if a covering number is not available for any of the Fv. We begin by remind the reader of the following classic. Lemma G.1 (Talagrand contraction lemma (cf. (Ledoux & Talagrand, 1991) see also (Meir & Zhang, 2003) page 846)). Let g : R Ñ R be 1-Lipschitz. Consider the set of functions tfipθq, i ď Nu (on t1, 2, . . . , Nu) depending on a parameter θ P Θ. For any function cpx, θq where x P X and any probability distribution on X, we have EσEX sup θPΘ i 1 σigpfipθqq ď EσEηEX sup θPΘ i 1 σifipθq where the σis are i.i.d. Rademacher variables. We then present our first extension of the above: Lemma G.2. Let g : Rm Ñ R be a function satisfying the following Lipschitz condition: ˇˇgpy2q gpy1q ˇˇ ď ÿ ˇˇy2 k y1 k ˇˇ λk (298) with řm k 1 λk ℓ. Consider f 1 i pθq ( , f 2 i pθq ( , . . . , tf m i pθqu, i ď N functions (on t1, 2, . . . , Nu) depending on a parameter θ P Θ. Define the function g on t1, 2 . . . , Nu by gipθq gpf 1 i pθq, f 2 i pθq, . . . , f m i pθqq for all i ď N. For any function cpx, θq where x P X and any probability distribution on X, we have EσEX sup θPΘ i 1 σigipθq ď EσEηEX sup θPΘ i 1 σif ηi i pθq where the σis are i.i.d. Rademacher variables and the ηi are i.i.d. categorical variables on the domain t1, 2, . . . , mu with corresponding probabilities λ1{ ℓ, . . . , λm{ ℓ. N.B.: The case m 1 is the standard Talagrand concentration Lemma. Proof. W.l.o.g., we assume ℓ 1. The proof is by induction over N. The initial case N 1 certainly holds. Assume it holds for N, we will show it holds for N 1. Generalization Analysis of Deep Non-linear Matrix Completion Eσ1,...,σN 1EX sup θPΘ i 1 σigipθq ď Eσ1,...,σN EX sup θ1,θ2PΘ # cp X, θ1q cp X, θ2q i 1 σi gipθ1q gipθ2q 2 g N 1pθ1q g N 1pθ2q Eσ1,...,σN EX sup θ1,θ2PΘ # cp X, θ1q cp X, θ2q i 1 σi gipθ1q gipθ2q 2 |g N 1pθ1q g N 1pθ2q| ď Eσ1,...,σN EX sup θ1,θ2PΘ # cp X, θ1q cp X, θ2q i 1 σi gipθ1q gipθ2q λj|f j N 1pθ1q f j N 1pθ2q| 2 ď Eσ1,...,σN EX j 1 sup θ1,θ2PΘ # λjcp X, θ1q λjcp X, θ2q i 1 σiλj gipθ1q gipθ2q 2 ℓ λj|f j N 1pθ1q f j N 1pθ2q| 2 Eσ1,...,σN EX j 1 sup θ1,θ2PΘ # λjcp X, θ1q λjcp X, θ2q i 1 σiλj gipθ1q gipθ2q 2 ℓ λj f j N 1pθ1q f j N 1pθ2q ı ď Eσ1,...,σN EXEσN 1 j 1 sup θPΘ # λjcp X, θq i 1 σi λjgipθq ℓ σN 1λjf j N 1pθq ď Eσ1,...,σN EXEσN 1E ηPrms N η λ sup θPΘ i 1 σif ηi i pθq j 1 σN 1λjf j N 1pθq Eσ1,...,σN,σN 1EXE ηPrms N 1 η λ sup θPΘ i 1 σif ηi i pθq where the line (300) follows from the induction hypothesis. This completes the proof. Using this, we then obtain the following result. Proposition G.3. Let g : Rm Ñ R be a function satisfying the following condition: ˇˇgpy2q gpy1q ˇˇ ď ÿ ˇˇy2 k y1 k ˇˇ λk (301) with řm k 1 λk ℓ. Consider m function classes F1, . . . , Fm from r Ns to r B, Bs and define G : g : x Ñ gpxq gpf 1pxq, f 2pxq, . . . , f mpxqq : f1 P F1, . . . , fm P Fm ( . We have the following bound on the Rademacher complexity of G: Generalization Analysis of Deep Non-linear Matrix Completion p Rp Gq Eσ sup f1,...,fm o 1 σogpf 1poq, f 2poq, . . . , f mpoqq (302) ď B ℓ13m logp2m Nq ℓ max kě Nλj p RN,kp Fjq, (303) where p RN,kp Fjq : E CĂr Ns |C| k EσPt 1,1u C supf PFj 1 o PC σofpoq. Proof. By Lemma G.2 we have Rp Gq ď ℓEσEη sup f1,...,fm o 1 σof ηipoq (304) ℓEσEη sup f1,...,fm Eoσof ηipoq, (305) where the expectation over o runs over the uniform distribution over r Ns. Now, let Cj ti : ηi ju. By Corollary F.16, we have that with probability greater than 1 δ{2 over the draw of the variables η1, . . . , ηN, 2 ℓ for all j s.t. λj ě γ : 8ℓlogp 2m δ q N . (306) Similarly, by a Chernoff bound (see Lemma F.15), w.p. ě 1 δ{2, the following holds for all j ď m: Thus, with overall probability greater than 1 δ over the draw of the ηis, both Equations (306) and (307) hold under their respective constraints, which implies that we can continue the calculation from Equation (305) as follows, where we write J1 (resp. J2) for the sets of indices satisfying (resp. not satisfying) the second equation in Equation (306) Rp Gq ď ℓEσEη sup f1,...,fm Eoσof ηipoq ď ℓEσEη sup f1,...,fm N Eo PCjσof ηipoq ď ℓEσEη sup f1,...,fm N Eo PCjσof jpoq ℓEσEη sup f1,...,fm N Eo PCjσof jpoq j PJ1 EσEη sup f j |Cj| N Eo PCjσof jpoq ÿ j PJ2 ℓEσEη sup f j |Cj| N Eo PCjσof jpoq ď δ B ℓ B ℓ ÿ ℓ max kě Nλj 2ℓ E Cj : |Cj | k Eσ sup f j Eo PCjσof jpoq (308) 8 logp2m Nq 8 logp2m Nq N 3 log p2m Nq ℓ max kě Nλj p RN,kp Fjq (309) ď B ℓ13m logp2m Nq ℓ max kě Nλj p RN,kp Fjq (310) where at line (309), we have simply set δ 1 N and replaced the definition of RN,kp Fjq. The result follows. Generalization Analysis of Deep Non-linear Matrix Completion Finally, the next corollary is the result we need for our analysis of NNm Sd(+NN). Corollary G.4. Let g : Rm Ñ R satisfy the conditions of Proposition G.3 and consider m functions classes F1, . . . , Fm from X to r B, Bs. As in Proposition G.3, define G : g : x Ñ gpxq gpf 1pxq, f 2pxq, . . . , f mpxqq : f1 P F1, . . . , fm P Fm ( . Assume that the Rademacher complexities of the individual function classes Fj satisfy the following inequality for any k ď N: Rp Fjq EXEσ sup f PFj o 1 σofpxoq ď where Rjpkq is an increasing function of k for all j ď m. We have the following bound on the Rademacher complexity of the function class G: Rp Gq EXEσ sup g PG o 1 σogpxoq ď B ℓ13m logp2m Nq Proof. By Proposition G.3 for any sample x1, . . . , x N we have p Rp Gq ď B ℓ13m logp2m Nq ℓ max kě Nλj p RN,kp Fjq. (312) Taking expectations with respect to the sample x1, . . . , x N on both sides we obtain: Rp Gq ď B ℓ13m logp2m Nq ℓ max kě Nλj 2ℓ EX p RN,kp Fjq. (313) Since the distribution of a uniformly random subset of size k of tx1, . . . , x Nu where the xos are drawn i.i.d. from X is distributed as k i.i.d. samples from X, we have EX p RN,kp Fjq EX p Rkp Fjq ď Plugging this back into Equation (313) we obtain: Rp Gq ď B ℓ13m logp2m Nq ℓ max kě Nλj ď B ℓ13m logp2m Nq ď B ℓ13m logp2m Nq ď B ℓ13m logp2m Nq where at Line (317) we have used the fact that γ : 8ℓlogp2m Nq N (cf. Equation (306) with δ 1{N, or the end of the statement of Proposition G.3) and at the last line (318), we have used the Cauchy-Schwarz inequality. G.2. Extension: an Example Generalization and Excess Risk Bound for a Composite Model with Multiple Latent Matrices In this subsection, we prove an additional result for a model which takes several latent matrices as input. In this preliminary version, we allow a factor of the embedding size of the neural network embeddings in the final bounds and reserve removing this factor to future work. We consider the following function class: H : N2,W pa1, s1q pconcatm v 1p r Fpv rv,B0q, N1,W,idpa, sqq (319) Generalization Analysis of Deep Non-linear Matrix Completion where the input of the network Ψ P N1,W,idpa, sq is a concatenation of a user ID and an item ID: N1,W,id being the class of matrices in Rmˆn which can be represented as ϕpxξq where ϕ is a network form (259) satisfying Conditions (265) and xi,j : concatpei, ejq. In this subsection, we assume that sℓ, aℓ, s1 ℓ, a1 ℓ, B0, B ě 1. Theorem G.5. Define ˆg P arg min p Eplpgξ, r G, ξqq : g P H and g P arg min Eplpgξ, r G, ξqq : g P H . Define sr řm v 1 rv and assume that sℓ, aℓ, s1 ℓ, a1 ℓ, B0, B ě 1. With probability greater than 1 δ over the draw of the training set we have p Rp Hq, sup g PH Eplpg, r G, ξqq p Eplpg, r G, ξqq, Eplpˆg, r G, ξqq Eplpg , r G, ξqq ď s R, (320) where S1 śL ℓ 1 s1 ℓ, and the r O notation hides polylogarithmic factors of m, n, N, m, B0, B, ℓ, ś ℓa1 ℓ (in particular, the depth L is considered constant). Remark: Note that the dependence on m is very strong. In particular, the last term alone contributes a sample complexity of r Opm3q. This is due to the need to bound the Lipschitz constant of the network Ψ in each dimension individually, resulting in an additional factor of m outside the square root in the last two terms. We leave the delicate question of mitigating this dependence, perhaps via an improved version of Corollary G.4, to future work. Proof. Similarly to the proof of Theorem C.8, we use Lemma E.3 to join the Rademacher complexities of r Fpv rv,B0 for all values of v with the covering numbers of the classes N1,W,idpa, sq and N2,W pa1, s1q. To that aim, we begin by using Proposition F.9 with l Id, ℓ 1 and χ śL ℓ 1 sℓ m B0, and ϵ Ð ϵ 2 to obtain a cover of C1 Ă N2,W pa1, s1q such that for any ϕ P N2,W there exists a sϕ P C1 such that for any y P R1 with }y} ď χ we have |pϕ sϕqpyq| ď ϵ logp|C1|q ď W 1 log 12 śL ℓ 1 sℓ B0 m ı śL ℓ 1 s1 ℓ ı rř ď W logpΓW,mq, (324) after setting ϵ 1{N and ΓW,m : 12N śL ℓ 1 sℓ m B0 ı śL ℓ 1 s1 ℓ ı śL ℓ 1 sℓ ı rř ℓaℓs 1. Next we can invoke Proposition F.9 again to construct a cover C2 of N2,W pa, sq such that for any Ψ P N2,W pa, sq there exists a sΨ P C2 such that ˇˇΨpξq sΨpξq ˇˇ ď ϵ 2 śL ℓ 1 s1 ℓ ı (325) logp|C2|q ď W 1 log 12 śL ℓ 1 sℓ śL ℓ 1 s1 ℓ ı rř ℓaℓs ď W 1 logpΓW,mq, (326) where we set ϵ 1 N . Note that for any fixed set of matrices Zv P r Fpv rv,B0 (v ď m), we have ˇˇϕp Z1, . . . , Zm, Ψq sϕp Z1, . . . , Zm, sΨq ˇˇ (327) ď ˇˇϕp Z1, . . . , Zm, Ψq sϕp Z1, . . . , Zm, Ψq ˇˇ ˇˇsϕp Z1, . . . , Zm, Ψq sϕp Z1, . . . , Zm, sΨq ˇˇ Generalization Analysis of Deep Non-linear Matrix Completion where at the last line we have used Equations (322) and (325). Thus, we are in a position to apply Lemma E.3 with ϵ 1 N to obtain: p Rp Hq ď 1 N sup sϕ, sΨ p R ϕ pconcatp r Fpv rv,B0q, Ψq B logp|C1p1{Nq||C2p1{Nq|q Note that by Theorem D.4, and Lemma F.6, we have for any N 1 ě 9pm nq, Eξ p RN1p r Fpv rv,B0q ď 7 B0 2 1 N 1 88 B0 N 1 logpΓq, (330) where Γ : 6Nmn3r B0 1s 1, and thus for any N 1, Rv ď 2 ˆ 882 log2pΓq B0 2 rvpm nq p B0 2 1q . (331) Thus by Corollary G.4 we have, for any ℓ śL ℓ 1 s1 ℓ ı -Lipschitz function G : Rm Ñ R: Rp Gpconcatm v 1p r Fpv rv,B0qqq ď B0 ℓ ff 13 m2 logp2 m Nq ff 13 m2 logp2 m Nq N 176 logpΓq ℓ m2 B0 2 srpm nq m3p B0 2 1q N and therefore, for any δ, with probability greater than 1 δ we have p RNp Gpconcatm v 1p r Fpv rv,B0qqq ď R O By a union bound, inequality (334) below holds with probability ě 1 δ over all the choices of G given by Gpxq sϕpx, sΨq for sϕ P C2 and sΨ P C1: p RNp Gpconcatm v 1p r Fpv rv,B0qqq ď R O logp|C1||C2|q Plugging this back into Equation (329) (after setting ϵ 1 N ) we obtain (w.p. ě 1 δ) p Rp Hq ď 1 logp|C1||C2|q p W W 1q log ΓW,m R : s R, (336) as expected. H. Additional Experimental Details To assess the proposed methodology in this paper and compare it with related matrix completion approaches, we conducted experiments using both synthetic and real-world datasets. On the one hand, the synthetic experiments were designed to evaluate the performance of the methods and related mechanisms by varying proportions of observed entries in the incomplete matrix targeted for completion. On the other hand, the real-world datasets were employed in practical scenarios to gain insights into the methods behavior in real-world applications of matrix completion. Generalization Analysis of Deep Non-linear Matrix Completion Figure 3: Summary of the results from the synthetic data experiments with ground truth was generated by considering fpxq x. H.1. Experiments on Synthetic Data As described in the main paper, we generate synthetic square data matrices in Rnˆn with a given rank r. For each matrix, we vary the proportion of observed entries (%obs=Er N{n2s) as well as performed non-uniform sampling distribution. Regarding the proportion of observed entries %obs, we explore values in the set t0.08, 0.10, . . . , 0.20, 0.25, . . . , 0.40u. Concerning the sampling distribution, we divide the matrix into four equal-sized regions of size n{2 ˆ n{2. In the first region, the probability of entry sampling equals α. In regions 2 and 3, it is 2α, and in region 4, it is 4α. For a given function f or generation procedure, it is described as follows: 1. Randomly generate matrices U P Rnˆr and V P Rnˆr with each entry pi, jq sampled from a normal distribution Np0, 1q. 2. Make R UV J and rescale the product as R n ˆ R{} R}Fr. 3. Apply the function f element-wise to R and obtain R. Return the ground truth generation as R. We generated 25 independent instances by considering the aforementioned generation procedure. For each matrix, we varied the observations accordingly. Remark: We observe that the number of degrees of freedom for an n ˆ n matrix of rank r is nr, which (up to logarithmic terms) is loosely connected to the sample complexity. Consequently, the proportion of observed entries necessary for the prediction task to be (statistically) feasible is linked to the choice of n and r. In our synthetic experiments, we opt for smaller matrices due to the high number of simulations. Therefore, we set n 100 and r 3 in line with the aforementioned observation strategy. As a choice for f, we considered the identity function gpxq x and the sigmoid function σpxq 1{p1 e xq. Further Results: Figures 3 provide detailed results of our synthetic data experiments when the ground truth function is the identity. In this case traditional matrix completion methods perform similarly to ours: the unneeded additional representative power doesn t hurt the performance, since it is small enough to come with negligible additional sample complexity as per Thm C.6 and C.6. Validation, Optimization and Hardware Specifications: For all synthetic data experiments, we employed a %obs of the data for training (specified in each case), allocating 10% for validation, and utilizing the remaining portion as the test set. In the validation procedure, we selected the regularization parameter from the set λ P t10 7, 10 6, . . . , 102u and fixed the size of the embeddings to 20. We optimize all models with ADAM using Nesterov optimization through Tensor Flow 2. We consider a maximum of 100 epochs with early stopping and a patience of 5 in the validation loss, returning the best weights. Regarding hardware specifications, all synthetic experiments were executed on a CPU cluster with 128 threads and 512GB of RAM. H.2. Experiments on Real-world Data Validation, Optimization and Hardware Specifications: Similar to the synthetic data experiments, we optimized the models using ADAM with Nesterov optimization in Tensor Flow 2. We set a maximum of 100 epochs with early stopping, employing a patience of 15 based on the validation loss for all models. The best weights were selected during training. Real-world experiments were conducted on Nvidia DGX-A100 graphics cards with 40GB of GPU RAM. Generalization Analysis of Deep Non-linear Matrix Completion Regarding our validation procedure, we randomly split the observed entries uniformly, resulting in 90% for training and 5% for each validation and test set. The parameter λ was selected from a sequence exponentially distributed between 10 7 and 102. For DOUBAN and LASTFM, this sequence has a size of 50, and for MVL25, it has a size of 15. For all datasets and methods, we fixed the embeddings to have a size of 128. H.3. Datasets DOUBAN [m 2718, n 34893 and %obs 1.2%]: Douban serves as a platform for users to curate movies. Within this matrix, users are interconnected within the social network, and the items represent movies. User ratings, ranging from 0.5 to 5 (in intervals of 0.5), are denoted by the entry pi, jq, corresponding to the rating of user i for movie j. LASTFM [m 1892, n 17632 and %obs 0.27%]:: Last.fm, profiles users musical preferences and habits. In contrast to other datasets, entries pi, jq in this matrix signify the log-scaled number of views user i has for band/artist j. MVL25 [m 162541, n 57971 and %obs 0.27%]: The Movie Lens 25M dataset, a widely adopted and stable benchmark dataset, originates from a non-commercial movie recommendation website. Similar to Douban, entries pi, jq here indicate the rating of user i for movie j, but on a scale from 1 to 5. I. More Detailed Related Works Note: in Matrix Completion, by approximate recovery , we mean results which bound the excess risk in the form of a function of architectural parameters and the number of samples, with decay rate typically of the form 1{ ? N (but sometimes 1{N, or 1{ 4? N if expressing the bound in terms of the Frobenius norm error rather than the square loss). For instance, in the realisable case, if the noise is independent of the entries and has standard deviation ε and the loss function is the square loss, this means that the normalized Frobenius norm of the error scales like ε2 b R N where R is some architectural quantity. By exact recovery , we mean results which guarantee that the ground truth matrix is recovered exactly with high probability when the number of samples N is large enough as long as there is no noise in the observations. By perturbed recovery , we mean results which guarantee that for large enough N, an error of the type ε b s R N is achievable for with high probability for some other architectural quantity s R. The quantity s R typically has much worse dependence on architectural parameters than the quantity R, and as long as that is the case, approximate recovery and perturbed/exact recovery are not subordinate to each other, even if we ignore the minor difference in the optimization problem and sampling regime. To the best of our knowledge: the only existing result which achieves the extremely impressive task of providing a perturbed recovery result where the architectural dependence of s R is as tight as that of R in competing approximate recovery results is (Chen et al., 2020), which only deals with the nuclear norm (p 1) and does not include non-linearities. In addition, like all exact and perturbed recovery results we are aware of, the results in (Chen et al., 2020) are limited to the uniform sampling case. Our results concern approximate recovery with the Schatten (quasi) norm, but we still compare to some exact and noisy recovery results for illustrative purposes. Approximate Recovery in Matrix Completion: There is a lot of literature on the sample complexity of matrix completion with bounded Lipschitz losses and norm constraints. In particular, our work takes much inspiration from the pioneering works of (Foygel et al., 2011) and (Shamir & Shalev-Shwartz, 2011; 2014), which proved analogues of our results (without a learnable function) in the case p 1. The explicitly rank-restricted case was studied in classification settings in (Srebro et al., 2004; Srebro & Shraibman, 2005; Srebro & Jaakkola, 2005). In general, the sample complexity is r Oprnq. Alternative Learning Settings and Models: There is also a substantial amount of work on other soft relaxations of the rank, such as the max norm. In particular, the early work of (Srebro & Shraibman, 2005) shows a sample complexity of r Opn M 2q, where M is a constraint on the max norm. A perturbed recovery result was achieved for the max norm in the classic work of (Cai & Zhou, 2016), which was further extended in (Wang et al., 2021) to provide bounds on the uniformly weighted Frobenius error of the recovered matrix in the non-uniform sampling regime (under some approximate uniformity assumption on the sampling probabilities). With nuclear norm regularizers, other works which provide uniform Frobenius error bounds without uniform sampling include the missing not at random setting (Ma & Chen, 2019; Sportisse et al., 2020), which adopts a Bayesian approach. Further, the pioneering work of (Gui et al., 2023) computes entry-wise confidence intervals in low-rank matrix completion with an arbirary backbone model, substantially extending the entry-wise guarantees provided in the known rank case in (Chen et al., 2021). Exact Recovery in Matrix Completion: The problem of exactly recovering the entries of a uniformly sampled matrix Generalization Analysis of Deep Non-linear Matrix Completion can be traced back to the work of Tao (Cand es & Tao, 2010) and has been widely studied since (Recht, 2011; Gross, 2011; Cand es & Recht, 2009; Chen, 2013): by minimizing the nuclear norm, the matrix can be recovered with high probability after sampling r Opnrq entries where r is the rank of the ground truth matrix. Perturbed Recovery in Matrix Completion: The first bounds with a refined dependence on the variance of the noise can be traced back to the early work of (Candes & Plan, 2010), which roughly speaking shows an excess risk bound of order r Opr1 b N sσq where σ is the standard deviation of the perturbation, sampling is without replacement and n " r Opnrq. Thus, the architectural dependence on the matrix size n is very strong inside the term which involves the variance parameter σ. Much later, a nearly optimal bound of r Opσa nr N qq (also for sampling without replacement) was achieved in (Chen et al., 2020). Inductive Matrix Completion: Inductive Matrix Completion studies predictors of the form AD1D2BJ where A and B are fixed matrices which collect side information about the rows and columns. Thus, this can be viewed as an analogue of deep matrix factorization with d 4 with A and B fixed. However, since A, B are fixed, the problem behaves more similarly to matrix completion with nuclear norm constraints. To the best of our knowledge, the first bounds for this model in the approximate recovery setting are from (Chiang et al., 2018; 2015), giving bounds of order M b 1 N where M is a bound on the nuclear norm of D1D2. Expressed in terms of rank-like quantities, this yields r Op b N q where a is the number of columns of A and B. Later, stronger results were provided in (Ledent et al., 2021b) which match the non-inductive literature with a playing the role of n in standard MC. For instance, the distribution-free sample complexity rate is r Opa 3 2 ?rq. For exact recovery, a sample complexity rate of r Oparq was provided in (Xu et al., 2013). Later, (Ledent et al., 2023) provided a perturbed recovery bound of r O ˆ σ b . Furthermore, several works study more specific settings where the rows ans columns have implicit cluster structure (Qiaosheng et al., 2019; Zhang et al., 2022; Ledent et al., 2021a; Alves et al., 2021). Such assumptions are also becoming common in the field of low rank bandits (Pal & Jain, 2022; Pal et al., 2023). However, none of these works consider the situation where the matrices A and B are trainable (which corresponds to the case d 4 in our setting). Orthogonal Tensor Recovery with the Schatten Quasi-Norm Beyond the examples above, we are not aware of any work on the approximate recovery for Schatten norm constrained matrix completion. However, similar problems have been studied with different losses or sampling regimes. In particular, (Fan et al., 2020; Fan, 2021) studies approximate tensor recovery with Schatten regularization. The results are far reaching and go well beyond the more restricted setting of matrix completion which we study here. However, in the case of a 2-way tensor (i.e. a matrix), the results can be interpreted as a Lagrangian formulation of the empirical risk minimization problems we study. The loss function is the square loss and sampling is uniformly at random without replacement, which means the results are not directly comparable. The achieved excess Frobenius norm bounds scale like 4 c 2 p M 2p 2 p Np (cf. (Fan, 2021), Theorem 4), where M is an upper bound on the }.}sc,p 3 Expressed in terms of our rank-like quantity r, this turns into 4 c rn 2 2 p Np . In contrast, our result is M 2p 2 p n 2 3p , which translates to r O ˆb . Firstly, note both results scale like r Oprnq when p Ñ 0 (though the constant blows up like 1{p in both cases). Secondly, our rate is uniformly tighter since 2 2 p ą 1. And lastly, the bound in (Fan, 2021) is vacuous for p 1, scaling like r Oprn2q in that case, compared to r Oprnq in our result. Matrix sensing with Schatten Quasi-Norm: While exact and perturbed recovery for matrix completion with the nuclear norm (and inductive matrix completion) is a very well-studied problem, for p ă 1, there appears to be little to no existing work in the case of randomly sampled entries. However, there is a lot of work on the sample complexity of compressed sensing for matrix completion, including with Schatten norm minimization (Zhang et al., 2013; Arora et al., 2019; Liu et al., 2014; Recht et al., 2010). In compressed sensing, instead of observing entries, we observe measurements in the form of Frobenius inner products of the ground truth with certain matrices Although MC can be expressed in the language of compressed sensing by saying that the measurement matrices are indicator function of entries (and inductive matrix completion can be expressed by saying that the measurement matrices are all the possible outer products of row and column 3We express our bounds in terms of excess risk with a bounded loss (which could be the truncated square loss), so the decay rate in N can be understood as comparable: the main difference lies in the architectural sample complexity. Generalization Analysis of Deep Non-linear Matrix Completion side information vectors), it is not easy to deduce even existing results for matrix completion or IMC from their compressed sensing analogues: indeed, the conditions on the measurement matrices are typically expressed deterministically via the Restricted Isometry Property, which cannot be easily checked for indicator measurement matrices, though it holds with high probability for certain classes of measurement matrices. For instance, (Zhang et al., 2013; Liu et al., 2014) show a sample complexity of nr for perturbed recovery with the Schatten norm for a broad class of measurement matrices called nearly isometric families (cf. (Recht et al., 2010)), which includes measurement matrices with i.i.d. Gaussian entries but not indicator measurements: in that case, Property 4.3 from (Recht et al., 2010) only holds for uniform sampling, and property 4.1 only holds for bounded X, which violates the definition (which requires the property to be satisfied for all X), though the fact it does hold for bounded X may offer insights on the relationship between the proof techniques. It is clear from the uniform sampling complexity of r Opnrq that this setting, although much more general in many ways, cannot capture the detailed effects of the sampling distribution on the function class capacity of matrices with constrained norms offered by (Shamir & Shalev-Shwartz, 2011; 2014; Ledent et al., 2021b) and the present work. Earlier works on deep matrix factorization often focus on the optimization and algorithmic aspects (Trigeorgis et al., 2016; Zhao et al., 2017) without providing sample complexity bounds, though some include non-linear components (Xue et al., 2017; Fan & Cheng, 2018; Wang et al., 2017; De Handschutter et al., 2021; Wei et al., 2020; Lara-Cabrera et al., 2020). Note also that the non-linear components in those works are interpsersed between each matrix in the product (by analogy with the activation functions in feedforward neural netowrks), rather than entry-wise and after the matrix multiplication (as in FRMC), which implies the models are also different. The observation that deep matrix factorization is equivalent to Schatten norm regularization was made in other works, including (Arora et al., 2019), which studies the optimization landscape of the problem in a compressed sensing setting where the measurement matrices commute (which does not apply to indicator measurements). The implications this has on the implicit rank-restriction in which occurs when training deep neural networks is currently the subject of a large amount of interest in the community (Dai et al., 2021; Jacot, 2022; Wang & Jacot, 2023). However, those works typically do not study sample complexity, perhaps because it is only non trivial when the matrix is not flat, which implies a multi-output scenario in the neural network context. Nevertheless, the potential to generalize our results to that situation is a tantalizing direction for future work which may shed a different light on implicit rank-restriction in DNN training. J. Future Directions There are plenty of unanswered questions which can be studied in future work. For instance: 1. Can the strong dependence on m in the results in Section G be improved through a more refined handling of the L1 Lipschtiz constant in Proposition G.3? 2. Our results concern matrix completion. However, the equivalence between Schatten quasi-norm regularization and L2 regularization of factor matrices is valid in the case of neural networks as well: in fact, there is a large amount of renewed enthusiasm for this problem in the community in recent years from the optimization perspective (Dai et al., 2021; Wang & Jacot, 2023; Giampouras et al., 2020; Arora et al., 2019). Do our results extend to this case? A simple question is how the sample complexity of linear networks of the form Rm Q fpxq AL . . . A1x px P Rnq (337) behaves similarly to our bounds where the quantity M would be replaced by an upper bound on ř }Aℓ}2 Fr. The two problems are still technically distinct, and adaptations of the techniques would be necessary. The question can also be extended to non zero reference matrices, which appears to be a highly non trivial problem. more generally, the relationship between our results and those of (Dai et al., 2021) could be investigated further in this context. 3. Perhaps a unifying question regarding both points above is whether the results of Section C.2 can be obtained through a covering number approach. 4. Can our chaining and Talagrand type arguments in Lemmas E.4 and E.3, as well as proposition G.3 be used to improve existing generalization bounds for neural networks (with activations), at least by removing certain logarithmic terms? 5. Do our results extend unbounded losses? Generalization Analysis of Deep Non-linear Matrix Completion 6. What can be said in the transductive case? Since has been studied in the case of nuclear norm regularization before (Shamir & Shalev-Shwartz, 2011), it is not unreasonable to assume that similar results could hold for our setting. 7. Our results concern excess risk bounds which correspond to traditional performance measures (e.g. RMSE). However, Recommendation Systems typically rely on measures more sensitive to higher predictions than lower ones (e.g. recall, NDCG). Can generalization bounds be proved in those settings? 8. In recommendation systems settings, do our results extend to Graph neural networks such as Light GCN (He et al.,