# matrix_completion_with_modelfree_weighting__74b17fdc.pdf

Matrix Completion with Model-free Weighting

Jiayi Wang 1 Raymond K. W. Wong 1 Xiaojun Mao 2 Kwun Chuen Gary Chan 3

In this paper, we propose a novel method for matrix completion under general non-uniform missing structures. By controlling an upper bound of a novel balancing error, we construct weights that can actively adjust for the non-uniformity in the empirical risk without explicitly modeling the observation probabilities, and can be computed efﬁciently via convex optimization. The recovered matrix based on the proposed weighted empirical risk enjoys appealing theoretical guarantees. In particular, the proposed method achieves stronger guarantee than existing work in terms of the scaling with respect to the observation probabilities, under asymptotically heterogeneous missing settings (where entry-wise observation probabilities can be of different orders). These settings can be regarded as a better theoretical model of missing patterns with highly varying probabilities. We also provide a new minimax lower bound under a class of heterogeneous settings. Numerical experiments are also provided to demonstrate the effectiveness of the proposed method.

1. Introduction

Matrix completion is a modern missing data problem where the object of interest is a high-dimensional and often lowrank matrix. In its simplest form, a partial (noisy) observation of the target matrix is collected, and the goal is to impute the missing entries and sometimes also to de-noise the observed ones. There are various related applications in, e.g., bioinformatics (Chi et al., 2013), causal inference (Athey et al., 2018; Kallus et al., 2018), collaborative ﬁltering (Rennie & Srebro, 2005), computer vision (Weinberger & Saul, 2006), positioning (Montanari & Oh, 2010), survey imputation (Davenport et al., 2014; Zhang et al., 2020; Sen-

1Department of Statistics, Texas A&M University, College Station, TX 77843, USA 2School of Data Science, Fudan University, Shanghai, 200433, China 3Department of Biostatistics, University of Washington, Seattle, WA 98195, USA. Correspondence to: Xiaojun Mao <maoxj@fudan.edu.cn>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

gupta et al., 2021) and quantum state tomography (Wang, 2013; Cai et al., 2016). Matrix completion has been popularized by the famous Netﬂix prize problem (Bennett & Lanning, 2007), in which a large matrix of movie ratings is partially observed. Each row of this matrix consists of ratings from a particular customer while each column records the ratings to a particular movie.

Matrix completion has attracted signiﬁcant interest from the machine learning and statistics communities (e.g., Koltchinskii et al., 2011; Hern andez-Lobato et al., 2014; Klopp, 2014; Lafond et al., 2014; Hastie et al., 2015; Klopp et al., 2015; Bhaskar, 2016; Cai & Zhou, 2016; Kang et al., 2016; Zhu et al., 2016; Bi et al., 2017; Fithian & Mazumder, 2018; Dai et al., 2019; Robin et al., 2020; Chen et al., 2020). Although many statistical and computational breakthroughs (e.g., Cand es & Recht, 2009; Koltchinskii et al., 2011; Recht, 2011) have been made in this area in the last decade, most work (with theoretical guarantees) is developed under a uniform missing structure where every entry is assumed to be observed with the same probability. However, uniform missingness is unrealistic in many applications.

The work under non-uniform missingness is relatively sparse, and can be roughly divided into two major classes. The ﬁrst class (e.g., Srebro et al., 2005; Foygel & Srebro, 2011; Klopp, 2014; Cai & Zhou, 2016) focuses on a form of robustness result, and shows that without actively adjusting for the non-uniform missing structure (e.g., simply applying a uniform empirical risk function ˆRuni deﬁned below), nuclear-norm and max-norm regularized methods can still lead to consistent estimations. Since no direct adjustment is imposed, there is no need to model the non-uniform missing structure. The second class aims to improve the estimation by modeling the missing structure and actively adjusting for non-uniformity. Several works (e.g., Srebro & Salakhutdinov, 2010; Foygel et al., 2011; Negahban & Wainwright, 2012; Mao et al., 2019) fall into this class. However, many of the underlying models can be viewed as special low-rank (e.g., rank 1) missing structures. For instance, a common model is the product sampling model (Negahban & Wainwright, 2012) where row and column are chosen independently according to possibly non-uniform marginal distributions, leading to a rank-1 matrix of observation probability. The speciﬁc model choices of non-uniformity restrict the applicability and theoretical guarantees of these

Matrix Completion with Model-free Weighting

works. One notable exception is Foygel et al. (2011), which actively adjusts for a product sampling model via a variant of weighted trace-norm regularization, but still provides guarantee under general missing structure. Despite these efforts, the study of non-uniform missing mechanisms is still far from comprehensive.

In this work, we propose a novel method of balancing weighting to actively adjust for the non-uniform empirical risk due to general unbalanced (i.e., non-uniform) sampling, without explicitly modeling the probabilities of observation. This is especially attractive when such model is hard to choose or estimate. We summarize our major contributions as follows.

First, we propose a novel balancing idea to adjust for the non-uniformity in matrix completion problems. Unlike many existing works, this idea does not require speciﬁc modeling of the observation probabilities. Thanks to the proposed relaxation of the balancing error (Lemma 1), the balancing weights can then be obtained via a constrained spectral norm minimization, which is a convex optimization problem.

Second, we provide theoretical guarantees on the balancing performance of the proposed weights, as well as the matrix recovery via the corresponding weighted empirical risk estimator. We note that the estimation nature of the balancing weights introduces non-trivial dependence in the weighted empirical risk, as opposed to the typical unweighted empirical risk (often assumed to be a sum of independent quantities). This leads to a non-standard analysis of the proposed matrix estimator.

Third, we investigate a new type of asymptotic regime asymptotically heterogeneous missing structures. This regime allows observation probabilities to be of different orders, a more reasonable asymptotic model for the scenarios with highly varying probabilities among entries. Under asymptotically heterogeneous settings, we show that our estimator achieves a signiﬁcantly better error upper bound than existing upper bounds in terms of the scaling with respect to the observation probabilities. Such scaling is shown to be optimal via a new minimax result based on a class of asymptotically heterogeneous settings. Note that we focus on the challenging uniform error d2 as opposed to the weighted (non-uniform) error d2 (see Section 5), so as to ensure entries with high missing rate would be given non-neglible emphasis in our error measure.

2. Background

2.1. Notation

Throughout the paper, we use several matrix norms: nuclear norm , Frobenius norm F , spectral norm

entry-wise maximum norm and max norm max. Speciﬁcally, the entry-wise maximum norm of a matrix B = (Bij) is deﬁned as B = maxi,j |Bij|, while the max norm is deﬁned as

B max = inf{ U 2, V 2, : B = UV },

where 2, denotes the maximum ℓ2-row-norm of a matrix. See, e.g., Srebro & Shraibman (2005) for the properties of max norm. The Frobenius inner product and Hadamard product between two matrices B1 = (B1,ij) and B2 = (B2,ij) of the same dimensions are represented by B1, B2 = P

i,j B1,ij B2,ij and B1 B2 = (B1,ij B2,ij) respectively. For any a R and any matrix B = (Bij), we write B (a) = (Ba ij).

We also adopt the following asymptotic notations. Let (bn)n 1 and (cn)n 1 be two sequences of nonnegative numbers. We write bn = O(cn) if bn Kcn for some constant K > 0; and bn cn if bn = O(cn) and cn = O(bn). In addition, we use polylog(n) to represent a polylogarithmic function of n, i.e., a polynomial in log n. So O(polylog(n)) represents a polylogarithmic order in n.

We aim to recover an unknown target matrix A = (A ,ij)n1,n2 i,j=1 Rn1 n2 from partial observation of its noisy realization Y = (Yij)n1,n2 i,j=1 Rn1 n2. Denote the observation indicator matrix T = (Tij)n1,n2 i,j=1 Rn1 n2, where Tij = 1 if Yij is observed and Tij = 0 otherwise. We consider an additive noise model

Yij = A ,ij + ϵij, i = 1, . . . , n1; j = 1, . . . , n2,

where {ϵij} are independent errors with zero mean, and are independent of {Tij}. Also, {Tij} are independent Bernoulli random variables with πij = Pr(Tij = 1). We write Π = (πij)n1,n2 i,j=1 .

2.3. Uniformity Versus Non-uniformity

Due to complexity of data, it is often undesirable to posit an additional distributional model for {εij} (such as normality) in practice. To recover A , an empirical risk minimization framework is commonly adopted with the risk function:

R (A) = 1 n1n2 E Y A 2 F , A Rn1 n2.

Under uniform sampling (i.e., πij π), this motivates the use of the popular empirical risk

b Runi(A) = 1 n1n2 T (Y A) 2 F , A Rn1 n2,

which is unbiased for πR(A) (e.g., Cand es & Recht, 2009; Cand es & Plan, 2010; Koltchinskii et al., 2011; Klopp,

Matrix Completion with Model-free Weighting

2014). To minimize b Runi, we can ignore the constant multiplier π. In such settings, a popular form of estimator is arg min A An1,n2 b Runi(A), where examples of the hypothesis class An1,n2 include a set of matrices with rank at most r (i.e., {A : rank(A) r}), and a nuclear norm ball of radius ν (i.e., {A : A ν}). In the latter case, one can also adopt an equivalent minimization

arg min A { b Runi(A) + λ A },

obtained by the method of Lagrange multipliers.

However, uniform sampling is a strong assumption and often not satisﬁed (e.g., Srebro & Salakhutdinov, 2010; Foygel et al., 2011; Hern andez-Lobato et al., 2014). In the empirical risk minimization framework, it is natural to adjust for such non-uniformity since b Runi is no longer unbiased for R. Interestingly, such biasedness does not lead to an incorrect estimator in an asymptotic sense (Klopp, 2014), a form of robustness result (the ﬁrst category of works under non-uniformity mentioned in Section 1). This is because A still minimizes E{ b Runi(A)} even when πij s are heterogeneous, and, to achieve consistency, the theory requires that An1,n2 grows asymptotically so that some appropriate distance between A and the set An1,n2 converges to zero. For ﬁnite sample, one often encounters some forms of misspeciﬁcation (A is not close to An1,n2). In such settings, the estimator based on b Runi(A) is inclined to favor entries with a higher chance of observation, which is often not desirable. For movie recommendation, it is generally not a good idea to neglect those people who rate less frequently, as they might be the customers who do not watch as frequently, and successful movie recommendation would help retain these customers from discontinuing movie subscription services. This is highly related to misspeciﬁcation in low-dimensional models where misspeciﬁcation requires weighting adjustments (Wooldridge, 2007). However, matrix completion problems involve a much more challenging high-dimensional setup with possibly diminishing observation probabilities (e.g., Cand es & Recht, 2009; Koltchinskii, 2011). That is, πL := mini,j πij 0 as n1, n2 . In fact, the diminishing setting is of great interest and plays a central role in most analyses, since it mimics high missing situations such as in the Netﬂix prize problem (< 1% of observed ratings).

2.4. Extremely Varying Probabilities: Heterogeneity Meets Asymptotics

For non-uniform settings, one expects heterogeneity among the entries of Π. We argue that there exist different levels of heterogeneity, and only the simplest level has been well-studied. Deﬁne

πU := max i,j πij and πL := min i,j πij.

Existing work (e.g., Negahban & Wainwright, 2012; Klopp, 2014; Lafond et al., 2014; Cai & Zhou, 2016) is based on an assumption that πU πL, which enforces that all observation probabilities are of the same order. We call this asymptotically homogeneous missing structure. When the observation probabilities vary highly among different entries, this asymptotic framework may not reﬂect the empirical world. Highly varying probabilities are not rare. As demonstrated in Section 2.3 of Mao et al. (2020), the estimated ratio of πU to πL can be high ( 20000) in the Yahoo! Webscope dataset, under low-rank models of Π (e.g., Negahban & Wainwright, 2012). In our theoretical analysis (Section 5), we also look into the asymptotically heterogeneous settings where πU and πL are of different orders.

3. Empirical Risk Balancing

3.1. Propensity Approaches and their Drawbacks

To deal with non-uniformity, a natural idea is to utilize a weighted empirical risk:

b RW (A) = 1 n1n2 T W (1/2) (Y A) 2 F , (1)

where W = (Wij)n1,n2 i,j=1 is a matrix composed of weights such that Wij 1 for all i, j. A natural choice of W is (π 1 ij )n1,n2 i,j=1 , which leads to an unbiased risk estimator for R(A), and such method is known as inverse probability weighting (IPW) in the missing data literature. As {πij} are unknown in general, most methods with IPW insert the estimated probabilities based on certain models. These ideas have been studied in, e.g., Schnabel et al. (2016) under the form of a nuclear-norm regularized estimator:

arg min A { b RW (A) + λ A }, (2)

where λ > 0 is a tuning parameter. Despite its conceptual simplicity, it is well-known in the statistical literature that IPW estimators could produce unstable results due to extreme weights (Rubin, 2001; Kang & Schafer, 2007). More problematically for matrix completion, the estimation quality of a high-dimensional probability matrix Π = (πij)n1,n2 i,j=1 could also be worsened signiﬁcantly by diminishing probabilities of observation (as n1, n2 ) (Davenport et al., 2014). To solve this problem, Mao et al. (2020) imposed a constraint (effectively an upper bound) on the estimated inverse probabilities, where the constraint has to be aggressively chosen such that some true inverse probabilities do not necessarily satisfy in ﬁnite sample. However, there are still two general issues in this line of research. First, the estimation of Π is required. One could come up with a variety of ways to model Π. But it is not obvious how to choose a good model for Π. Second, the constraint

Matrix Completion with Model-free Weighting

level is tricky to select, and difﬁcult to analyze theoretically. Indeed, the analysis of the effect of the constriant to matrix recovery forms the bulk of the analysis in Mao et al. (2020).

The goal of this work is to propose a method that does not require speciﬁc modeling and estimation of Π but still actively adjust for the non-uniformity in the sampling. This method aims to directly ﬁnd a stable weight matrix W that adjusts for non-uniformity, without enforcing W to be IPW derived from a speciﬁc model.

3.2. Balancing Weights

When εij = 0 for all i, j (only for motivation purpose, not required for the proposed techniques), we aim to choose W such that b RW (left hand side) approximates the desirable fully-observed one (right hand side):

1 n1n2 T W (1/2) (A A) 2 F 1 n1n2 A A 2 F ,

(3) for a set of A (a hypothesis class of A which grows with n1, n2) to be speciﬁed below. Indeed, we only need to determine those Wij such that Tij = 1, since the values of the remaining Wij play no role in (3). Intuitively, the weights W are introduced to maintain balance between the left and right hand sides of (3). Therefore, we may work with b RW as if we were using the uniform empirical risk b Runi. The condition (3) can be written as

0 1 n1n2 | (T W J) , | , (4)

where = A A and J Rn1 n2 is a matrix of ones. We call the right hand side the balancing error of with respect to W , denoted by S(W , ). Naturally, we want to ﬁnd weights W that minimize the uniform balancing error

F(W ) := sup Dn1,n2 S(W , ),

for a (standardized) set Dn1,n2, induced by the hypothesis class An1,n2 of A .

A typical assumption is that A is low-rank or approximately low-rank. Various classes are shown to be able to achieve such modeling. For instance, An1,n2 can be chosen as a max-norm ball {A : A max β} (e.g., Srebro et al., 2005; Foygel & Srebro, 2011; Cai & Zhou, 2013; 2016; Fang et al., 2018), and the induced choice of Dn1,n2 would be { : max 2β}. However, the uniform balancing error does not have a closed form and so the computation of the weights would be signiﬁcantly more difﬁcult and expensive. Similar difﬁculty exists for nuclear-norm balls.

To solve this problem, we have developed the following novel lemma which allows us to focus on a relaxed version of balancing error that enjoys strong theoretical guarantees (see Section 5).

Lemma 1. For any matrices B, C Rn1 n2, we have

| C B, B | C B max B n1n2 C B 2 max.

The proof of this lemma can be found in Section A of the supplemental document. The inequalities in Lemma 1 are tight in general: if C=a J and B=b J where a, b R and J is the matrix whose entries are all 1, the two equalities would hold simultaneously.

By Lemma 1, S(W , ) n1n2 T W J 2 max for any Rn1 n2, where the right hand side can be regarded as the relaxed balancing error. If we focus on the max-norm ball (for An1,n2 and hence Dn1,n2) as discussed before, we are only required to control the spectral norm of T W J , which is a convex function of W . Therefore, we propose the following novel weights:

c W = arg min W T W J (5)

subject to T W F κ and Wij 1,

where the optimization is taken only over Wij such that Tij = 1. Here κ P

i,j Tij is a tuning parameter.

The weights {Wij} are restricted to be greater than or equal to 1, as their counterparts, inverse probabilities, satisfy π 1 ij 1. The term T W F regularizes W and is particularly important when εij s are not zero.

Let h(κ) = T c W J where c W is deﬁned by (5) with the tuning parameter κ. It is proportional to the relaxed balancing error with respect to c W . As κ increases, a weaker constraint is imposed on W . Therefore h(κ) is non-increasing as κ increases. It can be shown that h(κ) stays constant for all large enough κ, i.e., h achieves its smallest value. The percentage of (relaxed) balancing with respect to a speciﬁc κ is deﬁned as [M h(κ)]/(M m) where M := maxκ h(κ) and m = minκ h(κ). One way to tune κ is to choose κ that achieves certain pre-speciﬁed percentage of balancing. We can also select κ from multiple values of κ with respect to certain balancing percentages, via a validation set. In Sections 6 and 7, we compare κ with respect to balancing percentages 100%, 75%, and 50%, and select the one with the smallest validation error.

3.3. Computation

The dual Lagrangian form of the constrained problem (5) is

n T W J + κ T W 2 F o , (6)

where κ is the dual parameter. Denote X = T W J, we can obtain the analytic form of the subgradient of the largest singular value by X = u 1( X)v1 where u1 and v1 are the corresponding left and right singular vectors

Matrix Completion with Model-free Weighting

with respect to the largest singular value of matrix X. Thus we have

X X Wij = u1v 1Tij,

and T W 2 F / Wij = 2Tij Wij. This allows us to efﬁciently adopt typical algorithms for smooth optimization with box-constraints such as L-BFGS-B algorithm.

4. Estimation of A

Given the weight estimator c W deﬁned by (5), we propose the following hybrid estimator that utilizes the advantages of both max-norm and nuclear-norm regularizations:

b A = arg min A max β

n b Rc W (A) + µ A o , (7)

where denotes the nuclear norm, and β > 0, µ 0 are tunning parameters. As explained in Section 3.2, the balancing weights c W aims to make b Rc W behave like the uniform empirical risk b Runi over a max-norm ball. Although not entirely necessary, the additional nuclear-norm penalty can sometimes produce tighter relaxation as shown in Lemma 1. As discussed in Fang et al. (2018), the additional nuclear norm bound shows its advantages under the uniform sampling scheme when the target matrix is exactly low-rank. We also ﬁnd that using the hybrid of max-norm and nuclear-norm regularizations improve the estimation performance. If one enforces all the elements of c W to be 1 (uniform weighting), then the estimator (7) degenerates to the estimator deﬁned in Fang et al. (2018). The major novelty of our work is the stable weights.

We extend the algorithm proposed in Fang et al. (2018) to handle the weighted empirical risk function, so as to solve (7). Corresponding details can be found in Section E.1 of the supplemental document.

5. Theoretical Properties

We provide a non-asymptotic analysis of the proposed estimator (7). One major challenge of our analysis is the estimation nature of the weights. As the same set of data is used to obtain the weights, the weighted empirical risk b Rc W (A) possesses complicated dependence structure, as opposed to the uniform empirical risk b Runi(A) (which is assumed to be a sum of independent variables), even for a ﬁxed A. To study the convergence, we carefully decompose the errors into different components. We utilize the properties of true weights to control the balancing error term. Besides, we develop a novel lemma (Lemma S4) to study the concentration of the dual max-norm of the noise matrix with entry-wise multiplicative perturbation.

The following two assumptions will be used in our theoretical analysis. Recall that πU = maxi,j πij and πL = mini,j πij.

Assumption 1. The observation indicators {Tij} are independent Bernoulli random variables with πij = Pr(Tij = 1). The minimum observation probability πL is positive, but it can depend on n1, n2. In particular, both πU and πL are allowed to diminish to zero when n1, n2 .

Assumption 2. The random errors {ϵij} are independent and centered sub-Gaussian random variables such that E(ϵij) = 0 and maxi,j ϵij ψ2 τ where ϵij ψ2 := inf{t > 0 : E[exp(ϵ2 ij/t2)] 2} is the sub-Gaussian norm of ϵij. Also, {ϵij} are independent of {Tij}.

We start with an essential result that the estimated weights c W possess the power to balance the non-uniform empirical risk. More speciﬁcally, in the following theorem, we derive a non-asymptotic upper bound of the uniform balancing error evaluated at c W , where the balancing error can be written as

S(W , ) = 1 n1n2

T W (1/2) 2 F 2 F .

Theorem 1. Suppose Assumption 1 holds. Take κ (2 P

i,j π 1 ij )1/2. There exists an absolute constant C1 > 0 such that for any β > 0,

sup max β S(c W , )

πL(n1 n2) min n [log(n1 + n2)]1/2, π 1/2 L o ,

with probability at least 1 exp{ 2 1(log 2)π2 L P

i,j π 1 ij } 1/(n1 + n2).

If A max β, it is natural to take β = 2β, since max = A A max 2β for any A such that A max β. Therefore, we can take β = 2β in Theorem 1 to achieve uniform control over the balancing error associated with the estimation (7).

With the above balancing guarantee, we are now in a good position to study b A. Our guarantee for b A is in terms of the uniform error d2( b A, A ) := (n1n2) 1 b A A 2 F , instead of the non-uniform error d2( b A, A ) = Π (1/2) ( b A A ) 2 F / Π (1/2) 2 F (e.g., Klopp, 2014; Cai & Zhou, 2016). Note that the non-uniform error d2( b A, A ) places less emphases on entries that are less likely to be observed, although the guarantee in terms of the non-uniform error can be stronger and is easier to obtain. In asymptotically heterogeneous missing settings (i.e., πU and πL are of different orders), entries with probabilities of order smaller than πU may be ignored within the non-uniform error in the asymptotic sense. Therefore it is not a good measure

Matrix Completion with Model-free Weighting

of performance if the guarantee over these entries are also important. In the following theorem, we provide a nonasymptotic error bound of our estimator (7) (based on the estimated weights).

Theorem 2. Suppose Assumptions 1 2 hold. Assume A max β, and µ = O(min{[log(n1 + n2)]1/2, π 1/2 L }/ p

πL(n1 n2)). Then there exists an absolute constant C2 > 0 such that for any κ (2 P

i,j π 1 ij )1/2,

d2 b A, A C2

min n [log(n1 + n2)]1/2, π 1/2 L o + βτκ n1 + n2

with probability at least 1 exp{ 2 1(log 2)π2 L P

i,j π 1 ij } 2 exp{ (n1 + n2)} 1/(n1 + n2).

First, we consider the asymptotically homogeneous missing structures (i.e., πL πU) which most existing work assumes. Under πL πU, the two errors d2( b A, A ) and d2( b A, A ) are of the same order because

πL πU d2( b A, A ) d2( b A, A ) πU

πL d2( b A, A ). (8)

Therefore, the upper bound for d2( b A, A ) that most existing work provides can be directly used to derive an upper bound for d2( b A, A ), which shares the same order. Note that πU and πL are allowed to be different despite πU πL. So certain non-uniform missing structures are still allowed under the setting of asymptotically homogeneous missingness. This setting has been studied in Negahban & Wainwright (2012); Klopp (2014); Lafond et al. (2014); Cai & Zhou (2016). Our bound is directly comparable to the work of Cai & Zhou (2016) which studies a max-norm constrained estimation. Their result assumes A α for some α, which allows their bound to depend on αβ instead of β2 as in our bound. The comparision of error bounds between max-norm-constrained estimation and nuclear-normregularized estimation is given in Section 3.5 of Cai & Zhou (2016). As for exactly low-rank matrices, we can further show that our estimator achieves optimal error bound (up to a logarithmic order). Roughly speaking, if κ is small (so weights are close to constant), our estimator would behave like a standard nuclear-norm regularized estimator, and hence share the (near-)optimality of such estimator. We provide the error bound of our estimator under exactly low-rank setting and asymptotically homogeneous missingness, in Theorem S1 of the supplemental document.

For non-uniform missing structures, the orders of πU and πL do not necessarily match. When their orders are different,

we call these missing structures asymptotically heterogeneous. We now focus on how the upper bound depends on πU and πL. As mentioned before, existing results are scarce. Recently, Mao et al. (2020) (their Section 5.3) provided an extension of existing upper bounds to possibly asymptotically heterogeneous settings, with a careful analysis. Corresponding upper bound scales with π 1 L π1/2 U . They also provided an additional result when one has access to the true probabilities Π, and show that the upper bound of the estimator based on the empirical risk deﬁned via the true probabilities can achieve the scaling π 1/2 L , which is signiﬁcantly better than π 1 L π1/2 U . However, until now, it remains unclear whether there exists an estimator with this scaling of πU and πL, without access to the true probabilities. Interestingly, Theorem 2 provides a positive result, and shows that the upper bound for the proposed estimator achieves this scaling π 1/2 L under very mild assumption that πL is diminishing in at least a slow order, more speciﬁcally πL = O(1/ log(n1 + n2)).

Next, we provide a theoretical result indicating that the scaling π 1/2 L cannot be improved under the asymptotically heterogeneous missing structures. In below, we give a minimax lower bound based on a class of asymptotically heterogeneous settings. To the best of the authors knowledge, the minimax lower bounds under asymptotically heterogeneous regimes have never been studied.

The heterogeneous class that we consider posits

(n1n2) 1 n1 X

j=1 πij πL. (9)

It is clear that (9) does not exclude asymptotically homogeneous settings. To demonstrate the heterogeneity, we provide an example as follows. Suppose there is only a ﬁxed number of entries with observation probabilities in constant order, and the observation probabilities of the remaining entries are of the same order as πL. Then πU 1, and (9) is satisﬁed. Therefore, for any diminishing πL, this setting is asymptotically heterogeneous.

Now, we provide the minimax result.

Theorem 3. Let {ϵij} be i.i.d. Gaussian N(0, σ2) with σ2 > 0. For any β > 0, assume (9) holds with π 1 L = O(β2(n1 n2)/(σ β)2). Then, there exist constants δ (0, 1) and c > 0 such that

inf b A sup A max β Pr

d2 b A, A > c(σ β)β p

In the discussion below, we focus on σ 1, which, most notably, excludes asymptotically noiseless settings. Theorem 3 shows that the scaling π 1/2 L in our upper bound obtained

Matrix Completion with Model-free Weighting

in Theorem 2 is essential. Due to the general inequality (Srebro & Shraibman, 2005):

rank(A ) A , (10)

β is not expected to grow fast for low-rank A with bounded entries. For β = O(polylog(n)), our upper bound matches with the lower bound in Theorem 3 up to a logarithmic factor. For general β, our upper bound scales with β2 instead of (σ β)β despite its matching scaling with respect to πL. Indeed, a mismatch between the upper bound and the lower bound also occurs in Cai & Zhou (2016) under asymptotically homogeneous settings, where their bound is derived via an additional assumption A α. Their upper bound scales with αβ instead of (σ α)β as in their minimax lower bound. We leave a more detailed study of the scaling with respect to β as a future direction.

6. Simulations

In this simulation study, we let the target matrix A Rn1 n2 be generated by A = UV , where U Rn1 r, V Rn2 r, and each entry of U and V is sampled uniformly and independently from [0, 2]. We set n1 = n2 = 200 and r = 5. Therefore, the rank of the target matrix is 5. The contaminated version of A is then generated as Y = A + ϵ, where ϵ Rn1 n2 has i.i.d. mean zero Gaussian entries ϵij N(0, σ2 ϵ ). There are three settings of σϵ, and they are chosen such that the signal-tonoise ratios (SNR:= (E A 2 F /E ϵ 2 F )1/2) are 1, 5 and 10.

We consider three different missing mechanisms and generate observation indicator matrix T from Π = (πij)n1,n2 i,j=1 that are speciﬁed as follows:

Setting 1: This setting is a uniform missing setting πij = 0.25 for all i, j = 1, . . . , 200.

Setting 2: In this setting, we relate the missingness with the value of the target matrix. For entries that have high values, they are more likely to be observed. More speciﬁcally, we set

1/16, if A ,ij q0.25 0.25, if q0.25 < A ,ij q0.75 7/16, if A ,ij > q0.75 where qa is the a quantile of A ,ij, i, j = 1, . . . , 200.

Setting 3: This setting is the contrary of Setting 2. For entries that have high values, they are less likely to be observed.

7/16, if A ,ij q0.25 0.25, if q0.25 < A ,ij q0.75 1/16, if A ,ij > q0.75

where qa is the a quantile of A ,ij, i, j = 1, . . . , 200.

We generate 200 simulated data sets separately for each of the above settings to compare different matrix completion methods, including the proposed method (Bal Weights) and ﬁve existing matrix completion methods: Mazumder et al. (2010) (Soft Impute), Cai & Zhou (2016) (CZ), Fang et al. (2018) (FLT), Koltchinskii et al. (2011) (KLT) and Negahban & Wainwright (2012) (NW). For all methods mentioned above, we randomly separate 20% of the observed entries in every simulated dataset and use it as the validation set to select tuning parameters.

In addition to the empirical root mean squared error (RMSE), we also include estimated rank and test error:

TE := (J T ) ( e A A ) F n1n2 N ,

where e A is a generic estimator of A ; T is the matrix of observed indicator and N is the number of observed entries. The test error measures the relative estimation error of the unobserved entries. Due to the space limitation, we only present the results for SNR = 5. Results for SNR = 1 and SNR = 10 can be found in Section F of the supplemental document. Table 1 summarizes the average RMSE, average TE, and average estimated ranks for all three settings. In all three settings, Soft Impute, CZ and KLT do not provide competitive results as others. For Setting 1, NW achieves the smallest RMSE and TE, but Bal Weights performs closely to it. When SNR = 1 (shown in supplemental document), Bal Weights performs best the average RMSE of Bal Weights is 1.901 while the average RMSE of NW is 2.012. As for Settings 2 and 3, Bal Weights outperforms other methods. Also, NW performs signiﬁcantly worse than Bal Weights in Setting 2. FLT has average RMSE and TE that are close to Bal Weights in Setting 2 but does not perform well in Setting 3. As a result, we can see that Bal Weights is quite robust across different missing structures.

7. Real Data Applications

We applied the above methods to two real datasets:

1. Coat Shopping Dataset, which is available at http: //www.cs.cornell.edu/ schnabts/mnar/. As described in Schnabel et al. (2016), the dataset contains ratings from 290 Turkers on an inventory of 300 items. The self-selected ratings form the training set and the uniformly selected ratings form the test set. The training set consists of 6960 entries and test set consists of 4640 entries.

2. Yahoo! Webscope Dataset, which is available at http: //research.yahoo.com/Academic Relations. It contains (incomplete) ratings from 15,400 users on 1000 songs. The dataset consists of two subsets, a training set and a test set. The training set records approximately 300,000

Matrix Completion with Model-free Weighting

Table 1. Simulation results for three Settings when SNR=5. The average RMSE (RMSE), average TE (TE), and average estimated ranks ( r) with standard errors (SE) in parentheses are provided for six methods (Bal Weights, Soft Impute, CZ, FLT, NW and KLT) in comparison. For the columns related RMSE and TE, we bold results with the ﬁrst two smallest errors.

Setting 1 Method RMSE TE r Bal Weights 0.679(0.001) 0.700(0.001) 25.150(0.128) Soft Impute 0.699(0.001) 0.721(0.001) 45.005(0.161) CZ 0.895(0.002) 0.899(0.002) 51.075(0.121) FLT 0.682(0.001) 0.703(0.001) 26.705(0.131) NW 0.668(0.001) 0.688(0.001) 28.04(0.187) KLT 1.913(0.003) 1.976(0.003) 8.720(0.060) Setting 2 Method RMSE TE r Bal Weights 0.624(0.001) 0.635(0.001) 24.980(0.136) Soft Impute 0.648(0.001) 0.660(0.001) 41.240(0.104) CZ 0.922(0.002) 0.945(0.002) 47.170(0.156) FLT 0.628(0.001) 0.640(0.001) 26.045(0.145) NW 0.665(0.002) 0.674(0.002) 22.030(0.806) KLT 1.980(0.006) 1.880(0.004) 1.355(0.141) Setting 3 Method RMSE TE r Bal Weights 0.925(0.002) 1.002(0.002) 24.090(0.138) Soft Impute 1.143(0.003) 1.254(0.003) 47.240(0.144) CZ 1.222(0.003) 1.324(0.003) 50.590(0.151) FLT 1.026(0.002) 1.118(0.003) 32.440(0.131) NW 0.964(0.002) 1.043(0.002) 18.350(0.319) KLT 3.174(0.006) 3.477(0.006) 9.575(0.093)

ratings given by the aforementioned 15,400 users. Each song has at least 10 ratings. The test set was constructed by surveying 5,400 out of these 15,400 users, such that each selected user rates exactly 10 additional songs.

For the second dataset, due to its large size, we use a nonconvex algorithm of Lee et al. (2010) to obtain CZ. Also, we modify this algorithm to incorporate another nuclear-norm regularization, to obtain Bal Weights and FLT. Detailed algorithm can be found in Section E.2 of the supplemental document. For both datasets, we separate half of the test data set as the validation set to select tuning parameters for all methods. And the remaining half test data set is used as the evaluation set.

Here, we include the test root mean squared error

TRMSE := Te ( e A A ) F Ne ,

where e A is a generic estimator of A ; Te is the indicator matrix for the evaluation set and Ne is the number of evaluation entries, and the test mean absolute error

Te,ij=1 | e Aij A ,ij|

to measure the performance of all the methods. Rank estimation is also provided.

Table 2. Test root mean squared errors (TRMSE), test mean absolute errors (TMAE) and estimated ranks (Rank) based on the evaluation set of Coat Shopping Dataset and Yahoo! Webscope Dataset for Bal Weights and ﬁve existing methods proposed respectively in Mazumder et al. (2010) (Soft Impute), Cai & Zhou (2016) (CZ), Fang et al. (2018)(FLT), Negahban & Wainwright (2012) (NW) and Koltchinskii et al. (2011) (KLT). For the columns related TRMSE and TMAE, we bold results with the ﬁrst two smallest errors.

Coat Shopping Dataset Method TRMSE TMAE Rank Bal Weights 0.9888 0.7627 26 Soft Impute 1.1401 0.8485 15 CZ 1.0354 0.8279 31 FLT 0.9980 0.7723 32 NW 1.0553 0.7972 25 KLT 2.0838 1.5733 2 Yahoo! Webscope Dataset Method TRMSE TMAE Rank Bal Weights 1.0111 0.7739 64 Soft Impute 1.2172 0.9230 31 CZ 1.0339 0.8156 29 FLT 1.0339 0.8156 29 NW 1.0338 0.7954 25 KLT 3.811 1.6589 1

Table 2 shows the TRMSE, TMAE and estimated ranks for the two datasets with all the methods mentioned above. For Coat Shopping Dataset, compared with the existing methods, the proposed method Bal Weights achieves best TRMSE and TMAE. The errors of FLT are similar to that of Bal Weights, but the estimated rank is larger than that of Bal Weights. In other words, Bal Weights is signiﬁcantly more efﬁcient in capturing the signal. For Yahoo! Webscope Dataset, Bal Weights also has the smallest errors among all the methods. However, compared with CZ, FLT and NW whose errors are relatively close to that of Bal Weights, Bal Weights has a higher estimated rank, though 64 is a reasonably small rank for a matrix with size 1000 by 15400. To conﬁrm the fact that the higher errors of CZ, FLT and NW are not due to their smaller rank estimates, we look into the test error sequences obtained by varying the tuning parameters, for each of these three methods. We ﬁnd that the change of test errors (based on the evaluation set) aligns well with the validation errors (based on the validation set), and the chosen tuning parameters indeed correspond to the almost smallest test errors they can achieve. This suggests that these three estimators are not able to capture additional useful information and hence produce a smaller rank estimates. But the proposed estimator is able to capitalize these additional signals to achieve reduction in errors.

Matrix Completion with Model-free Weighting

Acknowledgements

The authors thank the reviewers for their helpful comments and suggestions. The work of Raymond K. W. Wong is partially supported by the US National Science Foundation (DMS-1711952, DMS-1806063 and CCF-1934904). The work of Xiaojun Mao is partially supported by NSFC Grant No. 12001109 and 92046021, Shanghai Sailing Program 19YF1402800, and the Science and Technology Commission of Shanghai Municipality grant 20dz1200600. The work of K. C. G. Chan is partially supported by the US National Science Foundation (DMS-1711952). Portions of this research were conducted with high performance research computing resources provided by Texas A&M University (https://hprc.tamu.edu).

Athey, S., Bayati, M., Doudchenko, N., Imbens, G., and Khosravi, K. Matrix completion methods for causal panel data models. Technical report, National Bureau of Economic Research, 2018.

Bennett, J. and Lanning, S. The netﬂix prize. In Proceedings of KDD cup and workshop, volume 2007, pp. 35, 2007.

Bhaskar, S. A. Probabilistic low-rank matrix completion from quantized measurements. Journal of Machine Learning Research, 17(60):1 34, 2016.

Bi, X., Qu, A., Wang, J., and Shen, X. A group-speciﬁc recommender system. Journal of the American Statistical Association, 112(519):1344 1353, 2017.

Cai, T., Kim, D., Wang, Y., Yuan, M., and Zhou, H. H. Optimal large-scale quantum state tomography with pauli measurements. The Annals of Statistics, 44(2):682 712, 2016.

Cai, T. T. and Zhou, W.-X. A max-norm constrained minimization approach to 1-bit matrix completion. The Journal of Machine Learning Research, 14(1):3619 3647, 2013.

Cai, T. T. and Zhou, W.-X. Matrix completion via max-norm constrained optimization. Electronic Journal of Statistics, 10(1):1493 1525, 2016.

Cand es, E. J. and Plan, Y. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936, 2010.

Cand es, E. J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717 772, 2009.

Chen, Y., Chi, Y., Fan, J., Ma, C., and Yan, Y. Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM Journal on Optimization, 30(4):3098 3121, 2020.

Chi, E. C., Zhou, H., Chen, G. K., Del Vecchyo, D. O., and Lange, K. Genotype imputation via matrix completion. Genome research, 23(3):509 518, 2013.

Dai, B., Wang, J., Shen, X., and Qu, A. Smooth neighborhood recommender systems. Journal of Machine Learning Research, 20(16):1 24, 2019.

Davenport, M. A., Plan, Y., van den Berg, E., and Wootters, M. 1-bit matrix completion. Information and Inference, 3(3):189 223, 2014.

Fang, E. X., Liu, H., Toh, K.-C., and Zhou, W.-X. Maxnorm optimization for robust matrix recovery. Mathematical Programming, 167(1):5 35, 2018.

Fithian, W. and Mazumder, R. Flexible low-rank statistical modeling with missing data and side information. Statistical Science, 33(2):238 260, 2018.

Foygel, R. and Srebro, N. Concentration-based guarantees for low-rank matrix reconstruction. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 315 340, 2011.

Foygel, R., Shamir, O., Srebro, N., and Salakhutdinov, R. R. Learning with the weighted trace-norm under arbitrary sampling distributions. In Advances in Neural Information Processing Systems, pp. 2133 2141, 2011.

Hastie, T., Mazumder, R., Lee, J. D., and Zadeh, R. Matrix completion and low-rank svd via fast alternating least squares. Journal of Machine Learning Research, 16(1): 3367 3402, 2015.

Hern andez-Lobato, J. M., Houlsby, N., and Ghahramani, Z. Probabilistic matrix factorization with non-random missing data. In International Conference on Machine Learning, pp. 1512 1520, 2014.

Kallus, N., Mao, X., and Udell, M. Causal inference with noisy and missing covariates via matrix factorization. In Advances in Neural Information Processing Systems, pp. 6921 6932, 2018.

Kang, J. D. and Schafer, J. L. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4):523 539, 2007.

Kang, Z., Peng, C., and Cheng, Q. Top-n recommender system via matrix completion. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016.

Klopp, O. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1):282 303, 2014.

Matrix Completion with Model-free Weighting

Klopp, O., Lafond, J., Moulines, E., Salmon, J., et al. Adaptive multinomial matrix completion. Electronic Journal of Statistics, 9(2):2950 2975, 2015.

Koltchinskii, V. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer, 2011.

Koltchinskii, V., Lounici, K., and Tsybakov, A. B. Nuclearnorm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302 2329, 2011.

Lafond, J., Klopp, O., Moulines, E., and Salmon, J. Probabilistic low-rank matrix completion on ﬁnite alphabets. In Advances in Neural Information Processing Systems, pp. 1727 1735, 2014.

Lee, J. D., Recht, B., Srebro, N., Tropp, J., and Salakhutdinov, R. R. Practical large-scale optimization for maxnorm regularization. In Advances in neural information processing systems, pp. 1297 1305, 2010.

Mao, X., Chen, S. X., and Wong, R. K. Matrix completion with covariate information. Journal of the American Statistical Association, 114(525):198 210, 2019.

Mao, X., Wong, R. K., and Chen, S. X. Matrix completion under low-rank missing mechanism. Statistica Sinica, 2020.

Mazumder, R., Hastie, T., and Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11:2287 2322, 2010.

Montanari, A. and Oh, S. On positioning via distributed matrix completion. In Sensor Array and Multichannel Signal Processing Workshop (SAM), 2010 IEEE, pp. 197 200, 2010.

Negahban, S. and Wainwright, M. J. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 13 (1):1665 1697, 2012.

Recht, B. A simpler approach to matrix completion. The Journal of Machine Learning Research, 12:3413 3430, 2011.

Rennie, J. D. and Srebro, N. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pp. 713 719, 2005.

Robin, G., Klopp, O., Josse, J., Moulines, E., and Tibshirani, R. Main effects and interactions in mixed and incomplete data frames. Journal of the American Statistical Association, 115(531):1292 1303, 2020.

Rubin, D. B. Using propensity scores to help design observational studies: Application to the tobacco litigation. Health Services and Outcomes Research Methodology, 2 (3-4):169 188, 2001.

Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., and Joachims, T. Recommendations as treatments: Debiasing learning and evaluation. volume 48 of Proceedings of Machine Learning Research, pp. 1670 1679, New York, New York, USA, 2016. PMLR.

Sengupta, N., Srebro, N., and Evans, J. Simple surveys: Response retrieval inspired by recommendation systems. Social Science Computer Review, 39(1):105 129, 2021.

Srebro, N. and Salakhutdinov, R. R. Collaborative ﬁltering in a non-uniform world: Learning with the weighted trace norm. In Advances in Neural Information Processing Systems, volume 23, pp. 2056 2064, 2010.

Srebro, N. and Shraibman, A. Rank, trace-norm and maxnorm. In International Conference on Computational Learning Theory, pp. 545 560. Springer, 2005.

Srebro, N., Rennie, J., and Jaakkola, T. S. Maximum-margin matrix factorization. In Advances in neural information processing systems, pp. 1329 1336, 2005.

Wang, Y. Asymptotic equivalence of quantum state tomography and noisy matrix completion. The Annals of Statistics, 41(5):2462 2504, 2013.

Weinberger, K. Q. and Saul, L. K. Unsupervised learning of image manifolds by semideﬁnite programming. International Journal of Computer Vision, 70(1):77 90, 2006.

Wooldridge, J. M. Inverse probability weighted estimation for general missing data problems. Journal of Econometrics, 141(2):1281 1301, 2007.

Zhang, C., Taylor, S. J., Cobb, C., Sekhon, J., et al. Active matrix factorization for surveys. Annals of Applied Statistics, 14(3):1182 1206, 2020.

Zhu, Y., Shen, X., and Ye, C. Personalized prediction and sparsity pursuit in latent factor models. Journal of the American Statistical Association, 111(513):241 252, 2016.