# conformalized_matrix_completion__10430235.pdf Conformalized matrix completion Yu Gui Rina Foygel Barber Cong Ma Department of Statistics, University of Chicago {yugui,rina,congm}@uchicago.edu Matrix completion aims to estimate missing entries in a data matrix, using the assumption of a low-complexity structure (e.g., low rank) so that imputation is possible. While many effective estimation algorithms exist in the literature, uncertainty quantification for this problem has proved to be challenging, and existing methods are extremely sensitive to model misspecification. In this work, we propose a distribution-free method for predictive inference in the matrix completion problem. Our method adapts the framework of conformal prediction, which provides confidence intervals with guaranteed distribution-free validity in the setting of regression, to the problem of matrix completion. Our resulting method, conformalized matrix completion (cmc), offers provable predictive coverage regardless of the accuracy of the low-rank model. Empirical results on simulated and real data demonstrate that cmc is robust to model misspecification while matching the performance of existing model-based methods when the model is correct. 1 Introduction Matrix completion, the task of filling in missing entries in a large data matrix, has a wide range of applications such as gene-disease association analysis in bioinformatics (Natarajan and Dhillon, 2014), collaborative filtering in recommender systems (Rennie and Srebro, 2005), and panel data prediction and inference in econometrics (Amjad et al., 2018; Bai and Ng, 2021; Athey et al., 2021). If the underlying signal is assumed to be low-rank, a range of estimation algorithms have been proposed in the literature, including approaches based on convex relaxations of rank (Candes and Plan, 2010; Candès and Tao, 2010; Koltchinskii et al., 2011; Foygel and Srebro, 2011; Negahban and Wainwright, 2012; Yang and Ma, 2022) and nonconvex optimization over the space of low-rank matrices (Keshavan et al., 2010; Jain et al., 2013; Sun and Luo, 2016; Ma et al., 2020). While the problem of estimation has been explored through many different approaches in the literature, the question of uncertainty quantity for matrix completion remains challenging and under-explored. Cai et al. (2016) and Carpentier et al. (2018) construct confidence intervals for the missing entries using order-wise concentration inequalities, which could be extremely loose. Chen et al. (2019) proposes a de-biased estimator for matrix completion and characterizes its asymptotic distribution, which then yields entrywise confidence intervals are then derived for the underlying incoherent matrix with i.i.d. Gaussian noise. See also Xia and Yuan (2021); Farias et al. (2022) for related approaches. The effectiveness of the aforementioned uncertainty quantification methods rely heavily on the exact low-rank structure as well as assumptions on the tail of the noise. In practice, the low-rank assumption can only be met approximately, and heavy-tailed noise is quite common in areas such as macroeconomics and genomics (Nair et al., 2022). Without exact model specifications, the solution to the matrix completion algorithm is hard to interpret and the asymptotic distribution of the obtained estimator is no longer valid. The dependence of model-based approaches on exact model assumptions motivates us to answer the question: can we construct valid and short confidence intervals for the 37th Conference on Neural Information Processing Systems (Neur IPS 2023). missing entries that are free of both model assumptions and of constraints on which estimation algorithms we may use? 1.1 Conformal prediction for distribution-free uncertainty quantification Our proposed method is built on the idea of conformal prediction. The conformal prediction framework, developed by Vovk et al. (2005) and Shafer and Vovk (2008), has drawn significant attention in recent years in that it enables the construction of distribution-free confidence intervals that are valid with exchangeable data from any underlying distribution and with any black-box algorithm. To reduce computation in full conformal prediction, variants exist based on data splitting (Vovk et al., 2005; Lei et al., 2018), leave-one-out (Barber et al., 2021), cross-validation (Vovk, 2015; Barber et al., 2021), etc. When distribution shift is present, with covariate shift as a special case, the exchangeability is violated, and Tibshirani et al. (2019) proposes a more general procedure called weighted conformal prediction that guarantees the validity under the weighted exchangeability of data. Barber et al. (2022) further relaxes the exchangeability assumption on the data and characterizes the robustness of a generalized weighted approach via an upper bound for the coverage gap. Other related works (Lei and Candès, 2020; Candès et al., 2021; Jin et al., 2023) adapt the conformal framework to applications in causal inference, survival analysis, etc., and study the robustness of weighted conformal prediction with estimated weights. 1.2 Main contributions In this paper, we adapt the framework of conformal prediction for the problem of matrix completion, and make the following contributions: 1. We construct distribution-free confidence intervals for the missing entries in matrix completion via conformal prediction. The validity is free of any assumption on the underlying matrix and holds regardless of the choice of estimation algorithms practitioners use. To achieve this, we prove the (weighted) exchangeability of unobserved and observed units when they are sampled (possibly nonuniformly) without replacement from a finite population. 2. When the sampling mechanism is unknown, we develop a provable lower bound for the coverage rate which degrades gracefully as the estimation error of the sampling probability increases. 3. In addition, to improve computational efficiency when faced with a large number of missing entries, we propose a one-shot conformalized matrix completion approach that only computes the weighted quantile once for all missing entries. 2 Problem setup In this section, we formulate the matrix completion problem and contrast our distribution-free setting with the model-based settings that are more common in the existing literature. For a partially-observed d1 d2 matrix, the subset S [d1] [d2] denotes the set of indices where data is observed that is, our observations consist of (Mij)(i,j) S. In much of the matrix completion literature, it is common to assume a signal-plus-noise model for the observations, Mij = M ij + Eij, where M = (M ij)(i,j) [d1] [d2] is the true underlying signal while (Eij)(i,j) S denotes the (typically zero-mean) noise. Frequently, it is assumed that M is low-rank, or follows some other latent low-dimensional structure, so that recovering M from the observed data is an identifiable problem. In works of this type, the goal is to construct an estimator c M that accurately recovers the underlying M . In contrast, in a more model-free setting, we may no longer wish to hypothesize a signal-plus-noise model, and can instead assume that we are observing a random subset of entries of a deterministic matrix M; in this setting, estimating M itself becomes the goal. For both frameworks, many existing results focus on the problem of estimation (either of M or of M), with results establishing bounds on estimation errors under various assumptions. For instance, see Candes and Plan (2010); Candès and Tao (2010); Negahban and Wainwright (2012); Chen et al. (2020) for results on estimating M (under stronger conditions, e.g., a low-rank and incoherent signal, and zero-mean sub-Gaussian noise), or Srebro and Shraibman (2005); Foygel and Srebro (2011) for results on estimating M (under milder conditions); note that in the latter case, it is common to instead refer to predicting the entries of M, e.g., temperature or stock market returns, since we often think of these as noisy observations of some underlying signal. On the other hand, relatively little is known about the problem of uncertainty quantification for these estimates: given some estimator c M, can we produce a confidence interval around each c Mij that is (asymptotically) guaranteed to contain the target with some minimum probability? The results of Chen et al. (2019) address this question under strong assumptions, namely, a low-rank and incoherent signal M , plus i.i.d. Gaussian noise (Eij)(i,j) [d1] [d2]. Zhao and Udell (2020) proposes another inferential procedure under the Gaussian copula assumption for the data. In this work, we provide a complementary answer that instead addresses the model-free setting: without relying on assumptions, we aim to produce a provably valid confidence interval for the entries Mij of M (not of M , because without assuming a low-rank model, there is no meaningful notion of an underlying signal ). To do so, we will treat the matrix M as deterministic, while the randomness arises purely from the random subset of entries that are observed. More specifically, we assume Zij = 1 {(i, j) is observed} Bern(pij), independently for all (i, j) [d1] [d2], (1) and let S = {(i, j) [d1] [d2] : Zij = 1} be the subset of observed locations. We consider the setting where the sample probabilities pij s are nonzero. After observing MS = (Mij)(i,j) S, our goal is to provide confidence intervals { b C(i, j)}(i,j) Sc for all unobserved entries, with 1 α coverage over the unobserved portion of the matrix that is, we would like our confidence intervals to satisfy E h Avg Cov( b C; M, S) i 1 α, (2) Avg Cov( b C; M, S) = 1 |Sc| (i,j) Sc 1 n Mij b C(i, j) o (3) measures the average coverage rate of the confidence intervals.1 The goal of this work is to provide an algorithm for constructing confidence intervals { b C(i, j)}(i,j) Sc based on the observed entries MS, satisfying (2) with no assumptions on M. In particular, given the strong success of various matrix completion algorithms for the problem of estimation, we would like to be able to provide uncertainty quantification around any base estimator i.e., given any algorithm for producing an estimate c M, our method constructs confidence intervals { b C(i, j)}(i,j) Sc around this initial estimate. Notation. For a matrix A = (Aij) Rm1 m2, we define A := max(i,j) [m1] [m2] |Aij|, and A 1 := P (i,j) [m1] [m2] |Aij|. We use Ai, and A ,j to refer to the ith row and jth column of A, respectively. For any t R, δt denotes the distribution given by a point mass at t. 3 Conformalized matrix completion We next present our method, conformalized matrix completion (cmc). Our procedure adapts the split conformal prediction method (Vovk et al., 2005) to the problem at hand. As is standard in the conformal prediction framework, the goal of cmc is to provide uncertainty quantification (i.e., confidence intervals) around the output of any existing estimation procedure that the analyst chooses to use this is a core strength of the conformal methodology, as it allows the analyst to use state-ofthe-art estimation procedures without compromising on the validity of inference. At a high level, the cmc procedure will output a confidence interval of the form b C(i, j) = c Mij bq bsij for the matrix value Mij at each unobserved entry (i, j). Here, after splitting the observed entries into a training set and a calibration set, the training set is used to produce c Mij (a point estimate for the 1To handle the degenerate case where S = [d1] [d2] (i.e., we happen to have observed the entire matrix) and thus Sc = , we simply define Avg Cov( b C; M, S) 1. target value Mij) and bsij (a scaling parameter that estimates our uncertainty for this point estimate); then, the calibration set is used to tune the scalar bq, to adjust the width of these confidence intervals and ensure coverage at the desired level 1 α.2 Algorithm 1 Conformalized matrix completion (cmc) 1: Input: target coverage level 1 α; data splitting proportion q (0, 1); observed entries MS. 2: Split the data: draw Wij i.i.d. Bern(q), and define training and calibration sets Str = {(i, j) S : Wij = 1}, and Scal = {(i, j) S : Wij = 0}. 3: Using the training data MStr indexed by Str [d1] [d2], compute: An initial estimate c M using any matrix completion algorithm (with c Mij estimating the target Mij); Optionally, a local uncertainty estimate bs (with bsij estimating our relative uncertainty in the estimate c Mij), or otherwise set bsij 1; An estimate b P of the observation probabilities (with bpij estimating pij, the probability of entry (i, j) being observed). 4: Compute normalized residuals on the calibration set, Rij = |Mij c Mij| bsij , (i, j) Scal. 5: Compute estimated odds ratios for the calibration set and test set, bhij = 1 bpij bpij , (i, j) Scal Sc, and then compute weights for the calibration set and test point, bwij = bhij X (i ,j ) Scal bhi j + max (i ,j ) Scbhi j , (i, j) Scal, bwtest = max (i,j) Scbhij X (i ,j ) Scal bhi j + max (i ,j ) Scbhi j . (4) 6: Compute threshold bq = Quantile1 α P (i,j) Scal bwij δRij + bwtest δ+ . 7: Output: confidence intervals b C(i, j) = c Mij bq bsij for each unobserved entry (i, j) Sc. 3.1 Exchangeability and weighted exchangeability We split the set of observed indices S into a training and a calibration set S = Str Scal as is shown in Algorithm 1. We should notice that the sampling without replacement may introduce implicit dependence as well as a distribution shift from the i.i.d. sampling. Thus the two sets are dependent, which is different from the split conformal methods in regression problems. Before we present the coverage guarantees for our method, we first examine the role of (weighted) exchangeability in this method, to build intuition for how the method is constructed. 3.1.1 Intuition: the uniform sampling case First, for intuition, consider the simple case where the entries are sampled with constant probability, pij p. If this is the case, then the set of calibration samples and the test point are exchangeable that is, for (i , j ) denoting the test point that is drawn uniformly from Sc, if we are told that the combined set Scal {(i , j )} is equal to {(i1, j1), . . . , (incal+1, jncal+1)} in no particular order (where ncal = |Scal|), then the test point location (i , j ) is equally likely to be at any one of these ncal + 1 locations. Consequently, by exchangeability, the calibration residuals {Rij : (i, j) Scal}, defined in Algorithm 1 above, are exchangeable with the test residual Ri j = |Mi j c Mi j |/bsi j ; as a result, we can construct our confidence interval b C(i , j ) with bq determined by the (1 α)- quantile of the calibration residuals {Rij : (i, j) Scal}. 2More generally, cmc can be implemented with any choice of a nonconformity score that determines the shape of the resulting confidence intervals and indeed, this generalization allows for categorical rather than real-valued matrix entries Mij. We will address this extension in the Supplementary Material, as well as a version of cmc based on full conformal rather than split conformal, in order to avoid data splitting. Indeed, this is exactly what cmc does in this case: if we are aware that sampling is uniform, it would naturally follow that we estimate bpij bp0 by some (potentially data-dependent) scalar bp0; consequently, all the weights are given by bwij 1 ncal+1 and bwtest = 1 ncal+1, and thus bq is the (unweighted) (1 α)-quantile of the calibration residuals (with a small correction term, i.e., the term + bwtest δ+ appearing in the definition of bq, to ensure the correct probability of coverage at finite sample sizes). 3.1.2 Weighted exchangeability for the matrix completion problem Next, we move to the general case, where now the sampling may be highly nonuniform. First suppose the sampling probabilities pij are known, and suppose again that we are told that the combined set Scal {(i , j )} is equal to {(i1, j1), . . . , (incal+1, jncal+1)} in no particular order. In this case, the test point (i , j ) is not equally likely to be at any one of these ncal + 1 locations. Instead, we have the following result: Lemma 3.1. Following the notation defined above, if (i , j ) | S Unif(Sc), it holds that P (i , j ) = (ik, jk) | Scal {(i , j )} = {(i1, j1), . . . , (incal+1, jncal+1)}, Str = wikjk, where we define the weights wikjk = hikjk Pncal+1 k =1 hik jk for odds ratios given by hij = 1 pij Consequently, while the test point s residual Ri j = |Mi j c Mi j | / bsi j is not exchangeable with the calibration set residuals {Rij : (i, j) Scal}, the framework of weighted exchangeability (Tibshirani et al., 2019) tells us that we can view Ri j as a draw from the weighted distribution that places weight wikjk on each residual Rikjk and in particular, P Mi j c Mi j q i ,j bsi j = P Ri j q i ,j 1 α, where q i ,j = Quantile1 α Pncal+1 k=1 wikjk δRikjk . Noting that this threshold q i ,j may depend on the test location (i , j ), to accelerate the computation of prediction sets for all missing entries, we instead propose a one-shot weighted conformal approach by upper bounding the threshold uniformly over all test points: q i ,j q := Quantile1 α P (i,j) Scal w ij δRij + w test δ+ , for (i, j) Scal, w ij = hij X (i ,j ) Scal hi j + max (i ,j ) Schi j , w test = max (i,j) Schij X (i ,j ) Scal hi j + max (i ,j ) Schi j . (5) Now the threshold q no longer depends on the location (i , j ) of the test point3. If the true probabilities pij were known, then, we could define oracle conformal confidence intervals b C (i, j) = c Mij q bsij, (i, j) Sc. As we will see in the proofs below, weighted exchangeability ensures that, if we do have oracle knowledge of the pij s, then these confidence intervals satisfy the goal (2) of 1 α average coverage. Since the pij s are not known in general, our algorithm simply replaces the oracle weights w in (5) with their estimates bw defined in (4), using bpij as a plug-in estimate of the unknown pij. 3.2 Theoretical guarantee In the setting where the pij s are not known but instead are estimated, our cmc procedure simply repeats the procedure described above but with weights bw calculated using bpij s in place of pij s. Our theoretical results therefore need to account for the errors in our estimates bpij, to quantify the coverage gap of conformal prediction with the presence of these estimation errors. To do this, we define an estimation gap term (i,j) Scal {(i ,j )} (i ,j ) Scal {(i ,j )} bhi j hij P (i ,j ) Scal {(i ,j )} hi j 3Comparison between the one-shot weighted approach and the exact weighted approach (Algorithm 3) is conducted in Section B.1 where (i , j ) denotes the test point. Effectively, is quantifying the difference between the oracle weights, w ij, defined in (5), and the estimated weights, bwij, defined in (4), except that is defined relative to a specific test point, while the weights bw (and w , for the oracle version) provide a oneshot procedure that is universal across all possible test points, and is thus slightly more conservative. Theorem 3.2. Let c M, bs, and b P be estimates constructed using any algorithms, which depend on the data only through the training samples MStr at locations Str. Then, under the notation and definitions above, conformalized matrix completion (cmc) satisfies E Avg Cov( b C; M, S) 1 α E[ ]. In the homogeneous case, where we are aware that sampling is uniform (i.e., pij p for some potentially unknown p (0, 1)), our error in estimating the weights is given by 0 (since hij is constant across all (i, j), and same for bhij and therefore, the true and estimated weights are all equal to 1 ncal+1). In this setting, therefore, we would achieve the exact coverage goal (2), with E Avg Cov( b C; M, S) guaranteed to be at least 1 α. 3.3 Examples for missingness To characterize the coverage gap explicitly, we present the following concrete examples for P and show how the coverage gap can be controlled. Technical details are shown in the Supplementary Material. 3.3.1 Logistic missingness Suppose that the missingness follows a logistic model, with log(pij/(1 pij)) = log (hij) = ui + vj for some u Rd1 and v Rd2, where we assume u 1 = 0 for identifiability. This model is closely related to logistic regression with a diverging number of covariates (Portnoy, 1988; He and Shao, 2000; Wang, 2011). Following Chen et al. (2023), we estimate u, v via maximum likelihood, bu, bv = arg max u,v L(u, v) subject to 1 u = 0. Here the log-likelihood is defined by L(u, v) = X (i,j) [d1] [d2] { log (1 + exp( ui vj)) + 1(i,j) Sc tr log (1 q + exp( ui vj))}. Consequently, we define the estimate of pij as bpij = (1 + exp( bui bvj)) 1, for which one can show that E[ ] q log(max{d1,d2}) min{d1,d2} . The proof of this upper bound is shown in Section A.3.1. 3.3.2 Missingness with a general link function More broadly, we can consider a general link function, where we assume log(pij/(1 pij)) = log (hij) = ϕ(Aij), where ϕ is a link function and A Rd1 d2 is the model parameter. The logistic model we introduced above corresponds to the case when ϕ is the identity function and A = u1 + 1v is a rank-2 matrix. In this general setup, if rank(A) = k and A τ, we can apply one-bit matrix completion (Davenport et al., 2014) to estimate {pij}. More precisely, we define the log-likelihood LStr(B) = X (i,j) Str log(ψ(Bij)) + X (i,j) Sc tr log(1 ψ(Bij)), where ψ(t) = q(1 + e ϕ(t)) 1 (here rescaling by the constant q is necessary to account for subsampling the training set Str from the observed data indices S), and solve the following optimization problem b A = arg max B LStr(B) subject to B τ p k d1d2, B τ, (7) where B is the nuclear norm of B. Consequently, we let bhij = exp( ϕ( b Aij)), which leads to E[ ] min{d1, d2} 1/4. The proof of this upper bound is shown in Section A.3.2. 4 Simulation studies In this section, we conduct numerical experiments to verify the coverage guarantee of conformalized matrix completion (cmc) using both synthetic and real datasets. As the validity of cmc is independent of the choice of base algorithm, we choose alternating least squares (als) (Jain et al., 2013) as the base algorithm (results using a convex relaxation (Fazel et al., 2004; Candes and Recht, 2012; Chen et al., 2020) are shown in the Supplementary Material). We use cmc-als to refer to this combination. We use Avg Cov( b C) (the average coverage, defined in (3)), and average confidence interval length, Avg Length( b C) = 1 |Sc| P (i,j) Sc length b C(i, j) , to evaluate the performance of different methods. All the results in this section can be replicated with the code available at https: //github.com/yugjerry/conf-mc. 4.1 Synthetic dataset Throughout the experiments, we set the true rank r = 8, the desired coverage rate 1 α = 0.90, and report the average results over 100 random trials. Data generation. In the synthetic setting, we generate the data matrix by M = M + E, where M is a rank r matrix, and E is a noise matrix (with distribution specified below). The low-rank component M is generated by M = κU V , where U Rd1 r is the orthonormal basis of a random d1 r matrix consisting of i.i.d. entries from a certain distribution Pu,v (specified below) and V Rd2 r is independently generated in the same manner. In addition, the constant κ is chosen such that the average magnitude of the entries |M ij| is 2. Following the observation model (1), we only observe the entries in the random set S. Implementation. First, when implementing als, we adopt r as the hypothesized rank of the underlying matrix. For model-based methods (Chen et al., 2019), by the asymptotic distributional characterization c Mij M ij d N(0, θ2 ij) and Eij (c Mij M ij), one has c Mij Mij d N(0, θ2 ij + σ2). Consequently, the confidence interval for the model-based methods can be constructed as c Mij q1 α/2 q bθ2 ij + bσ2, where qβ is the βth quantile of N(0, 1). The estimates are obtained via bσ2 = MStr c MStr 2 F/|Str| and bθ2 ij = (bσ2/bp)( b Ui, 2 + b Vj, 2) with the SVD c M = b UbΣ b V . For cmc-als, we obtain c M, bs from MStr by running als on this training set (and taking bsij = (bθ2 ij + bσ2)1/2). The probabilities bpij s are estimated via bpij (d1d2q) 1|Str| for this setting where the pij s are homogeneous. 4.1.1 Homogeneous missingness In the homogeneous case, we consider the following four settings with d1 = d2 = 500: Setting 1: large sample size + Gaussian noise. p = 0.8, Pu,v = N(0, 1), and Eij N(0, 1). Setting 2: small sample size + Gaussian noise. p = 0.2, Pu,v = N(0, 1), and Eij N(0, 1). Setting 3: large sample size + heavy-tailed noise. p = 0.8, Pu,v = N(0, 1), and Eij 0.2 t1.2, where tν denotes the t distribution with ν degrees of freedom. Setting 4: violation of incoherence. p = 0.8, Pu,v = t1.2, and Eij N(0, 1). Figure 1 displays the results (i.e., the coverage rate and the interval length) for both cmc-als and als when we vary the hypothesized rank r from 2 to 40. The oracle length is the difference between the (1 α/2)-th and (α/2)-th quantiles of the underlying distribution for Eij. As can be seen, cmc-als achieves nearly exact coverage regardless of the choice of r across all four settings. When the hypothesized rank r is chosen well, cmc-als is often able to achieve an average length similar to that of the oracle. For the model-based method als, when we underestimate the true rank (r < r ), we may still see adequate coverage since bσ tends to over-estimate σ by including the remaining r r factors as noise. However, if we overestimate the rank (r > r ), then als tends to overfit the noise with additional factors and under-estimate σ regardless of the choice of r, leading to significant undercoverage 5 10 15 20 25 30 35 40 hypothesized rank 5 10 15 20 25 30 35 40 hypothesized rank cmc-als als oracle (a) Setting 1: large sample size + Gaussian noise 5 10 15 20 25 30 35 40 hypothesized rank 5 10 15 20 25 30 35 40 hypothesized rank cmc-als als oracle (b) Setting 2: small sample size + Gaussian noise 5 10 15 20 25 30 35 40 hypothesized rank 5 10 15 20 25 30 35 40 hypothesized rank cmc-als als oracle (c) Setting 3: large sample size + heavy-tailed noise 5 10 15 20 25 30 35 40 hypothesized rank 5 10 15 20 25 30 35 40 hypothesized rank cmc-als als oracle (d) Setting 4: violation of incoherence Figure 1: Comparison between cmc-als and als. Alternative least squares empirical scores Gaussian(0,1) 0 20 40 index unobserved entries Lower and upper bounds true entry cmc-als als (a) Large sample size + Gaussian noise: r = r = 8 Alternative least squares empirical scores Gaussian(0,1) 0 20 40 index unobserved entries Lower and upper bounds true entry cmc-als als (b) Large sample size + Gaussian noise: r = 8 < 50 = r Alternative least squares empirical scores Gaussian(0,1) 0 20 40 index unobserved entries Lower and upper bounds true entry cmc-als als (c) Violation of incoherence: r = 6 < 8 = r Figure 2: Histogram of standardized scores for als and prediction lower and upper bounds for 50 distinct unobserved entries. in Settings 1, 2 and 4. In Setting 3 with heavy-tailed noise, als tends to be conservative due to overestimating σ. Overall, Figure 1 confirms our expectation that the validity of the model-based estimates relies crucially on a well specified model, and it fails to hold when sample size is small (cf. Figure 1b), when noise is heavy tailed (cf. Figure 1c), and when the underlying matrix is not incoherent (cf. Figure 1d). In Figure 2, we present the histogram of standardized scores (c Mij Mij)/ q bθ2 ij + bσ2 and the plot of the upper and lower bounds for three settings. In Figure 2a, when the model assumptions are met and r = r , the scores match well with the standard Gaussian and the prediction bounds produced by als and cmc-als are similar. With the same data generating process, when the rank is overparametrized, the distribution of scores cannot be captured by the standard Gaussian, thus the quantiles are misspecified. As we can see from the confidence intervals, als tends to have smaller intervals which lead to the undercoverage. In the last setting, the underlying matrix is no longer incoherent. When the rank is underestimated, the r r factors will be captured by the noise term and the high heterogeneity in the entries will further lead to overestimated noise level. As a result, the intervals by als are much larger while the conformalized intervals are more adaptive to the magnitude of entries. 4.1.2 Heterogeneous missingness Now we move on to the case with heterogeneous missingness, where the observed entries are no longer sampled uniformly. To simulate this setting, we draw {ail : i d1, l k } i.i.d. from Unif(0, 1) and {blj : l k , j d2} i.i.d. from Unif( 0.5, 0.5), and define sampling probabilities by log(pij/(1 pij)) = Xk l=1 ailblj. 5 10 15 20 25 30 35 40 hypothesized rank 5 10 15 20 25 30 35 40 hypothesized rank cmc*-als cmc-als (a) Gaussian noise: k = 1. 5 10 15 20 25 30 35 40 hypothesized rank 5 10 15 20 25 30 35 40 hypothesized rank cmc*-als cmc-als (b) Gaussian noise: k = 5. 5 10 15 20 25 30 35 40 hypothesized rank 5 10 15 20 25 30 35 40 hypothesized rank cmc*-als cmc-als (c) Adversarial heterogeneous noise: k = 1. 5 10 15 20 25 30 35 40 hypothesized rank 5 10 15 20 25 30 35 40 hypothesized rank cmc*-als cmc-als (d) Adversarial heterogeneous noise: k = 5. 5 10 15 20 25 30 35 40 hypothesized rank 5 10 15 20 25 30 35 40 hypothesized rank cmc*-als cmc-als (e) Random heterogeneous noise: k = 1. 5 10 15 20 25 30 35 40 hypothesized rank 5 10 15 20 25 30 35 40 hypothesized rank cmc*-als cmc-als (f) Random heterogeneous noise: k = 5. Figure 3: Comparison between cmc with true and estimated weights under heterogeneous missingness. We consider two settings with k = 1 and k = 5, respectively. In both settings, bpij is constructed by estimating bail and bblj via constrained maximum likelihood estimation as is shown in Example 3.3.2. To analyze the effect of estimation error predicted by Theorem 3.2, we consider three matrix generation processes for each choice of k , where M is generated in the same way as in Section 4.1.1 with d1 = d2 = d = 500, Pu,v = N(0, 1): Gaussian homogeneous noise: Eij N(0, 1) independently. Adversarial heterogeneous noise: Eij N(0, σ2 ij) independently, where we take σij = 1/2pij. For this setting, note that the high-noise entries (i.e., those entries that are hardest to predict) occur in locations (i, j) that are least likely to be observed during training. Random heterogeneous noise: Eij N(0, σ2 ij) independently, where (σij)(i,j) [d1] [d2] is drawn from the same distribution as (1/2pij)(i,j) [d1] [d2]. We write cmc -als to denote the conformalized method (with als as the base algorithm) where we use oracle knowledge of the true pij s for weighting, while cmc-als uses estimates bpij. The results are shown in Figures 3, for the settings k = 1 and k = 5 (i.e., the rank of P). Under both homogeneous noise and random heterogeneous noise, both cmc -als and cmc-als achieve the correct coverage level, with nearly identical interval length. For adversarial heterogeneous noise, on the other hand, cmc -als achieves the correct coverage level as guaranteed by the theory, but cmc-als shows some undercoverage due to the errors in the estimates bpij (since now the variance of the noise is adversarially aligned with these errors); nonetheless, the undercoverage is mild. In Section B.2 in the appendix, the local coverage of cmc-als conditioning pij = p0 for varying p0 is presented. 4.2 Real data application We also compare conformalized approaches with model-based approaches using the Rossmann sales dataset4 (Farias et al., 2022). This real data provides the underlying fixed matrix M; the missingness pattern is generated artificially, as detailed below. We focus on daily sales (the unit is 1K dollar) of 1115 drug stores on workdays from Jan 1, 2013 to July 31, 2015 and hence the underlying matrix has dimension 1115 780. The hypothesized rank r is varied from 5 to 60 with step size 5 and we set 4https://www.kaggle.com/c/rossmann-store-sales 5 10 15 20 25 30 35 40 45 50 55 60 hypothesized rank 5 10 15 20 25 30 35 40 45 50 55 60 hypothesized rank cmc-als als (a) als vs cmc-als: homogeneous missingness. 5 10 15 20 25 30 35 40 45 50 55 60 hypothesized rank 5 10 15 20 25 30 35 40 45 50 55 60 hypothesized rank cmc-als als (b) als vs cmc-als: logistic missingness. Figure 4: Comparison between conformalized and model-based matrix completion with sales dataset. the target level to be 1 α = 0.9. We apply random masking for 100 independent trials and report the average for Avg Cov and Avg Length in Figure 4. Homogeneous pij. For each entry, we first apply random masking 1 Zij Bern(0.2) independently. In Figure 4a, conformalized approach has exact coverage at 1 α but als tends to be conservative when r is small and loses coverage when r increases to 50. Heterogeneous pij. After drawing pij s from log(pij/(1 pij)) = log (hij) = ϕ(Aij) in Section 4.1.2 with k = 1, b P is obtained via the constrained maximum likelihood estimator in Example 3.3.2. In Figure 4b, cmc-als has coverage around 1 α and in comparison, when r is greater than 40, als fails to guarantee the coverage and Avg Cov decreases to 0.75. 5 Discussion In this paper, we introduce the conformalized matrix completion method cmc, which offers a finitesample coverage guarantee for all missing entries on average and requires no assumptions on the underlying matrix or any specification of the algorithm adopted, relying only on an estimate of the entrywise sampling probabilities. Given an estimate of the entrywise sampling probability, we provide an upper bound for the coverage gap and give the explicit form with examples of the logistic as well as the general one-bit low-rank missingness. In the implementation, an efficient one-shot weighted conformal approach is proposed with the provable guarantee and achieves nearly exact coverage. We can compare our findings to a few related results in the literature. The work of Chernozhukov et al. (2021) applies conformal prediction to counterfactual and synthetic control methods, where they include matrix completion as an example with a regression-type formulation. However, their approach relies on a test statistic that is a function of estimated residuals. Consequently, their method requires the assumption of stationarity and weak dependence of errors. Furthermore, the validity of their approach is contingent upon the estimator being either consistent or stable. Additionally, Wieczorek (2023) studies conformal prediction with samples from a deterministic and finite population, but the validity under the sampling without replacement remains an open question in their work. Our results suggest a number of questions for further inquiry. In the setting of matrix completion, while the results are assumption-free with regard to the matrix M itself, estimating the sampling probabilities P that determine which entries are observed remains a key step in the procedure; while our empirical results suggest the method is robust to estimation error in this step, studying the robustness properties more formally is an important open question. More broadly, our algorithm provides an example of how the conformal prediction framework can be applied in a setting that is very different from the usual i.i.d. sampling regime, and may lend insights for how to develop conformal methodologies in other applications with non-i.i.d. sampling structure. Acknowledgement R.F.B. was partially supported by the Office of Naval Research via grant N00014-20-1-2337, and by the National Science Foundation via grant DMS-2023109. C.M. was partially supported by the National Science Foundation via grant DMS-2311127. Amjad, M., Shah, D., and Shen, D. (2018). Robust synthetic control. The Journal of Machine Learning Research, 19(1):802 852. Angelopoulos, A., Bates, S., Malik, J., and Jordan, M. I. (2020). Uncertainty sets for image classifiers using conformal prediction. ar Xiv preprint ar Xiv:2009.14193. Athey, S., Bayati, M., Doudchenko, N., Imbens, G., and Khosravi, K. (2021). Matrix completion methods for causal panel data models. Journal of the American Statistical Association, 116(536):1716 1730. Bai, J. and Ng, S. (2021). Matrix completion, counterfactuals, and factor analysis of missing data. Journal of the American Statistical Association, 116(536):1746 1763. Barber, R. F., Candès, E. J., Ramdas, A., and Tibshirani, R. J. (2021). Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486 507. Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. (2022). Conformal prediction beyond exchangeability. ar Xiv preprint ar Xiv:2202.13415. Cai, T. T., Liang, T., and Rakhlin, A. (2016). Geometric inference for general high-dimensional linear inverse problems. The Annals of Statistics, 44(4):1536 1563. Candes, E. and Recht, B. (2012). Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111 119. Candès, E. J., Lei, L., and Ren, Z. (2021). Conformalized survival analysis. ar Xiv preprint ar Xiv:2103.09763. Candes, E. J. and Plan, Y. (2010). Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936. Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080. Cao, Y. and Xie, Y. (2015). Categorical matrix completion. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 369 372. IEEE. Carpentier, A., Klopp, O., Löffler, M., and Nickl, R. (2018). Adaptive confidence sets for matrix completion. Bernoulli, 24(4A):2429 2460. Chen, W., Chun, K.-J., and Barber, R. F. (2018). Discretized conformal prediction for efficient distribution-free inference. Stat, 7(1):e173. Chen, W., Wang, Z., Ha, W., and Barber, R. F. (2016). Trimmed conformal prediction for highdimensional models. ar Xiv preprint ar Xiv:1611.09933. Chen, Y., Chi, Y., Fan, J., Ma, C., and Yan, Y. (2020). Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM journal on optimization, 30(4):3098 3121. Chen, Y., Fan, J., Ma, C., and Yan, Y. (2019). Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences, 116(46):22931 22937. Chen, Y., Li, C., Ouyang, J., and Xu, G. (2023). Statistical inference for noisy incomplete binary matrix. Journal of Machine Learning Research, 24(95):1 66. Chernozhukov, V., Wüthrich, K., and Zhu, Y. (2021). An exact and robust conformal inference method for counterfactual and synthetic controls. Journal of the American Statistical Association, 116:1849 1864. Davenport, M. A., Plan, Y., Van Den Berg, E., and Wootters, M. (2014). 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3(3):189 223. Farias, V., Li, A. A., and Peng, T. (2022). Uncertainty quantification for low-rank matrix completion with heterogeneous and sub-exponential noise. In International Conference on Artificial Intelligence and Statistics, pages 1179 1189. PMLR. Fazel, M., Hindi, H., and Boyd, S. (2004). Rank minimization and applications in system theory. In Proceedings of the 2004 American control conference, volume 4, pages 3273 3278. IEEE. Foygel, R. and Srebro, N. (2011). Concentration-based guarantees for low-rank matrix reconstruction. In Kakade, S. M. and von Luxburg, U., editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 315 340, Budapest, Hungary. PMLR. He, X. and Shao, Q.-M. (2000). On parameters of increasing dimensions. Journal of multivariate analysis, 73(1):120 135. Jain, P., Netrapalli, P., and Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665 674. Jin, Y., Ren, Z., and Candès, E. J. (2023). Sensitivity analysis of individual treatment effects: A robust conformal inference approach. Proceedings of the National Academy of Sciences, 120(6):e2214889120. Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980 2998. Koltchinskii, V., Lounici, K., and Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302 2329. Lei, J., G Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111. Lei, L. and Candès, E. J. (2020). Conformal inference of counterfactuals and individual treatment effects. ar Xiv preprint ar Xiv:2006.06138. Ma, C., Wang, K., Chi, Y., and Chen, Y. (2020). Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution. Foundations of Computational Mathematics, 20:451 632. Nair, J., Wierman, A., and Zwart, B. (2022). The fundamentals of heavy tails: Properties, emergence, and estimation, volume 53. Cambridge University Press. Natarajan, N. and Dhillon, I. S. (2014). Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12):i60 i68. Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13(1):1665 1697. Portnoy, S. (1988). Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. The Annals of Statistics, pages 356 366. Rennie, J. D. and Srebro, N. (2005). Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pages 713 719. Romano, Y., Sesia, M., and Candes, E. (2020). Classification with valid and adaptive coverage. Advances in Neural Information Processing Systems, 33:3581 3591. Shafer, G. and Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3). Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In Learning Theory: 18th Annual Conference on Learning Theory, COLT 2005, Bertinoro, Italy, June 27-30, 2005. Proceedings 18, pages 545 560. Springer. Sun, R. and Luo, Z.-Q. (2016). Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11):6535 6579. Tibshirani, R. J., Barber, R. F., Candès, E. J., and Ramdas, A. (2019). Conformal prediction under covariate shift. In Neur IPS. Vovk, V. (2015). Cross-conformal predictors. Annals of Mathematics and Artificial Intelligence, 74:9 28. Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic learning in a random world, volume 29. Springer. Wang, L. (2011). Gee analysis of clustered binary data with diverging number of covariates. The Annals of Statistics. Wieczorek, J. (2023). Design-based conformal prediction. ar Xiv preprint ar Xiv:2303.01422. Xia, D. and Yuan, M. (2021). Statistical inferences of linear forms for noisy matrix completion. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(1):58 77. Yang, Y. and Ma, C. (2022). Optimal tuning-free convex relaxation for noisy matrix completion. ar Xiv preprint ar Xiv:2207.05802. Zhao, Y. and Udell, M. (2020). Matrix completion with quantified uncertainty through low rank gaussian copula. Advances in Neural Information Processing Systems, 33:20977 20988. A Proofs 14 A.1 Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.3 Proofs for examples in Section 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3.1 Proof for Example 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3.2 Proof of Example 3.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3.3 Proof of Lemma A.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B Additional numerical experiments 18 B.1 Comparison between one-shot and exact weighted approaches . . . . . . . . . . . 19 B.2 Local coverage of cmc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.3 Additional results for homogeneous misingness . . . . . . . . . . . . . . . . . . . 19 B.4 Estimation error for one-bit matrix estimation . . . . . . . . . . . . . . . . . . . . 19 B.5 Additional results for heterogeneous missingness . . . . . . . . . . . . . . . . . . 20 B.6 Misspecified sampling probability . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.7 Additional results for the sales dataset . . . . . . . . . . . . . . . . . . . . . . . . 22 B.7.1 Comparison under misspecified missingness . . . . . . . . . . . . . . . . 22 B.7.2 Results with convex optimization . . . . . . . . . . . . . . . . . . . . . . 22 C Additional details of algorithms and extensions 23 C.1 Extension to likelihood-based scores and categorical matrices . . . . . . . . . . . . 23 C.2 Full conformalized matrix completion . . . . . . . . . . . . . . . . . . . . . . . . 24 C.3 Exact split conformalized matrix completion . . . . . . . . . . . . . . . . . . . . . 25 The section collects the proofs for the technical results in the main text. Section A.1 and A.2 are devoted to the proof of Lemma 3.1 and Theorem 3.2, respectively. In the end, Section A.3 provides proofs for the coverage gap bounds in the two examples in Section 3.3. A.1 Proof of Lemma 3.1 Fix a set of locations B := {(i1, j1), . . . , (incal+1, jncal+1)}, and an index 1 m ncal + 1. By the definition of conditional probability, we have P ((i , j ) = (im, jm) | Scal {(i , j )} = B, Str = S0) =P ((i , j ) = (im, jm), Scal = B\{(im, jm)} | Str = S0, |S| = n) P (Scal {(i , j )} = B | Str = S0, |S| = n) = P ((i , j ) = (im, jm), Scal = B\{(im, jm)} | Str = S0, |S| = n) Pncal+1 l=1 P ((i , j ) = (il, jl), Scal = B\{(il, jl)} | Str = S0, |S| = n) . It then boils down to computing P (Scal = S1, (i , j ) = (il, jl) | Str = S0, |S| = n) for any fixed set S1 and fixed location (il, jl) / S0 S1. To this end, we have the following claim (we defer its proof to the end of this section) P (Scal = S1, (i , j ) = (il, jl) | Str = S0, |S| = n) Q (i,j) S1 pij 1 pij P (i ,j ) A pi j 1 pi j , (8) where ΩStr,n = {A [d1] [d2] : |A| = n |Str|, A Str = }. As a result, we obtain P ((i , j ) = (im, jm) | Scal {(i , j )} = B, Str = S0) Q (i,j) B\{(im,jm)} pij 1 pij Pncal+1 l=1 Q (i,j) B\{(il,jl)} pij 1 pij = him,jm Pncal+1 l=1 hil,jl , (9) where hij = (1 pij)/pij. This finishes the proof. Proof of Equation (8). Denote W the matrix consisting of entries Wij defined in Algorithm 1. The matrix Z consists of Zij which are the indicators of non-missingness, i.e. Zij = 1{(i, j) is observed}. Fix any two disjoint subsets S0, S1 [d1] [d2] with |S0| + |S1| = n. We have P (Scal = S1, Str = S0 | |S| = n) =P (supp(Z) = S0 S1, S0 supp(W), S1 supp(W)c | |S| = n) . Since W and Z are independent, one further has P (supp(Z) = S0 S1, S0 supp(W), S1 supp(W)c | |S| = n) =P (S0 supp(W), S1 supp(W)c) P (supp(Z) = S0 S1 | |S| = n) =q|S0|(1 q)|S1| P (supp(Z) = S0 S1 | |S| = n) =q|S0|(1 q)|S1| (i ,j ) [d1] [d2] (1 pi j ) Y (i,j) S0 S1 pij 1 pij , (10) where the last two identities are based on direct computations. Based on (10), one can further compute P (Str = S0 | |S| = n) A ΩStr,n P (Scal = A, Str = S0 | |S| = n) =q|S0|(1 q)n |S0| (i ,j ) [d1] [d2] (1 pi j ) Y pij 1 pij , (11) where the last identity uses Equation (10). Now we are ready to prove (8). Recall that the new data point (i , j ) | S is drawn uniformly from Unif(Sc). Therefore one has P (Scal = S1, (i , j ) = (in+1, jn+1) | Str = S0, |S| = n) = P (Scal = S1, Str = S0, (i , j ) = (in+1, jn+1) | |S| = n) P (Str = S0 | |S| = n) = P ((i , j ) = (in+1, jn+1) | S = S1 S0) P (Scal = S1, Str = S0 | |S| = n) P (Str = S0 | |S| = n) (i,j) S1 pij 1 pij P (i ,j ) A pi j 1 pi j , (12) where the last line follows from (10) and (11). A.2 Proof of Theorem 3.2 First, fix any a [0, 1]. Lemma 3.1, together with the weighted conformal prediction framework of Tibshirani et al. (2019), implies that P Mi j c Mi j q i j (a) bsi j Str, Scal {(i , j )} 1 a, q i j (a) = Quantile1 a (i,j) Scal {(i ,j )} wij δRij and wij = hij P (i ,j ) Scal {(i ,j )} hi j . Indeed, here a can be any function of the random variables we are conditioning on that is, a may depend on n, on Str, and on Scal {(i , j )}. Next define a = α + where is defined as in (6). We observe that, since each bhij is a function of Str, then (and thus also a) can therefore be expressed as a function of n, Str, and Scal {(i , j )}. Therefore, applying the work above, we have P Mi j c Mi j q i j (α + ) bsi j Str, Scal {(i , j )} 1 α and thus, after marginalizing, P Mi j c Mi j q i j (α + ) bsi j 1 α E[ ]. Next, we verify that bq q i j (α + ) holds almost surely if this is indeed the case, then we have shown that P Mi j c Mi j bq bsi j 1 α E[ ], which establishes the desired result. Thus we only need to show that bq q i j (α+ ), or equivalently, Quantile1 α (i,j) Scal bwij δRij + bwtest δ+ Quantile1 α (i,j) Scal {(i ,j )} wij δRij w ij = bhij P (i ,j ) Scal {(i ,j )} bhi j (13) for all (i, j) S {(i , j )}. Then by definition of bw, we see that w ij bwij for (i, j) S, and therefore, Quantile1 α (i,j) Scal bwij δRij + bwtest δ+ Quantile1 α (i,j) Scal w ij δRij + w (i ,j ) δ+ Quantile1 α (i,j) Scal {(i ,j )} w ij δRij holds almost surely. Therefore it suffices to show that Quantile1 α (i,j) Scal {(i ,j )} w ij δRij Quantile1 α (i,j) Scal {(i ,j )} wij δRij holds almost surely. Indeed, we have (i,j) Scal {(i ,j )} w ij δRij, X (i,j) Scal {(i ,j )} wij δRij (i,j) Scal {(i ,j )} |w ij wij| = , where d TV denotes the total variation distance. This completes the proof. A.3 Proofs for examples in Section 3.3 Recall the definition of : (i,j) Scal {(i ,j )} (i ,j ) Scal {(i ,j )} bhi j hij P (i ,j ) Scal {(i ,j )} hi j Note that 1 by definition. We start with stating a useful lemma to bound the coverage gap using the estimation error of bhij. Lemma A.1. By the definition of the coverage gap , we have (i,j) Scal {(i ,j )} |bhij hij| (i ,j ) Scal bhi j . (14) A.3.1 Proof for Example 3.3.1 Under the logistic model, one has for any (i, j) P((i, j) Str) = q pij = q exp(ui + vj) 1 + exp(ui + vj). In this case, if bu and bv are the constrained maximum likelihood estimators in Example 3.3.1, Theorem 6 in Chen et al. (2023) implies that , bv v = OP with the proviso that u + v τ < , d2 d1 log d1 and d1 (log d2)2. Then for hij = exp( ui vj) and bhij = exp( bui bvj), we have max i,j |bhij hij| = max i,j e ui vj e (bui ui) (bvj vj) 1 = OP Further, as mini,j hi,j = mini,j exp( ui vj) e τ, then with probability approaching one, for every (i, j) [d1] [d2], one has bhi,j hi,j |hij bhij| e τ/2 =: h0. By the upper bound (14), we have log max{d1, d2} min{d1, d2} . Further, as 1, we have E[ ] q log max{d1,d2} min{d1,d2} . A.3.2 Proof of Example 3.3.2 Define the link function ψ(t) = q(1 + e ϕ(t)), where ϕ is monotonic. Applying Theorem 1 in the paper (Davenport et al., 2014), we obtain that with probability at least 1 C1/(d1 + d2) 1 d1d2 b A A 2 F d1d2 , (15) with the proviso that d1d2 (d1 + d2) log(d1d2). Here e Cτ = 239/4e9/4(1 + 6)τLτβτ with Lτ = sup τ t τ |ψ (t)| ψ(t)(1 ψ(t)), and βτ = sup τ t τ ψ(t)(1 ψ(t)) |ψ (t)|2 . (16) Denote this high probability event to be E0. Since A 1 d1d2 A F, on this event E0, we further have 1 d1d2 b A A 1 Cτ where Cτ = ζ Lτβτ and ζ is a universal constant. Recall that hij = exp( ϕ(Aij)) and A τ. On the same event E0, we further have 1 d1d2 b H H 1 C τ where C τ = 2eϕ(τ) q By the feasibility of the minimizer b A and the fact that bhij = exp( ϕ( b Aij)), we have bhij h0 = e ϕ(τ) for all (i, j). This together with the upper bound (14) implies that (i,j) Scal {(i ,j )} |hij bhij| 1 h0ncal b H H 1. Define a second high probability event E1 = {ncal (1 c)(1 q) P 1}. Using the Chernoff bound, we have P (E1) 1 C0(d1d2) 5. Therefore, on the event E0 E1, we have d1d2 ch0(1 q) P 1 C τ Under the assumptions that pij = 1/(1 + e ϕ(Aij)), and that A τ, we know that pij C2 for some constant that only depends on τ. As a result, we have P 1 C2d1d2, which further leads to the conclusion that 1 c C2h0(1 q)C τ on the event E0 E1. In addition, on the small probability event (E0 E1), one trivially has 1. Therefore simple combinations of the cases yields the desired bound on E[ ]. A.3.3 Proof of Lemma A.1 Reusing the definition of w (13), one has (i,j) Scal {(i ,j )} |wij w ij| (i,j) Scal {(i ,j )} (i ,j ) Scal {(i ,j )} bhi j bhij P (i ,j ) Scal {(i ,j )} hi j P (i ,j ) Scal {(i ,j )} bhi j P (i ,j ) Scal {(i ,j )} hi j (i,j) Scal {(i ,j )} (i ,j ) Scal {(i ,j )} bhi j P (i ,j ) Scal {(i ,j )} hi j P (i ,j ) Scal {(i ,j )} bhi j P (i ,j ) Scal {(i ,j )} hi j + |bhij hij| P (i ,j ) Scal {(i ,j )} hi j P (i ,j ) Scal {(i ,j )} bhi j P (i ,j ) Scal {(i ,j )} hi j (i ,j ) Scal {(i ,j )} bhi j P (i ,j ) Scal {(i ,j )} hi j P (i ,j ) Scal {(i ,j )} bhi j + 1 (i,j) Scal {(i ,j )} |bhij hij| P (i ,j ) Scal {(i ,j )} bhi j (i,j) Scal {(i ,j )} |bhij hij| P (i ,j ) Scal {(i ,j )} bhi j . (18) This completes the proof. B Additional numerical experiments In this section, we provide additional simulation results. In Section B.1, we compare the proposed one-shot weighted approach with the exact weighted conformal prediction. In Section B.2, the cmc-og-als cmc-1-als cmc-og-als cmc-1-als (a) Exact/original (og) cmc vs 1-shot cmc. 0.0 0.2 0.4 0.6 0.8 1.0 pij = p0 Conditional coverage pij = p0 0.0 0.2 0.4 0.6 0.8 1.0 pij = p0 (b) Local performance of coverage given pij = p0. local coverage of cmc-als conditioning on pij is evaluated. We present the simulation results for the convex relaxation (cvx) and the conformalized convex method (cmc-cvx) in both settings with homogeneous and heterogeneous missingness in Section B.3 and B.5. We further evaluate the performance of cmc-als when the sampling probability is misspecified in Section B.6. B.1 Comparison between one-shot and exact weighted approaches We conduct comparison in the setting with heterogeneous missingness and Gaussian noise as shown in Section 4.1.2 with k = 12. From the result in Figure 5a, the performance of two aproaches are essentially the same while the one-shot cmc-als is more computationally efficient. B.2 Local coverage of cmc In Figure 5b, we evaluate the local performance of cmc-als when conditioning on {pij = p0}, i.e. a subpopulation determined by the value of sampling probability. We use the setting with heterogeneous missingness and Gaussian noise as shown in Section 4.1.2 with k = 12 and p0 ranging from 0.1 to 0.9 with a step size of 0.1. The conditional coverage is approximated via kernel smoothing: with the indicators Aij for coverage, we use the weight Kij = ϕp0,h(pij) and calculate the conditional coverage by (P Aij Kij)/(P Kij). Here ϕµ,σ is the density function of N(µ, σ2) and h = 0.05. When p0 increases from 0.2 to 0.8, the conditional coverage increases and stays around 0.9. At p0 = 0.1, 0.9, the coverage varies much, which is due to the small effective sample size for the edge values of pij. Moreover, for uniform sampling, since the rows and columns are generated i.i.d., there are no meaningful subgroups of the data to condition on. B.3 Additional results for homogeneous misingness In this section, we present the results for synthetic simulation with the convex relaxation cvx and the conformalized convex matrix completion method cmc-cvx. Setting 1, 2, 3, 4 are the same as introduced in Section4.1. The true rank is r = 8 and the hypothesized rank varies from 4 to 24 with the stepsize 4. The conformalized methods, regardless of the based algorithm adopted, have nearly exact coverage around 1 α. But we can observe different behaviors between als and cvx since the convex relaxation is free of the choice of r until the projection of c Mcvx onto the rank-r subspace (Chen et al., 2020). As a result, when r > r , cvx tends to overestimate the strength of the noise. In Figure 6a, 6b and 6d, when r > r , cvx has coverage rate higher than the target level and the confidence interval is more conservative than conformalized methods. Since the accuracy of cvx is based on the large sample size, in Figure 6b, when the effective sample size is insufficient with small pij, the residuals from cvx have a large deviation from the standard distribution and the intervals are much larger than the oracle ones. Besides, in Figure 6c, when the noise has heavy tails than Gaussian variables, cvx overestimates the noise strength similar to als and is conservative in coverage. When the incoherence condition is violated in Figure 6d, if r < r , both cvx and als fit the missed factor by overestimating the noise strength and produce extremely large intervals. B.4 Estimation error for one-bit matrix estimation The estimation error in bpij can be visualized from the following heatmaps comparing P and b P. Here the entries are sorted by the order of pij for each row and each column. 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc-als als cmc-cvx cvx oracle (a) Setting 1: Large sample size + Gaussian noise 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc-als als cmc-cvx cvx oracle (b) Setting 2: Small sample size+Gaussian noise 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc-als als cmc-cvx cvx oracle (c) Setting 3: Large sample size + heavy-tailed noise 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc-als als cmc-cvx cvx oracle (d) Setting 4: Violation of incoherence Figure 6: Comparison between conformalized and model-based matrix completion approaches. (a) k = 1: d1 = d2 = 500. (b) k = 1: d1 = d2 = 2000. (c) k = 5: d1 = d2 = 500. (d) k = 5: d1 = d2 = 2000. Figure 7: Heatmaps for P and b P. B.5 Additional results for heterogeneous missingness In Figure 8, we present the results for synthetic simulation with the convex relaxation cvx as the base algorithm, where we denote cmc-cvx and cmc -cvx as the conformalized matrix completion method with estimated weights and true weights, respectively. Three settings with heterogeneous missingness are the same as Figure 3. B.6 Misspecified sampling probability We consider the following four settings: (a) The underlying missingness follows the rank-one model where pij = aibj, both ai and bj are generated i.i.d. from Unif(0.2, 1). The noise is adversarial. But we estimate pij via the one-bit matrix completion based on the logistic model (working model with hypothesized rank k = 5). (b) The underlying missingness follows the logistic model as in Example 3.3.2 with k = 5 and we adopt the adversarial noise. But pij is estimated under the assumption of uniform sampling (working model), i.e. bpij = bp. 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc*-als cmc-als cmc*-cvx cmc-cvx (a) Gaussian noise: k = 1. 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc*-als cmc-als cmc*-cvx cmc-cvx (b) Gaussian noise: k = 5. 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc*-als cmc-als cmc*-cvx cmc-cvx (c) Adversarial heterogeneous noise: k = 1. 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc*-als cmc-als cmc*-cvx cmc-cvx (d) Adversarial heterogeneous noise: k = 5. 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc*-als cmc-als cmc*-cvx cmc-cvx (e) Random heterogeneous noise: k = 1. 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc*-als cmc-als cmc*-cvx cmc-cvx (f) Random heterogeneous noise: k = 5. Figure 8: Comparison under heterogenous missingness. 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc-als-mis cmc-als (a) Estimating rank-1 model with logistic model (adversarial noise). 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc-als-mis cmc-als (b) Estimating logistic model with uniform sampling (adversarial noise). 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc-als-mis cmc-als (c) Estimating rank-1 model with logistic model (random noise). 5 10 15 20 hypothesized rank 5 10 15 20 hypothesized rank cmc-als-mis cmc-als (d) Estimating logistic model with uniform sampling (random noise). (c) Same as (a) except that we use the random heterogeneous noise. (d) Same as (b) except that we use the random heterogeneous noise. From the results, we can see that when the noise is generated following the random heterogeneous model, where the values of entries are independent of the sampling probabilities, the misspecification of the sampling model only slightly affects the coverage. Moreover, when the noise is generated in the adversarial way, where the values of entries depend on pij s, we can see that the coverage with a misspecified sampling model is lower than the target level, but is above 0.85 in practice, which depends on the divergence between the true and the working sampling models. B.7 Additional results for the sales dataset Denote M the underlying matrix in the sales dataset. In Figure 10a, we plot singular values of M and top-5 singular values contain a large proportion of the information. In Figure 10b, we plot the histogram of entries Mij s of the underlying matrix, and the sales dataset has the range from 0 to over 20 thousand with a heavy tail. 0 200 400 600 800 (a) Singular values of the underlying matrix M. 0 10 20 30 40 0.00 (b) Histogram of entries in M. Figure 10: Descriptive plots for the underlying matrix M. B.7.1 Comparison under misspecified missingness We consider the sales dataset where the missingness occurs in the way that: for weekdays, each entry is observed with p = 0.8. On weekends, as the stores are likely to be operated by less experienced interns or are less frequent to report the sales data, each entry is observed with a lower probability, e.g. 0.8/3. Moreover, as there could be a subgroup of stores that are less frequent in sales reporting, 200 stores are randomly sampled, for which p = 0.8/3. We use the logistic model with k = 5 to estimate pij. Comparison between als and cmc-als is shown in Figure 11. B.7.2 Results with convex optimization In Figure 12, cmc-cvx has nearly exact coverage at 1 α, but cvx tends to have higher coverage than the target level. Besides, the convex approach has much larger intervals when r is large, which can be caused by the overfitting of the observed entries. As conformalized approach leaves out a proportion of observed entries as the training set, intervals produced by cmc-cvx are less accurate than cvx due to the poorly behaved base algorithm. 5 15 25 35 45 55 hypothesized rank 5 15 25 35 45 55 hypothesized rank cmc-als als Figure 11: Sales dataset with heterogeneous and misspecified missingness. 5 15 25 35 45 55 65 hypothesized rank 5 15 25 35 45 55 65 hypothesized rank cmc-als als cmc-cvx cvx (a) Homogeneous missingness. 5 15 25 35 45 55 65 hypothesized rank 5 15 25 35 45 55 65 hypothesized rank cmc-als als cmc-cvx cvx (b) Heterogeneous missingness with k = 1. Figure 12: Comparison between conformalized and model-based matrix completion with sales dataset. C Additional details of algorithms and extensions C.1 Extension to likelihood-based scores and categorical matrices We will show in this section that cmc can also be applied to categorical matrix completion (Cao and Xie, 2015) or a more general setting in (19), the validity of which is also guaranteed by the presented theorem. Setup To formulate the problem, consider an underlying parameter matrix M [d1] [d2] and the observations {Mij : i [d1], j [d2]} are drawn from the distribution Mij | M ij PM ij, (19) where {Pθ}θ Θ can be a family of parametric distributions with probability density pθ. The categorical matrix completion is a specific example where the support of Mij is finite or countable. For example, a Poisson matrix is generated by Mij Pois(M ij), where M ij is the Poisson mean. Similar to the previous setup, we treat M as deterministic, and entries in the subset S [d1] [d2] are available. Here S is sampled in the same manner as before with the matrix P [d1] [d2]. Split conformal approach Consider the split approach with the partition S = Str Scal as in Algorithm 1. With the training set Mtr, we obtain an estimated likelihood function bπ(m; i, j) such that bπ(m; i, j) = bp M ij(m), which is an estimate for the true likelihood of Mij at m. The estimation can be feasible given certain low-complexity structures. For example, if a hypothesized distribution family {Qθ}θ Θ with probability density qθ is given and the underlying mean matrix M is assumed to be low-rank. Then, M can be viewed as a perturbation of M and we can estimate M via matrix completion algorithms with entries in Mtr. Denote c M as the estimate for M , then we have the estimated likelihood bπ(m; i, j) = qc Mij(m). The odds ratios are also estimated from the training set, i.e. bhij, from which we compute the weights bwij = bhij X (i ,j ) Scal bhi j + max (i ,j ) Scbhi j , (i, j) Scal, bwtest = bhi j X (i ,j ) Scal bhi j + max (i ,j ) Scbhi j . For each (i, j) Scal, calculate the likelihood-based nonconformity score Rij = bπ(Mij; i, j). Then, for any test point (i , j ) Sc, we can construct the confidence interval b C(i , j ) = {m [K] : bπ(m; i , j ) bq} , where bq is the weighted quantile bq = Quantile1 α (i,j) Scal bwij δRij + bwtest δ+ More examples of conformal methods for classification are shown in Romano et al. (2020), Angelopoulos et al. (2020), etc. C.2 Full conformalized matrix completion In Algorithm 2, the procedure of the full conformalized matrix completion (full-cmc) is presented. This full conformal version of cmc offers the same coverage guarantee as given in Theorem 3.2 for the split version of cmc (except with the entire observed set S in place of Scal, when defining ( bw); the formal proof of this bound for full conformal is very similar to the proof of Theorem 3.2, using an analogous weighted exchangeability argument as in Lemma 3.1, and so we omit it here. To define this algorithm, we need some notation: given the observed data MS, plus a test point location (i , j ) and a hypothesized value m for the test point Mi j , define a matrix M(m) with entries Mij, (i, j) S, m, (i, j) = (i , j ), , otherwise, (20) where, abusing notation, Mij = denotes that no information is observed in this entry. Algorithm 2 full-cmc: full conformalized matrix completion 1: Input: target level 1 α; partially observed matrix MS. 2: Using the training data MS, compute an estimate b P of the observation probabilities (with bpij estimating pij, the probability of entry (i, j) being observed). 3: for (i , j ) in Sc do 4: for m M do 5: Augment MS with one additional hypothesized entry, {Mi ,j = m}, to obtain M(m) defined as in (20). 6: Using the imputed matrix M(m) S , compute: An initial estimate c M(m) using any matrix completion algorithm (with c M (m) ij estimating the target Mij); Optionally, a local uncertainty estimate bs(m) (with bs(m) ij estimating our relative uncertainty in the estimate c M (m) ij ), or otherwise set bs(m) ij 1; An estimate b P of the observation probabilities (with bpij estimating pij, the probability of entry (i, j) being observed). 7: Compute normalized residuals for (i, j) S {(i , j )}, R(m) ij = |Mij c M (m) ij | 8: Compute weights bwij = bhij P (i ,j ) S {(i ,j )} bhi j , bhij = 1 bpij bpij , (i, j) S {(i , j )}. 9: Compute the weighted quantile bq(m)(i , j ) = Quantile1 α (i,j) S bwijδR(m) ij + bwi j δ+ 10: end for 11: end for 12: Output: n b C (i , j ) = n m M : R(m) i j bq(m)(i , j ) o : (i , j ) Sco We note that, when M = R (or an infinite subset of R), the computation of the prediction set is impossible in most of the cases. In that case, our algorithm can be modified via a trimmed or discretized approximation; these extensions are presented for the regression setting in the work of Chen et al. (2016, 2018), and can be extended to the matrix completion setting in a straightforward way. C.3 Exact split conformalized matrix completion In Algorithm 3, we present the exact split approach, which is less conservative than our one-shot approach given in Algorithm 1, but may be less computationally efficient. In this version of the algorithm, the quantile bq = bq(i , j ) needs to be computed for each missing entry since the weight vector bw depends on the value of bpi ,j . Algorithm 3 split-cmc: split conformalized matrix completion 1: Input: target coverage level 1 α; data splitting proportion q (0, 1); observed entries MS. 2: Split the data: draw Wij i.i.d. Bern(q), and define training and calibration sets, Str = {(i, j) S : Wij = 1}, Scal = {(i, j) S : Wij = 0}. 3: Using the training data MStr indexed by Str [d1] [d2], compute: An initial estimate c M using any matrix completion algorithm (with c Mij estimating the target Mij); Optionally, a local uncertainty estimate bs (with bsij estimating our relative uncertainty in the estimate c Mij), or otherwise set bsij 1; An estimate b P of the observation probabilities (with bpij estimating pij, the probability of entry (i, j) being observed). 4: Compute normalized residuals on the calibration set, bsij , (i, j) Scal. 5: Compute estimated odds ratios for the calibration set and test set, bhij = bpij 1 bpij , (i, j) Scal Sc, 6: for (i , j ) Sc do 7: Compute weights for the calibration set and test point, bwij = bhij X (i ,j ) Scal bhi j + bhi j , (i, j) Scal, bwtest = bhi j X (i ,j ) Scal bhi j + bhi j . 8: Compute threshold bq(i , j ) = Quantile1 α (i,j) Scal bwij δRij + bwtest δ+ where δt denotes the point mass at t. 9: end for 10: Output: confidence intervals b C(i , j ) = c Mi j bq(i , j ) bsi j for each unobserved entry (i , j ) Sc.