# naive_imputation_implicitly_regularizes_highdimensional_linear_models__1133b6c1.pdf

Naive imputation implicitly regularizes high-dimensional linear models.

Alexis Ayme 1 Claire Boyer 1 Aymeric Dieuleveut 2 Erwan Scornet 2

Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) used, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a highdimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed data. We establish an upper bound on its generalization error, highlighting that imputation is benign in the d n regime. Experiments illustrate our findings.

1. Introduction

Missing data has become an inherent problem in modern data science. Indeed, most real-world data sets contain missing entries due to a variety of reasons: merging different data sources, sensor failures, difficulty to collect/access data in sensitive fields (e.g., health), just to name a few. The simple, yet quite extreme, solution of throwing partial observations away can drastically reduce the data set size and thereby hinder further statistical analysis. Specific methods should be therefore developed to handle missing values.

1Sorbonne Universit e, CNRS, Laboratoire de Probabilit es, Statistique et Mod elisation (LPSM), F-75005 Paris, France 2CMAP, UMR7641, Ecole Polytechnique, IP Paris, 91128 Palaiseau, France. Correspondence to: Alexis Ayme <alexis.ayme@sorbonne-universite.fr>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Most of them are dedicated to model estimation, aiming at inferring the underlying model parameters despite missing values (see, e.g., Rubin, 1976). In this paper, we take a different route and consider a supervised machine learning (ML) problem with missing values in the training and test inputs, for which our aim is to build a prediction function (and not to estimate accurately the true model parameters).

Prediction with NA A common practice to perform supervised learning with missing data is to simply impute the data set first, and then train any predictor on the completed/imputed data set. The imputation technique can be simple (e.g., using mean imputation) or more elaborate (Van Buuren & Groothuis-Oudshoorn, 2011; Yoon et al., 2018; Muzellec et al., 2020; Ipsen et al., 2022). While such widely-used two-step strategies lack deep theoretical foundations, they have been shown to be consistent, provided that the approximation capacity of the chosen predictor is large enough (see Josse et al., 2019; Le Morvan et al., 2021). When considering low-capacity predictors, such as linear models, other theoretically sound strategies consist of decomposing the prediction task with respect to all possible missing patterns (see Le Morvan et al., 2020b; Ayme et al., 2022) or by automatically detecting relevant patterns to predict, thus breaking the combinatorics of such patternby-pattern predictors (see the specific Neu Miss architecture in Le Morvan et al., 2020a). Proved to be nearly optimal (Ayme et al., 2022), such approaches are likely to be robust to very pessimistic missing data scenarios. Inherently, they do not scale with high-dimensional data sets, as the variety of missing patterns explodes. Another direction is advocated in Agarwal et al. (2019) relying on principal component regression (PCR) in order to train linear models with missing inputs. However, out-of-sample prediction in such a case requires to retrain the predictor on the training and test sets (to perform a global PC analysis), which strongly departs from classical ML algorithms massively used in practice.

In this paper, we focus on the high-dimensional regime of linear predictors, which will appear to be more favorable to handling missing values via simple and cheap imputation methods, in particular in the missing completely at random (MCAR) case.

Naive imputation implicitly regularizes high-dimensional linear models.

High-dimensional linear models In supervised learning with complete inputs, when training a parametric method (such as a linear model) in a high-dimensional framework, one often resorts to an ℓ2 or ridge regularization technique. On the one hand, such regularization fastens the optimization procedure (via its convergence rate) (Dieuleveut et al., 2017); on the other hand, it also improves the generalization capabilities of the trained predictor (Caponnetto & De Vito, 2007; Hsu et al., 2012). In general, this second point holds for explicit ℓ2-regularization, but some works also emphasize the ability of optimization algorithms to induce an implicit regularization, e.g., via early stopping (Yao et al., 2007) and more recently via gradient strategies in interpolation regimes (Bartlett et al., 2020; Chizat & Bach, 2020; Pesme et al., 2021).

Contributions For supervised learning purposes, we consider a zero-imputation strategy consisting in replacing input missing entries by zero, and we formalize the induced bias on a regression task (Section 2). When the missing values are said Missing Completely At Random (MCAR), we prove that zero imputation, used prior to training a linear model, introduces an implicit regularization closely related to that of ridge regression (Section 3). This bias is exemplified to be negligible in settings commonly encountered in high-dimensional regimes, e.g., when the inputs admit a low-rank covariance matrix. We then advocate for the choice of an averaged stochastic gradient algorithm (SGD) applied on zero-imputed data (Section 4). Indeed, such a predictor, being computationally efficient, remains particularly relevant for high-dimensional learning. For such a strategy, we establish a generalization bound valid for all d, n, in which the impact of imputation on MCAR data is soothed when d n. These theoretical results legitimate the widespread imputation approach, adopted by most practitioners, and are corroborated by numerical experiments in Section 5. All proofs are to be found in the Appendix.

2. Background and motivation

2.1. General setting and notations

In the context of supervised learning, consider n N input/output observations ((Xi, Yi))i [n], i.i.d. copies of a generic pair (X, Y ) Rd R. By some abuse of notation, we always use Xi with i [n] to denote the i-th observation living in Rd, and Xj (or Xk) with j [d] (or k [d]) to denote the j-th (or k-th) coordinate of the generic input X (see Section A for notations).

Missing values In real data sets, the input covariates (Xi)i [n] are often only partially observed. To code for this missing information, we introduce the random vector P {0, 1}d, referred to as mask or missing pattern, and such that Pj = 0 if the j-th coordinate of X, Xj, is missing

and Pj = 1 otherwise. The random vectors P1, . . . , Pn are assumed to be i.i.d. copies of a generic random variable P {0, 1}d and the missing patterns of X1, . . . , Xn. Note that we assume that the output is always observed and only entries of the input vectors can be missing. Missing data are usually classified into 3 types, initially introduced by (Rubin, 1976). In this paper, we focus on the MCAR assumption where missing patterns and (underlying) inputs are independent.

Assumption 1 (Missing Completely At Random - MCAR). The pair (X, Y ) and the missing pattern P associated to X are independent.

For j [d], we define ρj := P(Pj = 1), i.e., 1 ρj is the expected proportion of missing values on the j-th feature. A particular case of MCAR data requires, not only the independence of the mask and the data, but also the independence between all mask components, as follows.

Assumption 1 (Ho-MCAR: MCAR pattern with independent homogeneous components). The pair (X, Y ) and the missing pattern P associated to X are independent, and the distribution of P satisfies P B(ρ) d for 0 < ρ 1, with 1 ρ the expected proportion of missing values, and B the Bernoulli distribution.

Naive imputation of covariates A common way to handle missing values for any learning task is to first impute missing data, to obtain a complete dataset, to which standard ML algorithms can then be applied. In particular, constant imputation (using the empirical mean or an oracle constant provided by experts) is very common among practitioners. In this paper, we consider, even for noncentered distributions, the naive imputation by zero, so that the imputed-by-0 observation (Ximp)i, for i [n], is given by

(Ximp)i = Pi Xi. (1)

Risk Let f : Rd R be a measurable prediction function, based on a complete d-dimensional input. Its predictive performance can be measured through its quadratic risk,

R(f) := E h (Y f (X))2i . (2)

Accordingly, we let f (X) = E[Y |X] be the Bayes predictor for the complete case and R the associated risk.

In the presence of missing data, one can still use the predictor function f, applied to the imputed-by-0 input Ximp, resulting in the prediction f(Ximp). In such a setting, the risk of f, acting on the imputed data, is defined by

Rimp(f) := E h (Y f(Ximp))2i . (3)

For the class F of linear prediction functions from Rd to R,

Naive imputation implicitly regularizes high-dimensional linear models.

we respectively define

R (F) = inf f F R(f), (4)

and R imp(F) = inf f F Rimp(f), (5)

as the infimum over the class F with respectively complete and imputed-by-0 input data.

For any linear prediction function defined by fθ(x) = θ x for any x Rd and a fixed θ Rd, as fθ is completely determined by the parameter θ, we make the abuse of notation of R(θ) to designate R(fθ) (and Rimp(θ) for Rimp(fθ)). We also let θ Rd (resp. θ imp) be a parameter achieving the best risk on the class of linear functions, i.e., such that R (F) = R(θ ) (resp. R imp(F) = Rimp(θ imp)).

Imputation bias Even if the prepocessing step consisting of imputing the missing data by 0 is often used in practice, this imputation technique can introduce a bias in the prediction. We formalize this imputation bias as

Bimp(F) := R imp(F) R (F). (6)

This quantity represents the difference in predictive performance between the best predictor on complete data and that on imputed-by-0 inputs. In particular, if this quantity is small, the risk of the best predictor on imputed data is close to that of the best predictor when all data are available. Note that, in presence of missing values, one might be interested in the Bayes predictor

f mis(Ximp, P) = E[Y |Ximp, P]. (7)

and its associated risk R mis.

Lemma 2.1. Assume that regression model Y = f (X)+ϵ is such that ϵ and P are independent, then R R mis.

Intuitively, under the classical assumption ε P (see Josse et al., 2019), which is verified under Assumption 1, missing data ineluctably deteriorates the original prediction problem. As a direct consequence, for a well-specified linear model on the complete case f F,

Rimp(F) R mis Bimp(F). (8)

Consequently, in this paper, we focus our analysis on the bias (and excess risk) associated to impute-then-regress strategies with respect to the complete-case problem (righthand side term of (8)) thus controlling the excess risk of imputation with respect to the missing data scenario (lefthand side term of (8)).

In a nutshell, the quantity Bimp(F) thus represents how missing values, handled with zero imputation, increase the

difficulty of the learning problem. This effect can be tempered in a high-dimensional regime, as rigorously studied in Section 3. To give some intuition, let us now study the following toy example. Example 2.2. Assume an extremely redundant setting in which all covariates are equal, that is, for all j [d], Xj = X1 with E X2 1 = 1. Also assume that the output is such that Y = X1 and that Assumption 1 holds with ρ = 1/2. In this scenario, due to the input redundancy, all θ satisfying Pd j=1 θj = 1 minimize θ 7 R(θ). Letting, for example, θ1 = (1, 0, ..., 0) , we have R = R(θ1) = 0 but

Rimp(θ1) = E (X1 P1X1)2 = 1

This choice of θ1 introduces an irreducible discrepancy between the risk computed on the imputed data and the Bayes risk R = 0. Another choice of parameter could actually help to close this gap. Indeed, by exploiting the redundancy in covariates, the parameter θ2 = (2/d, 2/d, ..., 2/d)

(which is not a minimizer of the initial risk anymore) gives

Rimp(θ2) = E X1 2

j=1 Pj Xj 2 = 1

so that the imputation bias Bimp(F) is bounded by 1/d, tending to zero as the dimension increases. Two other important observations on this example follow. First, this bound is still valid if EX1 = 0, thus the imputation by 0 is still relevant even for non-centered data. Second, we remark that θ2 2 2 = 4/d, thus good candidates to predict with imputation seem to be of small norm in high dimension. This will be proved for more general settings, in Section 4.

The purpose of this paper is to generalize the phenomenon described in Example 2.2 to less stringent settings. In light of this example, we focus our analysis on scenarios for which some information is shared across input variables: for linear models, correlation plays such a role.

Covariance matrix For a generic complete input X Rd, call Σ := E XX the associated covariance matrix, admitting the following singular value decomposition

j=1 λjvjv j , (9)

where λj (resp. vj) are singular values (resp. singular vectors) of Σ and such that λ1 ... λd. The associated pseudo-norm is given by, for all θ Rd,

θ 2 Σ := θ Σθ =

j=1 λj(v j θ)2.

Naive imputation implicitly regularizes high-dimensional linear models.

For the best linear prediction, Y = X θ + ϵ, and the noise satisfies E[ϵX] = 0 (first order condition). Denoting E[ϵ2] = σ2, we have

EY 2 = θ 2 Σ + σ2 =

j=1 λj(v j θ )2 + σ2. (10)

The quantity λj(v j θ )2 can be therefore interpreted as the part of the variance explained by the singular direction vj. Remark 2.3. Note that, in the setting of Example 2.2, Σ has a unique positive singular values λ1 = d, that is to say, all of the variance is concentrated on the first singular direction. Actually, our analysis will stress out that a proper decay of singular values leads to low imputation biases.

Furthermore, for the rest of our analysis, we need the following assumptions on the second-order moments of X.

Assumption 2. L < such that, j [d], E[X2 j ] L2.

Assumption 3. ℓ> 0 such that, j [d], E[X2 j ] ℓ2.

For example, Assumption 2 and 3 hold with L2 = ℓ2 = 1 with normalized data.

3. Imputation bias for linear models

3.1. Implicit regularization of imputation

Ridge regression, widely used in high-dimensional settings, and notably for its computational purposes, amounts to form an ℓ2-penalized version of the least square estimator:

ˆθλ arg min θ Rd

i=1 (Yi fθ(Xi))2 + λ θ 2 2

where λ > 0 is the penalization parameter. The associated generalization risk can be written as

Rλ(θ) := R(θ) + λ θ 2 2 .

Proposition 3.1 establishes a link between imputation and ridge penalization.

Proposition 3.1. Under Assumption 1, let V be the covariance matrix of P (Vij = Cov(Pi, Pj)) and H = diag(ρ1, . . . , ρd), with ρj = P(Pj = 1). Then, for all θ, Rimp(θ) = R (Hθ) + θ 2 V Σ .

In particular, under Assumptions 1 , 2 and 3 when L2 = ℓ2,

Rimp(θ) = R (ρθ) + L2ρ(1 ρ) θ 2 2 . (11)

This result highlights the implicit ℓ2-regularization at work: performing standard regression on zero-imputed ho-MCAR data can be seen as performing a ridge regression on complete data, whose strength λ depends on the missing values

proportion. More precisely, using Equation (11), the optimal predictor θ imp working with imputed samples verifies

θ imp = 1 L2ρ arg min θ Rd

n R (θ) + λimp θ 2 2 o ,

with λimp := L2 1 ρ

ρ . We exploit this correspondence in Section 3.2 and 3.3 to control the imputation bias.

3.2. Imputation bias for linear models with ho-MCAR missing inputs

When the inputs admit ho-MCAR missing patterns (Assumption 1 ), the zero-imputation bias Bimp(F) induced in the linear model is controlled by a particular instance of the ridge regression bias (see, e.g., Hsu et al., 2012; Dieuleveut et al., 2017; Mourtada, 2019), defined in general by

Bridge,λ(F) := inf θ Rd {Rλ(θ) R (F)} (12)

= λ θ 2 Σ(Σ+λI) 1 . (13)

Theorem 3.2. Under Assumption 1 , 2, and 3, one has

Bridge,λ imp(F) Bimp(F) Bridge,λimp(F),

with λ imp := ℓ2 1 ρ

ρ and λimp = L2 1 ρ

As could be expected from Proposition 3.1, the zeroimputation bias is lower and upper-bounded by the ridge bias, with a penalization constant depending on the fraction of missing values. In the specific case where ℓ2 = L2 (same second-order moment), the imputation bias exactly equals a ridge bias with a constant L2(1 ρ)/ρ. Besides, in the extreme case where there is no missing data (ρ = 1) then λimp = 0, and the bias vanishes. On the contrary, if there is a large percentage of missing values (ρ 0) then λ imp + and the imputation bias amounts to the excess risk of the naive predictor, i.e., Bimp(F) = R(0Rd) R (F). For the intermediate case where half of the data is likely to be missing (ρ = 1/2), we obtain λimp = L2.

Thus, in terms of statistical guarantees, performing linear regression on imputed inputs suffers from a bias comparable to that of a ridge penalization, but with a fixed hyperparameter λimp. Note that, when performing standard ridge regression in a high-dimensional setting, the best theoretical choice of the penalization parameter usually scales as d/n (see Sridharan et al., 2008; Hsu et al., 2012; Mourtada & Rosasco, 2022, for details). If ρ L2 n d+n (which is equivalent to λimp d

n), the imputation bias remains smaller than that of the ridge regression with the optimal hyperparameter λ = d/n (which is commonly accepted in applications). In this context, performing zero-imputation prior to applying a ridge regression allows handling easily missing data without drastically increasing the overall bias.

Naive imputation implicitly regularizes high-dimensional linear models.

In turns out that the bias of the ridge regression in random designs, and thus the imputation bias, can be controlled, under classical assumptions about low-rank covariance structures (Caponnetto & De Vito, 2007; Hsu et al., 2012; Dieuleveut et al., 2017). In all following examples, we consider that Tr(Σ) = d, which holds in particular for normalized data. Example 3.3 (Low-rank covariance matrix with equal singular values). Consider a covariance matrix with a low rank r d and constant eigenvalues (λ1 = = λr = d r ). Then Σ(Σ + λimp I) 1 λ 1 r Σ = r

dΣ and Theorem 3.2 leads to Bimp(F) λimp r d θ 2 Σ .

Hence, the imputation bias is small when r d (low-rank setting). Indeed, for a fixed dimension, when the covariance is low-rank, there is a lot of redundancy across variables, which helps counterbalancing missing information in the input variables, thereby reducing the prediction bias.

Note that Example 3.3 (r d) is a generalization of Example 2.2 (in which r = 1), and is rotation-invariant contrary to the latter. Remark 3.4. A first order condition (see equation (29)) implies that θ 2 Σ + σ2 = EY 2 = R (0Rd), which is independent of the dimension d. Thus, in all our upper bounds, θ 2 Σ can be replaced by EY 2, which is dimensionfree. Consequently, we can interpret Example 3.3 (and the following examples) upper bound as follows: if r d, then the risk of the naive predictor is divided by d/r 1. As a consequence, Bimp tends to zero when the dimension increases and the rank is fixed. Example 3.5 (Low-rank covariance matrix compatible with θ ). Consider a covariance matrix with a low rank r d and assume that θ , v1 2 θ , vd 2 (meaning that θ is well represented with the first eigendirections of Σ), Theorem 3.2 leads to

Bimp(F) λimp r(log(r) + 1)

This result is similar to Example 3.3 (up to a log factor), except that assumptions on the eigenvalues of Σ have been replaced by a condition on the compatibility between the covariance structure and θ . If θ is well explained by the largest eigenvalues then the imputation bias remains low. This underlines that imputation bias does not only depend on the spectral structure of Σ but also on θ . Example 3.6 (Spiked model, Johnstone (2001)). In this model, the covariance matrix can be decomposed as Σ = Σ r + Σ>r where Σ r corresponds to the low-rank part of the data with large eigenvalues and Σ>r to the residual highdimensional data. Suppose that Σ>r ηI (small operator norm) and that all non-zero eigenvalues of Σ r are equal, then Theorem 3.2 gives

Bimp(F) λimp

1 η r d θ 2 Σ + η θ >r 2 2 ,

where θ >r is the projection of θ on the range of Σ>r. Contrary to Example 3.3, Σ is only approximately low rank, and one can refer to r as the effective rank of Σ (see Bartlett et al., 2020). The above upper bound admits a term in O(r/d) (as in Example 3.3), but also suffers from a non-compressible part η θ >r 2 2, due to the presence of residual (potentially noisy) high-dimensional data. Note that, if θ >r = 0 (only the low-dimensional part of the data is informative) then we retrieve the same rate as in Example 3.3.

3.3. Imputation bias for linear models and general MCAR settings

Theorem 3.2 holds only for Ho-MCAR settings, which excludes the case of dependence between mask components. To cover the case of dependent variables P1, . . . , Pd under Assumption 1, recall ρj := P(Pj = 1) the probability that the component j is not missing, and define the matrix C Rd d associated to P, given by:

Ckj := Vk,j

ρkρj , (k, j) [d] [d]. (14)

Furthermore, under Assumption 2, define

Λimp := L2λmax(C). (15)

The following result establishes an upper bound on the imputation bias for general MCAR settings.

Proposition 3.7. Under Assumption 1 and 2, we have

Bimp(F) Bridge,Λimp(F).

The bound on the bias is similar to the one of Theorem 3.2 but appeals to λ = Λimp which takes into account the correlations between the components of missing patterns. Remark that, under Assumption 1 , there are no correlation and Λimp = L2 1 ρ

ρ , thus matching the result in Theorem 3.2. The following examples highlight generic scenarios in which an explicit control on Λimp is obtained. Example 3.8 (Limited number of correlations). If each missing pattern component is correlated with at most k 1 other components then Λimp L2k maxj [d] n 1 ρj

Example 3.9 (Sampling without replacement). Missing pattern components are sampled as k components without replacement in [d], then Λimp = L2 k+1

d k. In particular, if one half of data is missing (k = d

2) then Λimp 3L2.

In conclusion, we proved that the imputation bias is controlled by the ridge bias, with a penalization constant Λimp, under any MCAR settings. More precisely, all examples of the previous section (Examples 3.3, 3.5 and 3.6), relying on a specific structure of the covariance matrix Σ and the best predictor θ , are still valid, replacing λimp by Λimp.

Naive imputation implicitly regularizes high-dimensional linear models.

Additionally, specifying the missing data generation (as in Examples 3.8 and 3.9) allows us to control the imputation bias, which is then proved to be small in high dimension, for all the above examples.

4. SGD on zero-imputed data

Since the imputation bias is only a part of the story, we need to propose a proper estimation strategy for θ imp. To this aim, we choose to train a linear predictor on imputed samples, using an averaged stochastic gradient algorithm (Polyak & Juditsky, 1992), described below. We then establish generalization bounds on the excess risk of this estimation strategy.

4.1. Algorithm

Given an initialization θ0 Rd and a constant learning rate γ > 0, the iterates of the averaged SGD algorithm are given at iteration t by

θimp,t = I γXimp,t X imp,t θimp,t 1 + γYt Ximp,t, (16) so that after one pass over the data (early stopping), the final estimator θimp,n is given by the Polyak-Ruppert average θimp,n = 1 n+1 Pn t=1 θimp,t. Such recursive procedures are suitable for high-dimensional settings, and indicated for model miss-specification (induced here by missing entries), as studied in Bach & Moulines (2013). Besides, they are very competitive for large-scale datasets, since one pass over the data requires O(dn) operations.

4.2. Generalization bound

Our aim is to derive a generalization bound on the predictive performance of the above algorithm, trained on zeroimputed data. To do this, we require the following extra assumptions on the complete data.

Assumption 4. There exist σ > 0 and κ > 0 such that E[XX X 2 2] κTr(Σ)Σ and E[ϵ2 X 2 2] σ2κTr(Σ), where ϵ = Y X θ .

Assumption 4 is a classical fourth-moment assumption in stochastic optimization (see Bach & Moulines, 2013; Dieuleveut et al., 2017, for details). Indeed, the first statement in Assumption 4 holds, for example, if X is a Gaussian vector (with κ = 3) or when X satisfies X 2 κTr(Σ) almost surely. The second statement in Assumption 4 holds, for example, if the model is well specified or when the noise ε is almost surely bounded. Note that if the first part holds then the second part holds with σ2 2E[Y 2] + 2E[Y 4]1/2.

Our main result, establishing an upper bound on the risk of SGD applied to zero-imputed data, follows.

Theorem 4.1. Under Assumption 4, choosing a constant

learning rate γ = 1 κTr(Σ) n leads to

E Rimp θimp,n R (F)

θ imp θ0 2 2 + σ2 + θ 2 Σ n + Bimp(F),

where θ (resp. θ imp) is the best linear predictor for complete (resp. with imputed missing values) case.

Theorem 4.1 gives an upper bound on the difference between E[Rimp θimp,n ], the averaged risk of the estimated linear predictor with imputed missing values (in both train and test samples) and R (F), the risk of the best linear predictor on the complete case. Interestingly, by Lemma 2.1 and under a well-specified linear model, the latter also holds for E Rimp θimp,n R mis. The generalization bound in Theorem 4.1 takes into account the statistical error of the method as well as the optimization error. More precisely, the upper bound can be decomposed into (i) a bias associated to the initial condition, (ii) a variance term of the considered method, and (iii) the aforementioned imputation bias.

The variance term (ii) depends on the second moment of Y (as θ 2 Σ EY 2) and decreases with a slow rate 1/ n. As seen in Section 3, the imputation bias is upper-bounded by the ridge bias with penalization parameter λimp, which is controlled in high dimension for low-rank data (see examples in Section 3.2).

The bias (i) due to the initial condition is the most critical. Indeed, Tr(Σ) = E[ X 2 2] is likely to increase with d, e.g., under Assumption 2, Tr(Σ) d L2. Besides, the starting point θ0 may be far from θ imp. Fortunately, Lemma 4.2 establishes some properties of θ imp.

Lemma 4.2. Under Assumptions 1 and 3, let V be the covariance matrix of P defined in Proposition 3.1. If V is invertible, then

θ imp 2 2 Bimp(F)

ℓ2λmin(V ). (17)

In particular, under Assumption 1 ,

θ imp 2 2 Bimp(F)

ℓ2ρ(1 ρ). (18)

Lemma 4.2 controls the norm of the optimal predictor θ imp by the imputation bias: if the imputation bias is small, then the optimal predictor on zero-imputed data is of low norm. According to Section 3, this holds in particular for highdimensional settings. Thus, choosing θ0 = 0 permits us to exploit the upper bound provided by Lemma 4.2 in Theorem 4.1. With such an initialization, the bias due to this initial condition is upper bounded by κTr(Σ) n θ imp 2 2. Intuitively, as θ imp is in an ℓ2-ball of small radius, choosing θ0 within that ball, e.g. θ0 = 0 is a good choice.

Naive imputation implicitly regularizes high-dimensional linear models.

Taking into account Lemma 4.2, Proposition 4.3 establishes our final upper bound on SGD on zero-imputed data.

Proposition 4.3. Under Assumptions 1 , 2, 3 and 4, the predictor θimp,n resulting from the SGD strategy, defined in Section 4.1, with starting point θ0 = 0 and learning rate γ = 1 dκL2 n, satisfies

E Rimp θimp,n R (F)

ℓ2 κd ρ(1 ρ) n + 1 Bimp(F) + σ2 + θ 2 Σ n .

In this upper bound, the first term encapsulates the imputation bias and the one due to the initial condition, whilst the second one corresponds to the variance of the training procedure. As soon as d ℓ2

L2 ρ(1 ρ) n

κ then the imputation bias is negligible compared to that of the initial condition.

4.3. Examples

According to Examples 3.3 and 3.6, Bimp(F) decreases with the dimension, provided that Σ or β are structured. Strikingly, Corollary 4.4 highlights cases where the upper bound of Proposition 4.3 is actually dimension-free.

Corollary 4.4. Suppose that assumptions of Proposition 4.3 hold. Recall that λ1 . . . λd are the eigenvalues of Σ associated with the eigenvectors v1, . . . , vd.

(i) (Example 3.3 - Low-rank Σ). If Σ has a low rank r d and equal non-zero singular values, then

E Rimp θimp,n R (F)

ℓ2 κ ρ n + 1 ρ

r θ 2 Σ ρ + σ2

(ii) (Example 3.6 - Spiked model). If Σ = Σ r + Σ>r with Σ>r ℓ2ηI, Σ r has a low rank r d with equal non-zero singular values, and the projection of θ on the range of Σ>r satisfies θ >r = 0, then

E Rimp θimp,n R (F)

ℓ2 κ ρ n + 1 ρ

r θ 2 Σ ρ(1 η) + σ2

Corollary 4.4 establishes upper bounds on the risk of SGD applied on zero-imputed data, for some particular structures on Σ and θ . These bounds take into account the statistical error as well as the optimization one, and are expressed as function of d and n. Since θ 2 Σ is upper bounded by EY 2

(a dimension-free term), the risks in Corollary 4.4 can also be upper bounded by dimension-free quantities, provided d > ℓ2

L2 ρ(1 ρ) n

Besides, Corollary 4.4 shows that, for d ℓ2 L2 ρ(1 ρ) n

κ , the imputation bias is negligible with respect to the stochastic error of SGD. Therefore, for structured problems in high-dimensional settings for which d ℓ2

L2 ρ(1 ρ) n

κ , the zero-imputation strategy is consistent, with a slow rate of order 1/ n. Remark 4.5 (Discussion about slow rates). An important limitation of coupling naive imputation with SGD is that fast convergence rates cannot be reached. Indeed, in large dimensions, the classical fast rate is given by Tr(Σ(Σ+λI) 1)/n with λ the penalization hyper-parameter. The quantity Tr(Σ(Σ + λI) 1), often called degrees of freedom, can be negligible w.r.t. d (for instance when Σ has a fast eigenvalue decay). However, when working with an imputed dataset, the covariance matrix of the data is not Σ anymore, but Σimp = EXimp X imp. Therefore, in the case of Assumption 1 (Ho-MCAR), all the eigenvalues of Σimp are larger than ρ(1 ρ) (preventing the eigenvalues decay obtained when working with complete inputs). By concavity of the degrees of freedom (on positive semi-definite matrix), we can show that Tr(Σimp(Σimp + λI) 1) dρ(1 ρ)

1+λ , hindering traditional fast rates.

Link with dropout Dropout is a classical regularization technique used in deep learning, consisting in randomly discarding some neurons at each SGD iteration (Srivastava et al., 2014). Regularization properties of dropout have attracted a lot of attention (e.g., Gal & Ghahramani, 2016). Interestingly, setting a neuron to 0 on the input layer is equivalent to masking the corresponding feature. Running SGD (as in Section 4) on a stream of zero-imputed data is thus equivalent to training a neural network with no hidden layer, a single output neuron, and dropout on the input layer. Our theoretical analysis describes the implicit regularization impact of dropout in that very particular case. Interestingly, this can also be applied to the fine-tuning of the last layer of any regression network structure.

Other imputations In this paper, we proved that imputation by zero reduces the impact of missing data in highdimensional settings by acting as an implicit regularization method. Our results are valid for more generic imputation procedures as soon as (i) the mean values of the imputation functions x 7 E[Ximp|X = x, P = p] are linear (for all missing patterns p) and (ii) there exists a constant l such that E[(X E[Ximp|X])2] l2. Condition (i) is necessary to preserve the linearity of predictors after imputation, which is at the core of our analysis. Condition (ii) is more surprising, as it states that the imputation is inaccurate on average. Such a property is the cornerstone of the implicit regularization provided by imputation methods. In particular, our analysis can be extended to classical imputation strategies such as imputations by the mean or by a standard Gaussian.

Naive imputation implicitly regularizes high-dimensional linear models.

5. Numerical experiments

Data simulation We generate n = 500 complete input data according to a normal distribution with two different covariance structures. First, in the low-rank setting (Ex. 3.3 and 3.5), the output is formed as Y = β Z+ϵ, with β Rr, Z N(0, Ir) and ϵ N(0, 2), and the inputs are given by X = AZ + µ, with a full rank matrix A Rd r and a mean vector µ Rd. Note that the dimension d varies in the experiments, while r = 5 is kept fixed. Besides, the full model can be rewritten as Y = X θ + ϵ with θ = (A ) β where A is the Moore-Penrose inverse of A. Secondly, in the spiked model (Ex. 3.6), the input and the output are decomposed as X = (X1, X2) Rd/2 Rd/2

and Y = Y1 + Y2, where (X1, Y1) is generated according to the low-rank model above and (X2, Y2) is given by a linear model Y2 = θ 2 X2 and X2 N(0, Id/2), choosing θ2 = 0.2.

Two missing data scenarios, with a proportion ρ of observed entries equal to 50%, are simulated according to (i) the Ho MCAR setting (Assumption 1 ); and to (ii) the self-masking MNAR setting, which departs significantly from the MCAR case as the presence of missing data depends on the underlying value itself. More precisely, set α Rd such that, for all j [d], P(Pj = 1|X) = (1 + e αj Xj) 1 and E[Pj] = 0.5 (50% of missing data on average per component).

Regressors For two-step strategies, different imputers are combined with different regressors. The considered imputers are: the zero imputation method (0-imp) complying with the theoretical analysis developed in this paper, the optimal imputation by a constant for each input variable (Opti-imp), obtained by training a linear model on the augmented data (P X, 1d 1 P) (see Le Morvan et al., 2020b, Proposition 3.1), and single imputation by chained equations (ICE) (Van Buuren & Groothuis-Oudshoorn, 2011)1. The subsequent regressors, implemented in scikit-learn (Pedregosa et al., 2011), are either the averaged SGD (SGD, package SGDRegressor) with θ0 = 0 and γ = (d n) 1 (see Proposition 4.3, or the ridge regressor (with a leave-one-out cross-validation, package ridge). Two specific methods that do not resort to prior imputation are also assessed: a pattern-by-pattern regressor (Le Morvan et al., 2020b; Ayme et al., 2022) (Pat-by-Pat) and a neural network architecture (Neu Miss) (Le Morvan et al., 2020a) specifically designed to handle missing data in linear prediction.

Numerical results In Figure 1 (a) and (b), we consider Ho-MCAR patterns with Gaussian inputs with resp. a lowrank and spiked covariance matrix. The 2-step strategies perform remarkably well, with the ICE imputer on the top of the podium, highly appropriate to the type of data (MCAR

1Iterative Imputer in scikit-learn (Pedregosa et al., 2011).

Gaussian) in play. Nonetheless, the naive imputation by zero remains competitive in terms of predictive performance and is computationally efficient, with a complexity of O(nd), especially compared to ICE, whose complexity is of order n2d3. Regarding Figure 1 (b), we note that ridge regression outperforms SGD for large d. Note that, in the regime where d n, the imputation bias is negligible w.r.t. to the method bias, the latter being lower in the case of ridge regression. This highlights the benefit of explicit ridge regularization (with a tuned hyperparameter) over the implicit regularization induced by the imputation.

In practice, missing data are not always of the Ho-MCAR type, we compare therefore the different algorithms on selfmasked data. In Figure 1 (c), we note that specific methods remain competitive for larger d compared to MCAR settings. This was to be expected since those methods were designed to handle complex missing not at random (MNAR) data. However, they still suffer from the curse of dimensionality and turns out to be inefficient in large dimension, compared to all two-step strategies.

6. Discussion and conclusion

In this paper, we study the impact of zero imputation in highdimensional linear models. We demystify this widespread technique, by exposing its implicit regularization mechanism when dealing with MCAR data. We prove that, in high-dimensional regimes, the induced bias is similar to that of ridge regression, commonly accepted by practitioners. By providing generalization bounds on SGD trained on zeroimputed data, we establish that such two-step procedures are statistically sound, while being computationally appealing.

Theoretical results remain to be established beyond the MCAR case, to properly analyze and compare the different strategies for dealing with missing data in MNAR settings (see Figure 1 (c)). Extending our results to a broader class of functions (escaping linear functions) or even in a classification framework, would be valuable to fully understand the properties of imputation.

Acknowledgements

This work as part of Alexis Ayme s thesis was funded by the City of Paris through the Emergence(s) project Int Artica.

Agarwal, A., Shah, D., Shen, D., and Song, D. On robustness of principal component regression. Advances in Neural Information Processing Systems, 32, 2019.

Ayme, A., Boyer, C., Dieuleveut, A., and Scornet, E. Nearoptimal rate of consistency for linear models with missing

Naive imputation implicitly regularizes high-dimensional linear models.

Regressor 0-imp+SGD 0-imp+Ridge Opti-imp+Ridge Pat-by-Pat ICE+Ridge Neu Miss Method Imputation Specific

Number of features d

d = n d = n

Number of features d

d = n d = n

Number of features d

d = n d = n

(a) Ho-MCAR + Low-rank model (b) Ho-MCAR + Spiked model (c) Self-Masked + Low-rank model

Figure 1. Risk w.r.t. the input dimension (evaluated on 104 test samples) when 50% of the input data is missing. The y-axis corresponds to Rmis(f) R = E (Y f(Ximp, P))2 σ2. The averaged risk is depicted over 10 repetitions within a 95% confidence interval.

values. In International Conference on Machine Learning, pp. 1211 1243. PMLR, 2022.

Bach, F. and Moulines, E. Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). Advances in neural information processing systems, 26, 2013.

Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020.

Caponnetto, A. and De Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007.

Chizat, L. and Bach, F. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pp. 1305 1338. PMLR, 2020.

Dieuleveut, A., Flammarion, N., and Bach, F. Harder, better, faster, stronger convergence rates for least-squares regression. The Journal of Machine Learning Research, 18(1): 3520 3570, 2017.

Gal, Y. and Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. Advances in neural information processing systems, 29, 2016.

Hsu, D., Kakade, S. M., and Zhang, T. Random design analysis of ridge regression. In Conference on learning theory, pp. 9 1. JMLR Workshop and Conference Proceedings, 2012.

Ipsen, N. B., Mattei, P.-A., and Frellsen, J. How to deal with missing data in supervised deep learning? In ICLR 2022-10th International Conference on Learning Representations, 2022.

Johnstone, I. M. On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics, 29(2):295 327, 2001. doi: 10.1214/aos/ 1009210544. URL https://doi.org/10.1214/ aos/1009210544.

Josse, J., Prost, N., Scornet, E., and Varoquaux, G. On the consistency of supervised learning with missing values. ar Xiv preprint ar Xiv:1902.06931, 2019.

Le Morvan, M., Josse, J., Moreau, T., Scornet, E., and Varoquaux, G. Neu Miss networks: differentiable programming for supervised learning with missing values. In Neur IPS 2020 - 34th Conference on Neural Information Processing Systems, Vancouver / Virtual, Canada, December 2020a. URL https://hal. archives-ouvertes.fr/hal-02888867.

Le Morvan, M., Prost, N., Josse, J., Scornet, E., and Varoquaux, G. Linear predictor on linearly-generated data with missing values: non consistency and solutions. In International Conference on Artificial Intelligence and Statistics, pp. 3165 3174. PMLR, 2020b.

Le Morvan, M., Josse, J., Scornet, E., and Varoquaux, G. What sa good imputation to predict with missing values? Advances in Neural Information Processing Systems, 34: 11530 11540, 2021.

Mourtada, J. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. ar Xiv preprint ar Xiv:1912.10754, 2019.

Mourtada, J. and Rosasco, L. An elementary analysis of ridge regression with random design. ar Xiv preprint ar Xiv:2203.08564, 2022.

Muzellec, B., Josse, J., Boyer, C., and Cuturi, M. Missing data imputation using optimal transport. In Interna-

Naive imputation implicitly regularizes high-dimensional linear models.

tional Conference on Machine Learning, pp. 7130 7140. PMLR, 2020.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Pesme, S., Pillaud-Vivien, L., and Flammarion, N. Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34:29218 29230, 2021.

Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838 855, 1992.

Rubin, D. B. Inference and missing data. Biometrika, 63(3):581 592, 12 1976. ISSN 0006-3444. doi: 10. 1093/biomet/63.3.581. URL https://doi.org/10. 1093/biomet/63.3.581.

Sridharan, K., Shalev-Shwartz, S., and Srebro, N. Fast rates for regularized objectives. Advances in neural information processing systems, 21, 2008.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929 1958, 2014.

Van Buuren, S. and Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in r. Journal of statistical software, 45:1 67, 2011.

Yao, Y., Rosasco, L., and Caponnetto, A. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289 315, 2007.

Yoon, J., Jordon, J., and Schaar, M. Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning, pp. 5689 5698. PMLR, 2018.

Naive imputation implicitly regularizes high-dimensional linear models.

A. Notations

Notations For two vectors (or matrices) a, b, we denote by a b the Hadamard product (or component-wise product). [n] = {1, 2, ..., n}. For two symmetric matrices A and B, A B means that B A is positive semi-definite. The symbol denotes the inequality up to a universal constant. Table 1 summarizes the notations used throughout the paper.

Table 1. Notations P Mask F Set of linear functions Bimp Imputation bias Σ EXX

λj eigenvalues of Σ vj eigendirections of Σ ΣP EPP

L2 the largest second moments maxj EX2 j (Assumption 2) ℓ2 the smallest second moments minj EX2 j (Assumption 3) θ Best linear predictor on complete data θ imp Best linear predictor on imputed data r Rank of Σ ρj Theoretical proportion of observed entries for the j-th variable in a MCAR setting V Covariance matrix associated to the missing patterns C Covariance matrix V renormalized by (ρj)j defined in (14) κ Kurtosis of the input X

B. Proof of the main results

B.1. Proof of Lemma 2.1

The proof is based on the definition of the conditional expectation, and given that

R = E h (Y E [Y |X])2i .

Note that E [Y |X, P] = E [f (X) + ϵ|X, P] = E [f (X)|X, P] = f (X) (by independence of ϵ and P). Therefore,

R = E h (Y f (X))2i

E h (Y E [Y |X, P])2i

E h (Y E [Y |Ximp, P])2i

using that E [Y |Ximp, P] is a measurable function of (X, P).

B.2. Preliminary lemmas

Notation Let Xa be a random variable of law La (a modified version of the law of the underlying input X) on Rd, and for f F define Ra(f) = E h (Y f(Xa))2i ,

the associate risk. The Bayes risk is given by

R a(F) = inf f F E h (Y f(Xa))2i ,

if the infimum is reached, we denote by f a arg minf F Ra(f). The discrepancy between both risks, involving either the modified input Xa or the initial input X, can be measured through the following bias:

Ba = R a(F) R (F).

Naive imputation implicitly regularizes high-dimensional linear models.

General decomposition The idea of the next lemma is to compare Ra(f) with the true risk R(f).

Lemma B.1. If (Xa Y )|X, then, for all θ Rd,

Ra (fθ) = R (gθ) + θ 2 Γ ,

where gθ(X) = θ E [Xa|X] and Γ = E (Xa E [Xa|X])(Xa E [Xa|X]) the integrated conditional covariance matrix. In consequence, if there exists an invertible linear application H such that, E [Xa|X] = H 1X, then

For all θ Rd, gθ is a linear function and

R a(F) = inf θ Rd

n R (fθ) + θ 2 H ΓH o . (19)

If λmax(HΓH ) Λ, then

Ba(F) inf θ Rd

n R(fθ) + Λ θ 2 2 o = Bridge,Λ. (20)

If λmin(Γ) µ > 0, then

θ a 2 2 Ba(F)

Remark B.2. Equation (21) is crucial because a bound on the bias Ba(F) actually gives a bound for θ a 2 2 too. This will be of particular interest for Theorem 4.1.

Ra (fθ) = E h Y θ Xa 2i

= E h E h Y E θ Xa|X + E θ Xa|X θ Xa 2 X ii

= E h Y E θ Xa|X 2i + E h E h E θ Xa|X θ Xa 2 X ii

= E h (Y gθ(X))2i + E h E h E θ Xa|X θ Xa 2 X ii

= R(gθ) + E h E h E θ Xa|X θ Xa 2 X ii .

since E E θ Xa|X θ Xa|X = 0. Furthermore,

E h E h E θ Xa|Z θ Xa 2 |X ii = θ E h (E [Xa|X] Xa) (E [Xa|X] Xa) i θ

= E h θ E h (E [Xa|X] Xa) (E [Xa|X] Xa) |X i θ i

= E h θ 2 E[(E[Xa|X] Xa)(E[Xa|X] Xa) |X]

= E h θ 2 Γ i .

Ra (fθ) = R(gθ) + θ 2 Γ.

Assume that an invertible matrix H exists such that gθ(X) = θ H 1X, thus gθ is a linear function. Equation (19) is then obtained by using a change of variable: θ = (H 1) θ = (H ) 1θ and θ = H θ . Thus, we have gθ (X) = θ X = fθ(X) and

Ra (fθ ) = R(fθ) + H θ 2 Γ = R(fθ) + θ 2 HΓH .

Naive imputation implicitly regularizes high-dimensional linear models.

Then using HΓH ΛI proves (20). Note that, without resorting to the previous change of variable, the bias can be written as Ba(F) = R gθ a R (fθ ) + θ a 2 Γ . (22)

By linearity of gθ a, R gθ a R (fθ ) = R (F) (because gθ a F).

Thus, θ a 2 Γ Ba(F). Assuming µI Γ gives (21), as

µ θ a 2 θ a 2 Γ Ba(F).

B.3. Proof of Section 3

We consider the case of imputed-by-0 data, i.e., Ximp = P X.

Under the MCAR setting (Assumption 1), E [Ximp|X] = H 1X,

with H = diag(ρ 1 1 , ..., ρ 1 d ) (variables always missing are discarded) and (ρj)j [d] the observation rates associated to each input variable.

Proof of Proposition 3.1. For i, j [d],

Γij = E h (Ximp)i E (Ximp)i |X (Ximp)j E h (Ximp)j |X i i = E [Xi Xj(Pi EPi)(Pj EPj)]

= E [Xi Xj] Cov(Pi, Pj),

= Σij Vij (23)

since P and X are independent and with V defined in Proposition 3.1. Therefore, applying Lemma B.1 with Γ = Σ V proves the first part of Proposition 3.1. Regarding the second part, under the Ho-MCAR assumption, one has V = ρ(1 ρ)I, thus Γ = ρ(1 ρ)diag(Σ). Furthermore, if L2 = ℓ2, then diag(Σ) = L2I which gives Γ = L2ρ(1 ρ)I.

Proof of Theorem 3.2 and Proposition 3.7. Under Assumption 1, since H is a diagonal matrix,

H ΓH = Σ C,

where C is defined in Equation (14).

Under Assumption 1 , the matrix C satisfies C = 1 ρ

ρ I. Moreover, under Assumption 2 (resp. Assumption 3), one has Σ C 1 ρ

ρ L2I = λimp (resp. Σ C 1 ρ

ρ ℓ2I = λ imp) using (19), we obtain

n R (θ) + λ imp θ 2 2 o R imp inf θ Rd

n R (θ) + λimp θ 2 2 o .

Subtracting R (F), one has Bridge,λ imp Bimp Bridge,λimp,

which concludes the proof of Theorem 3.2.

Under Assumption 1, we have HΓH = Σ C. Using Lemma E.2, we obtain for all θ,

θ 2 HΓH = θ 2 Σ C λmax(C) θ 2 diag(Σ) .

Under Assumption 2, we have diag(Σ) L2I, thus

θ 2 HΓH L2λmax(C) θ 2 2 .

This shows that λmax(HΓH ) L2λmax(C) = Λimp We conclude on Proposition 3.7 using Equation (19).

Naive imputation implicitly regularizes high-dimensional linear models.

B.4. Proof of Lemma 4.2

Proof. Using (23), we have Γ = V Σ. Using that λmin(V )I V , by Lemma E.1, we obtain

λmin(V )I Σ Γ,

and equivalently λmin(V ) diag(Σ) Γ. Under Assumption 3, we have ℓ2I diag(Σ), thus

ℓ2λmin(V )I Γ.

Therefore, λmin(Γ) ℓ2λmin(V ). Thus, using (21), we obtain the first part of Lemma 4.2:

ℓ2λmin(V ) θ imp 2 2 Bimp(F). (24)

Under Assumption 1 , λmin(V ) = ρ(1 ρ), so that

ℓ2ρ(1 ρ) θ imp 2 2 Bimp(F), (25)

which proves the second part of Lemma 4.2.

C. Stochastic gradient descent

C.1. Proof of Theorem 4.1

Lemma C.1. Assume (xn, ξn) H H are Fn-measurable for a sequence of increasing σ-fields (Fn), n 1. Assume that

E [ξn | Fn 1] = 0, E h ξn 2 | Fn 1 i is finite and E h xn 2 xn xn | Fn 1 i R2H, with E [xn xn | Fn 1] = H

for all n 1, for some R > 0 and invertible operator H. Consider the recursion αn = (I γxn xn) αn 1 + γξn, with γR2 1. Then:

1 γR2 E [ αn 1, H αn 1 ] + 1 2nγ E αn 2 1 2nγ α0 2 + γ

k=1 E ξk 2 .

Proof. The idea is to use Lemma C.1 with

xk = Ximp,k, yk = Yk

H = Σimp = E h Ximp,k X imp,k i = ΣP Σ where ΣP = E PP

αk = θimp,k θ imp

ξk = Ximp,k(Yk X imp,kθ imp)

γ = 1 2R2 n

R2 = κTr(Σ)

We can show, with these notations, that recursion (16) leads to recursion αn = (I γxn xn) αn 1 + γξn with α0 = θ0 θ imp. Now, let s check the assumption of Lemma C.1.

Let show that E h Ximp X imp Ximp 2 2 i R2Σimp. Indeed,

E h Ximp X imp Ximp 2 2 i E h Ximp X imp X 2 2 i ,

Naive imputation implicitly regularizes high-dimensional linear models.

using that Ximp 2 2 X 2 2, and 0 Ximp X imp. Then,

E h Ximp X imp X 2 2 i = EE h Ximp X imp X 2 2 |P i

= EE h PP XX X 2 2 |P i

= E h ΣP XX X 2 2 i

= ΣP E h XX X 2 2 i .

According to Assumption 4, E h XX X 2 2 i R2Σ, and Lemma E.1 lead to

E h Ximp X imp Ximp 2 2 i R2ΣP Σ = R2Σimp.

Define ϵimp = Y X impθ imp = X θ + ϵ X impθ imp . First, we have ϵ2 imp 3 X θ 2 + ϵ2 + X impθ imp 2 , then

E h ξ 2 2 i = E h ϵ2 imp Ximp 2 2 i

3E h X θ 2 + ϵ2 + X impθ imp 2 Ximp 2 2 i

3 E h X θ 2 X 2 2 i + E h ϵ2 X 2 2 i + E h X impθ imp 2 Ximp 2 2 i

Let remark that, using Assumption 4

E h X θ 2 X 2 2 i = E h θ XX X 2 2 θ i θ 2 Σ

= R2 θ 2 Σ .

Using the first point, by the same way, E h X impθ imp 2 Ximp 2 2 i θ imp 2

Σimp . By Assumption 4, we have also

than E h ϵ2 X 2 2 i σ2R2. Thus,

E h ξ 2 2 i 3R2 σ2 + θ 2 Σ + θ imp 2 Σimp

3R2 σ2 + 2 θ 2 Σ ,

because θ 2 Σ = R (θ ) Rimp θ imp = θ imp 2

Consequently we can apply Lemma C.1, to obtain 1 1 2 n

E θimp,n θ imp, Σimp( θimp,n θ imp) + 1

2nγ E θimp,n θ imp 2 1 2nγ

θ imp θ0 2+γ

k=1 E ξk 2 .

The choice γ = 1 2R2 n leads to

E θimp,n θ imp 2 Σimp 2R2

θ imp θ0 2 + 4σ2 + 2 θ 2 Σ n .

We conclude on Theorem 4.1 using that,

E Rimp θimp R = E Rimp θimp R imp + R imp R = E θimp,n θ imp 2 Σimp + Bimp.

Naive imputation implicitly regularizes high-dimensional linear models.

C.2. Proof of Proposition 4.3 and Corollary 4.4

Proof of Proposition 4.3. First, under Assumption 2, Tr(Σ) d L2. Then, initial conditions term with θ0 = 0,

θ imp 2 2 κL2d nℓ2ρ(1 ρ)Bimp(F), (26)

using Lemma 4.2. We obtain Proposition 4.3 using inequality above in Theorem 4.1.

proof of Corollary 4.4. We obtain the upper bounds considered that: according to Theorem 3.2, Bimp Bridge,λimp; under Assumption 3, Tr(Σ) dℓ2. Then, we put together Proposition 4.3 and ridge bias bound (see Appendix D).

C.3. Miscellaneous

Proposition C.2. If X statisfies E h XX X 2 2 i κTr(Σ)Σ, then E h ϵ2 X 2 2 i σ2κTr(Σ) with σ2 2E[Y 2] +

2E[Y 4]1/2.

E h ϵ2 X 2 2 i = E h Y X θ 2 X 2 2 i

2E h X θ 2 + Y 2 X 2 2 i

2E h Y 2 X 2 2 i + 2E h X θ 2 X 2 2 i .

First term: by Cauchy Schwarz,

E h Y 2 X 2 2 i2 E Y 4 E h X 4 2 i

E Y 4 E h Tr XX X 2 2 i

E Y 4 κTr(Σ)2.

Second term:

E h X θ 2 X 2 2 i = E h (θ ) XX X 2 2 θ i

κTr(Σ)E (θ ) Σθ

κTr(Σ) θ 2 2 .

E h ϵ2 X 2 2 i E Y 4 1

2 κTr(Σ) + κTr(Σ) θ 2 Σ σ2κTr(Σ) θ 2 Σ .

D. Details on examples

Recall that

Bridge,λ(F) = λ θ 2 Σ(Σ+λI) 1 (27)

λj λj + λ(v j θ )2. (28)

Naive imputation implicitly regularizes high-dimensional linear models.

D.1. Low-rank covariance matrix (Example 3.3)

Proposition D.1 (Low-rank covariance matrix with equal singular values). Consider a covariance matrix with a low rank r d and constant eigenvalues (λ1 = λ2 = ... = λr). Then,

Bridge,λ(F) = λ r Tr(Σ) θ 2 Σ .

Proof. Using that λ1 = = λr and Pr j=1 λj = Tr(Σ), we have λ1 = = λr = Tr(Σ)

r . Then Σ(Σ + λI) 1 λ 1 r Σ = r Tr(Σ)Σ. Thus,

Bridge,λ(F) = λ θ 2 Σ(Σ+λI) 1 = λ r Tr(Σ) θ 2 Σ .

D.2. Low-rank covariance matrix compatible with θ (Example 3.5)

Proposition D.2 (Low-rank covariance matrix compatible with θ ). Consider a covariance matrix with a low rank r d and assume that θ , v1 2 θ , vd 2, then

Bridge,λ(F) λr(log(r) + 1)

Tr(Σ) θ 2 Σ .

Proof. Recall that

j=1 λj(v j θ )2. (29)

Under the assumptions of Example 3.5, using that (λj)j and (v j θ )2

j are decreasing, then for all k [r],

j=1 λj(v k θ )2 θ 2 Σ.

Thus, for all k [r],

(v k θ )2 θ 2 Σ Pk j=1 λj .

Using that Pr j=1 λj = Tr(Σ) and that eigenvalues are decreasing, we have Pk j=1 λj k

r Tr(Σ) using Lemma E.3. Then

Bridge,λ(F) = λ

λk λk + λ(v k θ )2

k=1 (v k θ )2

1 Pk j=1 λj

λ r Tr(Σ)(log(r) + 1),

by upper-bounding the Euler-Maclaurin formula.

Naive imputation implicitly regularizes high-dimensional linear models.

D.3. Spiked covariance matrix (Example 3.6)

Proposition D.3 (Spiked model). Assume that the covariance matrix is decomposed as Σ = Σ r + Σ>r. Suppose that Σ>r ηI (small operator norm) and that all non-zero eigenvalues of Σ r are equal, then

Bridge,λ(F) r Tr(Σ) dη θ 2 Σ + η θ > 2 2 .

where θ >r is the projection of θ on the range of Σ>r,

Proof. One has

Σ(Σ + λI) 1 = Σ (Σ + λI) 1 + Σ>(Σ + λI) 1

Σ (Σ + λI) 1 + Σ>(Σ> + λI) 1

where µ is the non-zero eigenvalue of Σ . Thus,

Bridge,λ(F) = θ 2 λΣ(Σ+λI) 1

θ 2 λ µ Σ +Σ>

µ θ 2 Σ + θ 2 Σ> .

Using that λmax(Σ>) η, we have

Bridge,λ(F) λ

µ θ 2 Σ + η θ > 2 2 .

Using Weyl s inequality, for all j [d], λj(Σ + Σ>) λj(Σ ) + η. Summing the previous inequalities, we get

Tr(Σ) rµ + dη.

In consequence,

Bridge,λ(F) r Tr(Σ) dη θ 2 Σ + η θ > 2 2 .

E. Technical lemmas

Lemma E.1. Let A, B, V be three symmetric non-negative matrix, if A B then A V B V .

Proof. Let X N(0, V ) and θ Rd,

Naive imputation implicitly regularizes high-dimensional linear models.

θ 2 A V = θ A V θ

= θ EXX A θ

= E θ XX A θ

i,j θi XX A

i,j θi Xi Xj Aijθj

i,j (θi Xi) (θj Xj) Aij

= E h X θ 2 A i

E h X θ 2 B i

= θ 2 B V .

Lemma E.2. Let A, B be two non-negative symmetric matrices, then A B is non-negative symmetric and, for all θ Rd:

θ 2 A B λmax(B) θ 2 diag(A) ,

where diag(A) is the diagonal matrix containing the diagonal terms of A.

Proof. Let X N(0, A), thus A = E XX , then for θ Rd

θ 2 A B = θ A Bθ

= θ EXX B θ

= E θ XX B θ

i,j θi XX B

i,j θi Xi Xj Bijθj

i,j (θi Xi) (θj Xj) Bij

= E h (X θ) B (X θ) i

Naive imputation implicitly regularizes high-dimensional linear models.

using that B is positive. Thus A B is positive. Furthermore,

θ 2 A B = E h (X θ) B (X θ) i

λmax(B)E h (X θ) (X θ) i

i θ2 i X2 i

= λmax(B) X

i θ2 i E X2 i

= λmax(B) X

= λmax(B) θ 2 diag(A) .

Lemma E.3. Let (vj)j [d] a non-increasing sequence of positive number, and S = Pd j=1 vj, for all k [d],

Proof. We use an apagogical arguments, if Pk j=1 vj < k

d S. Then, using that (vj)j [d] is non-increasing,

Thus vk+1 < 1

d S, summing last elements, d X

j=r+1 vj < d r

j=r+1 vj < k

Thus, this is absurd.

F. Additional experiments

To illustrate the influence of the input dimension and the missing data proportion on the bias, we consider the low-rank setting described in Section 5, making the dimension d and the missing value proportion ρ vary (and keeping a fixed rank r = 5). The imputation bias is estimated using 10 000 (complete) observations.

Naive imputation implicitly regularizes high-dimensional linear models.

Number of features d

Bimp estimated

0.0 0.1 0.5 0.9

Figure 2. Estimated imputation bias with n = 10000 observations in a low rank setting (r = 5).