# hmlasso_lasso_with_high_missing_rate__e8f91ff1.pdf HMLasso: Lasso with High Missing Rate Masaaki Takada1 , Hironori Fujisawa2 and Takeichiro Nishikawa1 1Toshiba Corporation 2The Institute of Statistical Mathematics masaaki1.takada@toshiba.co.jp, fujisawa@ism.ac.jp, takeichiro.nishikawa@toshiba.co.jp Sparse regression such as the Lasso has achieved great success in handling high-dimensional data. However, one of the biggest practical problems is that high-dimensional data often contain large amounts of missing values. Convex Conditioned Lasso (Co Co Lasso) has been proposed for dealing with high-dimensional data with missing values, but it performs poorly when there are many missing values, so that the high missing rate problem has not been resolved. In this paper, we propose a novel Lasso-type regression method for high-dimensional data with high missing rates. We effectively incorporate mean imputed covariance, overcoming its inherent estimation bias. The result is an optimally weighted modification of Co Co Lasso according to missing ratios. We theoretically and experimentally show that our proposed method is highly effective even when there are many missing values. 1 Introduction High-dimensional data appear in a wide range of fields, including biology, economy, and industry. Over the past several decades, sparse regression has achieved great success in dealing with high-dimensional data, because it efficiently performs both model estimation and variable selection simultaneously. Sparse regression methods include the Lasso [Tibshirani, 1996], Elastic Net [Zou and Hastie, 2005], SCAD [Fan and Li, 2001], and MCP [Zhang, 2010]. In practice, high-dimensional data often contain large amounts of missing values. For example, educational and psychological studies commonly have missing data ratios of 15 20% [Enders, 2003], while maintenance data for typical industrial processes had over 75% missing values in over 50% of variables [Lakshminarayan et al., 1999]. Up to 90% of traffic data can be missing [Tan et al., 2013]. There is thus demand for methods that can accommodate high-dimensional data with a high missing rate. Missing data analysis has a long history. Listwise deletion (complete case analysis) and pairwise deletion are widely Contact Author used because of their simplicity. Various methods have been developed, including the expectation maximization (EM) algorithm [Dempster et al., 1977], multiple imputation (MI) [Rubin, 1987; Schafer, 1997; Buuren and Groothuis Oudshoorn, 2011], and full information maximum likelihood (FIML) [Hartley and Hocking, 1971; Enders, 2001]. However, these methods focus on low-dimensional missing data and are intractable for high-dimensional data due to their computational cost. To deal with high-dimensional missing data, a direct regression method using a pairwise covariance matrix has been proposed [Loh and Wainwright, 2012]. This method incurs low computational costs, but heavily depends on some critical unknown parameters that must be determined in advance due to its nonconvexity. Convex Conditioned Lasso (Co Co Lasso) was proposed to avoid this problem [Datta and Zou, 2017]. Because of its convexity using a positive semidefinite approximation of the pairwise covariance matrix, it does not suffer from local optima or critical parameters. However, we found that Co Co Lasso can deteriorate at high missing rates. Indeed, estimator accuracy may significantly worsen even when only one variable has a high missing rate, so a high missing rate remains problematic. In this paper, we propose a novel regression method that overcomes problems related to both high-dimensionality and high missing rates. We use the mean imputed covariance matrix, effectively incorporating it into Lasso despite its noted tendency toward estimation bias for missing data. The resulting optimization problem can be seen as a weighted modification of Co Co Lasso using the missing ratio, and it is quite effective for high-dimensional data with a high missing rate. Our proposed method is free from local optima and critical parameters due to convexity, and it is theoretically and experimentally superior to Co Co Lasso regarding estimation error. Contributions of this study are as follows: We propose a novel regression method for handling high-dimensional data with high missing rates. We analyze theoretical properties of our method, showing that our formulation is superior to all other weighted formulations with regards to estimation error. We demonstrate the effectiveness of our method through both numerical simulations and real-world data experiments. Our method outperforms other methods in al- Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) most all cases, and particularly shows significant improvement for high-dimensional data with high missing rates. The remainder of this paper is organized as follows: We first review existing methods and describe our proposed method. We then show the advantages of our method through theoretical analyses, numerical simulations, and real-world data experiments. 1.1 Notations Let v Rp. v q (q > 0) is the ℓq norm, that is, v q = (|v1|q + + |vp|q)1/q. Let M Rn p. M F is the Frobenius norm, that is, M F = (P j,k M 2 jk)1/2. M max is the max norm, that is, M max = maxj,k |Mjk|. Let M1, M2 Rn p. M1 M2 is the element-wise product (Hadamard product) of M1 and M2. M1 M2 is the elementwise division of M1 and M2. Let M Rp p be a symmetric matrix. M 0 denotes that M is positive semidefinite, that is, v Mv 0 for any v Rp. 2 Methods 2.1 Problem Formulation Consider a linear regression model y = Xβ + ε, where X Rn p is a complete design matrix, y Rn is a response, β Rp is a regression coefficient, and ε Rn is a noise. Suppose that y and each column of X are centered without loss of generality. The ordinary problem is to estimate the regression coefficient β given complete data X and y. In contrast, we consider the situation where some elements of X are missing in this paper. If X does not contain missing values, Lasso is one of the most promising methods for handling high-dimensional data. It solves the problem [Tibshirani, 1996] ˆβ = argmin β 1 2n y Xβ 2 2 + λ β 1, (1) where λ > 0 is a regularization parameter. Since the objective function (1) is regularized by the ℓ1 norm of β, the solution is sparse and often has a small generalization error. In the presence of missing values, however, it is impossible to directly apply the Lasso. 2.2 Review of Existing Methods Interestingly, to estimate the parameter β in the presence of missing values does not require imputation in X, which is computationally expensive for high-dimensional data. We can directly estimate the parameter without imputation. The Lasso objective function (1) can be reformulated as ˆβ = argmin β 1 2β Sβ ρ β + λ β 1, (2) where S = 1 n X X (the sample covariance matrix of X) and ρ = 1 n X y (the sample covariance vector of X and y). Using (2), we can estimate β via S and ρ instead of X and y. If data are missing completely at random, we can easily construct unbiased estimators of the covariance matrix and vector using the pairwise covariance, that is, Spair = (Spair jk ) and ρpair = (ρpair jk ) as Spair jk := 1 njk i Ijk Xij Xik, and ρpair j := 1 njj i Ijj Xijyi, where Ijk := {i : Xij and Xik are observed }, and njk is the number of elements of Ijk. Thus, we can replace S and ρ by Spair and ρpair in (2), respectively. The major problem here is that Spair may not be positive semidefinite (PSD). In other words, it may have negative eigenvalues. This is a critical problem because negative eigenvalues can cause the objective function to diverge to minus infinity, meaning the optimization failed. A regression method with a nonconvex constrained formulation has been proposed to avoid this problem [Loh and Wainwright, 2012]. However, this method s nonconvexity causes some difficulties in practice. Namely, different initial values result in multiple global or local minimizers that produce different output solutions, and the solutions depend on unknown parameters that must be determined in advance. To avoid the above difficulties, a convex optimization problem called Co Co Lasso has been proposed [Datta and Zou, 2017]. This problem is formulated as ˆβ = argmin β 1 2β ˇΣβ ρpair β + λ β 1, (3) ˇΣ = argmin Σ 0 Σ Spair max. (4) Co Co Lasso obtains the PSD covariance matrix in (4) via the alternating direction method of multipliers (ADMM) algorithm, then optimizes the Lasso objective function (3). This formulation overcomes the difficulties seen in [Loh and Wainwright, 2012] because the objective function (3) is convex due to the PSD matrix ˇΣ, meaning it has no local minimizers, and because it uses no unknown parameters that must be determined in advance. In addition, statistical non-asymptotic properties were also derived. For these reasons, Co Co Lasso is practical and state-of-the-art. However, a high missing rate can deteriorate estimations of the covariance matrix in Co Co Lasso. If some pairwise observation numbers njk are very small, then the corresponding pairwise covariances Spair jk are quite unreliable, possibly becoming very large or small. Since (4) is based on the max norm, unreliable elements of Spair will greatly affect the estimator. As a result, other estimator elements can highly deviate from the corresponding elements in Spair, even if their variables have few missing values. This indicates that Co Co Lasso results can significantly worsen, even if only one variable has a high missing rate. The problem is that Co Co Lasso does not account for the differences in reliability of the pairwise covariance. The next subsection describes how we overcome this problem. We mention other approaches for regression with missing data. A simple approach is listwise deletion. This method is very fast but inappropriate when there are few complete samples, as is common with high-dimensional data. Another typical approach is to impute missing values, including the Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) mean imputation, the EM algorithm [Dempster et al., 1977], MI [Rubin, 1987; Schafer, 1997; Buuren and Groothuis Oudshoorn, 2011], FIML [Hartley and Hocking, 1971; Enders, 2001], and other non-parametric methods [Stekhoven and B uhlmann, 2012]. These methods, however, typically incur large computational costs or cause large bias in highdimensional problems, and it is hard to conduct theoretical analysis with these methods. Other direct modeling methods use the Dantzig selector instead of the Lasso [Rosenbaum and Tsybakov, 2010; Rosenbaum and Tsybakov, 2013]. An advantage of the Lasso-type approach is that computation is empirically much faster than with the Dantzig selector [Efron et al., 2007]. 2.3 Proposed Method: HMLasso The mean imputation method is commonly used in practice. Let Z be the mean imputed data of X. Because X is centered, Zjk = Xjk for observed elements and Zjk = 0 for missing elements. The covariance matrix of the mean imputed data, Simp = (Simp jk ), is defined as Simp jk = 1 i=1 Zij Zik = njk i Ijk Xij Xik = njk n Spair jk . We can equivalently express (5) as Simp = R Spair (6) where R = (Rjk) with Rjk = njk/n. The mean imputed covariance matrix Simp is biased but PSD, while the pairwise covariance matrix Spair is unbiased but not PSD. To take the best aspects of both, we optimize Σ = argmin Σ 0 R Σ Simp 2 F (7) to obtain a low-biased and PSD matrix. Direct use of mean imputation for covariance matrix estimation is known to produce estimation bias. However, we can use the relation (6) between Simp and Spair to effectively incorporate them into the optimization problem (7). The formulation (7) has a useful property, in that it is equivalent to Σ = argmin Σ 0 R (Σ Spair) 2 F. (8) The formulation (8) can be seen as a weighted modification of Co Co Lasso (4) using the observed ratio matrix R, where the max norm is replaced by the Frobenius norm. This weighting is beneficial under high missing rates. When there are missing observations, the objective function downweights the corresponding term Σjk Spair jk by the observed ratio Rjk = njk/n. In particular, the downweighting will be reasonable when njk is small, because the pairwise covariance Spair jk is unreliable. From the above, we extend the formulation and propose a novel optimization problem to estimate the regression model Figure 1: (Left) Two-dimensional covariance matrix space. (Right) Two simple missing patterns. Black elements represent missing values. ˆβ = argmin β 1 2β Σβ ρpair β + λ β 1, (9) Σ = argmin Σ 0 W (Σ Spair) 2 F, (10) where W is a weight matrix whose (j, k)-th element is Wjk = Rα jk with a constant α 0. We obtain a PSD matrix by minimizing the weighted Frobenius norm in (10), then optimizes the Lasso problem (9). This Lasso-type formulation allows us to efficiently deal with high missing rates, so we call our method HMLasso . Several α values can be considered. Setting α = 0 corresponds to the non-weighted case, which is just a projection of Spair onto the PSD region. This is the same as Co Co Lasso when the Frobenius norm in (10) is replaced by the max norm. The case where α = 1 relates to mean imputation, as described above. As shown below, non-asymptotic analyses support that α = 1 is reasonable, and numerical experiments show that this setting delivers the best performance. Therefore, we recommend setting α = 1 in practice. The case where α = 1/2 can be roughly viewed as the maximum likelihood method from an asymptotic perspective. We discuss the case of setting α = 1/2 in the supplementary material1. Note that in (10), we use the Frobenius norm instead of the max norm, because the Frobenius norm delivered better performance in numerical experiments. 2.4 Comparison Using a Simple Example The following simple example shows that our weighted formulation (10) is better than the non-weighted formulation. Consider three-dimensional data X Rn 3. To derive simple analytical results, we suppose that the pairwise covariance matrix and observation ratio matrix are " 1 s1 s2 s1 1 s2 s2 s2 1 " n1 n2 n1 n2 n2 n2 1The whole paper including the supplementary material is available at https://arxiv.org/abs/1811.00255. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Algorithm 1 Covariance Estimation with ADMM Input: Spair, W, µ for k = 1, 2, . . . do Ak+1 projection of Bk + Spair + µΛk onto PSD Bk+1 (Ak+1 Spair µΛk) (µW W + 11 ) Λk+1 Λk 1 µ Ak+1 Bk+1 Spair end for Output: Ai+1 respectively. We restrict diagonal elements of the covariance estimate to 1 for simplicity. From symmetry of the problem, we can parameterize the covariance estimate as Σ = [1, σ1, σ2; σ1, 1, σ2; σ2, σ2, 1]. Here, we can see Spair and Σ in the same two-dimensional space as in Figure 1. Simple calculation yields that the PSD condition of Spair is 2s2 2 1 s1 1, shown in gray in Figure 1. Hereafter, to show differences among the methods, without loss of generality we suppose s1 < 2s2 2 1 so that Spair is not PSD, and also s2 0. Point P in Figure 1 is an example of such an Spair. Consider the case where n1 is sufficiently small and n2 is sufficiently large, which is realized by missing data pattern 1 in Figure 1. The Co Co Lasso (non-weighted max norm) solution is the tangent point of the gray region and an elliptic contour with center point P, shown as point A in Figure 1. The solution of the non-weighted Frobenius norm is the tangent point of the gray region and a square contour with center point P, shown as point B. The optimal point of HMLasso (weighted norm) will be close to point C, because weights R13 and R23 are much larger than R12, so the optimal solution must be close to the tangent point of the gray PSD region and line CD. On the other hand, the sample covariance matrix S with complete observations satisfies σ2 s2 and 2σ2 2 1 σ1 1, represented as line segment CD. Since point C is closer to any point on line segment CD than are A or B, the HMLasso estimate is always closer to the covariance matrix of the complete data than are estimates using non-weighted norms. Consider another case where n1 is sufficiently large and n2 is sufficiently small, which is realized by missing pattern 2 in Figure 1. By reasoning similar to the case described above, the solutions for the non-weighted max and Frobenius norm are A and B, respectively, and the HMLasso optimal point will be close to point E in Figure 1. The sample covariance matrix S with complete observations satisfies σ1 s1 and 2σ2 2 1 σ1 1, shown as line segment EF. Hence, the HMLasso estimate is again superior to those of the non-weighted norms. 2.5 Algorithms The Lasso optimization problem using the covariance matrix (9) can be solved by various algorithms for the Lasso, such as the coordinate descent algorithm [Friedman et al., 2010] and the least angle regression algorithm [Efron et al., 2004]. Our implementation uses the coordinate descent algorithm because it is efficient for high-dimensional data. The algorithm details are described in the supplementary material. We use the warm-start and safe screening techniques to speed up the algorithm. We use the ADMM algorithm [Boyd et al., 2011] to derive the PSD covariance matrix optimization (10), which can be rewritten as (A, B) = argmin A 0,B=A Spair W B 2 F. Therefore, the augmented Lagrangian function is f(A, B, Λ) =1 2 W B 2 F Λ, A B Spair 2µ A B Spair 2 F, (11) where Λ is a Lagrangian matrix and µ is an augmented Lagrangian parameter. We iteratively update A, B, and Λ subject to A 0 by minimizing (11) in terms of each variable. The resulting algorithm is similar to the Co Co Lasso algorithm, except for the update rule for B due to weight matrix W. To derive the B-step update equation, differentiating f(A, B, Λ) with respect to B yields Bf(A, B, Λ) = W W B + Λ 1 µ A B Spair . Solving Bf(A, B, Λ) = 0, we obtain the update rule B (A Spair µΛ) (µW W + 11 ), where 11 is a matrix of ones. The algorithm for solving (10) is presented as Algorithm 1. The difference between the max norm and the Frobenius norm is trivial when we use ADMM. The supplementary material also describes the algorithm for the weighted max norm. 3 Theoretical Properties In this section, we investigate statistical properties of the proposed estimator. We first obtain a refined non-asymptotic property for the pairwise covariance matrix, which explicitly includes missing rate effects. We then derive a nonasymptotic property of our estimate in (10). These results show that α = 1 weighting is superior in terms of nonasymptotic properties over other weighting (α = 1) including non-weighted formulation (α = 0). Note that we focus on the Frobenius norm formulation in (10), but we can see that α = 1 weighting is superior as well for the max norm formulation, though Co Co Lasso uses the non-weighted norm (α = 0). Complete proofs for propositions and theorems in this section are given in the supplementary material. 3.1 Preliminaries Let M = (Mij) Rn p be the observation pattern matrix whose elements are 1 when data are observed and 0 otherwise, so that Z = M X. We suppose a sub-Gaussian assumption on M, which plays a key role in non-asymptotic properties. This assumption often holds, as seen in the following proposition. Definition 1. A random variable u R is said to be sub Gaussian with τ 2 if E [exp(s(u E[u]))] exp τ 2s2/2 for all s R. A random vector u Rp is said to be sub Gaussian with τ 2 if v u is a sub-Gaussian variable with τ 2 for all v Rp satisfying v 2 = 1. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Assumption 1. The rows of M are independent and identically distributed with mean µM, covariance ΣM, and sub Gaussian parameter τ 2. Proposition 1. Assume that Mij values are independent and identically distributed as a Bernoulli distribution with mean µj. Then, the rows of M are sub-Gaussian with τ 2 = maxj µj(1 µj) 1/4. For the theoretical analysis in this section, we substitute ˆΣ := 1 n Z Z Π for Spair, where Π := (πjk) := ΣM + µMµ M. This is reasonable because Spair = 1 n Z Z R from (5), and the expectation for R is E[R] = E[ 1 n M M] = Π. We respectively call πjk and 1 πjk the observed and missing rate, since they are expectations for the observed and missing ratios. We suppose that µM and ΣM are known. This substitution and assumption were also used in previous theoretical research [Loh and Wainwright, 2012; Datta and Zou, 2017]. 3.2 Statistical Properties We first derive a refined non-asymptotic property of ˆΣ. Theorem 2. Under Assumption 1, we have, for all ε cτ 2X2 max/πjk, Pr ˆΣjk Sjk ε 1 C exp cnε2π2 jkζ 1 , where ζ = τ 2X4 max max τ 2, µ2 j, µ2 k , Xmax = maxi,j |Xij|, and C and c are universal constants. Sketch of Proof. We see that ˆΣjk Sjk 1 nπjk i=1 vijk(mij µj)(mik µk) i=1 vijk(mik µk) i=1 vijk(mij µj) with vijk := xijxik. The first term is bounded using Lemma B.1 in [Datta and Zou, 2017], and the second and third terms are bounded using Property (B.2) in [Datta and Zou, 2017]. Careful analyses considering πjk, µj, and µk yield the assertion. In Theorem 2, the missing rate appears explicitly and the non-asymptotic property is stricter than Definition 1 and Lemma 2 in [Datta and Zou, 2017]. To clearly see the missing rate effect, we replace ε by ε/πjk. Then, for all ε cτ 2X2 max we have Pr πjk ˆΣjk Sjk ε 1 C exp cnε2ζ 1 . Since the right side does not depend on πjk, we can see that the concentration probability of πjk|ˆΣjk Sjk| is equally bounded regardless of the missing rate. This implies that our weighted formulation balances uncertainty of each element of ˆΣ, while non-weighted formulations such as Co Co Lasso suffer from this imbalance. Next, we derive a non-asymptotic property of our weighted estimator Σ. Theorem 3. Under Assumption 1, we have, for all ε cτ 2X2 max(minj,k Wjk/πjk)/Wmin, cnε2W 2 min min j,k πjk Wjk where ζ = τ 2X4 max max τ 2, µ2 1, . . . , µ2 p , Wmin = minj,k Wjk, Xmax = maxi,j |Xij|, and C and c are universal constants. Sketch of Proof. We have W ( Σ S) F W ( Σ ˆΣ) F+ W (ˆΣ S) F 2 W (ˆΣ S) F by the triangular equation and the optimality of Σ. Using W 2 min Σ S 2 F W ( Σ S) 2 F, Theorem 2 yields the assertion. According to Theorem 3, we can see that the proposed weight Wjk = njk is optimal in the population level as follows. Theorem 4. Let Wjk = πα jk. Then, the lower bound of the concentration probability and the upper bound of ε in Theorem 3 are minimized and maximized simultaneously when α = 1. Proof. Let πmin = minj,k πjk and πmax = maxj,k πjk. Substituting Wjk = πα jk, we have the concentration probability 1 p2C exp cnε2π2α min min j,k π2 2α jk with the constraint ε cτ 2X2 max(minj,k πα 1 jk )/πα min. i) For 0 α 1, the concentration probability becomes 1 p2C exp cnε2π2 minζ 1 , and the constraint of ε becomes ε cτ 2X2 max (πmax/πmin)α /πmax. Since πmax/πmin 1, the constraint region of ε is maximized at α = 1. ii) For α 1, the concentration probability becomes 1 p2C exp cnε2π2 max(πmin/πmax)2α and the constraint of ε becomes ε cτ 2X2 max/πmin. Since πmin/πmax 1, the concentration probability is maximized at α = 1. 4 Numerical Experiments We conducted some experiments using both synthetic and real-world data. More comprehensive simulation results under various conditions were given in the supplementary material due to space limitations. 4.1 Simulations with Various Norms First, we investigated the effect of various weighted norms in (10). We compared the max norm with α = 0, 1/2, 1, 2 and the Frobenius norm with α = 0, 1/2, 1, 2. The max norm with α = 0 corresponds to Co Co Lasso, and the Frobenius norm with α = 0 corresponds to a simple projection onto the PSD region. Training data were X Rn p with n = 10000 and p = 100 generated by N(0, Σ ) with Σ jk = 0.5 for j = k and Σ jk = 1 for j = k. Responses y were defined as y = Xβ +ε Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) 0 0.5 1 2 alpha coefficient error 0 0.5 1 2 alpha prediction error Figure 2: Comparison of the max and Frobenius norms with α = 0, 0.5, 1, 2. 0.1 0.5 0.9 missing rate coefficient error 0.1 0.5 0.9 correlation level coefficient error 1000 10000 100000 n coefficient error 50 100 150 200 p coefficient error Co Co Lasso Figure 3: Simulation results with various missing rates, covariance levels, sample sizes, and dimension numbers. with β1 = 10, β11 = 9, β21 = 8, β31 = 7, . . . , β91 = 1, and βj = 0 otherwise, and ε N(0, 1). We introduced missing values completely at random, setting a missing rate for each column sampled from a uniform distribution U(0, 1) (0.5 on average). Test data were generated independently in the same manner, except that we did not introduce missing values for evaluation. The regularization parameter λ was selected by five-fold corrected cross-validation as with [Datta and Zou, 2017]. We iterated each experiment 30 times and plotted averaged results with standard errors. Figure 2 shows the results. The performance measures were the ℓ2 error for the regression coefficients, and the root mean square error of prediction. The weighted norms with α = 1 were effective for both the Frobenius and max norm formulations, as suggested by the non-asymptotic theoretical analyses. In addition, the Frobenius norms outperformed the max norms. Therefore, we use the Frobenius norm with α = 1 as the proposed method, namely, HMLasso, in subsequent experiments. In contrast, Co Co Lasso (the max norm with α = 0) was apparently inferior to HMLasso. 4.2 Simulations Under Various Conditions We next compared the performance of HMLasso with other methods under various conditions, examining the mean im- 0.0 0.2 0.4 0.6 0.8 missing rate prediction error 0.0 0.2 0.4 0.6 0.8 missing rate prediction error Co Co Lasso Figure 4: Analysis of residential building data with various patterns and rates for missing data. The outcomes are sales prices (left) and construction costs (right). putation method, Co Co Lasso (the max norm with α = 0), and HMLasso (the Frobenius norm with α = 1). Following the simulation setting in the previous subsection, we varied the missing data rate, covariance level, sample size, and number of variables. We set the average missing rates to µ = 0.1, 0.5, 0.9, the covariance levels to r = 0.1, 0.5, 0.9, the sample size to n = 103, 104, 105, and the number of variables to p = 50, 100, 150, 200. Note that we also examined other missing imputation methods such as mice [Buuren and Groothuis-Oudshoorn, 2011] and miss Forest [Stekhoven and B uhlmann, 2012], but their computational costs were over 100 times larger than those for the above methods, so we excluded these methods in our experiments. Figure 3 shows the results. HMLasso outperformed other methods under almost all conditions, especially for data with high missing rates and high covariance levels. In addition, high dimensionality did not adversely affect HMLasso, while the other methods showed gradually worse performance. 4.3 Residential Building Dataset We evaluated the performance using a real-world residential building dataset [Rafiei and Adeli, 2016] from the UCI datasets repository2. The dataset included construction costs, sale prices, project variables, and economic variables corresponding to single-family residential apartments in Tehran. The objective was to predict sale prices and construction costs from physical, financial, and economic variables. The data consisted of n = 372 samples and p = 105 variables, including two output variables. We introduced missing values at missing rates µ = 0, 0.2, 0.4, 0.6, 0.8 on average. We evaluated performance in terms of prediction error from complete samples. We randomly split data into 300 samples for training, 36 for validation, and 36 for testing, and iterated the experiments 30 times. Figure 4 shows the results. HMLasso outperformed the other methods under nearly all cases. The advantage of HMLasso was clear especially for high missing rates. 5 Conclusion We proposed a novel regression method for high-dimensional data with high missing rates, and demonstrated the advantages of our method through theoretical analyses and numerical experiments. 2https://archive.ics.uci.edu/ml/datasets/Residential+Building+ Data+Set Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) References [Boyd et al., 2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine learning, 3(1):1 122, 2011. [Buuren and Groothuis-Oudshoorn, 2011] S van Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3), 12 2011. [Datta and Zou, 2017] Abhirup Datta and Hui Zou. Cocolasso for high-dimensional error-in-variables regression. The Annals of Statistics, 45(6):2400 2426, 2017. [Dempster et al., 1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1 22, 1977. [Efron et al., 2004] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. The Annals of statistics, 32(2):407 499, 2004. [Efron et al., 2007] Bradley Efron, Trevor Hastie, and Robert Tibshirani. Discussion: The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2358 2364, 2007. [Enders, 2001] Craig K Enders. A primer on maximum likelihood algorithms available for use with missing data. Structural Equation Modeling, 8(1):128 141, 2001. [Enders, 2003] Craig K Enders. Using the expectation maximization algorithm to estimate coefficient alpha for scales with item-level missing data. Psychological methods, 8(3):322, 2003. [Fan and Li, 2001] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348 1360, 2001. [Friedman et al., 2010] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010. [Hartley and Hocking, 1971] H O Hartley and R R Hocking. The analysis of incomplete data. Biometrics, pages 783 823, 1971. [Lakshminarayan et al., 1999] Kamakshi Lakshminarayan, Steven A Harp, and Tariq Samad. Imputation of missing data in industrial databases. Applied intelligence, 11(3):259 275, 1999. [Loh and Wainwright, 2012] Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of statistics, 40(3):1637 1664, 2012. [Rafiei and Adeli, 2016] Mohammad Hossein Rafiei and Hojjat Adeli. A novel machine learning model for estimation of sale prices of real estate units. Journal of Construc- tion Engineering and Management, 142(2):04015066, 2016. [Rosenbaum and Tsybakov, 2010] Mathieu Rosenbaum and Alexandre B Tsybakov. Sparse recovery under matrix uncertainty. The Annals of Statistics, 38(5):2620 2651, 2010. [Rosenbaum and Tsybakov, 2013] Mathieu Rosenbaum and Alexandre B Tsybakov. Improved matrix uncertainty selector. pages 276 290, 2013. [Rubin, 1987] Donald B Rubin. Multiple imputation for nonresponse in surveys. John Wiley & Sons, 1987. [Schafer, 1997] Joseph L Schafer. Analysis of incomplete multivariate data. Chapman and Hall/CRC, 1997. [Stekhoven and B uhlmann, 2012] Daniel J Stekhoven and Peter B uhlmann. Missforest non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112 118, 2012. [Tan et al., 2013] Huachun Tan, Guangdong Feng, Jianshuai Feng, Wuhong Wang, Yu-Jin Zhang, and Feng Li. A tensor-based method for missing traffic data completion. Transportation Research Part C: Emerging Technologies, 28:15 27, 2013. [Tibshirani, 1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. [Zhang, 2010] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2):894 942, 2010. [Zou and Hastie, 2005] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301 320, 2005. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)