# the_limits_of_learning_with_missing_data__98264c39.pdf The Limits of Learning with Missing Data Brian Bullins Elad Hazan Princeton University Princeton, NJ {bbullins,ehazan}@cs.princeton.edu Tomer Koren Google Brain Mountain View, CA tkoren@google.com We study linear regression and classification in a setting where the learning algorithm is allowed to access only a limited number of attributes per example, known as the limited attribute observation model. In this well-studied model, we provide the first lower bounds giving a limit on the precision attainable by any algorithm for several variants of regression, notably linear regression with the absolute loss and the squared loss, as well as for classification with the hinge loss. We complement these lower bounds with a general purpose algorithm that gives an upper bound on the achievable precision limit in the setting of learning with missing data. 1 Introduction The primary objective of linear regression is to determine the relationships between multiple variables and how they may affect a certain outcome. A standard example is that of medical diagnosis, whereby the data gathered for a given patient provides information about their susceptibility to certain illnesses. A major drawback to this process is the work necessary to collect the data, as it requires running numerous tests for each person, some of which may be discomforting. In such cases it may be necessary to impose limitations on the amount of data available for each example. For medical diagnosis, this might mean having each patient only undergo a small subset of tests. A formal setting for capturing regression and learning with limits on the number of attribute observations is known as the Limited Attribute Observation (LAO) setting, first introduced by Ben-David and Dichterman [1]. For example, in a regression problem, the learner has access to a distribution D over data (x, y) 2 Rd R, and fits the best (generalized) linear model according to a certain loss function, i.e., it approximately solves the optimization problem min w:kwkp B LD(w), LD(w) = E(x,y) D In the LAO setting, the learner does not have complete access to the examples x, which the reader may think of as attributes of a certain patient. Rather, the learner can observe at most a fixed number of these attributes, denoted k d. If k = d, this is the standard regression problem which can be solved to arbitrary precision. The main question we address: is it possible to compute an arbitrarily accurate solution if the number of observations per example, k, is strictly less than d? More formally, given any " > 0, can one compute a vector w for which LD(w) min kw kp B LD(w ) + ". Efficient algorithms for regression with squared loss when k < d have been shown in previous work [2], and the sample complexity bounds have since been tightened [6, 8]. However, similar results for 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. other common loss functions such as e.g. absolute loss have only been shown by relaxing the hard limit of k attributes per example [3, 6]. In this paper we show, for the first time, that in fact this problem cannot be solved in general. Our main result shows that even for regression with the absolute loss function, for any k d 1, there is an information-theoretic lower bound on the error attainable by any algorithm. That is, there is some "0 > 0 for which an "0-optimal solution cannot be determined, irrespective of the number of examples the learner sees. Formally, with constant probability, any algorithm returning a vector w 2 Rd must satisfy LD(w) > min kw kp B LD(w ) + "0. We further show that this ultimate achievable precision parameter is bounded from below by a polynomial in the dimension, i.e., "0 = (d 3/2). Additionally, for the basic setting of Ridge regression (with the squared loss), we give a tight lower bound for the LAO setting. Cesa-Bianchi et al. [2] provided the first efficient algorithm for this setting with sample complexity of O(d2/k"2) for " error. Hazan and Koren [6] improved upon this result and gave a tight sample complexity of O(d/k"2) to achieve " error. In both cases, however, the algorithms only work when k 2. We complete the picture and show that k 2 attributes are in fact necessary to obtain arbitrarily low error. That is, with only one attribute per example, there is an information-theoretic limit on the accuracy attainable by any regression algorithm. We remark that a similar impossibility result was proven by Cesa-Bianchi et al. [3] in the related setting of learning with noisy examples. Classification may be similarly cast in the LAO setting. For classification with the hinge loss, namely soft-margin SVM, we give a related lower bound, showing that it is impossible to achieve arbitrarily low error if the number of observed attributes is bounded by k d 1. However, unlike our lower bound for regression, the lower bound we prove for classification scales exponentially with the dimension. Although Hazan et al. [7] showed how classification may be done with missing data, their work includes low rank assumptions and so it is not in contradiction with the lower bounds presented here. Similar to the LAO setting, the setting of learning with missing data [9, 4, 10, 11] presents the learner with examples where the attributes are randomly observed. Since the missing data setting is at least as difficult as the LAO setting, our lower bounds extend to this case as well. We complement these lower bounds with a general purpose algorithm for regression and classification with missing data that, given a sufficient number of samples, can achieve an error of O(1/ d). This result leaves only a small polynomial gap compared to the information-theoretic lower bound that we prove. 2 Setup and Statement of Results The general framework of linear regression involves a set of instances, each of the form (x, y) where x 2 Rd is the attribute vector and y 2 R is the corresponding target value. Under the typical statistical learning framework [5], each (x, y) pair is drawn from a joint distribution D over Rd R. The learner s objective is to determine some linear predictor w such that w>x does well in predicting y. The quality of prediction is measured according to a loss function : R 7! R. Two commonly used loss functions for regression are the squared loss (w>x y) = 1 2 (w>x y)2 and the absolute loss (w>x y) = |w>x y|. Since our examples are drawn from some arbitrary distribution D, it is best to consider the expected loss LD(w) = E(x,y) D The learner s goal then is to determine a regressor w that minimizes the expected loss LD(w). To avoid overfitting, a regularization term is typically added, which up to some constant factor is equivalent to min w2Rd LD(w) s.t. kwkp B for some regularization parameter B > 0, where k kp is the standard p norm, p 1. Two common variants of regression are Ridge regression (p = 2 with squared loss) and Lasso regression (p = 1 with squared loss). The framework for classification is nearly identical to that of linear regression. The main distinction comes from a different meaning of y 2 R, namely that y acts as a label for the corresponding example. The loss function also changes when learning a classifier, and in this paper we are interested in the hinge loss (y w>x) = max{0, 1 y w>x}. The overall goal of the learner, however, remains the same: namely, to determine a classifier w such that LD(w) is minimized. Throughout the paper, we let w denote the minimizer of LD(w). 2.1 Main Results As a first step, for Lasso and Ridge regressions, we show that one always needs to observe at least two attributes to be able to learn a regressor to arbitrary precision. This is given formally in Theorem 1. Theorem 1. Let 0 < " < 1 32 and let be the squared loss. Then there exists a distribution D over {x : ||x||1 1} [ 1, 1] such that kw k1 2, and any regression algorithm that can observe at most one attribute of each training example of a training set S cannot output a regressor ˆw such that ES[LD( ˆw)] < LD(w ) + ". Corollary 2. Let 0 < " < 1 64 and let be the squared loss. Then there exists a distribution D over {x : ||x||2 1} [ 1, 1] such that kw k2 2, and any regression algorithm that can observe at most one attribute of each training example of a training set S cannot output a regressor ˆw such that ES[LD( ˆw)] < LD(w ) + ". The lower bounds are tight recall that with two attributes, it is indeed possible to learn a regressor to within arbitrary precision [2, 6]. Also, notice the order of quantification in the theorems: it turns out that there exists a distribution which is hard for all algorithms (rather than a different hard distribution for any algorithm). For regression with absolute loss, we consider the setting where the learner is limited to seeing k or fewer attributes of each training sample. Theorem 3 below shows that in the case where k < d the learner cannot hope to learn an "-optimal regressor for some " > 0. Theorem 3. Let d 4, d 0 (mod 2), 0 < " < 1 60d 3 2 , and let be the absolute loss. Then there exists a distribution D over {x : ||x||1 1} [ 1, 1] such that kw k1 2, and any regression algorithm that can observe at most d 1 attributes of each training example of a training set S cannot output a regressor ˆw such that ES[LD( ˆw)] < LD(w ) + ". Corollary 4. Let 0 < " < 1 60d 2, and let be the absolute loss. Then there exists a distribution D over {x : ||x||2 1} [ 1, 1] such that kw k2 1, and any regression algorithm that can observe at most d 1 attributes of each training example of a training set S cannot output a regressor ˆw such that ES[LD( ˆw)] < LD(w ) + ". We complement our findings for regression with the following analogous lower bound for classification with the hinge loss (a.k.a., soft margin SVM). Theorem 5. Let d 4, d 0 (mod 2), and let be the hinge loss. Then there exists an "0 > 0 such that the following holds: there exists a distribution D over {x : ||x||2 1} [ 1, 1] such that kw k2 1, and any classification algorithm that can observe at most d 1 attributes of each training example of a training set S cannot output a regressor ˆw such that ES[LD( ˆw)] < LD(w ) + "0. 3 Lower Bounds In this section we discuss our lower bounds for regression with missing attributes. As a warm-up, we first prove Theorem 1 for regression with the squared loss. While the proof is very simple, it illustrates some of the main ideas used in all of our lower bounds. Then, we give a proof of Theorem 3 for regression with the absolute loss. The proofs of the remaining bounds are deferred to the supplementary material. 3.1 Lower bounds for the squared loss Proof of Theorem 1. It is enough to prove the theorem for deterministic learning algorithms, namely, for algorithms that do not use any external randomization (i.e., any randomization besides the random samples drawn from the data distribution itself). This is because any randomized algorithm can be thought of as a distribution over deterministic algorithms, which is independent of the data distribution. Now, suppose 0 < " < 1 32. Let X1 = {(0, 0), (1, 1)}, X2 = {(0, 1), (1, 0)}, and let D1 and D2 be uniform distributions over X1 {1} and X2 {1}, respectively. The main observation is that any learner that can observe at most one attribute of each example cannot distinguish between the two distributions with probability greater than 1 2, no matter how many samples it is given. This is because the marginal distributions of the individual attributes under both D1 and D2 are exactly the same. Thus, to prove the theorem it is enough to show that the sets of "-optimal solutions under the distributions D1 and D2 are disjoint. Indeed, suppose that there is a learning algorithm that emits a vector ˆw such that E[LD( ˆw) LD(w )] < "/2 (where the expectation is over the random samples from D used by the algorithm). By Markov s inequality, it holds that LD( ˆw) < LD(w ) + " with probability > 1/2. Hence, the output of the algorithm allows one to distinguish between the two distributions with probability > 1/2, contradicting the indistinguishability property. We set to characterize the sets of "-optimal solutions under D1 and D2. For D1, we have 1 2 (w>x 1)2 = 1 4 (w1 + w2 1)2, while for D2, 1 2 (w>x 1)2 = 1 4 (w1 1)2 + 1 Note that the set of "-optimal regressors for LD1 is S1 = {w : |w>1 1| 2p"}, whereas for LD2 the set is S2 = {w : kw 1k2 2p"}. Let S0 2 = {w : |w>1 2| 2 2"}. Then S2 S0 2, so it is sufficient to show that S1 and S0 2 are disjoint. Since " < 1 32, for any w 2 S1, |w>1 1| < 1 2, meaning w>1 < 3 2. However, for any w 2 S0 2, |w>1 2| < 1 2 meaning w>1 > 3 2, and so w cannot be a member of both S1 and S2. As we argued earlier, this suffices to prove the theorem. 3.2 Lower bounds for the absolute loss As in the proof of Theorem 1, the main idea is to show that one can design two distributions that are indistinguishable to a learner who can observe no more than d 1 attributes of any sample given by the distribution (i.e., that their marginals over any choice of d 1 attributes are identical), but whose respective sets of "-optimal regressors are disjoint. However, in contrast to Theorem 1, both handling general d along with switching to the absolute loss introduce additional complexities to the proof that require different techniques. We start by constructing these two distributions D1 and D2. Let X1 = {x = (x1, . . ., xd) : x 2 {0, 1}d, kxk1 0 (mod 2)} and X2 = {x = (x1, . . ., xd) : x 2 {0, 1}d, kxk1 1 (mod 2)}, and let D1 and D2 be uniform over X1 {1} and X2 {1}, respectively. From this construction, it is not hard to see that for any choice of k d 1 attributes, the marginals over the k attributes of both distributions are identical: they are both a uniform distribution over k bits. Thus, the distributions D1 and D2 are indistinguishable to a learner that can only observe at most d 1 attributes of each example. Let (w>x y) = |w>x y|, and let LD1(w) = E(x,y) D1[ (w>x, y)] = 1 2d 1 LD2(w) = E(x,y) D2[ (w>x, y)] = 1 2d 1 It turns out that the subgradients of LD1(w) and LD2(w), which we denote by @LD1(w) and @LD2(w) respectively, can be expressed precisely. In fact, the full subgradient set at every point in the domain for both functions can be made explicit. With these representations in hand, we can show that w 2 = 2 d+21d are minimizers of LD1(w) and LD2(w), respectively. Figure 1: Geometric intuition for Lemmas 6 and 7. The lower bounding absolute value function acts as a relaxation of the true expected loss LD (depicted here as a cone). In fact, using the subgradient sets we can prove a much stronger property of the expected losses LD1 and LD2, akin to a directional strong convexity property around their respective minimizers. The geometric idea behind this property is shown in Figure 1, whereby LD is lower bounded by an absolute value function. Lemma 6. Let w d 1d. For any w 2 Rd we have LD1(w) LD1(w Lemma 7. Let w 2 = 2 d+21d. For any w 2 Rd we have LD2(w) LD2(w Given Lemmas 6 and 7, the proof of Theorem 3 is immediate. Proof of Theorem 3. As a direct consequence of Lemmas 6 and 7, we obtain that the sets 9>=>; contain the sets of "-optimal regressors for LD1(w) and LD2(w), respectively. All that is needed now is to show a separation of their "-optimal sets for 0 < " < 1 60d 3 2 , and this is done by showing a separation of the more manageable sets S1 and S2. Indeed, fix 0 < " < 1 60d 3 2 and observe that for any w 2 S1 we have 2 and so, for d 4, 2d > 2 1 d + 2 = 2d + 3 On the other hand, for any w 2 S2 we have dw 2d d + 2 + 1 2d < 2d d + 2 + 1 d + 2 = 2d + 1 We see that no w can exist in both S1 and S2, so these sets are disjoint. Theorem 3 follows by the same reasoning used to conclude the proof of Theorem 1. It remains to prove Lemmas 6 and 7. As the proofs are very similar, we will only prove Lemma 6 here and defer the proof of Lemma 7 to the supplementary material. Proof of Lemma 6. We first write @LD1(w) = 1 2d 1 @ (w>x, 1) = 1 2d 1 sign(w>x 1) x. d 1d, we have that 1) = 1 2d 1 x2X1, kxk1= d x2X1, kxk1> d x2X1, kxk1< d x2X1, kxk1= d sign(0) x + x2X1, kxk1> d x2X1, kxk1< d where sign(0) can be any number in [ 1, 1]. Next, we compute x2X1, kxk1> d x2X1, kxk1< d where the last equality follows from the elementary identity Pk = ( 1)k n 1 , which we prove in Lemma 9 in the supplementary material. Now, let X = {x 2 X1 : kxk1 = d 2 }, let m = |X |, and let X = [x1, . . ., xm] 2 Rd m be the matrix formed by all x 2 X . Then we may express the entire subgradient set explicitly as #### r 2 [ 1, 1]m, Thus, any choice of r 2 [ 1, 1]m will result in a specific subgradient of LD1(w 1). Consider two such choices: r1 = 0 and r2 = 1d. Note that Xr1 = 0 and Xr2 = 1d; to see the last equality, consider any fixed coordinate i and notice that the number of elements in X with non-zero values in the i th coordinate is equal to the number of ways to choose the remaining d 2 1 non-zero coordinates from the other d 1 coordinates. We then observe that the corresponding subgradients are h+ = 1 2d 1 Note that, since the set of subgradients of LD1(w 1) is a convex set, by taking a convex combination of h+ and h it follows that 0 2 @LD1(w 1) and so we see that w 1 is a minimizer of LD1(w). Given a handle on the subgradient set, we now show that these coefficients are polynomial in d. Observe that, using the fact that e )n n! epn( n e )n, we have d 1d. Since h can be written as a convex combination of h+ and 0, we see that h 2 @LD1(w 1). Similarly we may see that Again, since h can be written as a convex combination of the vectors h and 0 in the subgradient set, we may conclude that h 2 @LD1(w 1) as well. By the subgradient inequality it follows that, for all w 2 Rd, LD1(w) LD1(w LD1(w) LD1(w which taken together imply that LD1(w) LD1(w as required. 4 General Algorithm for Limited Precision Although we have established limits on the attainable precision for some learning problems, there is still the possibility of reaching this limit. In this section we provide a general algorithm, whereby a learner that can observe k < d attributes can always achieve an expected loss of O(p1 k/d). We provide the pseudo-code in Algorithm 1. Although similar to the AERR algorithm of Hazan and Koren [6] which is designed to work only with the squared loss Algorithm 1 avoids the necessity of an unbiased gradient estimator by replacing the original loss function with a slightly biased one. As long as the new loss function is chosen carefully (and the functions are Lipschitz bounded), and given enough samples, the algorithm can return a regressor of limited precision. This is in contrast to AERR whereby an arbitrarily precise regressor can always be achieved with enough samples. Formally, for Algorithm 1 we prove the following (proof in the supplementary material). Theorem 8. Let : R 7! R be an H-Lipschitz function defined over [ 2B, 2B]. Assume the distribution D is such that kxk2 1 and |y| B with probability 1. Let B = max{B, 1}, and let ˆw be the output of Algorithm 1, when run with = 2B Gpm . Then, k ˆwk2 B, and for any w 2 Rd with kw k2 B, E[LD( ˆw)] LD(w ) + 2HB pm + 2H B2 Algorithm 1 General algorithm for regression/classification with missing attributes Input: Loss function , training set S = {(xt, yt)}t 2[m], k, B, > 0 Output: Regressor ˆw with k ˆwk2 B 1: Initialize w1 , 0, kw1k2 B arbitrarily 2: for t = 1 to m do 3: Uniformly choose subset of k indices {it,r }r 2[k] from [d] without replacement 4: Set xt = Pk r=1 x[it,r] eit,r 5: Regression case: 6: Choose ˆφt 2 @ (w> t xt yt) 7: Classification case: 8: Choose ˆφt 2 @ (yt w> t xt) 9: Update wt+1 = B max{kwt ( ˆφt xt)k2, B} (wt ( ˆφt xt)) 10: end for 11: Return ˆw = 1 In particular, for m = d/(d k) we have E[LD( ˆw)] LD(w ) + 4H B2 and so when the learner observes k = d 1 attributes, the expected loss is O(1/ d)-away from optimum. 5 Conclusions and Future Work In the limited attribute observation setting, we have shown information-theoretic lower bounds for some variants of regression, proving that a distribution-independent algorithm for regression with absolute loss that attains " error cannot exist and closing the gap for ridge regression as suggested by Hazan and Koren [6]. We have also shown that the proof technique applied for regression with absolute loss can be extended to show a similar bound for classification with the hinge loss. In addition, we have described a general purpose algorithm which complements these results by providing a means of achieving error up to a certain precision limit. An interesting possibility for future work would be to try to bridge the gap between the upper and lower bounds of the precision limits, particularly in the case of the exponential gap for classification with hinge loss. Another direction would be to develop a more comprehensive understanding of these lower bounds in terms of more general functions, one example being classification with logistic loss. [1] S. Ben-David and E. Dichterman. Learning with restricted focus of attention. Journal of Computer and System Sciences, 56(3):277 298, 1998. [2] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Efficient learning with partially observed attributes. In Proceedings of the 27th International Conference on Machine Learning, 2010. [3] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Online learning of noisy data. IEEE Transactions on Information Theory, 57(12):7907 7931, 2011. [4] O. Dekel, O. Shamir, and L. Xiao. Learning to classify with missing and corrupted features. Machine Learning Journal, 81(2):149 178, 2010. [5] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78 150, 1992. [6] E. Hazan and T. Koren. Linear regression with limited observation. In Proceedings of the 29th International Conference on Machine Learning (ICML 12), Edinburgh, Scotland, UK, 2012. [7] E. Hazan, R. Livni, and Y. Mansour. Classification with low rank and missing data. In Proceedings of the 32nd International Conference on Machine Learning, 2015. [8] D. Kukliansky and O. Shamir. Attribute efficient linear regression with data-dependent sampling. In Proceedings of the 32nd International Conference on Machine Learning, 2015. [9] R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data, 2nd Edition. Wiley- Interscience, 2002. [10] P.-L. Loh and M. J. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Advances in Neural Information Processing Systems, 2011. [11] A. Rostamizadeh, A. Agarwal, and P. Bartlett. Learning with missing features. In The 27th Conference on Uncertainty in Artificial Intelligence, 2011. [12] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267 288, 1996. [13] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 2003.