# distributionally_robust_local_nonparametric_conditional_estimation__57b39525.pdf Distributionally Robust Local Non-parametric Conditional Estimation Viet Anh Nguyen Fan Zhang José Blanchet Stanford University, United States {viet-anh.nguyen, fzh, jose.blanchet}@stanford.edu Erick Delage HEC Montréal, Canada erick.delage@hec.ca Yinyu Ye Stanford University, United States yinyu-ye@stanford.edu Conditional estimation given specific covariate values (i.e., local conditional estimation or functional estimation) is ubiquitously useful with applications in engineering, social and natural sciences. Existing data-driven non-parametric estimators mostly focus on structured homogeneous data (e.g., weakly independent and stationary data), thus they are sensitive to adversarial noise and may perform poorly under a low sample size. To alleviate these issues, we propose a new distributionally robust estimator that generates non-parametric local estimates by minimizing the worst-case conditional expected loss over all adversarial distributions in a Wasserstein ambiguity set. We show that despite being generally intractable, the local estimator can be efficiently found via convex optimization under broadly applicable settings, and it is robust to the corruption and heterogeneity of the data. Experiments with synthetic and MNIST datasets show the competitive performance of this new class of estimators. 1 Introduction We consider the estimation of conditional statistics of a response variable, Y Rm, given the value of a predictor or covariate X Rn. The single most important instance of these types of problems involves estimating the conditional mean, or also known as the regression function. Under finite variance assumptions, the conditional mean EP[Y |X = x0] is technically defined as ψ (x0) for some measurable function ψ that solves the minimum mean square error problem min ψ EP[ Y ψ(X) 2 2], where the minimization is taken over the space of all measurable functions from Rn to Rm. While the optimal solution ψ is unique up to sets of P-measure zero, unfortunately, solving for ψ is challenging because it is an infinite-dimensional optimization problem. The regression function ψ can be efficiently found only under specific settings, for example, if one assumes that (X, Y ) follows a jointly Gaussian distribution. However, these specific situations are overly restrictive in practice. In order to bypass the infinite-dimensional challenge involved in directly computing ψ , we may instead consider a family of optimization problems that are parametrized by x0. More specifically, 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. in the presence of a regular conditional distribution, the conditional mean EP[Y |X = x0] can be estimated pointwise by bβ defined as bβ arg min β EP[ Y β 2 2|X = x0] for any covariate value x0 of interest. This presents the challenge of effectively accessing the conditional distribution, which is particularly difficult if the event X = x0 has P-probability zero. Using an analogous argument, if we are interested in the conditional (τ 100%)-quantile of Y given X, then this conditional statistics can be estimated pointwise at any location x0 of interest by bβ arg min β EP[max{ τ(Y β), (1 τ)(Y β)}|X = x0]. The previous examples illustrate that the estimation of a wide range of conditional statistics can be recast into solving a family of finite-dimensional optimization problems parametrically in x0 min β EP[ℓ(Y, β)|X = x0] (1) with an appropriately chosen statistical loss function ℓ. Problem (1) poses several challenges, some of which were alluded to earlier. First, it requires the integration with respect to a difficult to compute conditional probability distribution. Second, the probability measure P is generally unknown, hence we lack a fundamental input to solve (1). Finally, in a data-driven setting, there may be few, or even no, observations with value covariate X = x0. To alleviate these difficulties, our formulation, as we shall explain, involves two features. First, we consider a relaxation of problem (1) in which the event X = x0 is replaced by a neighborhood Nγ(x0) of a suitable radius γ 0 around x0. Second, we introduce a data-driven distributionally robust optimization (DRO) formulation (e.g. [8, 5, 21]) in order to mitigate the problem that P is unknown. In turn, the DRO formulation involves a novel class of conditional ambiguity set which copes with the underlying conditional distribution being unknown. In particular, we propose the following distributionally robust local conditional estimation problem min β sup Q B ρ ,Q(X Nγ(x0))>0 EQ ℓ(Y, β)|X Nγ(x0) , (2) where the maximization is taken over all probability measures Q that are within ρ distance in the -Wasserstein sense of a benchmark nominal model, which often corresponds to the empirical distribution of available data. The probability measures Q are constrained so that Q(X Nγ(x0)) > 0 to eliminate the complication of conditioning on a set of measure zero. Contributions. Resting on formulation (2), our main contributions are summarized as follows. 1. We introduce a novel paradigm of non-parametric local conditional estimation based on distributionally robust optimization. In contrast to classical non-parametric conditional estimators, our new class of estimators are endowed by design with robustness features. They are structurally built to mitigate the impact of model contamination and therefore they may be reasonably applied to heterogeneous data (e.g., non i.i.d. input). 2. We demonstrate that when the ambiguity set is a type Wasserstein ball around the empirical measure, the proposed min-max estimation problem can be efficiently solved in many applicable settings, including notably the local conditional mean and quantile estimation. 3. We show that this class of type Wasserstein local conditional estimators can be considered as a systematic robustification of the k-nearest neighbor estimator. We also provide further insights on the statistical properties of our approach and empirical evidence, with both a synthetic and real data sets, that our approach can provide more accurate estimations in practically relevant settings. Related work. One can argue that every single prediction task in machine learning ultimately relates to conditional estimation. So, attempting to provide a full literature survey on non-parametric conditional estimation is an impossible task. Since our contribution is primarily on introducing a novel conceptual paradigm powered by DRO, we focus on discussing well-understood estimators that encompass most of the conceptual ideas used to mitigate the challenges exposed earlier. The challenges of conditioning on zero probability events and the fact that x0 may not be a part of the sample are addressed based on the idea of averaging around a neighborhood of the point of interest and smoothing. This gives rise to estimators such as k-NN (see, for example, [9]), and kernel density estimators, including, for instance the Nadaraya-Watson estimator ([28, 38]) and the Epanechnikov estimator [10], among others. Additional averaging methods include, for example, random forests [6] and Classification and Regression Trees (CARTs, [7]), see also [16] for other techniques. These averaging and smoothing ideas are well understood, leading to the optimal selection (in a suitable sense) of the kernel along with the associated tuning parameters such as the bandwidth size. These choices are then used to deal with the ignorance of the true data generating distribution by assuming a certain degree of homogeneity in the data, such as stationarity and weak dependence, in order to guarantee consistency and recovery of the underlying generating model. However, none of these estimators are directly designed to cope with the problem of general (potentially adversarial) data contamination. The later issue revolving around the evaluation of an unknown conditional probability model is connected with robustness, another classical topic in statistics [17]. Much of the classical literature on robustness focuses on the impact of outliers. The work of [41] studies robust-against-outliers kernel regression which enjoys asymptotic consistency and normality under i.i.d. assumptions in a setting where the data contamination becomes negligible. In contrast to this type of contamination, our estimators are designed to be min-max optimal in the DRO sense by supplying the best response against a large (non-parametric) class of adversarial contamination. Our results can also be seen as connected to adversarial training, which has received a significant amount of attention in recent years [15, 22, 25, 35, 33, 31]. Existing robustification of the nearest neighbors and of the nonparametric classifiers in general can be streamlined into two main strategies: i) global approaches that modify the whole training dataset, e.g., adversarial pruning [37, 40, 4], and ii) local approaches that study well-crafted attack and seek appropriate defense for specific classifiers such as 1-NN [20, 36, 24]. Following this line of ideas, one can interpret our approach as a novel method to train conditional estimators against adversarial attacks. The difference, in the k-NN estimation setting for example, is that our attacks are optimal in a distributional sense. Our proposed estimator is thus provably the best for a uniform class of distributional attacks. Compared to the current literature, we believe that our approach is also more general in two significant ways: first, we start from a generic min-max estimation problem, and our ideas and methodology are easily applicable to other non-parametric settings, and second, we allow for perturbations on Y to hedge against label contamination. DRO-based estimators have generated a great deal of interest because they possess various desirable properties in connection to various forms of regularization (e.g., variance [29]; norm [32]; shrinkage [30]). The tools that we employ are related to those currently being investigated. Our formulation considers adversarial perturbations based on the Wasserstein distance [26, 5, 13, 21]. In particular, the type Wasserstein distance [14] is recently applied in DRO formulations [1, 3, 39]. In particular, the work of [2] considers adversarial conditional estimation, taking as input various classical estimators (e.g., k-NNs, kernel methods, etc.) and proposes a robustification approach considering only perturbation in the response variable. Our method whereas allows perturbations both to the covariate and response variables, which is technically more subtle because of the local conditioning problem. Within the k-NN DRO conditional robustification, our numerical experiments in Section 4 show substantial benefits of our local conditioning approach, especially in dealing with non-homogeneity and sharp variations in the underlying density. Our proposed framework is also relevant to the emerging stream of decision making with side information, where recent approaches rely on sample average approximation [19], decision forests [18] and probability trimmings [11]. Notations. For any integer M N+, we denote by [M] the set {1, . . . , M}. For any set S, M(S) is the space of all probability measures supported on S. 2 Local Conditional Estimate using Type Wasserstein Ambiguity Set We start by delineating the building blocks of our distributionally robust estimation problem (2). The nominal measure is set to the empirical distribution of the available data, b P = N 1 P i [N] δ(bxi,byi), where δ(bx,by) represents the Dirac distribution at (bx, by). The ambiguity set B ρ is a Wasserstein ball around b P that contains the true distribution P with high confidence. Definition 2.1 (Wasserstein distance). Let D be a metric on Ξ. The type-p (1 p < + ) Wasserstein distance between Q1 and Q2 is defined as Wp(Q1, Q2) inf n Eπ[D(ξ1, ξ2)p] 1 p : π Π(Q1, Q2) o , where Π(Q1, Q2) is the set of all probability measures on Ξ Ξ with marginals Q1 and Q2, respectively. The type Wasserstein distance is defined as the limit of Wp as p tends to and amounts to W (Q1, Q2) inf ess sup π D(ξ1, ξ2) : (ξ1, ξ2) Ξ Ξ : π Π(Q1, Q2) . We assume that (X, Y ) admits values in X Y Rn Rm, and the distance D on X Y is D (x, y), (x , y ) = DX (x, x ) + DY(y, y ) (x, y), (x , y ) X Y, where DX and DY are continuous metric on X and Y, respectively. The joint ambiguity set B ρ is now formally defined as a type Wasserstein ball in the space of joint probability measures B ρ n Q M(X Y) : W (Q, b P) ρ o . We assume further that the compact neighborhood Nγ(x0) around x0 is prescribed using the distance DX as Nγ(x0) {x X : DX (x, x0) γ}, and the loss function ℓis jointly continuous in y and β. To solve the estimation problem (2), we study the worst-case conditional expected loss function f(β) sup Q Bρ,Q(X Nγ(x0))>0 EQ ℓ(Y, β)|X Nγ(x0) , which corresponds to the inner maximization problem of (2). To ensure that the value f(β) is well-defined, we first investigate the conditions under which the above supremum problem has a non-empty feasible set. Towards this end, for any set Nγ(x0) X, define the quantities κi,γ as 0 κi,γ min x Nγ(x0) DX (x, bxi) + inf y Y DY(y, byi) i [N]. (3) The value κi,γ signifies the unit cost of moving a point mass from an observation (bxi, byi) to the fiber set Nγ(x0) Y. We also define bxp i as the projection of bxi onto the neighborhood Nγ(x0), which coincides with the optimal solution in the variable x of the minimization problem in (3). The next proposition asserts that f(β) is well-defined if the radius ρ is sufficiently large. Proposition 2.2 (Minimum radius). For any x0 X and γ R+, there exists a distribution Q Bρ that satisfies Q(X Nγ(x0)) > 0 if and only if ρ mini [N] κi,γ. We now proceed to the reformulation of f(β). Let I be the index set defined as I {i [N] : DX (x0, bxi) ρ + γ} , (4a) and I is decomposed further into two disjoint subsets I1 = {i I : DX (x0, bxi) + ρ γ} and I2 = I\I1. (4b) Intuitively speaking, I contains the indices of data points whose covariate bxi is sufficiently close to x0 measured by DX , and are thus relevant to the local estimation problem. The index set I1 indicates the data points that lie strictly inside the neighborhood, while the set I2 contains those points that are on the boundary ring of width ρ around the neighborhood Nγ(x0). The value f(β) can be efficiently computed in a quasi-closed form thanks to the following result. Theorem 2.3 (Worst-case conditional expected loss computation). For any γ R+, suppose that ρ mini [N] κi,γ. For any β Y, let v i (β) be defined as v i (β) sup yi {ℓ(yi, β) : yi Y, DY(yi, byi) ρ DX (bxp i , bxi)} i I. (5) The worst-case conditional expected loss is equal to f(β) = P i I αi 1 P i I αiv i (β), where α admits the value 1 if i I1 or (I1 = and v i (β) = maxj I2 v j (β)), 1 if v i (β) > P i I1 v i (β) + P j I2:v j (β)>v i (β) v j (β) |I1| + |{j I2 : v j (β) > v i (β)}| , 0 otherwise. Figure 1: Illustration around the neighborhood of x0 with ρ < γ. Black crosses are samples in the set I. If we possess an oracle that evaluates (5) at a complexity O, then by Theorem 2.3, quantifying f(β) is reduced to calculating |I| values of v i (β) and then sorting these values in order to determine the value of α. Thus, computing f(β) takes an amount of time of order O |I|(log |I| + O) . Moreover, f(β) depends solely on the observations in the locality of x0 whose indices belong to the index set I, the cardinality of which can be substantially smaller than the total number of training samples N. If ℓis a convex function in β, then a standard result from convex analysis implies that f, being a pointwise supremum of convex functions, is also convex. If Y, and hence β, is unidimensional, a golden section search algorithm can be utilized to identify the local conditional estimate β that solves (2) in an amount of time of order O log(1/ϵ)|I|(log(|I|) + O) , where ϵ > 0 is an arbitrary accuracy level. Fortunately, in the case of conditional mean and quantile estimation, we also have access to the closed form expressions of v i (β) as long as DY is an absolute distance. Corollary 2.4 (Value of v i (β)). Suppose that Y = [a, b] [ , + ] and DY(yi, byi) = |yi byi|. (i) Conditional mean estimation: if ℓ(y, β) = (y β)2, then i I v i (β) = max (max{byi + ρ DX (bxp i , bxi), a} β)2, (min{byi + ρ DX (bxp i , bxi), b} β)2 . (ii) Conditional quantile estimation: if ℓ(y, β) = max{ τ(y β), (1 τ)(y β)}, then i I v i (β)=max τ(max{byi+ρ DX (bxp i , bxi), a} β), (1 τ)(min{byi+ρ DX (bxp i , bxi), b} β) . If Y is multidimensional, the structure of ℓ(y, β) and DY might be exploited to identify tractable optimization reformulations. The next result focuses on the local conditional mean estimation. Proposition 2.5 (Multivariate conditional mean estimation). Let Y = Rm and ℓ(y, β) = y β 2 2. (i) Suppose that DY is a 2-norm on Y, that is, DY(y, by) = y by 2. The distributionally robust local conditional estimation problem (2) is equivalent to the second-order cone program min λ s. t. β Rm, λ R, ui R i I1, ui R+ i I2, ti R+ i I P i I ui 0, ti byi β 2 i I [ti + ρ DX (bxp i , bxi) ; (1/2)(1 λ ui)] 2 (1/2)(1 + λ + ui) i I. (ii) Suppose that DY is a -norm on Y, that is, DY(y, by) = y by . The distributionally robust local conditional estimation problem (2) is equivalent to the second-order cone program min λ s. t. β Rm, λ R, T R|I| m + , ui R i I1, ui R+ i I2 P i I ui 0, ; [Ti1 ; Ti2 ; ; Tim ; 1 2(1 λ ui)] 2 1 2(1 + λ + ui) i I Tij byij βj ρ + DX (bxp i , bxi) Tij Tij byij βj + ρ DX (bxp i , bxi) Tij (i, j) I [m], where byij and βj are the j-th component of byi and β, respectively. Both optimization problems presented in Proposition 2.5 can be solved in large scale by commercial optimization solvers such as MOSEK [27]. For other multivariate conditional estimation problems, there is also a possibility of employing subgradient methods by leveraging on the next proposition. Proposition 2.6 (Subgradient of f). Suppose that DY is coercive and ℓ(y, ) is convex. Under the conditions of Theorem 2.3, for any β Rm, a subgradient of the function f at β is given by f(β) = (P i I αi) 1 P i I αi βℓ(y i , β), where the value of α is as defined in Theorem 2.3 and y i satisfies y i {yi Y : DY(yi, byi) ρ DX (bxp i , bxi), ℓ(y i , β) = v i (β)} for all i I. Just as an adversarial example provides a description on how to optimally perturb a data point from the adversary s viewpoint [20, 36], the worst-case distribution provides full information on how to adversarially perturb the empirical distribution b P. For our distributionally robust estimator, the worst-case distribution can be obtained from the result of Theorem 2.3. Lemma 2.7 (Worst-case distribution). Fix an estimate β Y. Suppose that ρ mini [N] κi,γ and let v (β) and α be determined as in Theorem 2.3. Moreover, let y i satisfy y i {yi Y : DY(yi, byi) ρ DX (bxp i , bxi), ℓ(y i , β) = v i (β)} for all i I. Then the distribution i I:αi=1 δ(bxp i ,y i ) + X i I:αi=0 δ(bxi,byi) + X i [N]\I δ(bxi,byi) satisfies f(β) = EQ ℓ(Y, β)|X Nγ(x0) . The values of α calculated in Theorem 2.3 are of indicative nature: αi = 1 if it is optimal to perturb the sample point i to compute the worst-case conditional expected loss. The construction of the worst-case distribution is hence intuitive: it involves computing and sorting the values v i (β), and then performing a greedy assignment in order to maximize the objective value. 3 Probabilistic Theoretical Properties We now study the some statistical properties of our proposed estimator. Under some regularity conditions, the type Wasserstein ball can be viewed as a confidence set that contains the true distribution P with high probability, provided that the radius ρ is chosen judiciously. The value f(β ) thus constitutes a generalization bound on the out-of-sample performance of the optimal conditional estimate β . This idea can be formalized as follows. Proposition 3.1 (Finite sample guarantee). Suppose that X Y is bounded, open, connected with a Lipschitz boundary. Suppose that the true probability measure P of (X, Y ) admits a density function ν satisfying ν 1 ν(x, y) ν for some constant ν 1. For any γ > 0, if 2 log(N) 3 4 when n + m = 2, CN 1 n+m log(N) 1 n+m otherwise, where C is a constant dependent on X Y and ν, then for a probability of at least 1 O(N c), where c>1 is a constant dependent on C, we have EP[ℓ(Y, β )|X Nγ(x0)] f(β ), where β is the optimal conditional estimate that solves problem (2). We now switch gear to study the properties of our estimator in the asymptotic regime, in particular, we focus on the consistency of our estimator. The interplay between the neighborhood radius γ and the ambiguity size ρ often produces tangling effects on the asymptotic convergence of the estimate. We thus showcase two exemplary setups with either γ or ρ is zero, which interestingly produce two opposite outcomes on the consistency of the estimator. This underlines the intricacy of the problem. Example 3.2 (Non-consistency when γ = 0). Suppose that γ = 0, ρ R++ be a fixed constant, Y = R, ℓ(y, β) = (y β)2, and DY is the absolute distance. Let β N be the optimal estimate that solves (2) dependent on {(bxi, byi)}i=1,...,N. If under the true distribution P, X is independent of Y , P(DX (X, x0) ρ) > 0, P(Y 0)=1 and P(Y y)>0 y >0, then with probability 1, we have bβN + while EP[Y |X =x0]< . Example 3.3 (Consistency when ρ = 0). Suppose that ρ = 0, Y = R, ℓ(y, β) = (y β)2, DX and DY are the Euclidean distance, k N is a sequence of integer. Let γ be the k N-th smallest value of DX (x0, bxi), then β N that solves (2) recovers the k N-nearest neighbor regression estimator. If k N satisfies lim N k N = and lim N k N/N = 0, and EP[Y |X = x] is a continuous function of x, then lim N β N = EP[Y |X = x0] by [34, Corollary 3]. Example 3.3 suggests that if the radius γ of the neighborhood is chosen adaptively based on the available training data, then our proposed estimator coincides with the k-nearest neighbor estimator, and hence consistency is inherited in a straightforward manner. The robust estimator with an ambiguity size ρ > 0 and an adaptive neighborhood radius γ can thus be considered as a robustification of the k-nearest neighbor, which is obtained in a systematic way using the DRO framework. It is desirable to provide a descriptive connection between the distributionally robust estimator vis-àvis some popular statistical quantities. For the local conditional mean estimation, our estimate β coincides with the conditional mean of the distribution with the highest conditional variance. This insight culminates in the next proposition and bolsters the explainability of this class of estimators. Proposition 3.4 (Conditional mean estimate). Suppose that Y = R, ℓ(y, β) = (y β)2 and DY( , by) is convex, coercive for any by. For any ρ mini [N] κi,γ, define Q as Q = arg max Q B ρ ,Q(X Nγ(x0))>0 Variance Q(Y |X Nγ(x0)), then β = EQ [Y |X Nγ(x0)] is the optimal estimate that solves problem (2). 4 Numerical Experiment In this section we compare the quality of our proposed Distributionally Robust Conditional Mean Estimator (DRCME) to k-nearest neighbour (k-NN), Nadaraya-Watson (N-W), and Nadaraya Epanechnikov (N-E) estimators, together with the robust k-NN approach in [2] (Bert Et Al) using a synthetic and the MNIST datasets. Codes are available at https://github.com/nvietanh/DRCME. 0.200 0.225 0.250 0.275 0.300 0.325 0.350 0.375 0.400 x0 Mean Estimation Error k-NN DRCME Nadaraya-Watson Nadaraya-Epanechnikov Bertsimas et al. 2019 Figure 2: Comparison of the mean absolute errors of conditional mean estimators for synthetic data. The gray shade shows the density of X. 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Estimation Error Cumulated probability (in %) k-NN DRCME Nadaraya-Watson Nadaraya-Epanechnikov Bertsimas et al. 2019 Figure 3: Comparison of the distributions of absolute estimation errors of conditional mean estimators for synthetic data. 4.1 Conditional Mean Estimation With Synthetic Data In this section, we conducted 500 independent experiments where the training set contains N = 100 i.i.d. samples of (X, Y ) in each experiment. The marginal distribution of X has piecewise constant density function p(x), which is chosen as p(x) = 100/72 if x [0, 0.3] [0.7, 1] and p(x) = 30/72 if x (0.3, 0.7). Given X, the distribution of Y is determined by Y = f(X)+ε, where f = sin(10 x) and ε is i.i.d. Gaussian noise independent of X with mean 0 and variance 0.01. The conditional mean estimation problem is challenging when x0 is close to the jump points of the density function p(x), that is at x0 = 0.3 or x0 = 0.7, because the data are gathered unequally in the neighborhoods. Thus, to test the robustness of all the estimators, we employ all the five estimators to estimate the conditional mean EP[Y |X = x0], for x0 = 0.2, 0.21, . . . , 0.4 around the jump point x0 = 0.3. We select DX (x, x ) = |x x | and DY(y, y ) = |y y |. The hyperparameters of all the estimators, whose range and selection are given in Appendix A, are chosen by leave-one-out cross validation. Figure 2 displays the average of the mean estimation errors taken over 500 independent runs for different values x0 [0.2, 0.4]. One can observe from the figure that DRCME uniformly outperforms Method H.P. N=50 N=100 N=500 k-NN k 3 4 4 N-W h 0.022 0.019 0.015 N-E h 0.087 0.078 0.068 Bert Et Al k 3 4 5 ρ 0.712 1.313 1.313 γ hγ 1.3( ) hγ 1.3( ) hγ 1.6( ) DRCME ρ 0.13γ 0.13γ 0.06γ θ 0.004 0.002 0.001 Table 1: Median of hyper-parameters (H.P.) obtained with cross-validation. Method N=50 N=100 N=500 k-NN 24 2 33 2 60 1 N-W 30 2 38 2 65 1 N-E 26 1 32 1 50 1 Bert Et Al 29 2 41 2 67 1 DRCME 36 2 46 1 71 1 Table 2: Comparison of expected outof-sample classification accuracy (in % with 90% confidence intervals) from rounded estimates. k-NN, Bert Et Al for all x0 of interest. When compared with N-W and N-E, we remark that DRCME is the most accurate estimator around the jump point of p(x). As x0 moves away from the location 0.3, the performance of DRCME decays and becomes slightly worse than N-W as x0 goes far from the jump point. Figure 3 presents the cumulative distribution of the estimation errors when x0 [0.28, 0.32]. The empirical error distribution of DRCME is stochastically smaller than that of other estimators, which reinforces that DRCME outperforms around the jump point in a strong sense. 4.2 Digit Estimation With MNIST Database In this section, we compare the quality of the estimators on a digit estimation problem using the MNIST database [23]. While to this date most studies have focused on out-of-sample classification performances for this dataset, here we shift our attention to the task of estimation of digits as cardinal quantities and are especially interested in performance at a low-data regime. Treating the labels as cardinal quantities allows us to assess the distinctive features of DRCME in its most simplistic form (i.e. univariate conditional mean estimation of a real random variable). Mean estimation might in fact be more relevant than classification when trying to recognize handwritten measurements where confusing a 0 with a 6 is more damaging than with a 3. We executed 100 experiments where training and test sets were randomly drawn without replacement from the 60,000 training examples of this dataset. Training set sizes were N = 50, 100, or 500 while test sets size remained at 100. Each (x, y) pair is composed of the normalized vector, in R282 of grayscale intensities normalized so that x 1 = 1. For simplicity, we let DX (x, bx) = x bx 2 and DY(y, by) = θ|y by|. In each experiment, the hyper-parameters of all four methods were chosen based on a leave-one-out cross validation process. In the case of DRCME, we adapt the radius of the neighborhood γ and ρ locally at x0 to account for the non-uniform density of X.1 Table 1 presents the median choice of hyper parameters for each estimator. 0 2 4 6 Estimation error Cumulated probability (in %) 0 2 4 6 Estimation error 0 2 4 6 Estimation error k-NN DRCME Nadaraya-Watson Nadaraya-Epanechnikov Bertsimas et al. 2019 Figure 4: Comparison of the distributions of out-of-sample absolute estimation errors of conditional mean estimators for the MNIST database under different training set sizes. Figure 4 presents the out-of-sample estimation error distribution of all four conditional estimators. One can quickly remark that the DRCME outperforms Bert Et Al, k-NN, and N-E estimators, especially 1Specifically, we let γ = hγ i (x0) := κ[ i ],0 + (i i )(κ[ i ],0 κ[ i ],0), where [j] refers to the j-th smallest element while and refer to the floor and ceiling operations, i.e., the radius is set to the linear interpolation between the distance of the i -th and i + 1-th closest members of the training set to x0. We further let ρ be proportional to γ. This lets DRCME reduce to k-NN when γ = hγ k(x0), ρ = 0, and θ = 1. Figure 5: Comparison of estimations from N-W and DRCME on entropic regularized Wasserstein barycenters of pairs of images from the training set. Estimations are presented above each image in the format (N-W, DRCME) . for low-data regime. In particular, for all three training set sizes, the distribution of error for DRCME stochastically dominates the three other distributions. In particular, one even notices in (c) that DRCME has the largest chance of reaching an exact estimation: 66% compared to 60%, 55%, 30%, and 8% for the other estimators. This explains why DRCME is also the most accurate estimator when rounding it to the nearest integer as reported in Table 2: with a margin greater than 4% from all estimators across all N s. It is worth noting that while N-W does not produce high accuracy estimate, it however has less chances of producing estimation with large errors. This is also apparent when comparing the expected type-p deviation of the estimation error, i.e. (E[|y ˆy|p])1/p, for each estimator. Specifically, N-W slightly outperforms DRCME for deviation metrics of type p 1, e.g. with a root mean square error of 1.32 compared to 1.41 when N = 500. On the other hand, DRCME significantly outperforms N-W when p < 1 where high precision estimators are encouraged. We refer the reader to Appendix A for further details. Finally, we report on an experiment that challenges the capacity of both N-W and DRCME estimators to be resilient to adversarial corruption of the test images. This is done by exposing the two estimators to images from the training set (N = 100) that have been corrupted in a way that makes them resemble the closest differently-labeled image in the set.2 Figure 5 presents several visual examples of the progressively corrupted images and the resulting N-W and DRCME estimations. Overall, one quickly notices how the estimation produced by DRCME is less sensitive to such attacks, sticking to the original label until there is substantial evidence of a new label. More examples are in Appendix A. Broader Impact Our paper contributes theoretical insights at the intersection of statistics and optimization, with potential applications in diverse areas of machine learning. In particular, our proposed estimator can be used in almost all applications in which the non-parametric conditional estimators (including k-nearest neighbors and kernel estimators) are currently utilized, including regression and classification tasks with potential impact in health sciences, economics, business, finance, climate, various engineering areas, logistics, risk analysis, etc. Using ideas from the distributionally robust optimization framework, we propose a principled and systematic way to obtain a robustification of the popular k-nearest 2Implementation wise, we exploit the Python Optimal Transport toolbox [12] to compute different entropic regularized Wasserstein barycenters of the two normalized images treated as distributions. neighbors. At a methodological level, we contribute a novel paradigm that can be used to enhance robustness of conditional statistical estimation against model misspecification and adversarial attacks. Because our paper provides novel techniques for conditional estimation in the context structured and contaminated data, we believe that we have the potential of enabling more applications in which data sets are pulled together from different sources (e.g., for prediction of health care policy evaluations in which information from different environments needs to be put together to mitigate the lack of data given the need for quick decision making under time constraints; for online advertisement recommendation system in which the behaviors of many customers are employed to predict the behavior of incoming customers conditional on their profile). In addition, the results in this paper are a part of the thesis work of a Ph.D. student, thus promoting the training for highly qualified personnel. Acknowledgments. Material in this paper is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-20-1-0397. Additional support is gratefully acknowledged from NSF grants 1915967, 1820942, 1838676, NSERC grant RGPIN-2016-05208, and from the China Merchant Bank. Finally, this research was enabled in part by support provided by Compute Canada. [1] D. Bertsimas, V. Gupta, and N. Kallus. Data-driven robust optimization. Mathematical Programming, 167(2):235 292, 2018. [2] D. Bertsimas, C. Mc Cord, and B. Sturt. Dynamic optimization with side information. ar Xiv preprint ar Xiv:1907.07307, 2019. [3] D. Bertsimas, S. Shtern, and B. Sturt. Two-stage sample robust optimization. ar Xiv preprint ar Xiv:1907.07142, 2019. [4] R. Bhattacharjee and K. Chaudhuri. When are non-parametric methods robust? In International Conference on Machine Learning, 2020. [5] J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565 600, 2019. [6] L. Breiman. Random forests. Machine Learning, 45:5 32, 2001. [7] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth and Brooks, 1984. [8] E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595 612, 2010. [9] L. Devroye. The uniform convergence of nearest neighbor regression function estimators and their application in optimization. IEEE Transactions on Information Theory, 24(2):142 151, 1978. [10] V. A. Epanechnikov. Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 14(1):153 158, 1969. [11] A. Esteban-Pérez and J. Morales. Distributionally robust stochastic programs with side information based on trimmings. ar Xiv preprint ar Xiv:2009.10592, 2020. [12] R. Flamary and N. Courty. Pot python optimal transport library, 2017. [13] R. Gao and A. J. Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance. ar Xiv preprint ar Xiv:1604.02199, 2016. [14] C. Givens and R. Shortt. A class of Wasserstein metrics for probability distributions. The Michigan Mathematical Journal, 31(2):231 240, 1984. [15] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In Proceedings of the Third International Conference on Learning Representations, 2015. [16] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. [17] P. J. Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492 518. Springer, 1992. [18] N. Kallus and X. Mao. Stochastic optimization forests. ar Xiv preprint ar Xiv:2008.07473, 2020. [19] R. Kannan, G. Bayraksan, and J. R. Luedtke. Data-driven sample average approximation with covariate information. Available on Optimization Online, 2020. [20] M. Khoury and D. Hadfield-Menell. On the geometry of adversarial examples. ar Xiv preprint ar Xiv:1811.00525, 2018. [21] D. Kuhn, P. M. Esfahani, V. A. Nguyen, and S. Shafieezadeh-Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. INFORMS Tut ORials in Operations Research, pages 130 166, 2019. [22] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. In Proceedings of the Fifth International Conference on Learning Representations, 2017. [23] Y. Le Cun and C. Cortes. The MNIST Database of Handwritten Digits, 1998 (accessed May 28, 2020). [24] X. Li, Y. Chen, Y. He, and H. Xue. Advknn: Adversarial attacks on k-nearest neighbor classifiers with approximate gradients. ar Xiv preprint ar Xiv:1911.06591, 2019. [25] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In Proceedings of the Sixth International Conference on Learning Representations, 2018. [26] P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115 166, 2018. [27] MOSEK Ap S. MOSEK Optimizer API for Python 9.2.10, 2019. [28] E. A. Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141 142, 1964. [29] H. Namkoong and J. C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems 30, pages 2971 2980, 2017. [30] V. A. Nguyen, D. Kuhn, and P. Mohajerin Esfahani. Distributionally robust inverse covariance estimation: The Wasserstein shrinkage estimator. ar Xiv preprint ar Xiv:1805.07194, 2018. [31] A. Raghunathan, J. Steinhardt, and P. Liang. Certified defenses against adversarial examples. In International Conference on Learning Representations, 2018. [32] S. Shafieezadeh-Abadeh, D. Kuhn, and P. M. Esfahani. Regularization via mass transportation. Journal of Machine Learning Research, 20(103):1 68, 2019. [33] A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. [34] C. J. Stone. Consistent nonparametric regression. Annals of Statistics, 5(4):595 620, 1977. [35] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. D. Mc Daniel. Ensemble adversarial training: Attacks and defenses. In Proceedings of the Sixth International Conference on Learning Representations, 2018. [36] L. Wang, X. Liu, J. Yi, Z.-H. Zhou, and C.-J. Hsieh. Evaluating the robustness of nearest neighbor classifiers: A primal-dual perspective. ar Xiv preprint ar Xiv:1906.03972, 2019. [37] Y. Wang, S. Jha, and K. Chaudhuri. Analyzing the robustness of nearest neighbors to adversarial examples. In International Conference on Machine Learning, pages 5133 5142, 2018. [38] G. S. Watson. Smooth regression analysis. Sankhy a: The Indian Journal of Statistics, Series A, pages 359 372, 1964. [39] W. Xie. Tractable reformulations of distributionally robust two-stage stochastic programs with -Wasserstein distance. ar Xiv preprint ar Xiv:1908.08454, 2019. [40] Y.-Y. Yang, C. Rashtchian, Y. Wang, and K. Chaudhuri. Robustness for non-parametric classification: A generic attack and defense. In International Conference on Artificial Intelligence and Statistics, 2020. [41] G. Zhao and Y. Ma. Robust nonparametric kernel regression estimator. Statistics & Probability Letters, 116:72 79, 2016.