# robust_estimation_via_generative_adversarial_networks__91326b08.pdf Published as a conference paper at ICLR 2019 ROBUST ESTIMATION AND GENERATIVE ADVERSARIAL NETWORKS Chao Gao Department of Statistics University of Chicago Chicago, IL 60637 USA chaogao@galton.uchicago.edu Jiyi Liu Department of Statistics and Data Science Yale University New Haven, CT 06511 USA jiyi.liu@yale.edu Yuan Yao & Weizhi Zhu Department of Mathematics Hong Kong University of Science and Technology Kowloon, Hong Kong yuany@ust.hk; wzhuai@connect.ust.hk Robust estimation under Huber s ϵ-contamination model has become an important topic in statistics and theoretical computer science. Statistically optimal procedures such as Tukey s median and other estimators based on depth functions are impractical because of their computational intractability. In this paper, we establish an intriguing connection between f-GANs and various depth functions through the lens of f-Learning. Similar to the derivation of f GANs, we show that these depth functions that lead to statistically optimal robust estimators can all be viewed as variational lower bounds of the total variation distance in the framework of f-Learning. This connection opens the door of computing robust estimators using tools developed for training GANs. In particular, we show in both theory and experiments that some appropriate structures of discriminator networks with hidden layers in GANs lead to statistically optimal robust location estimators for both Gaussian distribution and general elliptical distributions where first moment may not exist. 1 INTRODUCTION In the setting of Huber s ϵ-contamination model (Huber, 1964; 1965), one has i.i.d observations X1, ..., Xn (1 ϵ)Pθ + ϵQ, (1) and the goal is to estimate the model parameter θ. Under the data generating process (1), each observation has a 1 ϵ probability to be drawn from Pθ and the other ϵ probability to be drawn from the contamination distribution Q. The presence of an unknown contamination distribution poses both statistical and computational challenges to the problem. For example, consider a normal mean estimation problem with Pθ = N(θ, Ip). Due to the contamination of data, the sample average, which is optimal when ϵ = 0, can be arbitrarily far away from the true mean if Q charges a positive probability at infinity. Moreover, even robust estimators such as coordinatewise median and geometric median are proved to be suboptimal under the setting of (1) (Chen et al., 2018; Diakonikolas et al., 2016a; Lai et al., 2016). The search for both statistically optimal and computationally feasible procedures has become a fundamental problem in areas including statistics and computer science. For the normal mean estimation problem, it has been shown in Chen et al. (2018) that the minimax rate with respect to the squared ℓ2 loss is p n ϵ2, and is achieved by Tukey s median (Tukey, 1975). Despite the statistical optimality of Tukey s median, its computation is not tractable. In fact, even an approximate algorithm takes O(e Cp) in time (Amenta et al., 2000; Chan, 2004; Rousseeuw & Struyf, 1998). Published as a conference paper at ICLR 2019 Recent developments in theoretical computer science are focused on the search of computationally tractable algorithms for estimating θ under Huber s ϵ-contamination model (1). The success of the efforts started from two fundamental papers Diakonikolas et al. (2016a); Lai et al. (2016), where two different but related computational strategies iterative filtering and dimension halving were proposed to robustly estimate the normal mean. These algorithms can provably achieve the minimax rate p n ϵ2 up to a poly-logarithmic factor in polynomial time. The main idea behind the two methods is a critical fact that a good robust moment estimator can be certified efficiently by higher moments. This idea was later further extended (Diakonikolas et al., 2017; Du et al., 2017; Diakonikolas et al., 2016b; 2018a;c;b; Kothari et al., 2018) to develop robust and computable procedures for various other problems. However, many of the computationally feasible procedures for robust mean estimation in the literature rely on the knowledge of covariance matrix and sometimes the knowledge of contamination proportion. Even though these assumptions can be relaxed, nontrivial modifications of the algorithms are required for such extensions and statistical error rates may also be affected. Compared with these computationally feasible procedures proposed in the recent literature for robust estimation, Tukey s median (9) and other depth-based estimators (Rousseeuw & Hubert, 1999; Mizera, 2002; Zhang, 2002; Mizera & M uller, 2004; Paindaveine & Van Bever, 2017) have some indispensable advantages in terms of their statistical properties. First, the depth-based estimators have clear objective functions that can be interpreted from the perspective of projection pursuit (Mizera, 2002). Second, the depth-based procedures are adaptive to unknown nuisance parameters in the models such as covariance structures, contamination proportion, and error distributions (Chen et al., 2018; Gao, 2017). Last but not least, Tukey s depth and other depth functions are mostly designed for robust quantile estimation, while the recent advancements in the theoretical computer science literature are all focused on robust moments estimation. Although this is not an issue when it comes to normal mean estimation, the difference is fundamental for robust estimation under general settings such as elliptical distributions where moments do not necessarily exist. Given the desirable statistical properties discussed above, this paper is focused on the development of computational strategies of depth-like procedures. Our key observation is that robust estimators that are maximizers of depth functions, including halfspace depth, regression depth and covariance matrix depth, can all be derived under the framework of f-GAN (Nowozin et al., 2016). As a result, these depth-based estimators can be viewed as minimizers of variational lower bounds of the total variation distance between the empirical measure and the model distribution (Proposition 2.1). This observation allows us to leverage the recent developments in the deep learning literature to compute these variational lower bounds through neural network approximations. Our theoretical results give insights on how to choose appropriate neural network classes that lead to minimax optimal robust estimation under Huber s ϵ-contamination model. In particular, Theorem 3.1 and 3.2 characterize the networks which can robustly estimate the Gaussian mean by TV-GAN and JS-GAN, respectively; Theorem 4.1 is an extension to robust location estimation under the class of elliptical distributions which includes Cauchy distribution whose mean does not exist. Numerical experiments in Section 5 are provided to show the success of these GANs. 2 ROBUST ESTIMATION AND f-GAN We start with the definition of f-divergence (Csisz ar, 1964; Ali & Silvey, 1966). Given a strictly convex function f that satisfies f(1) = 0, the f-GAN between two probability distributions P and Q is defined by Df(P Q) = Z f p Here, we use p( ) and q( ) to stand for the density functions of P and Q with respect to some common dominating measure. For a fully rigorous definition, see Polyanskiy & Wu (2017). Let f be the convex conjugate of f. That is, f (t) = supu domf (ut f(u)). A variational lower bound of (2) is Df(P Q) sup T T [EP T(X) EQf (T(X))] . (3) Note that the inequality (3) holds for any class T , and it becomes an equality whenever the class T contains the function f (p/q) (Nguyen et al., 2010). For notational simplicity, we also use f for an arbitrary element of Published as a conference paper at ICLR 2019 the subdifferential when the derivative does not exist. With i.i.d. observations X1, ..., Xn P, the variational lower bound (3) naturally leads to the following learning method b P = argmin Q Q sup T T i=1 T(Xi) EQf (T(X)) The formula (4) is a powerful and general way to learn the distribution P from its i.i.d. observations. It is known as f-GAN (Nowozin et al., 2016), an extension of GAN (Goodfellow et al., 2014), which stands for generative adversarial networks. The idea is to find a b P so that the best discriminator T in the class T cannot tell the difference between b P and the empirical distribution 1 n Pn i=1 δXi. 2.1 f-LEARNING: A UNIFIED FRAMEWORK Our f-Learning framework is based on a special case of the variational lower bound (3). That is, Df(P Q) sup e Q e QQ EQf f eq(X) where eq( ) stands for the density function of e Q. Note that here we allow the class e QQ to depend on the distribution Q in the second argument of Df(P Q). Compare (5) with (3), and it is easy to realize that (5) is a special case of (3) with T = TQ = f eq Moreover, the inequality (5) becomes an equality as long as P e QQ. The sample version of (5) leads to the following learning method b P = argmin Q Q sup e Q e QQ i=1 f eq(Xi) EQf f eq(X) The learning method (7) will be referred to as f-Learning in the sequel. It is a very general framework that covers many important learning procedures as special cases. For example, consider the special case where e QQ = e Q independent of Q, Q = e Q, and f(x) = x log x. Direct calculations give f (x) = log x + 1 and f (t) = et 1. Therefore, (7) becomes b P = argmin Q Q sup e Q Q i=1 log eq(Xi) q(Xi) = argmax q Q i=1 log q(Xi), which is the maximum likelihood estimator (MLE). 2.2 TV-LEARNING AND DEPTH-BASED ESTIMATORS An important generator f that we will discuss here is f(x) = (x 1)+. This leads to the total variation distance Df(P Q) = 1 2 R |p q|. With f (x) = I{x 1} and f (t) = t I{0 t 1}, the TV-Learning is given by b P = argmin Q Q sup e Q QQ i=1 I eq(Xi) q(Xi) 1 Q eq A closely related idea was previously explored by Yatracos (1985); Devroye & Lugosi (2012). The following proposition shows that when e QQ approaches to Q in some neighborhood, TV-Learning leads to robust estimators that are defined as the maximizers of various depth functions including Tukey s depth, regression depth, and covariance depth. Proposition 2.1. The TV-Learning (8) includes the following special cases: Published as a conference paper at ICLR 2019 1. Tukey s halfspace depth: Take Q = {N(η, Ip) : η Rp} and e Qη = {N(eη, Ip) : eη η r}. As r 0, (8) becomes bθ = argmax η Rp inf u =1 1 n i=1 I{u T (Xi η) 0}. (9) 2. Regression depth: Take Q = Py,X = Py|XPX : Py|X = N(XT η, 1), η Rp , and e Qη = Py,X = Py|XPX : Py|X = N(XT eη, 1), eη η r . As r 0, (8) becomes bθ = argmax η Rp inf u =1 1 n i=1 I{u T Xi(yi XT i η) 0}. (10) 3. Covariance matrix depth: Take Q = {N(0, Γ) : Γ Ep}, where Ep stands for the class of p p covariance matrices, and e QΓ = n N(0, eΓ) : eΓ 1 = Γ 1 + eruu T Ep, |er| r, u = 1 o . As r 0, (8) becomes bΣ = argmin Γ Ep sup u =1 i=1 I{|u T Xi|2 u T Γu} P(χ2 1 1) i=1 I{|u T Xi|2 > u T Γu} P(χ2 1 > 1) The formula (9) is recognized as Tukey s median, the maximizer of Tukey s halfspace depth. A traditional understanding of Tukey s median is that (9) maximizes the halfspace depth (Donoho & Gasko, 1992) so that bθ is close to the centers of all one-dimensional projections of the data. In the f-Learning framework, N(bθ, Ip) is understood to be the minimizer of a variational lower bound of the total variation distance. The formula (10) gives the estimator that maximizes the regression depth proposed by Rousseeuw & Hubert (1999). It is worth noting that the derivation of (10) does not depend on the marginal distribution PX in the linear regression model. Finally, (11) is related to the covariance matrix depth (Zhang, 2002; Chen et al., 2018; Paindaveine & Van Bever, 2017). All of the estimators (9), (10) and (11) are proved to achieve the minimax rate for the corresponding problems under Huber s ϵ-contamination model (Chen et al., 2018; Gao, 2017). 2.3 FROM f-LEARNING TO f-GAN The connection to various depth functions shows the importance of TV-Learning in robust estimation. However, it is well-known that depth-based estimators are very hard to compute (Amenta et al., 2000; van Kreveld et al., 1999; Rousseeuw & Struyf, 1998), which limits their applications only for very low-dimensional problems. On the other hand, the general f-GAN framework (4) has been successfully applied to learn complex distributions and images in practice (Goodfellow et al., 2014; Radford et al., 2015; Salimans et al., 2016). The major difference that gives the computational advantage to f-GAN is its flexibility in terms of designing the discriminator class T using neural networks compared with the pre-specified choice (6) in f-Learning. While f-Learning provides a unified perspective in understanding various depth-based procedures in robust estimation, we can step back into the more general f-GAN for its computational advantages, and to design efficient computational strategies. 3 ROBUST MEAN ESTIMATION VIA GAN In this section, we focus on the problem of robust mean estimation under Huber s ϵ-contamination model. Our goal is to reveal how the choice of the class of discriminators affects robustness and statistical optimality under the simplest possible setting. That is, we have i.i.d. observations X1, ..., Xn (1 ϵ)N(θ, Ip) + ϵQ, and we need to estimate the unknown location θ Rp with the contaminated data. Our goal is to achieve the minimax rate p n ϵ2 with respect to the squared ℓ2 loss uniformly over all θ Rp and all Q. Published as a conference paper at ICLR 2019 Figure 1: Landscape of TV-GAN objective function F(η, w) = supb[EP sigmoid(w X + b) EN(η,1)sigmoid(w X + b)], where b is maximized out for visualization. Samples are drawn from P = (1 ϵ)N(1, 1) + ϵN(10, 1) with ϵ = 0.2. Left: a surface plot of F(η, w). The solid curves are marginal functions for fixed η s: F(1, w) (red) and F(5, w) (blue), and the dash curves are marginal functions for fixed w s: F(η, 10) (orange) and F(η, 10) (green). Right: a heatmap of F(η, w). It is clear that F(w) = F(η, w) has two local maxima for a given η, achieved at w = + and w = . In fact, the global maximum for F(w) has a phase transition from w = + to w = as η grows. For example, the maximum is achieved at w = + when η = 1 (blue solid) and is achieved at w = when η = 5 (red solid). Unfortunately, even if we initialize with η0 = 1 and w0 > 0, gradient ascents on η will only increase the value of η (green dash), and thus as long as the discriminator cannot reach the global maximizer, w will be stuck in the positive half space {w : w > 0} and further increase the value of η. 3.1 RESULTS FOR TV-GAN We start with the total variation GAN (TV-GAN) with f(x) = (x 1)+ in (4). For the Gaussian location family, (4) can be written as bθ = argmin η Rp max D D i=1 D(Xi) EN(η,Ip)D(X) with T(x) = D(x) in (4). Now we need to specify the class of discriminators D to solve the classification problem between N(η, Ip) and the empirical distribution 1 n Pn i=1 δXi. One of the simplest discriminator classes is the logistic regression, D = D(x) = sigmoid(w T x + b) : w Rp, b R . (13) With D(x) = sigmoid(w T x+b) = (1+e w T x b) 1 in (13), the procedure (12) can be viewed as a smoothed version of TV-Learning (8). To be specific, the sigmoid function sigmoid(w T x + b) tends to an indicator function as w , which leads to a procedure very similar to (9). In fact, the class (13) is richer than the one used in (9), and thus (12) can be understood as the minimizer of a sharper variational lower bound than that of (9). Theorem 3.1. Assume p n + ϵ2 c for some sufficiently small constant c > 0. With i.i.d. observations X1, ..., Xn (1 ϵ)N(θ, Ip) + ϵQ, the estimator bθ defined by (12) satisfies with probability at least 1 e C (p+nϵ2) uniformly over all θ Rp and all Q. The constants C, C > 0 are universal. Though TV-GAN can achieve the minimax rate p n ϵ2 under Huber s contamination model, it may suffer from optimization difficulties especially when the distributions Q and N(θ, Ip) are far away from each other, as shown in Figure 1. Published as a conference paper at ICLR 2019 3.2 RESULTS FOR JS-GAN Given the intractable optimization property of TV-GAN, we next turn to Jensen-Shannon GAN (JS-GAN) with f(x) = x log x (x + 1) log x+1 2 . The estimator is defined by bθ = argmin η Rp max D D i=1 log D(Xi) + EN(η,Ip) log(1 D(Xi)) + log 4, (14) with T(x) = log D(x) in (4). This is exactly the original GAN (Goodfellow et al., 2014) specialized to the normal mean estimation problem. The advantages of JS-GAN over other forms of GAN have been studied extensively in the literature (Lucic et al., 2017; Kurach et al., 2018). Unlike TV-GAN, our experiment results show that (14) with the logistic regression discriminator class (13) is not robust to contamination. However, if we replace (13) by a neural network class with one or more hidden layers, the estimator will be robust and will also work very well numerically. To understand why and how the class of the discriminators affects the robustness property of JS-GAN, we introduce a new concept called restricted Jensen-Shannon divergence. Let g : Rp Rd be a function that maps a p-dimensional observation to a d-dimensional feature space. The restricted Jensen-Shannon divergence between two probability distributions P and Q with respect to the feature g is defined as JSg(P, Q) = max w W EP log sigmoid(w T g(X)) + EQ log(1 sigmoid(w T g(X))) + log 4. In other words, P and Q are distinguished by a logistic regression classifier that uses the feature g(X). It is easy to see that JSg(P, Q) is a variational lower bound of the original Jensen-Shannon divergence. The key property of JSg(P, Q) is given by the following proposition. Proposition 3.1. Assume W is a convex set that contains an open neighborhood of 0. Then, JSg(P, Q) = 0 if and only if EP g(X) = EQg(X). The proposition asserts that JSg( , ) cannot distinguish P and Q if the feature g(X) has the same expected value under the two distributions. This generalized moment matching effect has also been studied by Liu et al. (2017) for general f-GANs. However, the linear discriminator class considered in Liu et al. (2017) is parameterized in a different way compared with the discriminator class here. When we apply Proposition 3.1 to robust mean estimation, the JS-GAN is trying to match the values of 1 n Pn i=1 g(Xi) and EN(η,Iη)g(X) for the feature g(X) used in the logistic regression classifier. This explains what we observed in our numerical experiments. A neural net without any hidden layer is equivalent to a logistic regression with a linear feature g(X) = (XT , 1)T Rp+1. Therefore, whenever η = 1 n Pn i=1 Xi, we have JSg 1 n Pn i=1 δXi, N(η, Ip) = 0, which implies that the sample mean is a global maximizer of (14). On the other hand, a neural net with at least one hidden layers involves a nonlinear feature function g(X), which is the key that leads to the robustness of (14). We will show rigorously that a neural net with one hidden layer is sufficient to make (14) robust and optimal. Consider the following class of discriminators, D(x) = sigmoid j 1 wjσ(u T j x + bj) j 1 |wj| κ, uj Rp, bj R The class (15) consists of two-layer neural network functions. While the dimension of the input layer is p, the dimension of the hidden layer can be arbitrary, as long as the weights have a bounded ℓ1 norm. The nonlinear activation function σ( ) is allowed to take 1) indicator: σ(x) = I{x 1}, 2) sigmoid: σ(x) = 1 1+e x , 3) ramp: σ(x) = max(min(x + 1/2, 1), 0). Other bounded activation functions are also possible, but we do not exclusively list them. The rectified linear unit (Re LU) will be studied in Appendix A. Theorem 3.2. Consider the estimator bθ defined by (14) with D specified by (15). Assume p n + ϵ2 c for some sufficiently small constant c > 0, and set κ = O p p n + ϵ . With i.i.d. observations X1, ..., Xn Published as a conference paper at ICLR 2019 (1 ϵ)N(θ, Ip) + ϵQ, we have with probability at least 1 e C (p+nϵ2) uniformly over all θ Rp and all Q. The constants C, C > 0 are universal. 4 ELLIPTICAL DISTRIBUTIONS An advantage of Tukey s median (9) is that it leads to optimal robust location estimation under general elliptical distributions such as Cauchy distribution whose mean does not exist. In this section, we show that JS-GAN shares the same property. A random vector X Rp follows an elliptical distribution if it admits a representation X = θ + ξAU, where U is uniformly distributed on the unit sphere {u Rp : u = 1} and ξ 0 is a random variable independent of U that determines the shape of the elliptical distribution (Fang, 2017). The center and the scatter matrix is θ and Σ = AAT . For a unit vector v, let the density function of ξv T U be h. Note that h is independent of v because of the symmetry of U. Then, there is a one-to-one relation between the distribution of ξ and h, and thus the triplet (θ, Σ, h) fully parametrizes an elliptical distribution. Note that h and Σ = AAT are not identifiable, because ξA = (cξ)(c 1A) for any c > 0. Therefore, without loss of generality, we can restrict h to be a member of the following class H = h : h(t) = h( t), h 0, Z h = 1, Z σ(t)(1 σ(t))h(t)dt = 1 . This makes the parametrization (θ, Σ, h) of an elliptical distribution fully identifiable, and we use EC(θ, Σ, h) to denote an elliptical distribution parametrized in this way. The JS-GAN estimator is defined as (bθ, bΣ,bh) = argmin η Rp,Γ Ep(M),g H max D D i=1 log D(Xi) + EEC(η,Γ,g) log(1 D(X)) + log 4, (16) where Ep(M) is the set of all positive semi-definite matrix with spectral norm bounded by M. Theorem 4.1. Consider the estimator bθ defined above with D specified by (15). Assume M = O(1), p n +ϵ2 c for some sufficiently small constant c > 0, and set κ = O p p n + ϵ . With i.i.d. observations X1, ..., Xn (1 ϵ)EC(θ, Σ, h) + ϵQ, we have with probability at least 1 e C (p+nϵ2) uniformly over all θ Rp, Σ Ep(M) and all Q. The constants C, C > 0 are universal. Remark 4.1. The result of Theorem 4.1 also holds (and is proved) under the strong contamination model (Diakonikolas et al., 2016a). That is, we have i.i.d. observations X1, ..., Xn P for some P satisfying TV(P, EC(θ, Σ, h)) ϵ. See its proof in Appendix D.2. Note that Theorem 4.1 guarantees the same convergence rate as in the Gaussian case for all elliptical distributions. This even includes multivariate Cauchy where mean does not exist. Therefore, the location estimator (16) is fundamentally different from Diakonikolas et al. (2016a); Lai et al. (2016), which is only designed for robust mean estimation. We will show such a difference in our numerical results. To achieve rate-optimality for robust location estimation under general elliptical distributions, the estimator (16) is different from (14) only in the generator class. They share the same discriminator class (15). This underlines Published as a conference paper at ICLR 2019 an important principle for designing GAN estimators: the overall statistical complexity of the estimator is only determined by the discriminator class. The estimator (16) also outputs (bΣ,bh), but we do not claim any theoretical property for (bΣ,bh) in this paper. This will be systematically studied in a future project. 5 NUMERICAL EXPERIMENTS In this section, we give extensive numerical studies of robust mean estimation via GAN. After introducing the implementation details in Section 5.1, we verify our theoretical results on minimax estimation with both TVGAN and JS-GAN in Section 5.2. Comparison with other methods on robust mean estimation in the literature is given in Section 5.3. The effects of various network structures are studied in Section 5.4. Adaptation to unknown covariance is studied in Section 5.5. In all these cases, we assume i.i.d. observations are drawn from (1 ϵ)N(0p, Ip) + ϵQ with ϵ and Q to be specified. Finally, adaptation to elliptical distributions is studied in Section 5.6. 5.1 IMPLEMENTATIONS We adopt the standard algorithmic framework of f-GANs (Nowozin et al., 2016) for the implementation of JSGAN and TV-GAN for robust mean estimation. In particular, the generator for mean estimation is Gη(Z) = Z + η with Z N(0p, Ip); the discriminator D is a multilayer perceptron (MLP), where each layer consisting of a linear map and a sigmoid activation function and the number of nodes will vary in different experiments to be specified below. Details related to algorithms, tuning, critical hyper-parameters, structures of discriminator networks and other training tricks for stabilization and acceleration are discussed in Appendix B.1. A Py Torch implementation is available at https://github.com/zhuwzh/Robust-GAN-Center. 5.2 NUMERICAL SUPPORTS FOR THE MINIMAX RATES We verify the minimax rates achieved by TV-GAN (Theorem 3.1) and JS-GAN (Theorem 3.2) via numerical experiments. Two main scenarios we consider here are p p/n < ϵ and p p/n > ϵ, where in both cases, various types of contamination distributions Q are considered. Specifically, the choice of contamination distributions Q includes N(µ 1p, Ip) with µ ranges in {0.2, 0.5, 1, 5}, N(0.5 1p, Σ) and Cauchy(τ 1p). Details of the construction of the covariance matrix Σ is given in Appendix B.2. The distribution Cauchy(τ 1p) is obtained by combining p independent one-dimensional standard Cauchy with location parameter τj = 0.5. Figure 2: ℓ2 error bθ θ against ϵ (left: p = 100, n = 50, 000 and ϵ ranges from 0.05 to 0.20), p (middle: n = 1, 000, ϵ = 0.1 and p ranges from 10 to 100) and 1/ n (right: p = 50, ϵ = 0.1 and n ranges from 50 to 1, 000), respectively. Net structure: One hidden layer with 20 hidden units (JS-GAN), zero hidden layer (TV-GAN). The vertical bars indicate standard deviations. The main experimental results are summarized in Figure 2, where the ℓ2 error we present is the maximum error among all choices of Q, and detailed numerical results can be founded in Tables 7, 8 and 9 in Appendix. We separately explore the relation between the error and one of ϵ, p and 1/ n with the other two parameters fixed. The study of the relation between the ℓ2 error and ϵ is in the regime p p/n < ϵ so that ϵ dominates Published as a conference paper at ICLR 2019 the minimax rate. The scenario p p/n > ϵ is considered in the study of the effects of p and 1/ n. As is shown in Figure 2, the errors are approximately linear against the corresponding parameters in all cases, which empirically verifies the conclusions of Theorem 3.1 and Theorem 3.2. 5.3 COMPARISONS WITH OTHER METHODS We perform additional experiments to compare with other methods including dimension halving (Lai et al., 2016) and iterative filtering (Diakonikolas et al., 2017) under various settings. We emphasize that our method does not require any knowledge about the nuisance parameters such as the contamination proportion ϵ. Tuning GAN is only a matter of optimization and one can tune parameters based on the objective function only. Table 1: Comparison of various robust mean estimation methods. Net structure: One-hidden layer network with 20 hidden units when n = 50, 000 and 2 hidden units when n = 5, 000. The number in each cell is the average of ℓ2 error bθ θ with standard deviation in parenthesis estimated from 10 repeated experiments and the smallest error among four methods is highlighted in bold. Q n p ϵ TV-GAN JS-GAN Dimension Halving Iterative Filtering N(0.5 1p, Ip) 50,000 100 .2 0.0953 (0.0064) 0.1144 (0.0154) 0.3247 (0.0058) 0.1472 (0.0071) N(0.5 1p, Ip) 5,000 100 .2 0.1941 (0.0173) 0.2182 (0.0527) 0.3568 (0.0197) 0.2285 (0.0103) N(0.5 1p, Ip) 50,000 200 .2 0.1108 (0.0093) 0.1573 (0.0815) 0.3251 (0.0078) 0.1525 (0.0045) N(0.5 1p, Ip) 50,000 100 .05 0.0913 (0.0527) 0.1390 (0.0050) 0.0814 (0.0056) 0.0530 (0.0052) N(5 1p, Ip) 50,000 100 .2 2.7721 (0.1285) 0.0534 (0.0041) 0.3229 (0.0087) 0.1471 (0.0059) N(0.5 1p, Σ) 50,000 100 .2 0.1189 (0.0195) 0.1148 (0.0234) 0.3241 (0.0088) 0.1426 (0.0113) Cauchy(0.5 1p) 50,000 100 .2 0.0738 (0.0053) 0.0525 (0.0029) 0.1045 (0.0071) 0.0633 (0.0042) Table 1 shows the performances of JS-GAN, TV-GAN, dimension halving, and iterative filtering. The network structure, for both JS-GAN and TV-GAN, has one hidden layer with 20 hidden units when the sample size is 50,000 and 2 hidden units when sample size is 5,000. The critical hyper-parameters we apply is given in Appendix and it turns out that the choice of the hyper-parameter is robust against different models when the net structures are the same. To summarize, our method outperforms other algorithms in most cases. TV-GAN is good at cases when Q and N(0p, Ip) are non-separable but fails when Q is far away from N(0p, Ip) due to optimization issues discussed in Section 3.1 (Figure 1). On the other hand, JS-GAN stably achieves the lowest error in separable cases and also shows competitive performances for non-separable ones. 5.4 NETWORK STRUCTURES We further study the performance of JS-GAN with various structures of neural networks. The main observation is tuning networks with one-hidden layer becomes tough as the dimension grows (e.g. p 200), while a deeper network can significantly refine the situation perhaps by improving the landscape. Some experiment results are given in Table 2. On the other hand, one-hidden layer performs not worse than deeper networks when dimension is not very large (e.g. p 100). More experiments are given in Appendix B.4. Additional theoretical results for deep neural nets are given in Appendix A. Table 2: Experiment results for JS-GAN using networks with different structures in high dimension. Settings: ϵ = 0.2, p {200, 400} and n = 50, 000. p 200-100-20-1 200-200-100-1 200-100-1 200-20-1 200 0.0910 (0.0056) 0.0790 (0.0026) 0.3064 (0.0077) 0.1573 (0.0815) p 400-200-100-50-20-1 400-200-100-20-1 400-200-20-1 400-200-1 400 0.1477 (0.0053) 0.1732 (0.0397) 0.1393 (0.0090) 0.3604 (0.0990) Published as a conference paper at ICLR 2019 5.5 ADAPTATION TO UNKNOWN COVARIANCE The robust mean estimator constructed through JS-GAN can be easily made adaptive to unknown covariance structure, which is a special case of (16). We define (bθ, bΣ) = argmin η Rp,Γ Ep max D D i=1 log D(Xi) + EN(η,Γ) log(1 D(Xi)) The estimator bθ, as a result, is rate-optimal even when the true covariance matrix is not necessarily identity and is unknown (see Theorem 4.1). Below, we demonstrate some numerical evidence of the optimality of bθ as well as the error of bΣ in Table 3. Data generating process Network structure bθ 0p bΣ Σ1 op 0.8N(0p, Σ1) + 0.2N(0.5 1p, Σ2) 100-20-1 0.1680 (0.1540) 1.9716 (0.7405) 0.8N(0p, Σ1) + 0.2N(0.5 1p, Σ2) 100-20-20-1 0.1824 (0.3034) 1.4495 (0.6028) 0.8N(0p, Σ1) + 0.2N(1p, Σ2) 100-20-1 0.0817 (0.0213) 1.2753 (0.4523) 0.8N(0p, Σ1) + 0.2N(6 1p, Σ2) 100-20-1 0.1069 (0.0357) 1.1668 (0.1839) 0.8N(0p, Σ1) + 0.2Cauchy(0.5 1p) 100-20-1 0.0797 (0.0257) 4.0653 (0.1569) Table 3: Numerical experiments for robust mean estimation with unknown covariance trained with 50, 000 samples. The covariance matrices Σ1 and Σ2 are generated by the same way described in Appendix B.2. 5.6 ADAPTATION TO ELLIPTICAL DISTRIBUTIONS We consider the estimation of the location parameter θ in elliptical distribution EC(θ, Σ, h) by the JS-GAN defined in (16). In particular, we study the case with i.i.d. observations X1, ..., Xn (1 ϵ)Cauchy(θ, Ip)+ϵQ. The density function of Cauchy(θ, Σ) is given by p(x; θ, Σ) |Σ| 1/2 1 + (x θ)T Σ 1(x θ) (1+p)/2. Compared with Algorithm (1), the difference lies in the choice of the generator. We consider the generator G1(ξ, U) = gω(ξ)U + θ, where gω(ξ) is a non-negative neural network parametrized by ω and some random variable ξ. The random vector U is sampled from the uniform distribution on {u Rp : u = 1}. If the scatter matrix is unknown, we will use the generator G2(ξ, U) = gω(ξ)AU +θ, with AAT modeling the scatter matrix. Table 4 shows the comparison with other methods. Our method still works well under Cauchy distribution, while the performance of other methods that rely on moment conditions deteriorates in this setting. Table 4: Comparison of various methods of robust location estimation under Cauchy distributions. Samples are drawn from (1 ϵ)Cauchy(0p, Ip) + ϵQ with ϵ = 0.2, p = 50 and various choices of Q. Sample size: 50,000. Discriminator net structure: 50-50-25-1. Generator gω(ξ) structure: 48-48-32-24-12-1 with absolute value activation function in the output layer. Contamination Q JS-GAN (G1) JS-GAN (G2) Dimension Halving Iterative Filtering Cauchy(1.5 1p, Ip) 0.0664 (0.0065) 0.0743 (0.0103) 0.3529 (0.0543) 0.1244 (0.0114) Cauchy(5.0 1p, Ip) 0.0480 (0.0058) 0.0540 (0.0064) 0.4855 (0.0616) 0.1687 (0.0310) Cauchy(1.5 1p, 5 Ip) 0.0754 (0.0135) 0.0742 (0.0111) 0.3726 (0.0530) 0.1220 (0.0112) Normal(1.5 1p, 5 Ip) 0.0702 (0.0064) 0.0713 (0.0088) 0.3915 (0.0232) 0.1048 (0.0288)) ACKNOWLEDGEMENT The research of Chao Gao was supported in part by NSF grant DMS-1712957 and NSF Career Award DMS1847590. The research of Yuan Yao was supported in part by Hong Kong Research Grant Council (HKRGC) grant 16303817, National Basic Research Program of China (No. 2015CB85600), National Natural Science Foundation of China (No. 61370004, 11421110001), as well as awards from Tencent AI Lab, Si Family Foundation, Baidu Big Data Institute, and Microsoft Research-Asia. Published as a conference paper at ICLR 2019 Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pp. 131 142, 1966. Nina Amenta, Marshall Bern, David Eppstein, and S-H Teng. Regression depth and center points. Discrete & Computational Geometry, 23(3):305 323, 2000. Peter L Bartlett. For valid generalization the size of the weights is more important than the size of the network. In Advances in neural information processing systems, pp. 134 140, 1997. Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. Timothy M Chan. An optimal randomized algorithm for maximum tukey depth. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 430 436. Society for Industrial and Applied Mathematics, 2004. Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under hubers contamination model. The Annals of Statistics, 46(5):1932 1960, 2018. Imre Csisz ar. Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akad. Mat. Kutato Int. Koezl., 8:85 108, 1964. Luc Devroye and G abor Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media, 2012. Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pp. 655 664. IEEE, 2016a. Ilias Diakonikolas, Daniel Kane, and Alistair Stewart. Robust learning of fixed-structure bayesian networks. ar Xiv preprint ar Xiv:1606.07384, 2016b. Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. ar Xiv preprint ar Xiv:1703.00893, 2017. Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. ar Xiv preprint ar Xiv:1803.02815, 2018a. Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. List-decodable robust mean estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1047 1060. ACM, 2018b. Ilias Diakonikolas, Weihao Kong, and Alistair Stewart. Efficient algorithms and lower bounds for robust linear regression. ar Xiv preprint ar Xiv:1806.00040, 2018c. David L Donoho and Miriam Gasko. Breakdown properties of location estimates based on halfspace depth and projected outlyingness. The Annals of Statistics, 20(4):1803 1827, 1992. Simon S Du, Sivaraman Balakrishnan, and Aarti Singh. Computationally efficient robust estimation of sparse functionals. ar Xiv preprint ar Xiv:1702.07709, 2017. Kai Wang Fang. Symmetric Multivariate and Related Distributions: 0. Chapman and Hall/CRC, 2017. Chao Gao. Robust regression via mutivariate regression depth. ar Xiv preprint ar Xiv:1702.04656, 2017. Published as a conference paper at ICLR 2019 Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249 256, 2010. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Peter J Huber. Robust estimation of a location parameter. The annals of mathematical statistics, 35(1):73 101, 1964. Peter J Huber. A robust version of the probability ratio test. The Annals of Mathematical Statistics, 36(6): 1753 1758, 1965. Pravesh K Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1035 1046. ACM, 2018. Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. The gan landscape: Losses, architectures, regularization, and normalization. ar Xiv preprint ar Xiv:1807.04720, 2018. Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pp. 665 674. IEEE, 2016. Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pp. 5545 5553, 2017. Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a large-scale study. ar Xiv preprint ar Xiv:1711.10337, 2017. Colin Mc Diarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148 188, 1989. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018. Ivan Mizera. On depth and deep points: a calculus. The Annals of Statistics, 30(6):1681 1736, 2002. Ivan Mizera and Christine H M uller. Location scale depth. Journal of the American Statistical Association, 99(468):949 966, 2004. Youssef Mroueh, Tom Sercu, and Vaibhava Goel. Mcgan: Mean and covariance feature matching gan. ar Xiv preprint ar Xiv:1702.08398, 2017. Xuan Long Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271 279, 2016. Davy Paindaveine and Germain Van Bever. Halfspace depths for scatter, concentration and shape matrices. ar Xiv preprint ar Xiv:1704.06160, 2017. David Pollard. Convergence of stochastic processes. Springer Science & Business Media, 2012. Yury Polyanskiy and Yihong Wu. Lecture notes on information theory. 2017. Published as a conference paper at ICLR 2019 Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. Peter J Rousseeuw and Mia Hubert. Regression depth. Journal of the American Statistical Association, 94 (446):388 402, 1999. Peter J Rousseeuw and Anja Struyf. Computing location depth and regression depth in higher dimensions. Statistics and Computing, 8(3):193 203, 1998. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234 2242, 2016. John W Tukey. Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pp. 523 531, 1975. Aad W Van Der Vaart and Jon A Wellner. Weak convergence. In Weak convergence and empirical processes, pp. 16 28. Springer, 1996. Marc van Kreveld, Joseph SB Mitchell, Peter Rousseeuw, Micha Sharir, Jack Snoeyink, and Bettina Speckmann. Efficient algorithms for maximum regression depth. In Proceedings of the fifteenth annual symposium on Computational geometry, pp. 31 40. ACM, 1999. Yannis G Yatracos. Rates of convergence of minimum distance estimators and kolmogorov s entropy. The Annals of Statistics, 13(2):768 774, 1985. Jian Zhang. Some extensions of tukey s depth function. Journal of Multivariate Analysis, 82(1):134 165, 2002. Published as a conference paper at ICLR 2019 A ADDITIONAL THEORETICAL RESULTS In this section, we investigate the performance of discriminator classes of deep neural nets with the Re LU activation function. Since our goal is to learn a p-dimensional mean vector, a deep neural network discriminator without any regularization will certainly lead to overfitting. Therefore, it is crucial to design a network class with some appropriate regularizations. Inspired by the work of Bartlett (1997); Bartlett & Mendelson (2002), we consider a network class with ℓ1 regularizations on all layers except for the second last layer with an ℓ2 regularization. With GH 1 (B) = g(x) = Re LU(v T x) : v 1 B , a neural network class with l + 1 layers is defined as GH l+1(B) = g(x) = Re LU h=1 vhgh(x) h=1 |vh| B, gh GH l (B) Combining with the last sigmoid layer, we obtain the following discriminator class, FH L (κ, τ, B) = D(x) = sigmoid j 1 wjsigmoid h=1 ujhgjh(x) + bj j 1 |wj| κ, h=1 u2 jh 2, |bj| τ, gjh GH L 1(B) Note that all the activation functions are Re LU( ) except that we use sigmoid( ) in the last layer in the feature map g( ). A theoretical guarantees of the class defined above is given by the following theorem. Theorem A.1. Assume p log p n ϵ2 c for some sufficiently small constant c > 0. Consider i.i.d. observations X1, ..., Xn (1 ϵ)N(θ, Ip) + ϵQ and the estimator bθ defined by (14) with D = FH L (κ, τ, B) with H 2p, 2 L = O(1), 2 B = O(1), and τ = p log p. We set κ = O q n + ϵ . Then, for the estimator bθ defined by (14) with D = FH L (κ, τ, B), we have bθ θ 2 C p log p with probability at least 1 e C (p log p+nϵ2) uniformly over all θ Rp such that θ log p and all Q. The theorem shows that JS-GAN with a deep Re LU network can achieve the error rate p log p n ϵ2 with respect to the squared ℓ2 loss. The condition θ log p for the Re LU network can be easily satisfied with a simple preprocessing step. We split the data into two halves, whose sizes are log n and n log n, respectively. Then, we calculate the coordinatewise median eθ using the small half. It is easy to show that eθ θ q log p log n ϵ with high probability. Then, for each Xi from the second half, the conditional distribution of Xi eθ given the first half is (1 ϵ)N(θ eθ, Ip)+ϵ e Q. Since q log p log n ϵ log p, the condition θ eθ log p is satisfied, and thus we can apply the estimator (14) using the shifted data Xi eθ from the second half. The theoretical guarantee of Theorem A.1 will be bθ (θ eθ) 2 C p log p with high probability. Hence, we can use bθ + eθ as the final estimator to achieve the same rate in Theorem A.1. On the other hand, our experiments show that this preprocessing step is not needed. We believe that the assumption θ log p is a technical artifact in the analysis of the Rademacher complexity. It can probably be dropped by a more careful analysis. Published as a conference paper at ICLR 2019 B DETAILS OF EXPERIMENTS B.1 TRAINING DETAILS The implementation for JS-GAN is given in Algorithm 1, and a simple modification of the objective function leads to that of TV-GAN. Algorithm 1 JS-GAN: argminη maxw[ 1 n Pn i=1 log Dw(Xi) + E log(1 Dw(Gη(Z)))] Input: Observation set S = {X1, . . . , Xn} Rp, discriminator network Dw(x), generator network Gη(z) = z +η, learning rates γd and γg for the discriminator and the generator, batch size m, discriminator steps in each iteration K, total epochs T, average epochs T0. Initialization: Initialize η with coordinatewise median of S. Initialize w with N(0, .05) independently on each element or Xavier (Glorot & Bengio, 2010). 1: for t = 1, . . . , T do 2: for k = 1, . . . , K do 3: Sample mini-batch {X1, . . . , Xm} from S. Sample {Z1, . . . , Zm} from N(0, Ip) 4: gw w[ 1 mΣm i=1 log Dw(Xi) + 1 mΣm i=1 log(1 Dw(Gη(Zi)))] 5: w w + γdgw 6: end for 7: Sample {Z1, . . . , Zm} from N(0, Ip) 8: gη η[ 1 mΣm i=1 log(1 Dw(Gη(Zi)))] 9: η η γggη 10: end for Return: The average estimate η over the last T0 epochs. Several important implementation details are discussed below. How to tune parameters? The choice of learning rates is crucial to the convergence rate, but the minimax game is hard to evaluate. We propose a simple strategy to tune hyper-parameters including the learning rates. Suppose we have estimators bθ1, . . . , bθM with corresponding discriminator networks D b w1,..., D b w M . Fixing η = bθ, we further apply gradient descent to Dw with a few more epochs (but not many in order to prevent overfitting, for example 10 epochs) and select the bθ with the smallest value of the objective function (14) (JS-GAN) or (12) (TV-GAN). We note that training discriminator and generator alternatively usually will not suffer from overfitting since the objective function for either the discriminator or the generator is always changing. However, we must be careful about the overfitting issue when training the discriminator alone with a fixed η, and that is why we apply an early stopping strategy here. Fortunately, the experiments show if the structures of networks are same (then of course, the dimensions of the inputs are same), the choices of hyper-parameters are robust to different models and we present the critical parameters in Table 5 to reproduce the experiment results in Table 1 and Table 2. When to stop training? Judging convergence is a difficult task in GAN trainings, since sometimes oscillation may occur. In computer vision, people often use a task related measure and stop training once the requirement based on the measure is achieved. In our experiments below, we simply use a sufficiently large T which works well, but it is still interesting to explore an efficient early stopping rule in the future work. How to design the network structure? Although Theorem 3.1 and Theorem 3.2 guarantee the minimax rates of TV-GAN without hidden layer and JS-GAN with one hidden layer, one may wonder whether deeper network structures will perform better. From our preliminary experiments, TV-GAN with one hidden layer is significantly better than TV-GAN without any hidden layer. Moreover, JS-GAN Published as a conference paper at ICLR 2019 Table 5: Choices of hyper-parameters. The parameter λ is the penalty factor for the regularization term (17) and other parameters are listed in Algorithm 1. We apply Xavier initialization (Glorot & Bengio, 2010) for both JS-GAN and TV-GAN trainings. Structure Net γg γd K T T0 λ 100-20-1 JS 0.02 0.2 5 150 25 0 TV 0.0001 0.3 2 150 1 0.1 100-2-1 JS 0.01 0.2 5 150 25 0 TV 0.01 0.1 5 150 1 1 200-20-1 JS 0.02 0.2 5 200 25 0 TV 0.0001 0.1 2 200 1 0.5 200-200-100-1 JS 0.005 0.1 2 200 25 0 400-200-20-1 JS 0.02 0.05 2 250 25 0.5 with deep network structures can significantly improve over shallow networks especially when the dimension is large (e.g. p 200). For a network with one hidden layer, the choice of width may depend on the sample size. If we only have 5,000 samples of 100 dimensions, two hidden units performs better than five hidden units, which performs better than twenty hidden units. If we have 50,000 samples, networks with twenty hidden units perform the best. How to stabilize and accelerate TV-GAN? As we have discussed in Section 3.1, TV-GAN has a bad landscape when N(θ, Ip) and the contamination distribution Q are linearly separable (see Figure 1). An outlier removal step before training TV-GAN may be helpful. Besides, spectral normalization (Miyato et al., 2018) is also worth trying since it can prevent the weight from going to infinity and thus can increase the chance to escape from bad saddle points. To accelerate the optimization of TVGAN, in all the numerical experiments below, we adopt a regularized version of TV-GAN inspired by Proposition 3.1. Since a good feature extractor should match nonlinear moments of P = (1 ϵ)N(θ, Ip) + ϵQ and N(η, Ip), we use an additional regularization term that can accelerate training and sometimes even leads to better performances. Specifically, let D(x) = sigmoid(w T Φ(x)) be the discriminator network with w being the weights of the output layer and ΦD(x) be the corresponding network after removing the output layer from D(x). The quantity ΦD(x) is usually viewed as a feature extractor, which naturally leads to the following regularization term (Salimans et al., 2016; Mroueh et al., 2017), defined as i=1 T(ΦD, Xi) T(ΦD, N(η, Ip)) where T(Φ, P) = EP Φ(X) (moment matching) or T(Φ, P) = Median X P ΦD(X) (median matching). B.2 SETTINGS OF CONTAMINATION Q We introduce the contamination distributions Q used in the experiments. We first consider Q = N(µ, Ip) with µ ranges in {0.2, 0.5, 1, 5}. Note that the total variation distance between N(0p, Ip) and N(µ, Ip) is of order 0p µ = µ . We hope to use different levels of µ to test the algorithm and verify the error rate in the worst case. Second, we consider Q = N(1.5 1p, Σ) to be a Gaussian distribution with a non-trivial covariance matrix Σ. The covariance matrix is generated according to the following steps. First generate a sparse precision matrix Γ = (γij) with each entry γij = zij τij, i j, where zij and τij are independently generated from Uniform(0.4, 0.8) and Bernoulli(0.1). We then define γij = γji for all i > j and Γ = Γ + (| min eig(Γ)| + 0.05)Ip to make the precision matrix symmetric and positive definite, where min eig(Γ) is the smallest eigenvalue of Γ. The covariance matrix is Σ = Γ 1. Finally, we consider Q to be a Cauchy Published as a conference paper at ICLR 2019 distribution with independent component, and the jth component takes a standard Cauchy distribution with location parameter τj = 0.5. B.3 COMPARISON DETAILS In Section 5.3, we compare GANs with the dimension halving (Lai et al., 2016) and iterative filtering (Diakonikolas et al., 2017). Dimension Halving. Experiments conducted are based on the code from https://github.com/ kal2000/Agnostic Mean And Covariance Code. The only hyper-parameter is the threshold in the outlier removal step, and we take C = 2 as suggested in the file out Rem Sperical.m. Iterative Filtering. Experiments conducted are based on the code from https://github.com/ hoonose/robust-filter. We assume ϵ is known and take other hyper-parameters as suggested in the file filter Gaussian Mean.m. B.4 SUPPLEMENTARY EXPERIMENTS FOR NETWORK STRUCTURES The experiments are conducted with i.i.d. observations drawn from (1 ϵ)N(0p, Ip) + ϵN(0.5 1p, Ip) with ϵ = 0.2. Table 6 summarizes results for p = 100, n {5000, 50000} and various network structures. We observe that TV-GAN that uses neural nets with one hidden layer improves over the performance of that without any hidden layer. This indicates that the landscape of TV-GAN might be improved by a more complicated network structure. However, adding one more layer does not improve the results. For JS-GAN, we omit the results without hidden layer because of its lack of robustness (Proposition 3.1). Deeper networks sometimes improve over shallow networks, but this is not always true. We also observe that the optimal choice of the width of the hidden layer depends on the sample size. Table 6: Experiment results for JS-GAN and TV-GAN with various network structures. Structure n JS-GAN TV-GAN 100-1 50,000 - 0.1173 (0.0056) 100-20-1 50,000 0.0953 (0.0064) 0.1144 (0.0154) 100-50-1 50,000 0.2409 (0.0500) 0.1597 (0.0219) 100-20-20-1 50,000 0.1131 (0.0855) 0.1724 (0.0295) 100-1 5,000 - 0.9818 (0.0417) 100-2-1 5,000 0.1941 (0.0173) 0.1941 (0.0173) 100-5-1 5,000 0.2148 (0.0241) 0.2244 (0.0238) 100-20-1 5,000 0.3379 (0.0273) 0.3336 (0.0186) B.5 TABLES FOR TESTING THE MINIMAX RATES Tables 7, 8 and 9 show numerical results corresponding to Figure 2. Published as a conference paper at ICLR 2019 Table 7: Scenario I: p p/n < ϵ. Setting: p = 100, n = 50, 000, and ϵ from 0.05 to 0.20. Network structure of JS-GAN: one hidden layer with 5 hidden units. Network structure of TV-GAN: zero-hidden layer. The number in each cell is the average of ℓ2 error bθ θ with standard deviation in parenthesis estimated from 10 repeated experiments. The bold character marks the worst case among our choices of Q at each ϵ level. The results of TV-GAN for Q = N(5 1p, Ip) are highlighted in slanted font. The failure of training in this case is due to the bad landscape when N(0p, Ip) and Q are linearly separable, as discussed in Section 3.1 (see Figure 1). Q Net ϵ = 0.05 ϵ = 0.10 ϵ = 0.15 ϵ = 0.20 N(0.2 1p, Ip) JS 0.1025 (0.0080) 0.1813 (0.0122) 0.2632 (0.0080) 0.3280 (0.0069) TV 0.1110 (0.0204) 0.2047 (0.0112) 0.2769 (0.0315) 0.3283 (0.0745) N(0.5 1p, Ip) JS 0.1407 (0.0061) 0.1895 (0.0070) 0.1714 (0.0502) 0.1227 (0.0249) TV 0.2003 (0.0480) 0.2065 (0.1495) 0.2088 (0.0100) 0.3985 (0.0112) N(1p, Ip) JS 0.0855 (0.0054) 0.1055 (0.0322) 0.0602 (0.0133) 0.0577 (0.0029) TV 0.1084 (0.0063) 0.0842 (0.0036) 0.3228 (0.0123) 0.1329 (0.0125) N(5 1p, Ip) JS 0.0587 (0.0033) 0.0636 (0.0025) 0.0625 (0.0045) 0.0591 (0.0040) TV 1.2886 (0.5292) 4.4511 (0.8754) 7.3868 (0.8081) 10.5724 (1.2605) Cauchy(0.5 1p) JS 0.0625 (0.0045) 0.0652 (0.0044) 0.0648 (0.0035) 0.0687 (0.0042) TV 0.2280 (0.0067) 0.3842 (0.0083) 0.5740 (0.0071) 0.7768 (0.0074) N(0.5 1p, Σ) JS 0.1490 (0.0061) 0.1958 (0.0074) 0.2379 (0.0076) 0.1973 (0.0679) TV 0.2597 (0.0090) 0.4621 (0.0649) 0.6344 (0.0905) 0.7444 (0.3115) Table 8: Scenario II-a: p p/n > ϵ. Setting: n = 1, 000, ϵ = 0.1, and p from 10 to 100. Other details are the same as above. Q Net p = 10 p = 25 p = 50 p = 75 p = 100 N(0.2 1p, Ip) JS 0.1078 (0.0338) 0.1819 (0.0215) 0.3355 (0.0470) 0.4806 (0.0497) 0.5310 (0.0414) TV 0.2828 (0.0580) 0.4740 (0.1181) 0.5627 (0.0894) 0.8217 (0.0382) 0.8090 (0.0457) N(0.5 1p, Ip) JS 0.1587 (0.0438) 0.2684 (0.0386) 0.4213 (0.0356) 0.5355 (0.0634) 0.6825 (0.0981) TV 0.2864 (0.0521) 0.5024 (0.1038) 0.6878 (0.1146) 0.9204 (0.0589) 0.9418 (0.0551) N(1p, Ip) JS 0.1644 (0.0255) 0.2177 (0.0480) 0.3505 (0.0552) 0.4740 (0.0742) 0.6662 (0.0611) TV 0.3733 (0.0878) 0.5407 (0.0634) 0.9061 (0.1029) 1.0672 (0.0629) 1.1150 (0.0942) N(5 1p, Ip) JS 0.0938 (0.0195) 0.2058 (0.0218) 0.3316 (0.0462) 0.4054 (0.0690) 0.5553 (0.0518) TV 0.3707 (0.2102) 0.7434 (0.3313) 1.1532 (0.3488) 1.1850 (0.3739) 1.3257 (0.1721) Cauchy(0.5 1p) JS 0.1188 (0.0263) 0.1855 (0.0282) 0.2967 (0.0284) 0.4094 (0.0385) 0.4826 (0.0479) TV 0.3198 (0.1543) 0.5205 (0.1049) 0.6240 (0.0652) 0.7536 (0.0673) 0.7612 (0.0613) N(0.5 1p, Σ) JS 0.1805 (0.0220) 0.2692 (0.0318) 0.3885 (0.0339) 0.5144 (0.0547) 0.6833 (0.1094) TV 0.3036 (0.0736) 0.5152 (0.0707) 0.7305 (0.0966) 0.9460 (0.0900) 1.0888 (0.0863) Published as a conference paper at ICLR 2019 Table 9: Scenario II-b: p p/n > ϵ. Setting: p = 50, ϵ = 0.1, and n from 50 to 1, 000. Other details are the same as above. Q Net n = 50 n = 100 n = 200 n = 500 n = 1000 N(0.2 1p, Ip) JS 1.3934 (0.5692) 1.0055 (0.1040) 0.8373 (0.1335) 0.4781 (0.0677) 0.3213 (0.0401) TV 1.9714 (0.1552) 1.2629 (0.0882) 0.7579 (0.0486) 0.6640 (0.0689) 0.6348 (0.0547) N(0.5 1p, Ip) JS 1.6422 (0.6822) 1.2101 (0.2826) 0.8374 (0.1021) 0.5832 (0.0595) 0.3930 (0.0485) TV 1.9780 (0.2157) 1.2485 (0.0668) 0.8198 (0.0778) 0.7597 (0.0456) 0.7346 (0.0750) N(1p, Ip) JS 1.8427 (0.9633) 1.2179 (0.2782) 1.0147 (0.2170) 0.5586 (0.1013) 0.3639 (0.0464) TV 1.9907 (0.1498) 1.4575 (0.1270) 0.9724 (0.0802) 0.9050 (0.1479) 0.8747 (0.0757) N(5 1p, Ip) JS 2.6392 (1.3877) 1.3966 (0.5370) 0.9633 (0.1383) 0.5360 (0.0808) 0.3265 (0.0336) TV 2.1050 (0.3763) 1.5205 (0.2221) 1.1909 (0.2273) 1.0957 (0.1390) 1.0695 (0.2639) Cauchy(0.5 1p) JS 1.6563 (0.5246) 1.0857 (0.3613) 0.8944 (0.1759) 0.5363 (0.0593) 0.3832 (0.0408) TV 2.1031 (0.2300) 1.1712 (0.1493) 0.6904 (0.0763) 0.6300 (0.0642) 0.5085 (0.0662) N(0.5 1p, Σ) JS 1.2296 (0.3157) 0.7696 (0.0786) 0.5892 (0.0931) 0.5015 (0.0831) 0.4085 (0.0209) TV 1.9243 (0.2079) 1.2217 (0.0681) 0.7939 (0.0688) 0.7033 (0.0414) 0.7125 (0.0490) C PROOFS OF PROPOSITION 2.1 AND PROPOSITION 3.1 In the first example, consider Q = {N(η, Ip) : η Rp} , e Qη = {N(eη, Ip) : eη η r} . In other words, Q is the class of Gaussian location family, and e Qη is taken to be a subset in a local neighborhood of N(η, Ip). Then, with Q = N(η, Ip) and e Q = N(eη, Ip), the event eq(X)/q(X) 1 is equivalent to X eη 2 X η 2. Since eη η r, we can write eη = η + eru for some er R and u Rp that satisfy 0 er r and u = 1. Then, (8) becomes bθ = argmin η Rp sup u =1 0 er r i=1 I u T (Xi η) er P N(0, 1) er Letting r 0, we obtain (9), the exact formula of Tukey s median. The next example is a linear model y|X N(XT θ, 1). Consider the following classes Q = Py,X = Py|XPX : Py|X = N(XT η, 1), η Rp , e Qη = Py,X = Py|XPX : Py|X = N(XT eη, 1), eη η r . Here, Py,X stands for the joint distribution of y and X. The two classes Q and e Q share the same marginal distribution PX and the conditional distributions are specified by N(XT η, 1) and N(XT eη, 1), respectively. Follow the same derivation of Tukey s median, let r 0, and we obtain the exact formula of regression depth (10). It is worth noting that the derivation of (10) does not depend on the marginal distribution PX. The last example is on covariance/scatter matrix estimation. For this task, we set Q = {N(0, Γ) : Γ Ep}, where Ep is the class of all p p covariance matrices. Inspired by the derivations of Tukey depth and regression depth, it is tempting to choose e Q in the neighborhood of N(0, Γ). However, a native choice would lead to a definition that is not even Fisher consistent. We propose a rank-one neighborhood, given by e QΓ = n N(0, eΓ) : eΓ 1 = Γ 1 + eruu T Ep, |er| r, u = 1 o . (19) Then, a direct calculation gives ( d N(0, eΓ) d N(0, Γ)(X) 1 = I er|u T X|2 log(1 + eru T Γu) . (20) Published as a conference paper at ICLR 2019 Since limer 0 log(1+eru T Γu) eru T Γu = 1, the limiting event of (20) is either I{|u T X|2 u T Γu} or I{|u T X|2 u T Γu}, depending on whether er tends to zero from left or from right. Therefore, with the above Q and e QΓ, (8) becomes (11) under the limit r 0. Even though the definition of (19) is given by a rank-one neighborhood of the inverse covariance matrix, the formula (11) can also be derived with eΓ 1 = Γ 1 + eruu T in (19) replaced by eΓ = Γ + eruu T by applying the Sherman-Morrison formula. A similar formula to (11) in the literature is given by bΣ = argmax Γ Ep inf u =1 i=1 I{|u T Xi|2 βu T Γu} 1 i=1 I{|u T Xi|2 βu T Γu} which is recognized as the maximizer of what is known as the matrix depth function (Zhang, 2002; Chen et al., 2018; Paindaveine & Van Bever, 2017). The β in (21) is a scalar defined through the equation P(N(0, 1) β) = 3/4. It is proved in Chen et al. (2018) that bΣ achieves the minimax rate under Huber s ϵ-contamination model. While the formula (11) can be derived from TV-Learning with discriminators in the form of I n d N(0,eΓ) d N(0,Γ)(X) 1 o , a special case of (6), the formula (21) can be derived directly from TV- GAN with discriminators in the form of I n d N(0,βeΓ) d N(0,βΓ)(X) 1 o by following a similar rank-one neighborhood argument. This completes the derivation of Proposition 2.1. To prove Proposition 3.1, we define F(w) = EP log sigmoid(w T g(X)) + EQ log(1 sigmoid(w T g(X))) + log 4, so that JSg(P, Q) = maxw W F(w). The gradient and Hessian of F(w) are given by F(w) = EP e w T g(X) 1 + e w T g(X) g(X) EQ ew T g(X) 1 + ew T g(X) g(X), 2F(w) = EP ew T g(X) (1 + ew T g(X))2 g(X)g(X)T EQ e w T g(X) (1 + e w T g(X))2 g(X)g(X)T . Therefore, F(w) is concave in w, and maxw W F(w) is a convex optimization with a convex W. Suppose JSg(P, Q) = 0. Then maxw W F(w) = 0 = F(0), which implies F(0) = 0, and thus we have EP g(X) = EQg(X). Now suppose EP g(X) = EQg(X), which is equivalent to F(0) = 0. Therefore, w = 0 is a stationary point of a concave function, and we have JSg(P, Q) = maxw W F(w) = F(0) = 0. D PROOFS OF MAIN RESULTS In this section, we present proofs of all main theorems in the paper. We first establish some useful lemmas in Section D.1, and the the proofs of main theorems will be given in Section D.2. D.1 SOME AUXILIARY LEMMAS Lemma D.1. Given i.i.d. observations X1, ..., Xn P and the function class D defined in (13), we have for any δ > 0, i=1 D(Xi) ED(X) with probability at least 1 δ for some universal constant C > 0. Proof. Let f(X1, ..., Xn) = sup D D 1 n Pn i=1 D(Xi) ED(X) . It is clear that f(X1, ..., Xn) satisfies the bounded difference condition. By Mc Diarmid s inequality (Mc Diarmid, 1989), we have f(X1, ..., Xn) Ef(X1, ..., Xn) + Published as a conference paper at ICLR 2019 with probability at least 1 δ. Using a standard symmetrization technique (Pollard, 2012), we obtain the following bound that involves Rademacher complexity, Ef(X1, ..., Xn) 2E sup D D i=1 ϵi D(Xi) where ϵ1, ..., ϵn are independent Rademacher random variables. The Rademacher complexity can be bounded by Dudley s integral entropy bound, which gives i=1 ϵi D(Xi) log N(δ, D, n)dδ, where N(δ, D, n) is the δ-covering number of D with respect to the empirical ℓ2 distance f 1 n Pn i=1(f(Xi) g(Xi))2. Since the VC-dimension of D is O(p), we have N(δ, D, n) p (16e/δ)O(p) (see Theorem 2.6.7 of Van Der Vaart & Wellner (1996)). This leads to the bound 1 n R 2 0 p log N(δ, D, n)dδ p p n, which gives the desired result. Lemma D.2. Given i.i.d. observations X1, ..., Xn P, and the function class D defined in (15), we have for any δ > 0, i=1 log D(Xi) E log D(X) with probability at least 1 δ for some universal constant C > 0. Proof. Let f(X1, ..., Xn) = sup D D 1 n Pn i=1 log D(Xi) E log D(X) . Since sup D D sup x | log(2D(x))| κ, we have sup x1,...,xn,x i |f(x1, ..., xn) f(x1, ..., xi 1, x i, xi+1, ..., xn)| 2κ Therefore, by Mc Diarmid s inequality (Mc Diarmid, 1989), we have f(X1, ..., Xn) Ef(X1, ..., Xn) + κ with probability at least 1 δ. By the same argument of (22), it is sufficient to bound the Rademacher complexity E sup D D 1 n Pn i=1 ϵi log(2D(Xi)) . Since the function ψ(x) = log(2sigmoid(x)) has Lipschitz constant 1 and satisfies ψ(0) = 0, we have i=1 ϵi log(2D(Xi)) j 1 |wj| κ,uj Rp,bj R j 1 wjσ(u T j Xi + bj) which uses Theorem 12 of Bartlett & Mendelson (2002). By H older s inequality, we further have j 1 |wj| κ,uj Rp,bj R j 1 wjσ(u T j Xi + bj) κE max j 1 sup uj Rp,bj R i=1 ϵiσ(u T j Xi + bj) = κE sup u Rp,b R i=1 ϵiσ(u T Xi + b) Published as a conference paper at ICLR 2019 Note that for a monotone function σ : R [0, 1], the VC-dimension of the class {σ(u T x + b) : u R, b R} is O(p). Therefore, by using the same argument of Dudley s integral entropy bound in the proof Lemma D.1, we have E sup u Rp,b R i=1 ϵiσ(u T Xi + b) which leads to the desired result. Lemma D.3. Given i.i.d. observations X1, .., Xn N(θ, Ip) and the function class FH L (κ, τ, B). Assume θ log p and set τ = p log p. We have for any δ > 0, sup D FH L (κ,τ,B) i=1 log D(Xi) E log D(X) with probability at least 1 δ for some universal constants C > 0. Proof. Write f(X1, ..., Xn) = sup D FH L (κ,τ,B) 1 n Pn i=1 log D(Xi) E log D(X) . Then, the inequality (23) holds with probability at least 1 δ. It is sufficient to analyze the Rademacher complexity. Using the fact that the function log(2sigmoid(x)) is Lipschitz and H older s inequality, we have E sup D FH L (κ,τ,B) i=1 ϵi log(2D(Xi)) 2E sup w 1 κ, uj 2 2,|bj| τ,gjh GH L 1(B) j 1 wjsigmoid h=1 ujhgjh(Xi) + bj 2κE sup u 2 2,|b| τ,gh GH L 1(B) i=1 ϵisigmoid h=1 uhgh(Xi) + b 4κE sup u 2 2,|b| τ,gh GH L 1(B) h=1 uhgh(Xi) + b 8 pκE sup g GH L 1(B) i=1 ϵig(Xi) Now we use the notation Zi = Xi θ N(0, Ip) for i = 1, ..., n. We bound E supg GH L 1(B) 1 n Pn i=1 ϵig(Zi + θ) by induction. Since sup g GH 1 (B) i=1 ϵig(Zi + θ) i=1 ϵiv T (Zi + θ) CB log p + θ n , Published as a conference paper at ICLR 2019 sup g GH l+1(B) i=1 ϵig(Zi + θ) sup v 1 B,gh GH l (B) h=1 vhgh(Zi + θ) sup g GH l (B) i=1 ϵig(Zi + θ) sup g GH l (B) i=1 ϵig(Zi + θ) sup g GH L 1(B) i=1 ϵig(Zi + θ) C(2B)L 1 log p + θ n . Combining the above inequalities, we get sup D FH L (κ,τ,B) i=1 ϵi log D(Zi + θ) Cκ p(2B)L 1 log p + θ n + τ n This leads to the desired result under the conditions on τ and θ . D.2 PROOFS OF MAIN THEOREMS Proof of Theorem 3.1. We first introduce some notations. Define F(P, η) = maxw,b Fw,b(P, η), where Fw,b(P, η) = EP sigmoid(w T X + b) EN(η,Ip)sigmoid(w T X + b). With this definition, we have bθ = argminη F(Pn, η), where we use Pn for the empirical distribution 1 n Pn i=1 δXi. We shorthand N(η, Ip) by Pη, and then F(Pθ, bθ) F((1 ϵ)Pθ + ϵQ, bθ) + ϵ (24) F(Pn, bθ) + ϵ + C F(Pn, θ) + ϵ + C F((1 ϵ)Pθ + ϵQ, θ) + ϵ + 2C F(Pθ, θ) + 2ϵ + 2C With probability at least 1 δ, the above inequalities hold. We will explain each inequality. Since F((1 ϵ)Pθ + ϵQ, η) = max w,b [(1 ϵ)Fw,b(Pθ, η) + ϵFw,b(Q, η)] , we have sup η |F((1 ϵ)Pθ + ϵQ, η) F(Pθ, η)| ϵ, Published as a conference paper at ICLR 2019 which implies (24) and (28). The inequalities (25) and (27) are implied by Lemma D.1 and the fact that sup η |F(Pn, η) F((1 ϵ)Pθ + ϵQ, η)| sup w,b i=1 sigmoid(w T Xi + b) Esigmoid(w T X + b) The inequality (26) is a direct consequence of the definition of bθ. Finally, it is easy to see that F(Pθ, θ) = 0, which gives (29). In summary, we have derived that with probability at least 1 δ, Fw,b(Pθ, bθ) 2ϵ + 2C for all w Rp and b R. For any u Rp such that u = 1, we take w = u and b = u T θ, and we have f(0) f(u T (θ bθ)) 2ϵ + 2C where f(t) = R 1 1+ez+t φ(z)dz, with φ( ) being the probability density function of N(0, 1). It is not hard to see that as long as |f(t) f(0)| c for some sufficiently small constant c > 0, then |f(t) f(0)| c |t| for some constant c > 0. This implies bθ θ = sup u =1 |u T (bθ θ)| 1 c sup u =1 f(0) f(u T (θ bθ)) with probability at least 1 δ. The proof is complete. Proof of Theorem 3.2. We continue to use Pη to denote N(η, Ip). Define F(P, η) = max w 1 κ,u,b Fw,u,b(P, η), where Fw,u,b(P, η) = EP log D(X) + EN(η,Ip) log (1 D(X)) + log 4, with D(x) = sigmoid P j 1 wjσ(u T j x + bj) . Then, F(Pθ, bθ) F((1 ϵ)Pθ + ϵQ, bθ) + 2κϵ (30) F(Pn, bθ) + 2κϵ + Cκ F(Pn, θ) + 2κϵ + Cκ F((1 ϵ)Pθ + ϵQ, θ) + 2κϵ + 2Cκ F(Pθ, θ) + 4κϵ + 2Cκ = 4κϵ + 2Cκ Published as a conference paper at ICLR 2019 The inequalities (30)-(34) follow similar arguments for (24)-(28). To be specific, (31) and (33) are implied by Lemma D.2, and (32) is a direct consequence of the definition of bθ. To see (30) and (34), note that for any w such that w 1 κ, we have | log(2D(X))| j 1 wjσ(u T j X + bj) A similar argument gives the same bound for | log(2(1 D(X)))|. This leads to sup η |F((1 ϵ)Pθ + ϵQ, η) F(Pθ, η)| 2κϵ, (35) which further implies (30) and (34). To summarize, we have derived that with probability at least 1 δ, Fw,u,b(Pθ, bθ) 4κϵ + 2Cκ for all w 1 κ, uj 1 and bj. Take w1 = κ, wj = 0 for all j > 1, u1 = u for some unit vector u and b1 = u T θ, and we get fu T (bθ θ)(κ) 4κϵ + 2Cκ where fδ(t) = E log 2 1 + e tσ(Z) + E log 2 1 + etσ(Z+δ) , (37) with Z N(0, 1). Direct calculations give f δ(t) = E e tσ(Z) 1 + e tσ(Z) σ(Z) E etσ(Z+δ) 1 + etσ(Z+δ) σ(Z + δ), f δ (t) = Eσ(Z)2 e tσ(Z) (1 + e tσ(Z))2 Eσ(Z + δ)2 etσ(Z+δ) (1 + etσ(Z+δ))2 . (38) Therefore, fδ(0) = 0, f δ(0) = 1 2 (Eσ(Z) Eσ(Z + δ)), and f δ (t) 1 2. By the inequality fδ(κ) fδ(0) + κf δ(0) 1 we have κf δ(0) fδ(κ) + κ2/4. In view of (36), we have Z σ(z)φ(z)dz Z σ(z + u T (bθ θ))φ(z)dz It is easy to see that for the choices of σ( ), R σ(z)φ(z)dz R σ(z + t)φ(z)dz is locally linear with respect to t. This implies that κ bθ θ = κ sup u =1 u T (bθ θ) κ Therefore, with a κ p p n + ϵ, the proof is complete. Proof of Theorem 4.1. We use Pθ,Σ,h to denote the elliptical distribution EC(θ, Σ, h). Define F(P, (η, Γ, g)) = max w 1 κ,u,b Fw,u,b(P, (η, Γ, g)), Published as a conference paper at ICLR 2019 where Fw,u,b(P, (η, Γ, g)) = EP log D(X) + EEC(η,Γ,g) log (1 D(X)) + log 4, with D(x) = sigmoid P j 1 wjσ(u T j x + bj) . Let P be the data generating process that satisfies TV(P, Pθ,Σ,h) ϵ, and then there exist probability distributions Q1 and Q2, such that P + ϵQ1 = Pθ,Σ,h + ϵQ2. The explicit construction of Q1, Q2 is given in the proof of Theorem 5.1 of Chen et al. (2018). This implies that |F(P, (η, Γ, g)) F(Pθ,Σ,h, (η, Γ, g))| sup w 1 κ,u,b |Fw,u,b(P, (η, Γ, g)) Fw,u,b(Pθ,Σ,h, (η, Γ, g))| = ϵ sup w 1 κ,u,b |EQ2 log(2D(X)) EQ1 log(2D(X))| Then, the same argument in Theorem 3.2 (with (35) replaced by (39)) leads to the fact that with probability at least 1 δ, Fw,u,b(Pθ,Σ,h, (bθ, bΣ,bh)) 4κϵ + 2Cκ for all w 1 κ, uj 1 and bj. Take w1 = κ, wj = 0 for all j > 1, u1 = u/ p u T bΣu for some unit vector u and b1 = u T θ/ p u T bΣu, and we get f u T (bθ θ) u T b Σu (κ) 4κϵ + 2Cκ fδ(t) = Z log 2 1 + e tσ( s) h(s)ds + Z log 2 1 + etσ(δ+s) where δ = u T (bθ θ) u T bΣu and = u T bΣu. A similar argument to the proof of Theorem 3.2 gives Z σ( s)h(s)ds Z σ(δ + s)bh(s)ds Since Z σ( s)h(s)ds = 1 2 = Z σ(s)bh(s)ds, the above bound is equivalent to κ 2 (H(0) H(δ)) 4κϵ + 2Cκ where H(δ) = R σ(δ + s)bh(s)ds. The above bound also holds for κ 2 (H(δ) H(0)) by a symmetric argument, and therefore the same bound holds for κ 2 |H(δ) H(0)|. Since H (0) = R σ(s)(1 σ(s))bh(s)ds = 1, H(δ) is locally linear at δ = 0, which leads to a desired bound for δ = u T (bθ θ) u T bΣu . Finally, since u T bΣu M, we get the bound for u T (bθ θ). The proof is complete by taking supreme over all unit vector u. Published as a conference paper at ICLR 2019 Proof of Theorem A.1. We continue to use Pη to denote N(η, Ip). Define F(P, η) = sup D FH L (κ,τ,B) FD(P, η), with FD(P, η) = EP log D(X) + EN(η,Ip) log(1 D(X)) + log 4. Follow the same argument in the proof of Theorem 3.2, use Lemma D.3, and we have FD(Pθ, bθ) Cκ ϵ + (2B)L 1 r uniformly over D FH L (κ, τ, B) with probability at least 1 δ. Choose w1 = κ and wj = 0 for all wj > 1. For any unit vector eu Rp, take u1h = u1(h+p) = euh for h = 1, ..., p and b1 = eu T θ. For h = 1, ..., p, set g1h(x) = max(xh, 0). For h = p + 1, ..., 2p, set g1h(x) = max( xh p, 0). It is obvious that such u and b satisfy P h u2 1h 2 and |b1| θ p θ p log p. We need to show both the functions max(x, 0) and max( x, 0) are elements of GH L 1(B). This can be proved by induction. It is obvious that max(xh, 0), max( xh, 0) GH 1 (B) for any h = 1, ..., p. Suppose we have max(xh, 0), max( xh, 0) GH l (B) for any h = 1, ..., p. Then, max (max(xh, 0) max( xh, 0), 0) = max(xh, 0), max (max( xh, 0) max(xh, 0), 0) = max( xh, 0). Therefore, max(xh, 0), max( xh, 0) GH l+1(B) as long as B 2. Hence, the above construction satisfies D(x) = sigmoid(κsigmoid(eu T (x θ))) FH L (κ, τ, B), and we have fu T (bθ θ)(κ) Cκ ϵ + (2B)L 1 r where the definition of fδ(t) is given by (37) with Z N(0, 1) and σ( ) is taken as sigmoid( ). Apply the a similar in the proof of Theorem 3.2, we obtain the desired result.