# robust_estimation_via_generative_adversarial_networks__91326b08.pdf

Published as a conference paper at ICLR 2019

ROBUST ESTIMATION AND GENERATIVE ADVERSARIAL NETWORKS

Chao Gao Department of Statistics University of Chicago Chicago, IL 60637 USA chaogao@galton.uchicago.edu

Jiyi Liu Department of Statistics and Data Science Yale University New Haven, CT 06511 USA jiyi.liu@yale.edu

Yuan Yao & Weizhi Zhu Department of Mathematics Hong Kong University of Science and Technology Kowloon, Hong Kong yuany@ust.hk; wzhuai@connect.ust.hk

Robust estimation under Huber s ϵ-contamination model has become an important topic in statistics and theoretical computer science. Statistically optimal procedures such as Tukey s median and other estimators based on depth functions are impractical because of their computational intractability. In this paper, we establish an intriguing connection between f-GANs and various depth functions through the lens of f-Learning. Similar to the derivation of f GANs, we show that these depth functions that lead to statistically optimal robust estimators can all be viewed as variational lower bounds of the total variation distance in the framework of f-Learning. This connection opens the door of computing robust estimators using tools developed for training GANs. In particular, we show in both theory and experiments that some appropriate structures of discriminator networks with hidden layers in GANs lead to statistically optimal robust location estimators for both Gaussian distribution and general elliptical distributions where ﬁrst moment may not exist.

1 INTRODUCTION

In the setting of Huber s ϵ-contamination model (Huber, 1964; 1965), one has i.i.d observations

X1, ..., Xn (1 ϵ)Pθ + ϵQ, (1)

and the goal is to estimate the model parameter θ. Under the data generating process (1), each observation has a 1 ϵ probability to be drawn from Pθ and the other ϵ probability to be drawn from the contamination distribution Q. The presence of an unknown contamination distribution poses both statistical and computational challenges to the problem. For example, consider a normal mean estimation problem with Pθ = N(θ, Ip). Due to the contamination of data, the sample average, which is optimal when ϵ = 0, can be arbitrarily far away from the true mean if Q charges a positive probability at inﬁnity. Moreover, even robust estimators such as coordinatewise median and geometric median are proved to be suboptimal under the setting of (1) (Chen et al., 2018; Diakonikolas et al., 2016a; Lai et al., 2016). The search for both statistically optimal and computationally feasible procedures has become a fundamental problem in areas including statistics and computer science.

For the normal mean estimation problem, it has been shown in Chen et al. (2018) that the minimax rate with respect to the squared ℓ2 loss is p

n ϵ2, and is achieved by Tukey s median (Tukey, 1975). Despite the statistical optimality of Tukey s median, its computation is not tractable. In fact, even an approximate algorithm takes O(e Cp) in time (Amenta et al., 2000; Chan, 2004; Rousseeuw & Struyf, 1998).

Published as a conference paper at ICLR 2019

Recent developments in theoretical computer science are focused on the search of computationally tractable algorithms for estimating θ under Huber s ϵ-contamination model (1). The success of the efforts started from two fundamental papers Diakonikolas et al. (2016a); Lai et al. (2016), where two different but related computational strategies iterative ﬁltering and dimension halving were proposed to robustly estimate the normal mean. These algorithms can provably achieve the minimax rate p

n ϵ2 up to a poly-logarithmic factor in polynomial time. The main idea behind the two methods is a critical fact that a good robust moment estimator can be certiﬁed efﬁciently by higher moments. This idea was later further extended (Diakonikolas et al., 2017; Du et al., 2017; Diakonikolas et al., 2016b; 2018a;c;b; Kothari et al., 2018) to develop robust and computable procedures for various other problems.

However, many of the computationally feasible procedures for robust mean estimation in the literature rely on the knowledge of covariance matrix and sometimes the knowledge of contamination proportion. Even though these assumptions can be relaxed, nontrivial modiﬁcations of the algorithms are required for such extensions and statistical error rates may also be affected. Compared with these computationally feasible procedures proposed in the recent literature for robust estimation, Tukey s median (9) and other depth-based estimators (Rousseeuw & Hubert, 1999; Mizera, 2002; Zhang, 2002; Mizera & M uller, 2004; Paindaveine & Van Bever, 2017) have some indispensable advantages in terms of their statistical properties. First, the depth-based estimators have clear objective functions that can be interpreted from the perspective of projection pursuit (Mizera, 2002). Second, the depth-based procedures are adaptive to unknown nuisance parameters in the models such as covariance structures, contamination proportion, and error distributions (Chen et al., 2018; Gao, 2017). Last but not least, Tukey s depth and other depth functions are mostly designed for robust quantile estimation, while the recent advancements in the theoretical computer science literature are all focused on robust moments estimation. Although this is not an issue when it comes to normal mean estimation, the difference is fundamental for robust estimation under general settings such as elliptical distributions where moments do not necessarily exist.

Given the desirable statistical properties discussed above, this paper is focused on the development of computational strategies of depth-like procedures. Our key observation is that robust estimators that are maximizers of depth functions, including halfspace depth, regression depth and covariance matrix depth, can all be derived under the framework of f-GAN (Nowozin et al., 2016). As a result, these depth-based estimators can be viewed as minimizers of variational lower bounds of the total variation distance between the empirical measure and the model distribution (Proposition 2.1). This observation allows us to leverage the recent developments in the deep learning literature to compute these variational lower bounds through neural network approximations. Our theoretical results give insights on how to choose appropriate neural network classes that lead to minimax optimal robust estimation under Huber s ϵ-contamination model. In particular, Theorem 3.1 and 3.2 characterize the networks which can robustly estimate the Gaussian mean by TV-GAN and JS-GAN, respectively; Theorem 4.1 is an extension to robust location estimation under the class of elliptical distributions which includes Cauchy distribution whose mean does not exist. Numerical experiments in Section 5 are provided to show the success of these GANs.

2 ROBUST ESTIMATION AND f-GAN

We start with the deﬁnition of f-divergence (Csisz ar, 1964; Ali & Silvey, 1966). Given a strictly convex function f that satisﬁes f(1) = 0, the f-GAN between two probability distributions P and Q is deﬁned by

Df(P Q) = Z f p

Here, we use p( ) and q( ) to stand for the density functions of P and Q with respect to some common dominating measure. For a fully rigorous deﬁnition, see Polyanskiy & Wu (2017). Let f be the convex conjugate of f. That is, f (t) = supu domf (ut f(u)). A variational lower bound of (2) is

Df(P Q) sup T T [EP T(X) EQf (T(X))] . (3)

Note that the inequality (3) holds for any class T , and it becomes an equality whenever the class T contains the function f (p/q) (Nguyen et al., 2010). For notational simplicity, we also use f for an arbitrary element of

Published as a conference paper at ICLR 2019

the subdifferential when the derivative does not exist. With i.i.d. observations X1, ..., Xn P, the variational lower bound (3) naturally leads to the following learning method

b P = argmin Q Q sup T T

i=1 T(Xi) EQf (T(X))

The formula (4) is a powerful and general way to learn the distribution P from its i.i.d. observations. It is known as f-GAN (Nowozin et al., 2016), an extension of GAN (Goodfellow et al., 2014), which stands for generative adversarial networks. The idea is to ﬁnd a b P so that the best discriminator T in the class T cannot tell the difference between b P and the empirical distribution 1

n Pn i=1 δXi.

2.1 f-LEARNING: A UNIFIED FRAMEWORK

Our f-Learning framework is based on a special case of the variational lower bound (3). That is,

Df(P Q) sup e Q e QQ

EQf f eq(X)

where eq( ) stands for the density function of e Q. Note that here we allow the class e QQ to depend on the distribution Q in the second argument of Df(P Q). Compare (5) with (3), and it is easy to realize that (5) is a special case of (3) with

T = TQ = f eq

Moreover, the inequality (5) becomes an equality as long as P e QQ. The sample version of (5) leads to the following learning method

b P = argmin Q Q sup e Q e QQ

i=1 f eq(Xi)

EQf f eq(X)

The learning method (7) will be referred to as f-Learning in the sequel. It is a very general framework that covers many important learning procedures as special cases. For example, consider the special case where e QQ = e Q independent of Q, Q = e Q, and f(x) = x log x. Direct calculations give f (x) = log x + 1 and f (t) = et 1. Therefore, (7) becomes

b P = argmin Q Q sup e Q Q

i=1 log eq(Xi)

q(Xi) = argmax q Q

i=1 log q(Xi),

which is the maximum likelihood estimator (MLE).

2.2 TV-LEARNING AND DEPTH-BASED ESTIMATORS

An important generator f that we will discuss here is f(x) = (x 1)+. This leads to the total variation distance Df(P Q) = 1

2 R |p q|. With f (x) = I{x 1} and f (t) = t I{0 t 1}, the TV-Learning is given by

b P = argmin Q Q sup e Q QQ

i=1 I eq(Xi)

q(Xi) 1 Q eq

A closely related idea was previously explored by Yatracos (1985); Devroye & Lugosi (2012). The following proposition shows that when e QQ approaches to Q in some neighborhood, TV-Learning leads to robust estimators that are deﬁned as the maximizers of various depth functions including Tukey s depth, regression depth, and covariance depth.

Proposition 2.1. The TV-Learning (8) includes the following special cases:

Published as a conference paper at ICLR 2019

1. Tukey s halfspace depth: Take Q = {N(η, Ip) : η Rp} and e Qη = {N(eη, Ip) : eη η r}. As r 0, (8) becomes

bθ = argmax η Rp inf u =1 1 n

i=1 I{u T (Xi η) 0}. (9)

2. Regression depth: Take Q = Py,X = Py|XPX : Py|X = N(XT η, 1), η Rp , and e Qη = Py,X = Py|XPX : Py|X = N(XT eη, 1), eη η r . As r 0, (8) becomes

bθ = argmax η Rp inf u =1 1 n

i=1 I{u T Xi(yi XT i η) 0}. (10)

3. Covariance matrix depth: Take Q = {N(0, Γ) : Γ Ep}, where Ep stands for the class of p p

covariance matrices, and e QΓ = n N(0, eΓ) : eΓ 1 = Γ 1 + eruu T Ep, |er| r, u = 1 o . As r 0, (8) becomes

bΣ = argmin Γ Ep sup u =1

i=1 I{|u T Xi|2 u T Γu} P(χ2 1 1)

i=1 I{|u T Xi|2 > u T Γu} P(χ2 1 > 1)

The formula (9) is recognized as Tukey s median, the maximizer of Tukey s halfspace depth. A traditional understanding of Tukey s median is that (9) maximizes the halfspace depth (Donoho & Gasko, 1992) so that bθ is close to the centers of all one-dimensional projections of the data. In the f-Learning framework, N(bθ, Ip) is understood to be the minimizer of a variational lower bound of the total variation distance. The formula (10) gives the estimator that maximizes the regression depth proposed by Rousseeuw & Hubert (1999). It is worth noting that the derivation of (10) does not depend on the marginal distribution PX in the linear regression model. Finally, (11) is related to the covariance matrix depth (Zhang, 2002; Chen et al., 2018; Paindaveine & Van Bever, 2017). All of the estimators (9), (10) and (11) are proved to achieve the minimax rate for the corresponding problems under Huber s ϵ-contamination model (Chen et al., 2018; Gao, 2017).

2.3 FROM f-LEARNING TO f-GAN

The connection to various depth functions shows the importance of TV-Learning in robust estimation. However, it is well-known that depth-based estimators are very hard to compute (Amenta et al., 2000; van Kreveld et al., 1999; Rousseeuw & Struyf, 1998), which limits their applications only for very low-dimensional problems. On the other hand, the general f-GAN framework (4) has been successfully applied to learn complex distributions and images in practice (Goodfellow et al., 2014; Radford et al., 2015; Salimans et al., 2016). The major difference that gives the computational advantage to f-GAN is its ﬂexibility in terms of designing the discriminator class T using neural networks compared with the pre-speciﬁed choice (6) in f-Learning. While f-Learning provides a uniﬁed perspective in understanding various depth-based procedures in robust estimation, we can step back into the more general f-GAN for its computational advantages, and to design efﬁcient computational strategies.

3 ROBUST MEAN ESTIMATION VIA GAN

In this section, we focus on the problem of robust mean estimation under Huber s ϵ-contamination model. Our goal is to reveal how the choice of the class of discriminators affects robustness and statistical optimality under the simplest possible setting. That is, we have i.i.d. observations X1, ..., Xn (1 ϵ)N(θ, Ip) + ϵQ, and we need to estimate the unknown location θ Rp with the contaminated data. Our goal is to achieve the minimax rate p

n ϵ2 with respect to the squared ℓ2 loss uniformly over all θ Rp and all Q.

Published as a conference paper at ICLR 2019

Figure 1: Landscape of TV-GAN objective function F(η, w) = supb[EP sigmoid(w X + b) EN(η,1)sigmoid(w X + b)], where b is maximized out for visualization. Samples are drawn from P = (1 ϵ)N(1, 1) + ϵN(10, 1) with ϵ = 0.2. Left: a surface plot of F(η, w). The solid curves are marginal functions for ﬁxed η s: F(1, w) (red) and F(5, w) (blue), and the dash curves are marginal functions for ﬁxed w s: F(η, 10) (orange) and F(η, 10) (green). Right: a heatmap of F(η, w). It is clear that F(w) = F(η, w) has two local maxima for a given η, achieved at w = + and w = . In fact, the global maximum for F(w) has a phase transition from w = + to w = as η grows. For example, the maximum is achieved at w = + when η = 1 (blue solid) and is achieved at w = when η = 5 (red solid). Unfortunately, even if we initialize with η0 = 1 and w0 > 0, gradient ascents on η will only increase the value of η (green dash), and thus as long as the discriminator cannot reach the global maximizer, w will be stuck in the positive half space {w : w > 0} and further increase the value of η.

3.1 RESULTS FOR TV-GAN

We start with the total variation GAN (TV-GAN) with f(x) = (x 1)+ in (4). For the Gaussian location family, (4) can be written as

bθ = argmin η Rp max D D

i=1 D(Xi) EN(η,Ip)D(X)

with T(x) = D(x) in (4). Now we need to specify the class of discriminators D to solve the classiﬁcation problem between N(η, Ip) and the empirical distribution 1

n Pn i=1 δXi. One of the simplest discriminator classes is the logistic regression, D = D(x) = sigmoid(w T x + b) : w Rp, b R . (13)

With D(x) = sigmoid(w T x+b) = (1+e w T x b) 1 in (13), the procedure (12) can be viewed as a smoothed version of TV-Learning (8). To be speciﬁc, the sigmoid function sigmoid(w T x + b) tends to an indicator function as w , which leads to a procedure very similar to (9). In fact, the class (13) is richer than the one used in (9), and thus (12) can be understood as the minimizer of a sharper variational lower bound than that of (9).

Theorem 3.1. Assume p

n + ϵ2 c for some sufﬁciently small constant c > 0. With i.i.d. observations X1, ..., Xn (1 ϵ)N(θ, Ip) + ϵQ, the estimator bθ deﬁned by (12) satisﬁes

with probability at least 1 e C (p+nϵ2) uniformly over all θ Rp and all Q. The constants C, C > 0 are universal.

Though TV-GAN can achieve the minimax rate p

n ϵ2 under Huber s contamination model, it may suffer from optimization difﬁculties especially when the distributions Q and N(θ, Ip) are far away from each other, as shown in Figure 1.

Published as a conference paper at ICLR 2019

3.2 RESULTS FOR JS-GAN

Given the intractable optimization property of TV-GAN, we next turn to Jensen-Shannon GAN (JS-GAN) with f(x) = x log x (x + 1) log x+1

2 . The estimator is deﬁned by

bθ = argmin η Rp max D D

i=1 log D(Xi) + EN(η,Ip) log(1 D(Xi))

+ log 4, (14)

with T(x) = log D(x) in (4). This is exactly the original GAN (Goodfellow et al., 2014) specialized to the normal mean estimation problem. The advantages of JS-GAN over other forms of GAN have been studied extensively in the literature (Lucic et al., 2017; Kurach et al., 2018).

Unlike TV-GAN, our experiment results show that (14) with the logistic regression discriminator class (13) is not robust to contamination. However, if we replace (13) by a neural network class with one or more hidden layers, the estimator will be robust and will also work very well numerically.

To understand why and how the class of the discriminators affects the robustness property of JS-GAN, we introduce a new concept called restricted Jensen-Shannon divergence. Let g : Rp Rd be a function that maps a p-dimensional observation to a d-dimensional feature space. The restricted Jensen-Shannon divergence between two probability distributions P and Q with respect to the feature g is deﬁned as

JSg(P, Q) = max w W EP log sigmoid(w T g(X)) + EQ log(1 sigmoid(w T g(X))) + log 4.

In other words, P and Q are distinguished by a logistic regression classiﬁer that uses the feature g(X). It is easy to see that JSg(P, Q) is a variational lower bound of the original Jensen-Shannon divergence. The key property of JSg(P, Q) is given by the following proposition.

Proposition 3.1. Assume W is a convex set that contains an open neighborhood of 0. Then, JSg(P, Q) = 0 if and only if EP g(X) = EQg(X).

The proposition asserts that JSg( , ) cannot distinguish P and Q if the feature g(X) has the same expected value under the two distributions. This generalized moment matching effect has also been studied by Liu et al. (2017) for general f-GANs. However, the linear discriminator class considered in Liu et al. (2017) is parameterized in a different way compared with the discriminator class here.

When we apply Proposition 3.1 to robust mean estimation, the JS-GAN is trying to match the values of 1 n Pn i=1 g(Xi) and EN(η,Iη)g(X) for the feature g(X) used in the logistic regression classiﬁer. This explains what we observed in our numerical experiments. A neural net without any hidden layer is equivalent to a logistic regression with a linear feature g(X) = (XT , 1)T Rp+1. Therefore, whenever η = 1

n Pn i=1 Xi, we have JSg 1

n Pn i=1 δXi, N(η, Ip) = 0, which implies that the sample mean is a global maximizer of (14). On the other hand, a neural net with at least one hidden layers involves a nonlinear feature function g(X), which is the key that leads to the robustness of (14).

We will show rigorously that a neural net with one hidden layer is sufﬁcient to make (14) robust and optimal. Consider the following class of discriminators,

D(x) = sigmoid

j 1 wjσ(u T j x + bj)

j 1 |wj| κ, uj Rp, bj R

The class (15) consists of two-layer neural network functions. While the dimension of the input layer is p, the dimension of the hidden layer can be arbitrary, as long as the weights have a bounded ℓ1 norm. The nonlinear activation function σ( ) is allowed to take 1) indicator: σ(x) = I{x 1}, 2) sigmoid: σ(x) = 1 1+e x , 3) ramp: σ(x) = max(min(x + 1/2, 1), 0). Other bounded activation functions are also possible, but we do not exclusively list them. The rectiﬁed linear unit (Re LU) will be studied in Appendix A.

Theorem 3.2. Consider the estimator bθ deﬁned by (14) with D speciﬁed by (15). Assume p

n + ϵ2 c for some sufﬁciently small constant c > 0, and set κ = O p p

n + ϵ . With i.i.d. observations X1, ..., Xn

Published as a conference paper at ICLR 2019

(1 ϵ)N(θ, Ip) + ϵQ, we have

with probability at least 1 e C (p+nϵ2) uniformly over all θ Rp and all Q. The constants C, C > 0 are universal.

4 ELLIPTICAL DISTRIBUTIONS

An advantage of Tukey s median (9) is that it leads to optimal robust location estimation under general elliptical distributions such as Cauchy distribution whose mean does not exist. In this section, we show that JS-GAN shares the same property. A random vector X Rp follows an elliptical distribution if it admits a representation

X = θ + ξAU,

where U is uniformly distributed on the unit sphere {u Rp : u = 1} and ξ 0 is a random variable independent of U that determines the shape of the elliptical distribution (Fang, 2017). The center and the scatter matrix is θ and Σ = AAT .

For a unit vector v, let the density function of ξv T U be h. Note that h is independent of v because of the symmetry of U. Then, there is a one-to-one relation between the distribution of ξ and h, and thus the triplet (θ, Σ, h) fully parametrizes an elliptical distribution.

Note that h and Σ = AAT are not identiﬁable, because ξA = (cξ)(c 1A) for any c > 0. Therefore, without loss of generality, we can restrict h to be a member of the following class

H = h : h(t) = h( t), h 0, Z h = 1, Z σ(t)(1 σ(t))h(t)dt = 1 .

This makes the parametrization (θ, Σ, h) of an elliptical distribution fully identiﬁable, and we use EC(θ, Σ, h) to denote an elliptical distribution parametrized in this way.

The JS-GAN estimator is deﬁned as

(bθ, bΣ,bh) = argmin η Rp,Γ Ep(M),g H max D D

i=1 log D(Xi) + EEC(η,Γ,g) log(1 D(X))

+ log 4, (16)

where Ep(M) is the set of all positive semi-deﬁnite matrix with spectral norm bounded by M.

Theorem 4.1. Consider the estimator bθ deﬁned above with D speciﬁed by (15). Assume M = O(1), p

n +ϵ2 c for some sufﬁciently small constant c > 0, and set κ = O p p

n + ϵ . With i.i.d. observations X1, ..., Xn (1 ϵ)EC(θ, Σ, h) + ϵQ, we have

with probability at least 1 e C (p+nϵ2) uniformly over all θ Rp, Σ Ep(M) and all Q. The constants C, C > 0 are universal.

Remark 4.1. The result of Theorem 4.1 also holds (and is proved) under the strong contamination model (Diakonikolas et al., 2016a). That is, we have i.i.d. observations X1, ..., Xn P for some P satisfying TV(P, EC(θ, Σ, h)) ϵ. See its proof in Appendix D.2.

Note that Theorem 4.1 guarantees the same convergence rate as in the Gaussian case for all elliptical distributions. This even includes multivariate Cauchy where mean does not exist. Therefore, the location estimator (16) is fundamentally different from Diakonikolas et al. (2016a); Lai et al. (2016), which is only designed for robust mean estimation. We will show such a difference in our numerical results.

To achieve rate-optimality for robust location estimation under general elliptical distributions, the estimator (16) is different from (14) only in the generator class. They share the same discriminator class (15). This underlines

Published as a conference paper at ICLR 2019

an important principle for designing GAN estimators: the overall statistical complexity of the estimator is only determined by the discriminator class.

The estimator (16) also outputs (bΣ,bh), but we do not claim any theoretical property for (bΣ,bh) in this paper. This will be systematically studied in a future project.

5 NUMERICAL EXPERIMENTS

In this section, we give extensive numerical studies of robust mean estimation via GAN. After introducing the implementation details in Section 5.1, we verify our theoretical results on minimax estimation with both TVGAN and JS-GAN in Section 5.2. Comparison with other methods on robust mean estimation in the literature is given in Section 5.3. The effects of various network structures are studied in Section 5.4. Adaptation to unknown covariance is studied in Section 5.5. In all these cases, we assume i.i.d. observations are drawn from (1 ϵ)N(0p, Ip) + ϵQ with ϵ and Q to be speciﬁed. Finally, adaptation to elliptical distributions is studied in Section 5.6.

5.1 IMPLEMENTATIONS

We adopt the standard algorithmic framework of f-GANs (Nowozin et al., 2016) for the implementation of JSGAN and TV-GAN for robust mean estimation. In particular, the generator for mean estimation is Gη(Z) = Z + η with Z N(0p, Ip); the discriminator D is a multilayer perceptron (MLP), where each layer consisting of a linear map and a sigmoid activation function and the number of nodes will vary in different experiments to be speciﬁed below. Details related to algorithms, tuning, critical hyper-parameters, structures of discriminator networks and other training tricks for stabilization and acceleration are discussed in Appendix B.1. A Py Torch implementation is available at https://github.com/zhuwzh/Robust-GAN-Center.

5.2 NUMERICAL SUPPORTS FOR THE MINIMAX RATES

We verify the minimax rates achieved by TV-GAN (Theorem 3.1) and JS-GAN (Theorem 3.2) via numerical experiments. Two main scenarios we consider here are p

p/n < ϵ and p

p/n > ϵ, where in both cases, various types of contamination distributions Q are considered. Speciﬁcally, the choice of contamination distributions Q includes N(µ 1p, Ip) with µ ranges in {0.2, 0.5, 1, 5}, N(0.5 1p, Σ) and Cauchy(τ 1p). Details of the construction of the covariance matrix Σ is given in Appendix B.2. The distribution Cauchy(τ 1p) is obtained by combining p independent one-dimensional standard Cauchy with location parameter τj = 0.5.

Figure 2: ℓ2 error bθ θ against ϵ (left: p = 100, n = 50, 000 and ϵ ranges from 0.05 to 0.20), p (middle: n = 1, 000, ϵ = 0.1 and p ranges from 10 to 100) and 1/ n (right: p = 50, ϵ = 0.1 and n ranges from 50 to 1, 000), respectively. Net structure: One hidden layer with 20 hidden units (JS-GAN), zero hidden layer (TV-GAN). The vertical bars indicate standard deviations.

The main experimental results are summarized in Figure 2, where the ℓ2 error we present is the maximum error among all choices of Q, and detailed numerical results can be founded in Tables 7, 8 and 9 in Appendix. We separately explore the relation between the error and one of ϵ, p and 1/ n with the other two parameters ﬁxed. The study of the relation between the ℓ2 error and ϵ is in the regime p

p/n < ϵ so that ϵ dominates

Published as a conference paper at ICLR 2019

the minimax rate. The scenario p

p/n > ϵ is considered in the study of the effects of p and 1/ n. As is shown in Figure 2, the errors are approximately linear against the corresponding parameters in all cases, which empirically veriﬁes the conclusions of Theorem 3.1 and Theorem 3.2.

5.3 COMPARISONS WITH OTHER METHODS

We perform additional experiments to compare with other methods including dimension halving (Lai et al., 2016) and iterative ﬁltering (Diakonikolas et al., 2017) under various settings. We emphasize that our method does not require any knowledge about the nuisance parameters such as the contamination proportion ϵ. Tuning GAN is only a matter of optimization and one can tune parameters based on the objective function only.

Table 1: Comparison of various robust mean estimation methods. Net structure: One-hidden layer network with 20 hidden units when n = 50, 000 and 2 hidden units when n = 5, 000. The number in each cell is the average of ℓ2 error bθ θ with standard deviation in parenthesis estimated from 10 repeated experiments and the smallest error among four methods is highlighted in bold.

Q n p ϵ TV-GAN JS-GAN Dimension Halving Iterative Filtering N(0.5 1p, Ip) 50,000 100 .2 0.0953 (0.0064) 0.1144 (0.0154) 0.3247 (0.0058) 0.1472 (0.0071) N(0.5 1p, Ip) 5,000 100 .2 0.1941 (0.0173) 0.2182 (0.0527) 0.3568 (0.0197) 0.2285 (0.0103) N(0.5 1p, Ip) 50,000 200 .2 0.1108 (0.0093) 0.1573 (0.0815) 0.3251 (0.0078) 0.1525 (0.0045) N(0.5 1p, Ip) 50,000 100 .05 0.0913 (0.0527) 0.1390 (0.0050) 0.0814 (0.0056) 0.0530 (0.0052) N(5 1p, Ip) 50,000 100 .2 2.7721 (0.1285) 0.0534 (0.0041) 0.3229 (0.0087) 0.1471 (0.0059) N(0.5 1p, Σ) 50,000 100 .2 0.1189 (0.0195) 0.1148 (0.0234) 0.3241 (0.0088) 0.1426 (0.0113) Cauchy(0.5 1p) 50,000 100 .2 0.0738 (0.0053) 0.0525 (0.0029) 0.1045 (0.0071) 0.0633 (0.0042)

Table 1 shows the performances of JS-GAN, TV-GAN, dimension halving, and iterative ﬁltering. The network structure, for both JS-GAN and TV-GAN, has one hidden layer with 20 hidden units when the sample size is 50,000 and 2 hidden units when sample size is 5,000. The critical hyper-parameters we apply is given in Appendix and it turns out that the choice of the hyper-parameter is robust against different models when the net structures are the same. To summarize, our method outperforms other algorithms in most cases. TV-GAN is good at cases when Q and N(0p, Ip) are non-separable but fails when Q is far away from N(0p, Ip) due to optimization issues discussed in Section 3.1 (Figure 1). On the other hand, JS-GAN stably achieves the lowest error in separable cases and also shows competitive performances for non-separable ones.

5.4 NETWORK STRUCTURES

We further study the performance of JS-GAN with various structures of neural networks. The main observation is tuning networks with one-hidden layer becomes tough as the dimension grows (e.g. p 200), while a deeper network can signiﬁcantly reﬁne the situation perhaps by improving the landscape. Some experiment results are given in Table 2. On the other hand, one-hidden layer performs not worse than deeper networks when dimension is not very large (e.g. p 100). More experiments are given in Appendix B.4. Additional theoretical results for deep neural nets are given in Appendix A.

Table 2: Experiment results for JS-GAN using networks with different structures in high dimension. Settings: ϵ = 0.2, p {200, 400} and n = 50, 000.

p 200-100-20-1 200-200-100-1 200-100-1 200-20-1 200 0.0910 (0.0056) 0.0790 (0.0026) 0.3064 (0.0077) 0.1573 (0.0815) p 400-200-100-50-20-1 400-200-100-20-1 400-200-20-1 400-200-1 400 0.1477 (0.0053) 0.1732 (0.0397) 0.1393 (0.0090) 0.3604 (0.0990)

Published as a conference paper at ICLR 2019

5.5 ADAPTATION TO UNKNOWN COVARIANCE

The robust mean estimator constructed through JS-GAN can be easily made adaptive to unknown covariance structure, which is a special case of (16). We deﬁne

(bθ, bΣ) = argmin η Rp,Γ Ep max D D

i=1 log D(Xi) + EN(η,Γ) log(1 D(Xi))

The estimator bθ, as a result, is rate-optimal even when the true covariance matrix is not necessarily identity and is unknown (see Theorem 4.1). Below, we demonstrate some numerical evidence of the optimality of bθ as well as the error of bΣ in Table 3.

Data generating process Network structure bθ 0p bΣ Σ1 op 0.8N(0p, Σ1) + 0.2N(0.5 1p, Σ2) 100-20-1 0.1680 (0.1540) 1.9716 (0.7405) 0.8N(0p, Σ1) + 0.2N(0.5 1p, Σ2) 100-20-20-1 0.1824 (0.3034) 1.4495 (0.6028) 0.8N(0p, Σ1) + 0.2N(1p, Σ2) 100-20-1 0.0817 (0.0213) 1.2753 (0.4523) 0.8N(0p, Σ1) + 0.2N(6 1p, Σ2) 100-20-1 0.1069 (0.0357) 1.1668 (0.1839) 0.8N(0p, Σ1) + 0.2Cauchy(0.5 1p) 100-20-1 0.0797 (0.0257) 4.0653 (0.1569)

Table 3: Numerical experiments for robust mean estimation with unknown covariance trained with 50, 000 samples. The covariance matrices Σ1 and Σ2 are generated by the same way described in Appendix B.2.

5.6 ADAPTATION TO ELLIPTICAL DISTRIBUTIONS

We consider the estimation of the location parameter θ in elliptical distribution EC(θ, Σ, h) by the JS-GAN deﬁned in (16). In particular, we study the case with i.i.d. observations X1, ..., Xn (1 ϵ)Cauchy(θ, Ip)+ϵQ.

The density function of Cauchy(θ, Σ) is given by p(x; θ, Σ) |Σ| 1/2 1 + (x θ)T Σ 1(x θ) (1+p)/2.

Compared with Algorithm (1), the difference lies in the choice of the generator. We consider the generator G1(ξ, U) = gω(ξ)U + θ, where gω(ξ) is a non-negative neural network parametrized by ω and some random variable ξ. The random vector U is sampled from the uniform distribution on {u Rp : u = 1}. If the scatter matrix is unknown, we will use the generator G2(ξ, U) = gω(ξ)AU +θ, with AAT modeling the scatter matrix.

Table 4 shows the comparison with other methods. Our method still works well under Cauchy distribution, while the performance of other methods that rely on moment conditions deteriorates in this setting.

Table 4: Comparison of various methods of robust location estimation under Cauchy distributions. Samples are drawn from (1 ϵ)Cauchy(0p, Ip) + ϵQ with ϵ = 0.2, p = 50 and various choices of Q. Sample size: 50,000. Discriminator net structure: 50-50-25-1. Generator gω(ξ) structure: 48-48-32-24-12-1 with absolute value activation function in the output layer.

Contamination Q JS-GAN (G1) JS-GAN (G2) Dimension Halving Iterative Filtering Cauchy(1.5 1p, Ip) 0.0664 (0.0065) 0.0743 (0.0103) 0.3529 (0.0543) 0.1244 (0.0114) Cauchy(5.0 1p, Ip) 0.0480 (0.0058) 0.0540 (0.0064) 0.4855 (0.0616) 0.1687 (0.0310) Cauchy(1.5 1p, 5 Ip) 0.0754 (0.0135) 0.0742 (0.0111) 0.3726 (0.0530) 0.1220 (0.0112) Normal(1.5 1p, 5 Ip) 0.0702 (0.0064) 0.0713 (0.0088) 0.3915 (0.0232) 0.1048 (0.0288))

ACKNOWLEDGEMENT

The research of Chao Gao was supported in part by NSF grant DMS-1712957 and NSF Career Award DMS1847590. The research of Yuan Yao was supported in part by Hong Kong Research Grant Council (HKRGC) grant 16303817, National Basic Research Program of China (No. 2015CB85600), National Natural Science Foundation of China (No. 61370004, 11421110001), as well as awards from Tencent AI Lab, Si Family Foundation, Baidu Big Data Institute, and Microsoft Research-Asia.

Published as a conference paper at ICLR 2019

Syed Mumtaz Ali and Samuel D Silvey. A general class of coefﬁcients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pp. 131 142, 1966.

Nina Amenta, Marshall Bern, David Eppstein, and S-H Teng. Regression depth and center points. Discrete & Computational Geometry, 23(3):305 323, 2000.

Peter L Bartlett. For valid generalization the size of the weights is more important than the size of the network. In Advances in neural information processing systems, pp. 134 140, 1997.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Timothy M Chan. An optimal randomized algorithm for maximum tukey depth. In Proceedings of the ﬁfteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 430 436. Society for Industrial and Applied Mathematics, 2004.

Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under hubers contamination model. The Annals of Statistics, 46(5):1932 1960, 2018.

Imre Csisz ar. Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akad. Mat. Kutato Int. Koezl., 8:85 108, 1964.

Luc Devroye and G abor Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media, 2012.

Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pp. 655 664. IEEE, 2016a.

Ilias Diakonikolas, Daniel Kane, and Alistair Stewart. Robust learning of ﬁxed-structure bayesian networks. ar Xiv preprint ar Xiv:1606.07384, 2016b.

Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. ar Xiv preprint ar Xiv:1703.00893, 2017.

Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. ar Xiv preprint ar Xiv:1803.02815, 2018a.

Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. List-decodable robust mean estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1047 1060. ACM, 2018b.

Ilias Diakonikolas, Weihao Kong, and Alistair Stewart. Efﬁcient algorithms and lower bounds for robust linear regression. ar Xiv preprint ar Xiv:1806.00040, 2018c.

David L Donoho and Miriam Gasko. Breakdown properties of location estimates based on halfspace depth and projected outlyingness. The Annals of Statistics, 20(4):1803 1827, 1992.

Simon S Du, Sivaraman Balakrishnan, and Aarti Singh. Computationally efﬁcient robust estimation of sparse functionals. ar Xiv preprint ar Xiv:1702.07709, 2017.

Kai Wang Fang. Symmetric Multivariate and Related Distributions: 0. Chapman and Hall/CRC, 2017.

Chao Gao. Robust regression via mutivariate regression depth. ar Xiv preprint ar Xiv:1702.04656, 2017.

Published as a conference paper at ICLR 2019

Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pp. 249 256, 2010.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Peter J Huber. Robust estimation of a location parameter. The annals of mathematical statistics, 35(1):73 101, 1964.

Peter J Huber. A robust version of the probability ratio test. The Annals of Mathematical Statistics, 36(6): 1753 1758, 1965.

Pravesh K Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1035 1046. ACM, 2018.

Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. The gan landscape: Losses, architectures, regularization, and normalization. ar Xiv preprint ar Xiv:1807.04720, 2018.

Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pp. 665 674. IEEE, 2016.

Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pp. 5545 5553, 2017.

Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a large-scale study. ar Xiv preprint ar Xiv:1711.10337, 2017.

Colin Mc Diarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148 188, 1989.

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018.

Ivan Mizera. On depth and deep points: a calculus. The Annals of Statistics, 30(6):1681 1736, 2002.

Ivan Mizera and Christine H M uller. Location scale depth. Journal of the American Statistical Association, 99(468):949 966, 2004.

Youssef Mroueh, Tom Sercu, and Vaibhava Goel. Mcgan: Mean and covariance feature matching gan. ar Xiv preprint ar Xiv:1702.08398, 2017.

Xuan Long Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010.

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271 279, 2016.

Davy Paindaveine and Germain Van Bever. Halfspace depths for scatter, concentration and shape matrices. ar Xiv preprint ar Xiv:1704.06160, 2017.

David Pollard. Convergence of stochastic processes. Springer Science & Business Media, 2012.

Yury Polyanskiy and Yihong Wu. Lecture notes on information theory. 2017.

Published as a conference paper at ICLR 2019

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015.

Peter J Rousseeuw and Mia Hubert. Regression depth. Journal of the American Statistical Association, 94 (446):388 402, 1999.

Peter J Rousseeuw and Anja Struyf. Computing location depth and regression depth in higher dimensions. Statistics and Computing, 8(3):193 203, 1998.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234 2242, 2016.

John W Tukey. Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pp. 523 531, 1975.

Aad W Van Der Vaart and Jon A Wellner. Weak convergence. In Weak convergence and empirical processes, pp. 16 28. Springer, 1996.

Marc van Kreveld, Joseph SB Mitchell, Peter Rousseeuw, Micha Sharir, Jack Snoeyink, and Bettina Speckmann. Efﬁcient algorithms for maximum regression depth. In Proceedings of the ﬁfteenth annual symposium on Computational geometry, pp. 31 40. ACM, 1999.

Yannis G Yatracos. Rates of convergence of minimum distance estimators and kolmogorov s entropy. The Annals of Statistics, 13(2):768 774, 1985.

Jian Zhang. Some extensions of tukey s depth function. Journal of Multivariate Analysis, 82(1):134 165, 2002.

Published as a conference paper at ICLR 2019

A ADDITIONAL THEORETICAL RESULTS

In this section, we investigate the performance of discriminator classes of deep neural nets with the Re LU activation function. Since our goal is to learn a p-dimensional mean vector, a deep neural network discriminator without any regularization will certainly lead to overﬁtting. Therefore, it is crucial to design a network class with some appropriate regularizations. Inspired by the work of Bartlett (1997); Bartlett & Mendelson (2002), we consider a network class with ℓ1 regularizations on all layers except for the second last layer with an ℓ2 regularization. With GH 1 (B) = g(x) = Re LU(v T x) : v 1 B , a neural network class with l + 1 layers is deﬁned as

GH l+1(B) =

g(x) = Re LU

h=1 vhgh(x)

h=1 |vh| B, gh GH l (B)

Combining with the last sigmoid layer, we obtain the following discriminator class,

FH L (κ, τ, B) =

D(x) = sigmoid

j 1 wjsigmoid

h=1 ujhgjh(x) + bj

j 1 |wj| κ,

h=1 u2 jh 2, |bj| τ, gjh GH L 1(B)

Note that all the activation functions are Re LU( ) except that we use sigmoid( ) in the last layer in the feature map g( ). A theoretical guarantees of the class deﬁned above is given by the following theorem.

Theorem A.1. Assume p log p

n ϵ2 c for some sufﬁciently small constant c > 0. Consider i.i.d. observations X1, ..., Xn (1 ϵ)N(θ, Ip) + ϵQ and the estimator bθ deﬁned by (14) with D = FH L (κ, τ, B) with H 2p,

2 L = O(1), 2 B = O(1), and τ = p log p. We set κ = O q

n + ϵ . Then, for the estimator bθ

deﬁned by (14) with D = FH L (κ, τ, B), we have

bθ θ 2 C p log p

with probability at least 1 e C (p log p+nϵ2) uniformly over all θ Rp such that θ log p and all Q.

The theorem shows that JS-GAN with a deep Re LU network can achieve the error rate p log p

n ϵ2 with respect to the squared ℓ2 loss. The condition θ log p for the Re LU network can be easily satisﬁed with a simple preprocessing step. We split the data into two halves, whose sizes are log n and n log n, respectively. Then,

we calculate the coordinatewise median eθ using the small half. It is easy to show that eθ θ q

log p log n ϵ

with high probability. Then, for each Xi from the second half, the conditional distribution of Xi eθ given the

ﬁrst half is (1 ϵ)N(θ eθ, Ip)+ϵ e Q. Since q

log p log n ϵ log p, the condition θ eθ log p is satisﬁed,

and thus we can apply the estimator (14) using the shifted data Xi eθ from the second half. The theoretical guarantee of Theorem A.1 will be

bθ (θ eθ) 2 C p log p

with high probability. Hence, we can use bθ + eθ as the ﬁnal estimator to achieve the same rate in Theorem A.1.

On the other hand, our experiments show that this preprocessing step is not needed. We believe that the assumption θ log p is a technical artifact in the analysis of the Rademacher complexity. It can probably be dropped by a more careful analysis.

Published as a conference paper at ICLR 2019

B DETAILS OF EXPERIMENTS

B.1 TRAINING DETAILS

The implementation for JS-GAN is given in Algorithm 1, and a simple modiﬁcation of the objective function leads to that of TV-GAN.

Algorithm 1 JS-GAN: argminη maxw[ 1

n Pn i=1 log Dw(Xi) + E log(1 Dw(Gη(Z)))]

Input: Observation set S = {X1, . . . , Xn} Rp, discriminator network Dw(x), generator network Gη(z) = z +η, learning rates γd and γg for the discriminator and the generator, batch size m, discriminator steps in each iteration K, total epochs T, average epochs T0. Initialization: Initialize η with coordinatewise median of S. Initialize w with N(0, .05) independently on each element or Xavier (Glorot & Bengio, 2010).

1: for t = 1, . . . , T do 2: for k = 1, . . . , K do 3: Sample mini-batch {X1, . . . , Xm} from S. Sample {Z1, . . . , Zm} from N(0, Ip) 4: gw w[ 1

mΣm i=1 log Dw(Xi) + 1

mΣm i=1 log(1 Dw(Gη(Zi)))] 5: w w + γdgw 6: end for 7: Sample {Z1, . . . , Zm} from N(0, Ip) 8: gη η[ 1

mΣm i=1 log(1 Dw(Gη(Zi)))] 9: η η γggη 10: end for Return: The average estimate η over the last T0 epochs.

Several important implementation details are discussed below.

How to tune parameters? The choice of learning rates is crucial to the convergence rate, but the minimax game is hard to evaluate. We propose a simple strategy to tune hyper-parameters including the learning rates. Suppose we have estimators bθ1, . . . , bθM with corresponding discriminator networks D b w1,..., D b w M . Fixing η = bθ, we further apply gradient descent to Dw with a few more epochs (but not many in order to prevent overﬁtting, for example 10 epochs) and select the bθ with the smallest value of the objective function (14) (JS-GAN) or (12) (TV-GAN). We note that training discriminator and generator alternatively usually will not suffer from overﬁtting since the objective function for either the discriminator or the generator is always changing. However, we must be careful about the overﬁtting issue when training the discriminator alone with a ﬁxed η, and that is why we apply an early stopping strategy here. Fortunately, the experiments show if the structures of networks are same (then of course, the dimensions of the inputs are same), the choices of hyper-parameters are robust to different models and we present the critical parameters in Table 5 to reproduce the experiment results in Table 1 and Table 2.

When to stop training? Judging convergence is a difﬁcult task in GAN trainings, since sometimes oscillation may occur. In computer vision, people often use a task related measure and stop training once the requirement based on the measure is achieved. In our experiments below, we simply use a sufﬁciently large T which works well, but it is still interesting to explore an efﬁcient early stopping rule in the future work.

How to design the network structure? Although Theorem 3.1 and Theorem 3.2 guarantee the minimax rates of TV-GAN without hidden layer and JS-GAN with one hidden layer, one may wonder whether deeper network structures will perform better. From our preliminary experiments, TV-GAN with one hidden layer is signiﬁcantly better than TV-GAN without any hidden layer. Moreover, JS-GAN

Published as a conference paper at ICLR 2019

Table 5: Choices of hyper-parameters. The parameter λ is the penalty factor for the regularization term (17) and other parameters are listed in Algorithm 1. We apply Xavier initialization (Glorot & Bengio, 2010) for both JS-GAN and TV-GAN trainings.

Structure Net γg γd K T T0 λ

100-20-1 JS 0.02 0.2 5 150 25 0 TV 0.0001 0.3 2 150 1 0.1

100-2-1 JS 0.01 0.2 5 150 25 0 TV 0.01 0.1 5 150 1 1

200-20-1 JS 0.02 0.2 5 200 25 0 TV 0.0001 0.1 2 200 1 0.5 200-200-100-1 JS 0.005 0.1 2 200 25 0 400-200-20-1 JS 0.02 0.05 2 250 25 0.5

with deep network structures can signiﬁcantly improve over shallow networks especially when the dimension is large (e.g. p 200). For a network with one hidden layer, the choice of width may depend on the sample size. If we only have 5,000 samples of 100 dimensions, two hidden units performs better than ﬁve hidden units, which performs better than twenty hidden units. If we have 50,000 samples, networks with twenty hidden units perform the best.

How to stabilize and accelerate TV-GAN? As we have discussed in Section 3.1, TV-GAN has a bad landscape when N(θ, Ip) and the contamination distribution Q are linearly separable (see Figure 1). An outlier removal step before training TV-GAN may be helpful. Besides, spectral normalization (Miyato et al., 2018) is also worth trying since it can prevent the weight from going to inﬁnity and thus can increase the chance to escape from bad saddle points. To accelerate the optimization of TVGAN, in all the numerical experiments below, we adopt a regularized version of TV-GAN inspired by Proposition 3.1. Since a good feature extractor should match nonlinear moments of P = (1 ϵ)N(θ, Ip) + ϵQ and N(η, Ip), we use an additional regularization term that can accelerate training and sometimes even leads to better performances. Speciﬁcally, let D(x) = sigmoid(w T Φ(x)) be the discriminator network with w being the weights of the output layer and ΦD(x) be the corresponding network after removing the output layer from D(x). The quantity ΦD(x) is usually viewed as a feature extractor, which naturally leads to the following regularization term (Salimans et al., 2016; Mroueh et al., 2017), deﬁned as

i=1 T(ΦD, Xi) T(ΦD, N(η, Ip))

where T(Φ, P) = EP Φ(X) (moment matching) or T(Φ, P) = Median X P ΦD(X) (median matching).

B.2 SETTINGS OF CONTAMINATION Q

We introduce the contamination distributions Q used in the experiments. We ﬁrst consider Q = N(µ, Ip) with µ ranges in {0.2, 0.5, 1, 5}. Note that the total variation distance between N(0p, Ip) and N(µ, Ip) is of order 0p µ = µ . We hope to use different levels of µ to test the algorithm and verify the error rate in the worst case. Second, we consider Q = N(1.5 1p, Σ) to be a Gaussian distribution with a non-trivial covariance matrix Σ. The covariance matrix is generated according to the following steps. First generate a sparse precision matrix Γ = (γij) with each entry γij = zij τij, i j, where zij and τij are independently generated from Uniform(0.4, 0.8) and Bernoulli(0.1). We then deﬁne γij = γji for all i > j and Γ = Γ + (| min eig(Γ)| + 0.05)Ip to make the precision matrix symmetric and positive deﬁnite, where min eig(Γ) is the smallest eigenvalue of Γ. The covariance matrix is Σ = Γ 1. Finally, we consider Q to be a Cauchy

Published as a conference paper at ICLR 2019

distribution with independent component, and the jth component takes a standard Cauchy distribution with location parameter τj = 0.5.

B.3 COMPARISON DETAILS

In Section 5.3, we compare GANs with the dimension halving (Lai et al., 2016) and iterative ﬁltering (Diakonikolas et al., 2017).

Dimension Halving. Experiments conducted are based on the code from https://github.com/ kal2000/Agnostic Mean And Covariance Code. The only hyper-parameter is the threshold in the outlier removal step, and we take C = 2 as suggested in the ﬁle out Rem Sperical.m.

Iterative Filtering. Experiments conducted are based on the code from https://github.com/ hoonose/robust-filter. We assume ϵ is known and take other hyper-parameters as suggested in the ﬁle ﬁlter Gaussian Mean.m.

B.4 SUPPLEMENTARY EXPERIMENTS FOR NETWORK STRUCTURES

The experiments are conducted with i.i.d. observations drawn from (1 ϵ)N(0p, Ip) + ϵN(0.5 1p, Ip) with ϵ = 0.2. Table 6 summarizes results for p = 100, n {5000, 50000} and various network structures. We observe that TV-GAN that uses neural nets with one hidden layer improves over the performance of that without any hidden layer. This indicates that the landscape of TV-GAN might be improved by a more complicated network structure. However, adding one more layer does not improve the results. For JS-GAN, we omit the results without hidden layer because of its lack of robustness (Proposition 3.1). Deeper networks sometimes improve over shallow networks, but this is not always true. We also observe that the optimal choice of the width of the hidden layer depends on the sample size.

Table 6: Experiment results for JS-GAN and TV-GAN with various network structures.

Structure n JS-GAN TV-GAN 100-1 50,000 - 0.1173 (0.0056) 100-20-1 50,000 0.0953 (0.0064) 0.1144 (0.0154) 100-50-1 50,000 0.2409 (0.0500) 0.1597 (0.0219) 100-20-20-1 50,000 0.1131 (0.0855) 0.1724 (0.0295) 100-1 5,000 - 0.9818 (0.0417) 100-2-1 5,000 0.1941 (0.0173) 0.1941 (0.0173) 100-5-1 5,000 0.2148 (0.0241) 0.2244 (0.0238) 100-20-1 5,000 0.3379 (0.0273) 0.3336 (0.0186)

B.5 TABLES FOR TESTING THE MINIMAX RATES

Tables 7, 8 and 9 show numerical results corresponding to Figure 2.

Published as a conference paper at ICLR 2019

Table 7: Scenario I: p

p/n < ϵ. Setting: p = 100, n = 50, 000, and ϵ from 0.05 to 0.20. Network structure of JS-GAN: one hidden layer with 5 hidden units. Network structure of TV-GAN: zero-hidden layer. The number in each cell is the average of ℓ2 error bθ θ with standard deviation in parenthesis estimated from 10 repeated experiments. The bold character marks the worst case among our choices of Q at each ϵ level. The results of TV-GAN for Q = N(5 1p, Ip) are highlighted in slanted font. The failure of training in this case is due to the bad landscape when N(0p, Ip) and Q are linearly separable, as discussed in Section 3.1 (see Figure 1).

Q Net ϵ = 0.05 ϵ = 0.10 ϵ = 0.15 ϵ = 0.20

N(0.2 1p, Ip) JS 0.1025 (0.0080) 0.1813 (0.0122) 0.2632 (0.0080) 0.3280 (0.0069) TV 0.1110 (0.0204) 0.2047 (0.0112) 0.2769 (0.0315) 0.3283 (0.0745)

N(0.5 1p, Ip) JS 0.1407 (0.0061) 0.1895 (0.0070) 0.1714 (0.0502) 0.1227 (0.0249) TV 0.2003 (0.0480) 0.2065 (0.1495) 0.2088 (0.0100) 0.3985 (0.0112)

N(1p, Ip) JS 0.0855 (0.0054) 0.1055 (0.0322) 0.0602 (0.0133) 0.0577 (0.0029) TV 0.1084 (0.0063) 0.0842 (0.0036) 0.3228 (0.0123) 0.1329 (0.0125)

N(5 1p, Ip) JS 0.0587 (0.0033) 0.0636 (0.0025) 0.0625 (0.0045) 0.0591 (0.0040) TV 1.2886 (0.5292) 4.4511 (0.8754) 7.3868 (0.8081) 10.5724 (1.2605)

Cauchy(0.5 1p) JS 0.0625 (0.0045) 0.0652 (0.0044) 0.0648 (0.0035) 0.0687 (0.0042) TV 0.2280 (0.0067) 0.3842 (0.0083) 0.5740 (0.0071) 0.7768 (0.0074)

N(0.5 1p, Σ) JS 0.1490 (0.0061) 0.1958 (0.0074) 0.2379 (0.0076) 0.1973 (0.0679) TV 0.2597 (0.0090) 0.4621 (0.0649) 0.6344 (0.0905) 0.7444 (0.3115)

Table 8: Scenario II-a: p

p/n > ϵ. Setting: n = 1, 000, ϵ = 0.1, and p from 10 to 100. Other details are the same as above.

Q Net p = 10 p = 25 p = 50 p = 75 p = 100

N(0.2 1p, Ip) JS 0.1078 (0.0338) 0.1819 (0.0215) 0.3355 (0.0470) 0.4806 (0.0497) 0.5310 (0.0414) TV 0.2828 (0.0580) 0.4740 (0.1181) 0.5627 (0.0894) 0.8217 (0.0382) 0.8090 (0.0457)

N(0.5 1p, Ip) JS 0.1587 (0.0438) 0.2684 (0.0386) 0.4213 (0.0356) 0.5355 (0.0634) 0.6825 (0.0981) TV 0.2864 (0.0521) 0.5024 (0.1038) 0.6878 (0.1146) 0.9204 (0.0589) 0.9418 (0.0551)

N(1p, Ip) JS 0.1644 (0.0255) 0.2177 (0.0480) 0.3505 (0.0552) 0.4740 (0.0742) 0.6662 (0.0611) TV 0.3733 (0.0878) 0.5407 (0.0634) 0.9061 (0.1029) 1.0672 (0.0629) 1.1150 (0.0942)

N(5 1p, Ip) JS 0.0938 (0.0195) 0.2058 (0.0218) 0.3316 (0.0462) 0.4054 (0.0690) 0.5553 (0.0518) TV 0.3707 (0.2102) 0.7434 (0.3313) 1.1532 (0.3488) 1.1850 (0.3739) 1.3257 (0.1721)

Cauchy(0.5 1p) JS 0.1188 (0.0263) 0.1855 (0.0282) 0.2967 (0.0284) 0.4094 (0.0385) 0.4826 (0.0479) TV 0.3198 (0.1543) 0.5205 (0.1049) 0.6240 (0.0652) 0.7536 (0.0673) 0.7612 (0.0613)

N(0.5 1p, Σ) JS 0.1805 (0.0220) 0.2692 (0.0318) 0.3885 (0.0339) 0.5144 (0.0547) 0.6833 (0.1094) TV 0.3036 (0.0736) 0.5152 (0.0707) 0.7305 (0.0966) 0.9460 (0.0900) 1.0888 (0.0863)

Published as a conference paper at ICLR 2019

Table 9: Scenario II-b: p

p/n > ϵ. Setting: p = 50, ϵ = 0.1, and n from 50 to 1, 000. Other details are the same as above.

Q Net n = 50 n = 100 n = 200 n = 500 n = 1000

N(0.2 1p, Ip) JS 1.3934 (0.5692) 1.0055 (0.1040) 0.8373 (0.1335) 0.4781 (0.0677) 0.3213 (0.0401) TV 1.9714 (0.1552) 1.2629 (0.0882) 0.7579 (0.0486) 0.6640 (0.0689) 0.6348 (0.0547)

N(0.5 1p, Ip) JS 1.6422 (0.6822) 1.2101 (0.2826) 0.8374 (0.1021) 0.5832 (0.0595) 0.3930 (0.0485) TV 1.9780 (0.2157) 1.2485 (0.0668) 0.8198 (0.0778) 0.7597 (0.0456) 0.7346 (0.0750)

N(1p, Ip) JS 1.8427 (0.9633) 1.2179 (0.2782) 1.0147 (0.2170) 0.5586 (0.1013) 0.3639 (0.0464) TV 1.9907 (0.1498) 1.4575 (0.1270) 0.9724 (0.0802) 0.9050 (0.1479) 0.8747 (0.0757)

N(5 1p, Ip) JS 2.6392 (1.3877) 1.3966 (0.5370) 0.9633 (0.1383) 0.5360 (0.0808) 0.3265 (0.0336) TV 2.1050 (0.3763) 1.5205 (0.2221) 1.1909 (0.2273) 1.0957 (0.1390) 1.0695 (0.2639)

Cauchy(0.5 1p) JS 1.6563 (0.5246) 1.0857 (0.3613) 0.8944 (0.1759) 0.5363 (0.0593) 0.3832 (0.0408) TV 2.1031 (0.2300) 1.1712 (0.1493) 0.6904 (0.0763) 0.6300 (0.0642) 0.5085 (0.0662)

N(0.5 1p, Σ) JS 1.2296 (0.3157) 0.7696 (0.0786) 0.5892 (0.0931) 0.5015 (0.0831) 0.4085 (0.0209) TV 1.9243 (0.2079) 1.2217 (0.0681) 0.7939 (0.0688) 0.7033 (0.0414) 0.7125 (0.0490)

C PROOFS OF PROPOSITION 2.1 AND PROPOSITION 3.1

In the ﬁrst example, consider

Q = {N(η, Ip) : η Rp} , e Qη = {N(eη, Ip) : eη η r} .

In other words, Q is the class of Gaussian location family, and e Qη is taken to be a subset in a local neighborhood of N(η, Ip). Then, with Q = N(η, Ip) and e Q = N(eη, Ip), the event eq(X)/q(X) 1 is equivalent to X eη 2 X η 2. Since eη η r, we can write eη = η + eru for some er R and u Rp that satisfy 0 er r and u = 1. Then, (8) becomes

bθ = argmin η Rp sup u =1 0 er r

i=1 I u T (Xi η) er

P N(0, 1) er

Letting r 0, we obtain (9), the exact formula of Tukey s median.

The next example is a linear model y|X N(XT θ, 1). Consider the following classes

Q = Py,X = Py|XPX : Py|X = N(XT η, 1), η Rp , e Qη = Py,X = Py|XPX : Py|X = N(XT eη, 1), eη η r .

Here, Py,X stands for the joint distribution of y and X. The two classes Q and e Q share the same marginal distribution PX and the conditional distributions are speciﬁed by N(XT η, 1) and N(XT eη, 1), respectively. Follow the same derivation of Tukey s median, let r 0, and we obtain the exact formula of regression depth (10). It is worth noting that the derivation of (10) does not depend on the marginal distribution PX.

The last example is on covariance/scatter matrix estimation. For this task, we set Q = {N(0, Γ) : Γ Ep}, where Ep is the class of all p p covariance matrices. Inspired by the derivations of Tukey depth and regression depth, it is tempting to choose e Q in the neighborhood of N(0, Γ). However, a native choice would lead to a deﬁnition that is not even Fisher consistent. We propose a rank-one neighborhood, given by

e QΓ = n N(0, eΓ) : eΓ 1 = Γ 1 + eruu T Ep, |er| r, u = 1 o . (19)

Then, a direct calculation gives

( d N(0, eΓ) d N(0, Γ)(X) 1

= I er|u T X|2 log(1 + eru T Γu) . (20)

Published as a conference paper at ICLR 2019

Since limer 0 log(1+eru T Γu)

eru T Γu = 1, the limiting event of (20) is either I{|u T X|2 u T Γu} or I{|u T X|2 u T Γu}, depending on whether er tends to zero from left or from right. Therefore, with the above Q and e QΓ, (8) becomes (11) under the limit r 0. Even though the deﬁnition of (19) is given by a rank-one neighborhood of the inverse covariance matrix, the formula (11) can also be derived with eΓ 1 = Γ 1 + eruu T in (19) replaced by eΓ = Γ + eruu T by applying the Sherman-Morrison formula. A similar formula to (11) in the literature is given by

bΣ = argmax Γ Ep inf u =1

i=1 I{|u T Xi|2 βu T Γu} 1

i=1 I{|u T Xi|2 βu T Γu}

which is recognized as the maximizer of what is known as the matrix depth function (Zhang, 2002; Chen et al., 2018; Paindaveine & Van Bever, 2017). The β in (21) is a scalar deﬁned through the equation P(N(0, 1) β) = 3/4. It is proved in Chen et al. (2018) that bΣ achieves the minimax rate under Huber s ϵ-contamination model. While the formula (11) can be derived from TV-Learning with discriminators in the form of I n d N(0,eΓ) d N(0,Γ)(X) 1 o , a special case of (6), the formula (21) can be derived directly from TV-

GAN with discriminators in the form of I n d N(0,βeΓ) d N(0,βΓ)(X) 1 o by following a similar rank-one neighborhood argument. This completes the derivation of Proposition 2.1.

To prove Proposition 3.1, we deﬁne F(w) = EP log sigmoid(w T g(X)) + EQ log(1 sigmoid(w T g(X))) + log 4, so that JSg(P, Q) = maxw W F(w). The gradient and Hessian of F(w) are given by

F(w) = EP e w T g(X)

1 + e w T g(X) g(X) EQ ew T g(X)

1 + ew T g(X) g(X),

2F(w) = EP ew T g(X)

(1 + ew T g(X))2 g(X)g(X)T EQ e w T g(X)

(1 + e w T g(X))2 g(X)g(X)T .

Therefore, F(w) is concave in w, and maxw W F(w) is a convex optimization with a convex W. Suppose JSg(P, Q) = 0. Then maxw W F(w) = 0 = F(0), which implies F(0) = 0, and thus we have EP g(X) = EQg(X). Now suppose EP g(X) = EQg(X), which is equivalent to F(0) = 0. Therefore, w = 0 is a stationary point of a concave function, and we have JSg(P, Q) = maxw W F(w) = F(0) = 0.

D PROOFS OF MAIN RESULTS

In this section, we present proofs of all main theorems in the paper. We ﬁrst establish some useful lemmas in Section D.1, and the the proofs of main theorems will be given in Section D.2.

D.1 SOME AUXILIARY LEMMAS

Lemma D.1. Given i.i.d. observations X1, ..., Xn P and the function class D deﬁned in (13), we have for any δ > 0,

i=1 D(Xi) ED(X)

with probability at least 1 δ for some universal constant C > 0.

Proof. Let f(X1, ..., Xn) = sup D D 1

n Pn i=1 D(Xi) ED(X) . It is clear that f(X1, ..., Xn) satisﬁes the bounded difference condition. By Mc Diarmid s inequality (Mc Diarmid, 1989), we have

f(X1, ..., Xn) Ef(X1, ..., Xn) +

Published as a conference paper at ICLR 2019

with probability at least 1 δ. Using a standard symmetrization technique (Pollard, 2012), we obtain the following bound that involves Rademacher complexity,

Ef(X1, ..., Xn) 2E sup D D

i=1 ϵi D(Xi)

where ϵ1, ..., ϵn are independent Rademacher random variables. The Rademacher complexity can be bounded by Dudley s integral entropy bound, which gives

i=1 ϵi D(Xi)

log N(δ, D, n)dδ,

where N(δ, D, n) is the δ-covering number of D with respect to the empirical ℓ2 distance f

1 n Pn i=1(f(Xi) g(Xi))2. Since the VC-dimension of D is O(p), we have N(δ, D,

n) p (16e/δ)O(p) (see Theorem 2.6.7 of Van Der Vaart & Wellner (1996)). This leads to the bound 1 n R 2 0 p

log N(δ, D, n)dδ p p

n, which gives the desired result.

Lemma D.2. Given i.i.d. observations X1, ..., Xn P, and the function class D deﬁned in (15), we have for any δ > 0,

i=1 log D(Xi) E log D(X)

with probability at least 1 δ for some universal constant C > 0.

Proof. Let f(X1, ..., Xn) = sup D D 1

n Pn i=1 log D(Xi) E log D(X) . Since

sup D D sup x | log(2D(x))| κ,

we have sup x1,...,xn,x i |f(x1, ..., xn) f(x1, ..., xi 1, x i, xi+1, ..., xn)| 2κ

Therefore, by Mc Diarmid s inequality (Mc Diarmid, 1989), we have

f(X1, ..., Xn) Ef(X1, ..., Xn) + κ

with probability at least 1 δ. By the same argument of (22), it is sufﬁcient to bound the Rademacher complexity E sup D D 1

n Pn i=1 ϵi log(2D(Xi)) . Since the function ψ(x) = log(2sigmoid(x)) has Lipschitz constant 1 and satisﬁes ψ(0) = 0, we have

i=1 ϵi log(2D(Xi))

j 1 |wj| κ,uj Rp,bj R

j 1 wjσ(u T j Xi + bj)

which uses Theorem 12 of Bartlett & Mendelson (2002). By H older s inequality, we further have

j 1 |wj| κ,uj Rp,bj R

j 1 wjσ(u T j Xi + bj)

κE max j 1 sup uj Rp,bj R

i=1 ϵiσ(u T j Xi + bj)

= κE sup u Rp,b R

i=1 ϵiσ(u T Xi + b)

Published as a conference paper at ICLR 2019

Note that for a monotone function σ : R [0, 1], the VC-dimension of the class {σ(u T x + b) : u R, b R} is O(p). Therefore, by using the same argument of Dudley s integral entropy bound in the proof Lemma D.1, we have

E sup u Rp,b R

i=1 ϵiσ(u T Xi + b)

which leads to the desired result.

Lemma D.3. Given i.i.d. observations X1, .., Xn N(θ, Ip) and the function class FH L (κ, τ, B). Assume θ log p and set τ = p log p. We have for any δ > 0,

sup D FH L (κ,τ,B)

i=1 log D(Xi) E log D(X)

with probability at least 1 δ for some universal constants C > 0.

Proof. Write f(X1, ..., Xn) = sup D FH L (κ,τ,B) 1

n Pn i=1 log D(Xi) E log D(X) . Then, the inequality (23) holds with probability at least 1 δ. It is sufﬁcient to analyze the Rademacher complexity. Using the fact that the function log(2sigmoid(x)) is Lipschitz and H older s inequality, we have

E sup D FH L (κ,τ,B)

i=1 ϵi log(2D(Xi))

2E sup w 1 κ, uj 2 2,|bj| τ,gjh GH L 1(B)

j 1 wjsigmoid

h=1 ujhgjh(Xi) + bj

2κE sup u 2 2,|b| τ,gh GH L 1(B)

i=1 ϵisigmoid

h=1 uhgh(Xi) + b

4κE sup u 2 2,|b| τ,gh GH L 1(B)

h=1 uhgh(Xi) + b

8 pκE sup g GH L 1(B)

i=1 ϵig(Xi)

Now we use the notation Zi = Xi θ N(0, Ip) for i = 1, ..., n. We bound E supg GH L 1(B) 1

n Pn i=1 ϵig(Zi + θ) by induction. Since

sup g GH 1 (B)

i=1 ϵig(Zi + θ)

i=1 ϵiv T (Zi + θ)

CB log p + θ n ,

Published as a conference paper at ICLR 2019

sup g GH l+1(B)

i=1 ϵig(Zi + θ)

sup v 1 B,gh GH l (B)

h=1 vhgh(Zi + θ)

sup g GH l (B)

i=1 ϵig(Zi + θ)

sup g GH l (B)

i=1 ϵig(Zi + θ)

sup g GH L 1(B)

i=1 ϵig(Zi + θ)

C(2B)L 1 log p + θ n .

Combining the above inequalities, we get

sup D FH L (κ,τ,B)

i=1 ϵi log D(Zi + θ)

Cκ p(2B)L 1 log p + θ n + τ n

This leads to the desired result under the conditions on τ and θ .

D.2 PROOFS OF MAIN THEOREMS

Proof of Theorem 3.1. We ﬁrst introduce some notations. Deﬁne F(P, η) = maxw,b Fw,b(P, η), where

Fw,b(P, η) = EP sigmoid(w T X + b) EN(η,Ip)sigmoid(w T X + b).

With this deﬁnition, we have bθ = argminη F(Pn, η), where we use Pn for the empirical distribution 1 n Pn i=1 δXi. We shorthand N(η, Ip) by Pη, and then

F(Pθ, bθ) F((1 ϵ)Pθ + ϵQ, bθ) + ϵ (24)

F(Pn, bθ) + ϵ + C

F(Pn, θ) + ϵ + C

F((1 ϵ)Pθ + ϵQ, θ) + ϵ + 2C

F(Pθ, θ) + 2ϵ + 2C

With probability at least 1 δ, the above inequalities hold. We will explain each inequality. Since

F((1 ϵ)Pθ + ϵQ, η) = max w,b [(1 ϵ)Fw,b(Pθ, η) + ϵFw,b(Q, η)] ,

we have sup η |F((1 ϵ)Pθ + ϵQ, η) F(Pθ, η)| ϵ,

Published as a conference paper at ICLR 2019

which implies (24) and (28). The inequalities (25) and (27) are implied by Lemma D.1 and the fact that

sup η |F(Pn, η) F((1 ϵ)Pθ + ϵQ, η)| sup w,b

i=1 sigmoid(w T Xi + b) Esigmoid(w T X + b)

The inequality (26) is a direct consequence of the deﬁnition of bθ. Finally, it is easy to see that F(Pθ, θ) = 0, which gives (29). In summary, we have derived that with probability at least 1 δ,

Fw,b(Pθ, bθ) 2ϵ + 2C

for all w Rp and b R. For any u Rp such that u = 1, we take w = u and b = u T θ, and we have

f(0) f(u T (θ bθ)) 2ϵ + 2C

where f(t) = R 1 1+ez+t φ(z)dz, with φ( ) being the probability density function of N(0, 1). It is not hard to see that as long as |f(t) f(0)| c for some sufﬁciently small constant c > 0, then |f(t) f(0)| c |t| for some constant c > 0. This implies

bθ θ = sup u =1 |u T (bθ θ)|

1 c sup u =1

f(0) f(u T (θ bθ))

with probability at least 1 δ. The proof is complete.

Proof of Theorem 3.2. We continue to use Pη to denote N(η, Ip). Deﬁne

F(P, η) = max w 1 κ,u,b Fw,u,b(P, η),

where Fw,u,b(P, η) = EP log D(X) + EN(η,Ip) log (1 D(X)) + log 4,

with D(x) = sigmoid P

j 1 wjσ(u T j x + bj) . Then,

F(Pθ, bθ) F((1 ϵ)Pθ + ϵQ, bθ) + 2κϵ (30)

F(Pn, bθ) + 2κϵ + Cκ

F(Pn, θ) + 2κϵ + Cκ

F((1 ϵ)Pθ + ϵQ, θ) + 2κϵ + 2Cκ

F(Pθ, θ) + 4κϵ + 2Cκ

= 4κϵ + 2Cκ

Published as a conference paper at ICLR 2019

The inequalities (30)-(34) follow similar arguments for (24)-(28). To be speciﬁc, (31) and (33) are implied by Lemma D.2, and (32) is a direct consequence of the deﬁnition of bθ. To see (30) and (34), note that for any w such that w 1 κ, we have

| log(2D(X))|

j 1 wjσ(u T j X + bj)

A similar argument gives the same bound for | log(2(1 D(X)))|. This leads to

sup η |F((1 ϵ)Pθ + ϵQ, η) F(Pθ, η)| 2κϵ, (35)

which further implies (30) and (34). To summarize, we have derived that with probability at least 1 δ,

Fw,u,b(Pθ, bθ) 4κϵ + 2Cκ

for all w 1 κ, uj 1 and bj. Take w1 = κ, wj = 0 for all j > 1, u1 = u for some unit vector u and b1 = u T θ, and we get

fu T (bθ θ)(κ) 4κϵ + 2Cκ

where fδ(t) = E log 2 1 + e tσ(Z) + E log 2 1 + etσ(Z+δ) , (37)

with Z N(0, 1). Direct calculations give

f δ(t) = E e tσ(Z)

1 + e tσ(Z) σ(Z) E etσ(Z+δ)

1 + etσ(Z+δ) σ(Z + δ),

f δ (t) = Eσ(Z)2 e tσ(Z)

(1 + e tσ(Z))2 Eσ(Z + δ)2 etσ(Z+δ)

(1 + etσ(Z+δ))2 . (38)

Therefore, fδ(0) = 0, f δ(0) = 1

2 (Eσ(Z) Eσ(Z + δ)), and f δ (t) 1

2. By the inequality

fδ(κ) fδ(0) + κf δ(0) 1

we have κf δ(0) fδ(κ) + κ2/4. In view of (36), we have

Z σ(z)φ(z)dz Z σ(z + u T (bθ θ))φ(z)dz

It is easy to see that for the choices of σ( ), R σ(z)φ(z)dz R σ(z + t)φ(z)dz is locally linear with respect to t. This implies that

κ bθ θ = κ sup u =1 u T (bθ θ) κ

Therefore, with a κ p p

n + ϵ, the proof is complete.

Proof of Theorem 4.1. We use Pθ,Σ,h to denote the elliptical distribution EC(θ, Σ, h). Deﬁne

F(P, (η, Γ, g)) = max w 1 κ,u,b Fw,u,b(P, (η, Γ, g)),

Published as a conference paper at ICLR 2019

where Fw,u,b(P, (η, Γ, g)) = EP log D(X) + EEC(η,Γ,g) log (1 D(X)) + log 4,

with D(x) = sigmoid P

j 1 wjσ(u T j x + bj) . Let P be the data generating process that satisﬁes

TV(P, Pθ,Σ,h) ϵ, and then there exist probability distributions Q1 and Q2, such that

P + ϵQ1 = Pθ,Σ,h + ϵQ2.

The explicit construction of Q1, Q2 is given in the proof of Theorem 5.1 of Chen et al. (2018). This implies that

|F(P, (η, Γ, g)) F(Pθ,Σ,h, (η, Γ, g))|

sup w 1 κ,u,b |Fw,u,b(P, (η, Γ, g)) Fw,u,b(Pθ,Σ,h, (η, Γ, g))|

= ϵ sup w 1 κ,u,b |EQ2 log(2D(X)) EQ1 log(2D(X))|

Then, the same argument in Theorem 3.2 (with (35) replaced by (39)) leads to the fact that with probability at least 1 δ,

Fw,u,b(Pθ,Σ,h, (bθ, bΣ,bh)) 4κϵ + 2Cκ

for all w 1 κ, uj 1 and bj. Take w1 = κ, wj = 0 for all j > 1, u1 = u/ p

u T bΣu for some unit vector

u and b1 = u T θ/ p

u T bΣu, and we get

f u T (bθ θ)

u T b Σu (κ) 4κϵ + 2Cκ

fδ(t) = Z log 2 1 + e tσ( s)

h(s)ds + Z log 2 1 + etσ(δ+s)

where δ = u T (bθ θ)

u T bΣu and =

u T bΣu. A similar argument to the proof of Theorem 3.2 gives

Z σ( s)h(s)ds Z σ(δ + s)bh(s)ds

Since Z σ( s)h(s)ds = 1

2 = Z σ(s)bh(s)ds,

the above bound is equivalent to

κ 2 (H(0) H(δ)) 4κϵ + 2Cκ

where H(δ) = R σ(δ + s)bh(s)ds. The above bound also holds for κ

2 (H(δ) H(0)) by a symmetric argument, and therefore the same bound holds for κ

2 |H(δ) H(0)|. Since H (0) = R σ(s)(1 σ(s))bh(s)ds = 1, H(δ)

is locally linear at δ = 0, which leads to a desired bound for δ = u T (bθ θ)

u T bΣu . Finally, since u T bΣu M, we get

the bound for u T (bθ θ). The proof is complete by taking supreme over all unit vector u.

Published as a conference paper at ICLR 2019

Proof of Theorem A.1. We continue to use Pη to denote N(η, Ip). Deﬁne

F(P, η) = sup D FH L (κ,τ,B) FD(P, η),

with FD(P, η) = EP log D(X) + EN(η,Ip) log(1 D(X)) + log 4.

Follow the same argument in the proof of Theorem 3.2, use Lemma D.3, and we have

FD(Pθ, bθ) Cκ

ϵ + (2B)L 1 r

uniformly over D FH L (κ, τ, B) with probability at least 1 δ. Choose w1 = κ and wj = 0 for all wj > 1. For any unit vector eu Rp, take u1h = u1(h+p) = euh for h = 1, ..., p and b1 = eu T θ. For h = 1, ..., p, set g1h(x) = max(xh, 0). For h = p + 1, ..., 2p, set g1h(x) = max( xh p, 0). It is obvious that such u and b satisfy P

h u2 1h 2 and |b1| θ p θ p log p. We need to show both the functions max(x, 0) and max( x, 0) are elements of GH L 1(B). This can be proved by induction. It is obvious that max(xh, 0), max( xh, 0) GH 1 (B) for any h = 1, ..., p. Suppose we have max(xh, 0), max( xh, 0) GH l (B) for any h = 1, ..., p. Then,

max (max(xh, 0) max( xh, 0), 0) = max(xh, 0),

max (max( xh, 0) max(xh, 0), 0) = max( xh, 0).

Therefore, max(xh, 0), max( xh, 0) GH l+1(B) as long as B 2. Hence, the above construction satisﬁes D(x) = sigmoid(κsigmoid(eu T (x θ))) FH L (κ, τ, B), and we have

fu T (bθ θ)(κ) Cκ

ϵ + (2B)L 1 r

where the deﬁnition of fδ(t) is given by (37) with Z N(0, 1) and σ( ) is taken as sigmoid( ). Apply the a similar in the proof of Theorem 3.2, we obtain the desired result.