# graphbased_discriminators_sample_complexity_and_expressiveness__92f88ff4.pdf

Graph-based Discriminators: Sample Complexity and Expressiveness

Roi Livni Tel Aviv University rlivni@tauex.tau.ac.il

Yishay Mansour Tel Aviv University and Google mansour.yishay@gmail.com

A basic question in learning theory is to identify if two distributions are identical when we have access only to examples sampled from the distributions. This basic task is considered, for example, in the context of Generative Adversarial Networks (GANs), where a discriminator is trained to distinguish between a reallife distribution and a synthetic distribution. Classically, we use a hypothesis class H and claim that the two distributions are distinct if for some h H the expected value on the two distributions is (signiﬁcantly) different. Our starting point is the following fundamental problem: "is having the hypothesis dependent on more than a single random example beneﬁcial". To address this challenge we deﬁne k-ary based discriminators, which have a family of Boolean k-ary functions G. Each function g G naturally deﬁnes a hyper-graph, indicating whether a given hyper-edge exists. A function g G distinguishes between two distributions, if the expected value of g, on a k-tuple of i.i.d examples, on the two distributions is (signiﬁcantly) different. We study the expressiveness of families of k-ary functions, compared to the classical hypothesis class H, which is k = 1. We show a separation in expressiveness of k + 1-ary versus k-ary functions. This demonstrate the great beneﬁt of having k 2 as distinguishers. For k 2 we introduce a notion similar to the VC-dimension, and show that it controls the sample complexity. We proceed and provide upper and lower bounds as a function of our extended notion of VC-dimension.

1 Introduction

The task of discrimination consists of a discriminator that receives ﬁnite samples from two distributions, say p1 and p2, and needs to certify whether the two distributions are distinct. Discrimination has a central role within the framework of Generative Adversarial Networks [12], where a discriminator trains a neural net to distinguish between samples from a real-life distribution and samples generated synthetically by another neural network, called a generator.

A possible formal setup for discrimination identiﬁes the discriminator with some distinguishing class D = {f : X R} of distinguishing functions. In turn, the discriminator wishes to ﬁnd the best d D that distinguishes between the two distributions. Formally, she wishes to ﬁnd d D such that1

E x p1[d(x)] E x p2[d(x)] > sup d D

E x p1[d (x)] E x p2[d (x)] ϵ. (1)

1Note that with such d at hand, with an order of O(1/ϵ2) examples one can verify if any discriminator in the class certiﬁes that the two distributions are distinct.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

For examples, in GANs, the class of distinguishing functions we will consider could be the class of neural networks trained by the discriminator.

The ﬁrst term in the RHS of eq. (1) is often referred to as the Integral Probability Metric (IPM distance) w.r.t a class D [16], denoted IPMD. As such, we can think of the discriminator as computing the IPMD distance.

Whether two, given, distributions can be distinguished by the discriminator becomes, in the IPM setup, a property of the distinguishing class. Also, the number of examples needed to be observed will depend on the class in question. Thus, if we take a large expressive class of distinguishers, the discriminator can potentially distinguish between any two distributions that are far in total variation. In that extreme, though, the class of distinguishers would need to be very large and in turn, the number of samples needed to be observed scales accordingly. One could also choose a small" class, but at a cost of smaller distinguishing power that yields smaller IPM distance.

For example, consider two distributions over [n] to be distinguished. We could choose as a distinguishing class the class of all possible subsets over n. This distinguishing class give rise to the total variation distance, but the sample complexity turns out to be O(n). Alternatively we can consider the class of singletones: This class will induce a simple IPM distance, with graceful sample complexity, however in worst case the IPM distance can be as small as O(1/n) even though the total variation distance is large.

Thus, IPM framework initiates a study of generalization complexity where we wish to understand what is the expressive power of each class and what is its sample complexity.

For this special case that D consists of Boolean functions, the problem turns out to be closely related to the classical statistical learning setting and prediction [20]. The sample complexity (i.e., number of samples needed to be observed by the discriminator) is governed by a combinatorial measure termed VC dimension. Speciﬁcally, for the discriminator to be able to ﬁnd a d as in eq. (1), she needs to observe order of Θ( ρ

ϵ2 ) examples, where ρ is the VC dimension of the class D [5, 20].

In this work we consider a natural extension of this framework to more sophisticated discriminators: For example, consider a discriminator that observes pairs of points from the distribution and checks for collisions such a distinguisher cannot apriori be modeled as a test of Boolean functions, as the tester measures a relation between two points and not a property of a single point. The collision test has indeed been used, in the context of synthetic data generation, to evaluate the diversity of the synthetic distribution [2].

More generally, suppose we have a class of 2-ary Boolean functions: G = {g : g(x1, x2) {0, 1}} and the discriminator wishes to (approximately) compute

E (x1,x2) p2 1 [g(x1, x2)] E (x1,x2) p2 2 [g(x1, x2)]

Here p2 denotes the product distribution over p. More generally, we may consider k-ary mappings, but for the sake of clarity, we will restrict our attention in this introduction to k = 2.

Such 2-ary Boolean mapping can be considered as graphs where g(x1, x2) = 1 symbolizes that there exists an edge between x1 and x2 and similarly g(x1, x2) = 0 denotes that there is no such edge. The collision test, for example, is modelled by a graph that contains only self loops. We thus call such multi-ary statistical tests graph-based distinguishers.

Two natural question then arise

1. Do graph based discriminators have any added distinguishing power over classical discriminators?

2. What is the sample complexity of graph based discriminators?

With respect to the ﬁrst question we give an afﬁrmative answer and we show a separation between the distinguishing power of graph based discriminators and classical discriminators. As to the second question, we introduce a new combinatorial measure (termed graph VC dimension) that governs the sample complexity of graph based discriminators analogously to the VC characterization of the sample complexity of classical discriminators. We next elaborate on each of these two results.

As to the distinguishing power of graph based discriminators, we give an afﬁrmative answer in the following sense: We show that there exists a single graph g such that, for any distinguishing class D with bounded VC dimension, and ϵ, there are two distributions p1 and p2 that are D indistinguishable but g certiﬁes that p1 and p2 are distinct. Namely, the quantity in eq. (2) is at least 1/4 for G = {g}.

This result may be surprising. It is indeed known that for any two distributions that are ϵ far in total variation, there exists a boolean mapping d that distinguishes between the two distributions. In that sense, distinguishing classes are known to be universal. Thus, asymptotically, with enough samples any two distribution can be ultimately distinguished via a standard distinguishing function.

Nevertheless, our result shows that, given ﬁnite data, the restriction to classes with ﬁnite capacity is limiting, and there could be graph-based distinguishing functions whose distinguishing power is not comparable to any class with ﬁnite capacity. We stress that the same graph competes with all ﬁnite capacity classes, irrespective of their VC dimension.

With respect to the second question, we introduce a new VC-like notion termed graph VC dimension that extends naturally to graphs (and hypergraphs). On a high level, we show that for a class of graphbased distinguishers with graph VC dimension ρ, O(ρ) examples are sufﬁcient for discrimination and that Ω( ρ) examples are necessary. This leaves a gap of factor ρ which we leave as an open question.

The notion we introduce is strictly weaker than the standard VC dimension of families of multi-ary functions, and the proofs we provide do not follow directly from classical results on learnability of ﬁnite VC classes [20, 5]. In more details, a graph-based distinguishing class G is a family of Boolean functions over the product space of vertices V: G {0, 1}V2. As such it is equipped with a VC dimension, the largest set of pairs of vertices that is shattered by G.

It is not hard to show that ﬁnite VC is sufﬁcient to achieve ﬁnite sample complexity bounds over 2-ary functions [9]. It turns out, though, that it is not a necessary condition: For example, one can show that the class of k-regular graphs has ﬁnite graph VC dimension but inﬁnite VC dimension. Thus, even though they are not learnable in the standard PAC setting, they have ﬁnite sample complexity within the framework of discrimination.

The reason for this gap, between learnability and discriminability, is that learning requires uniform convergence with respect to any possible distribution over pairs, while discrimination requires uniform convergence only with respect to product distributions formally then, it is a weaker task, and, potentially, can be performed even for classes with inﬁnite VC dimension.

1.1 Related Work

The task of discrimination has been considered as early as the work of Vapnik and Chervonenkis in [20]. In fact, even though Vapnik and Chervonenkis original work is often referred in the context of prediction, the original work considered the question of when the empirical frequency of Boolean functions converges uniformly to the true probability over a class of functions. In that sense, this work can be considered as a natural extension to k-ary functions and generalization of the notion of VC dimension.

The work of [9, 8] studies also a generalization of VC theory to multi-ary functions in the context of ranking tasks and U-statistics. They study the standard notion of VC dimension. Speciﬁcally they consider the function class as Boolean functions over multi-tuples and the VC dimension is deﬁned by the largest set of multi-tuples that can be shattered. Their work provides several interesting fast-rate convergence guarantees. As discussed in the introduction, our notion of capacity is weaker, and in general the results are incomparable.

GANs A more recent interest in discrimination tasks is motivated by the framework of GANs, where a neural network is trained to distinguish between two sets of data one is real and the other is generated by another neural network called generator. Multi-ary tests have been proposed to assess the quality of GANs networks. [2] suggests birthday paradox to evaluate diversity in GANs. [18] uses Binning to assess the solution proposed by GANs.

Closer to this work [15] suggests the use of a discriminator that observes samples from the m-th product distribution. Motivated by the problem of mode collapse they suggest a theoretical framework in which they study the algorithmic beneﬁts of such discriminators and observe that they can

signiﬁcantly reduce mode collapse. In contrast, our work is less concerned with the problem of mode collapse directly and we ask in general if we can boost the distinguishing power of discriminators via multi-ary discrimination. Moreover, we provide several novel sample complexity bounds.

Property Testing A related problem to ours is that of testing closeness of distributions [3, 11]. Traditionally, testing closeness of distribution is concerned with evaluating if two discrete distributions are close vs. far/identical in total variation. [11], motivated by graph expansion test, propose a collision test to verify if a certain distribution is close to uniform. Interestingly, a collision test is a graph-based discriminator which turns out to be optimal for the setting[17]. Our sample complexity lower bounds are derived from these results. Speciﬁcally we reduce discrimination to testing uniformity [17]. Other lower bounds in the literature can be similarly used to achieve alternative (yet incomparable bounds) (e.g. [7] provides a Ω(n2/3/ϵ3/4) lower bounds for testing whether two distributions are far or close).

In contrast with the aforementioned setup, here we do not measure distance between distributions in terms of total variation but in terms of an IPM distance induced by a class of distinguishers. The advantage of the IPM distance is that it sometimes can be estimated with limited amount of samples, while the total variation distance scales with the size of the support, which is often too large to allow estimation.

Several works do study the question of distinguishing between two distributions w.r.t a ﬁnite capacity class of tests, Speciﬁcally the work of [14] studies refutation algorithms that distinguish between noisy labels and labels that correlate with a bounded hypothesis class. [19] studies a closely related question in the context of realizable PAC learning. A graph-based discriminator can be directly turned to a refutation algorithm, and both works of [14, 19] show reductions from refutation to learning. In turn, the agnostic bounds of [14] can be harnessed to achieve lower bounds for graph-based discrimination. Unfortunately this approach leads to suboptimal lower bounds. It would be interesting to see if one can improve the guarantees for such reductions, and in turn exploit it for our setting.

2 Problem Setup

2.1 Basic Notations Graphs and Hyper Graphs

Recall that a k-hypergraph g consists of a a set Vg of vertices and a collection of non empty k tuples over V: Eg Vk, which are referred to as hyperedges. If k = 2 then g is called a graph. 1 hypergraphs are simply identiﬁed as subsets over V. We will normally use d to denote such 1-hypergraphs and will refer to them as distinguishers. A distinguisher d can be identiﬁed with a Boolean function according to the rule: d(x) = 1 iff x Ed.

Similarly we can identify a k-hypergraph with a function g : Vk {0, 1}. Namely, for any graph g we identify it with the Boolean function

g(v1, . . . , vk) = 1 (v1, . . . , vk) Eg 0 else

We will further simplify and assume that g is undirected, this means that for any permutation π : [k] [k], we have that

g(vπ(1), vπ(2), . . . , vπ(k)) = g(v1, . . . , vk).

We will call undirected k-hypergraphs, k-distinguishers. A collection of k-distinguishers over a common set of vertices V will be referred to as a k-distinguishing class. If k = 1 we will simply call such a collection a distinguishing class. For k > 1 we will normally denote such a collection with G and for k = 1 we will often use the letter D.

Next, given a distribution P over vertices and a k hypergraph g let us denote as follows the frequency of an edge w.r.t P:

EP (g) = E v1:k P k [g(v1:k)] = P k [{(v1, . . . , vk) : (v1, . . . , vk) Eg)}] ,

where we use the notation v1:t in shorthand for the sequence (v1, . . . , vt) Vt, and P k denotes the product distribution of P k times.

Similarly, given a sample S = {vi}m i=1 we denote the empirical frequency of an edge:

ES(g) = 1 mk X

u1:k Sk g(u1:k) = |{(u1, . . . , uk) Eg : i, ui S}|

As a ﬁnal set of notations: Given a k-hypergraph g a sequence v1:n where n < k, we deﬁne a k n distinguisher gv1:n as follows:

gv1:n(u1:k n) = g(v1, . . . vn, u1, . . . uk n).

In turn, we deﬁne the following distinguishing classes: For every sequence v1:n, n < k, the distinguishing class Gv1:n is deﬁned as follows:

Gv1:n = {gv1:n : g G} (3)

Finally, we point out that we will mainly be concerned with the case that |V| or V = N. However, all the results here can be easily extended to other domains as long as certain (natural) measurability assumptions are given to ensure that VC theory holds (see [20, 4]).

2.2 IPM distance

Given a class of distinguishers D the induced IPM distance [16], denoted by IPMD, is a (pseudo) metric between distributions over V deﬁned as follows

IPMD(p1, p2) = sup d D |Ep1(d) Ep2(d)| = sup d D

E v p1[d(v)] E v p2[d(v)] .

The deﬁnition can naturally be extended to a general family of graphs, and we deﬁne:

IPMG(p1, p2) = sup g G |Ep1(g) Ep2(g)]| = sup g G

E v1:k pk 1 [g(v1:k)] E v1:k pk 2 [g(v1:k)]]

Another metric we would care about is the total variation metric. Given two distributions p1 and p2 the total variation distance is deﬁned as:

TV(p1, p2) = sup E |p1(E) p2(E)|

where E V{0,1} goes over all measurable events.

In contrast with an IPM distance, the total variation metric is indeed a metric and any two distributions p1 = p2 we have that TV(p1, p2) > 0. In fact, for every distinguishing class D, IPMD TV.2

For ﬁnite classes of vertices V, it is known that the total variation metric is given by

TV(p1, p2) = 1

v V |p1(v) p2(v)|.

Further, if we let D = P(V) the power set of V we obtain

IPMP (V)(p1, p2) = TV(p1, p2).

2.3 Discriminating Algorithms

Deﬁnition 1. Given a distinguishing class G a G-discriminating algorithm A with sample complexity m(ϵ, δ) is an algorithm that receives as input two ﬁnite samples S = (S1, S2) of vertices and outputs a hyper-graph g A S G such that:

2we use the notation f1 f2 to denote that for every x, y we have f1(x, y) f2(x, y).

If S1, S2 are drawn IID from some unknown distributions p1, p2 respectively and |S1|, |S2| > m(ϵ, δ) then w.p. (1 δ) the algorithm s output satisﬁes:

|Ep1(g A S ) Ep2(g A S )| > IPMG(p1, p2) ϵ.

The sample complexity of a class G is then given by the smallest possible sample complexity of a G-discriminating algorithm A.

A class G is said to be discriminable if it has ﬁnite sample complexity. Namely there exists a discriminating algorithm for G with sample complexity m(ϵ, δ) < .

VC classes are discriminable For the case k = 1, discrimination is closely related to PAC learning. It is easy to see that a proper learning algorithm for a class D can be turned into a discriminator: Indeed, given access to samples from two distributions p1 and p2 we can provide a learner with labelled examples from a distribution p deﬁned as follows: p(y = 1) = p(y = 1) = 1 2 and p( |y = 1) = p1, and p( |y = 1) = p2. Given access to samples from p1 and p2 we can clearly generate IID samples from the distribution p. If, in turn, we provide a learner with samples from p and it outputs a hypothesis d D we have that (w.h.p):

|Ep1(d) Ep2(d)| = 2|1

2 E (x,y) p1 {1}[yd(x)] + 1

2 E (x,y) p2 { 1}[yd(x)]|

= 2| E (x,y) p[yd(x)]|

= 2(1 2p(d(x) = y)) 2(1 2(min d D p(d(x) = y) + ϵ))

= max d D (2(| E (x,y) p yd(x)| 4ϵ)

= max d D |Ep1(d) Ep2(d)| 4ϵ

= IPMD(p1, p2) 4ϵ

One can also see that a converse relation holds, if we restrict our attention to learning balanced labels (i.e., p(y = 1) = p(y = 1)). Namely, given labelled examples from some balanced distribution, the output of a discriminator is a predictor that competes with the class of predictors induced by D.

Overall, the above calculation, together with Vapnik and Chervonenkis s classical result [20] shows that classes with ﬁnite VC dimension ρ are discriminable with sample complexity O( ρ

ϵ2 ).3 The necessity of ﬁnite VC dimension for agnostic PAC-learning was shown in [1]. Basically the same argument shows that given a class D, Ω( ρ

ϵ2 ) examples are necessary for discrimination. We next introduce a natural extension of VC dimension to hypergraphs, which will play a similar role.

2.4 VC Dimension of hypergraphs

We next deﬁne the notion of graph VC dimension for hypergraphs, as we will later see this notion indeed characterizes the sample complexity of discriminating classes, and in that sense it is a natural extension of the notion of VC dimension for hypotheses classes:

Deﬁnition 2. Given a family of k-hypergraphs, G: The graph VC dimension of the class G, denoted g VC(G), is deﬁned inductively as follows: For k = 1 g VC(G) is the standard notion of VC dimension, i.e., g VC(G) = VC(G). For k > 1:

g VC(G) = max v V {g VC(Gv)}

Roughly, the graph VC dimension of a hypergraph is given by the VC dimension of the induced classes of distinguishers via projections. Namely, we can think of the VC dimension of hypergraphs as the projected VC dimension when we ﬁx all coordinates in an edge except for one.

3Recall that the VC dimension of a class D is the largest set that can be shattered by D where a set S V is said to be shattered if D restricted to S consists of 2|S| possible Boolean functions.

3 Main Results

We next describe the main results of this work. The results are divided into two sections: For the ﬁrst part we characterize the sample complexity of graph based distinguishing class. The second part is concerned with the expressive/distinguishing power of graph based discriminators. All proofs are provided in the full version.

3.1 The sample complexity of graph-based distinguishing class

We begin by providing upper bounds to the sample complexity for discrimination

Theorem 1 (Sample Complexity Upper Bound). Let G be a k distinguishing class with g VC(G) = ρ then G has sample complexity O( ρk2

ϵ2 log 1/δ).

Theorem 1 is a corollary of the following uniform convergence upper bound for graph-based distinguishing classes.

Theorem 2 (uniform convergence). Let G be a k distinguishing class with g VC(G) = ρ. Let S = {vi}m i=1 be an IID sample of vertices drawn from some unknown distribution P. If m = Ω( ρk2

ϵ2 log 1/δ) then with probability at least (1 δ) (over the randomness of S):

sup g G |ES(g) EP (g)| ϵ.

We next provide a lower bound for the sample complexity of discriminating algorithms in terms of the graph VC dimension of the class

Theorem 3 (Sample Complexity Lower Bound). Let G be a k distinguishing class with g VC(G) = ρ. Any G-discriminating algorithm with accuracy ϵ > 0 that succeeds with probability 1 2 k log k

must observe at least Ω ρ 27k3ϵ2 samples.

Our upper bounds and lower bounds leave a gap of order O( ρ): As dicussed in section 2.3, for the case k = 1 we can provide a tight θ( ρ

ϵ2 ) bound through a reduction to agnostic PAC learning and the appropriate lower bounds[1].

3.2 The expressive power of graph-based distinguishing class

So far we have characterized the discriminability of graph-based distinguishing classes. It is natural though to ask if graph based distinguishing classes add any advantage over standard 1-distinguishing classes. In this section we provide several results that show that indeed graph provide extra expressive power over standard distinguishing classes.

We begin by providing a result over inﬁnite graphs.

Theorem 4. Let V = N. There exists a distinguishing graph class G, with sample complexity m(ϵ, δ) = O( log 1/δ

ϵ2 ) (in fact |G| = 1) such that: for any 1-distinguishing class D with ﬁnite VC dimension, and every ϵ > 0 there are two distributions p1, p2 such that IPMD(p1, p2) < ϵ but IPMG(p1, p2) > 1/2

Theorem 4 can be generalized to higher order distinguishing classes :

Theorem 5. Let V = N. There exists a k-distinguishing class Gk, with sample complexity m(ϵ, δ) = O( k2+log 1/δ

ϵ2 ) such that: For any k 1-distinguishing class Gk 1 with bounded sample complexity, and every ϵ > 0 there are two distributions p1, p2 such that IPMGk 1(p1, p2) < ϵ and IPMGk(p1, p2) > 1/4.

Finite Graphs We next study the expressive power of distinguishing graphs over ﬁnite domains.

It is known that, over a ﬁnite domain V = {1, . . . , n}, we can learn with a sample complexity of O( n

ϵ2 log 1/δ) any distinguishing class. In fact, we can learn the total variation metric (indeed the sample complexity of P(V) is bounded by log |P(V )| = n).

Therefore if we allow classes whose sample complexity scales linearly with n we cannot hope to show any advantage for distinguishing graphs. However, in most natural problems n is considered to be very large (for example, over the Boolean cube n is exponential in the dimension). We thus, in general, would like to study classes that have better complexity in terms of n. In that sense, we can show that indeed distinguishing graphs yield extra expressive power.

In particular, we show that for classes with sublogarithmic sample complexity, we can construct graphs that are incomparable with a higher order distinguishing class. Theorem 6. Let |V| = n. There exists a k-distinguishing class Gk, with sample complexity m(ϵ, δ) = O( k2+log 1/δ

ϵ2 ) (in fact |G| = 1) such that: For any ϵ > 0 and any k 1 distinguishing class Gk 1 if:

IPMGk 1 ϵ IPMGk

then g VC(Gk 1) = Ω( ϵ2

We can improve the bound in theorem 6 for the case k = 1 . Theorem 7. Let |V| = n. There exists a 2-distinguishing class G, with sample complexity m(ϵ, δ) = O( log 1/δ

ϵ2 ) (in fact |G| = 1) such that: For any ϵ > 0 and any distinguishing class D if:

IPMD ϵ IPMG

then g VC(D) = Ω(ϵ2 log n).

4 Discussion and open problems

In this work we developed a generalization of the standard framework of discrimination to graph-based distinguishers that discriminate between two distributions by considering multi-ary tests. Several open question arise from our results:

Improving Sample Complexity Bounds In terms of sample complexity, while we give a natural upper bound of O(ρk2), the lower bound we provide are not tight neither in d nor in k and we provide a lower bound of Ω( ρ 2poly(k) ) This leave room for improvement both in terms of ρ and in terms of k.

Improving Expressiveness Bounds We also showed that, over ﬁnite domains, we can construct a graph that is incomparable with any class with VC dimension Ω(ϵ2 log n). The best upper bound we can provide (the VC of a class that competes with any graph) is the naive O(n) which is the VC dimension of the total variation metric.

Additionally, for the k-hypergraph case, our bounds deteriorate to Ω(ϵ2 log n). The improvement in the graph case follows from using an argument in the spirit of Boosting [10] and Hardcore Lemma [13] to construct two indistinguishable probabilities with distinct support over a small domain. It would be interesting to extend these techniques in order to achieve similar bounds for the k > 2 case.

Relation to GANs and Extension to Online Setting Finally, a central motivation for learning the sample complexity of discriminators is in the context of GANs. It then raises interesting questions as to the foolability of graph-based distinguishers.

The work of [6] suggests a framework for studying sequential games between generators and discriminators (GAM-Fooling). In a nutshell, the GAM setting considers a sequential game between a generator G that outputs distributions and a discriminator D that has access to data from some distribution p (not known to G). At each round of the game, the generator proposes a distribution and the discriminator outputs a d D which distinguishes between the distribution of G and the true distribution p . The class D is said to be GAM-Foolable if the generator outputs after ﬁnitely many rounds a distribution p that is D indistinguishable from p

[6] showed that a class D is GAM foolable if and only if it has ﬁnite Littlestone dimension. We then ask, similarly, which classes of graph based distinguishers are GAM-Foolable? A characterization of such classes can potentially lead to a natural extension of the Littlestone notion and online prediction, to graph-based classes analogously to this work w.r.t VC dimension

Ackgnoweledgements Y.M is supported in part by a grant from the ISF

[1] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.

[2] Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. ar Xiv preprint ar Xiv:1706.08224, 2017.

[3] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 259 269. IEEE, 2000.

[4] Shai Ben-David. 2 notes on classes with vapnik-chervonenkis dimension 1. ar Xiv preprint ar Xiv:1507.05307, 2015.

[5] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929 965, 1989.

[6] Olivier Bousquet, Roi Livni, and Shay Moran. Passing tests without memorizing: Two models for fooling discriminators. ar Xiv preprint ar Xiv:1902.03468, 2019.

[7] Siu-On Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the twenty-ﬁfth annual ACM-SIAM symposium on Discrete algorithms, pages 1193 1203. SIAM, 2014.

[8] Stéphan Clémençon, Igor Colin, and Aurélien Bellet. Scaling-up empirical risk minimization: optimization of incomplete u-statistics. The Journal of Machine Learning Research, 17(1):2682 2717, 2016.

[9] Stéphan Clémençon, Gábor Lugosi, Nicolas Vayatis, et al. Ranking and empirical minimization of u-statistics. The Annals of Statistics, 36(2):844 874, 2008.

[10] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In COLT, volume 96, pages 325 332. Citeseer, 1996.

[11] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation, pages 68 75. Springer, 2011.

[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014.

[13] Russell Impagliazzo. Hard-core distributions for somewhat hard problems. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 538 545. IEEE, 1995.

[14] Pravesh K Kothari and Roi Livni. Agnostic learning by refuting. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.

[15] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples in generative adversarial networks. In Advances in Neural Information Processing Systems, pages 1498 1507, 2018.

[16] Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997.

[17] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750 4755, 2008.

[18] Eitan Richardson and Yair Weiss. On gans and gmms. In Advances in Neural Information Processing Systems, pages 5847 5858, 2018.

[19] Salil P. Vadhan. On learning vs. refutation. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 1835 1848, 2017.

[20] Vladimir N Vapnik and Aleksei Yakovlevich Chervonenkis. The uniform convergence of frequencies of the appearance of events to their probabilities. In Doklady Akademii Nauk, volume 181, pages 781 783. Russian Academy of Sciences, 1968.