# learning_with_stochastic_orders__d56253fe.pdf

Published as a conference paper at ICLR 2023

LEARNING WITH STOCHASTIC ORDERS

Carles Domingo-Enrich New York University cd2754@nyu.edu

Yair Schiff Cornell University yzs2@cornell.edu

Youssef Mroueh IBM Research AI mroueh@us.ibm.com

Learning high-dimensional distributions is often done with explicit likelihood modeling or implicit modeling via minimizing integral probability metrics (IPMs). In this paper, we expand this learning paradigm to stochastic orders, namely, the convex or Choquet order between probability measures. Towards this end, exploiting the relation between convex orders and optimal transport, we introduce the Choquet-Toland distance between probability measures, that can be used as a drop-in replacement for IPMs. We also introduce the Variational Dominance Criterion (VDC) to learn probability measures with dominance constraints, that encode the desired stochastic order between the learned measure and a known baseline. We analyze both quantities and show that they suffer from the curse of dimensionality and propose surrogates via input convex maxout networks (ICMNs), that enjoy parametric rates. We provide a min-max framework for learning with stochastic orders and validate it experimentally on synthetic and high-dimensional image generation, with promising results. Finally, our ICMNs class of convex functions and its derived Rademacher Complexity are of independent interest beyond their application in convex orders. Code to reproduce experimental results is available here.

1 INTRODUCTION

Learning complex high-dimensional distributions with implicit generative models (Goodfellow et al., 2014; Mohamed & Lakshminarayanan, 2017; Arjovsky et al., 2017) via minimizing integral probability metrics (IPMs) (M uller, 1997a) has led to the state of the art generation across many data modalities (Karras et al., 2019; De Cao & Kipf, 2018; Padhi et al., 2020). An IPM compares probability distributions with a witness function belonging to a function class F, e.g., the class of Lipchitz functions, which makes the IPM correspond to the Wasserstein distance 1. While estimating the witness function in such large function classes suffers from the curse of dimensionality, restricting it to a class of neural networks leads to the so called neural net distance (Arora et al., 2017) that enjoys parametric statistical rates.

In probability theory, the question of comparing distributions is not limited to assessing only equality between two distributions. Stochastic orders were introduced to capture the notion of dominance between measures. Similar to IPMs, stochastic orders can be defined by looking at the integrals of measures over function classes F (M uller, 1997b). Namely, for µ+, µ P1(Rd), µ+ dominates µ , or µ µ+, if for any function f F, we have R

Rd f(x) dµ (x) R

Rd f(x) dµ+(x) (See Figure 1a for an example). In the present work, we focus on the Choquet or convex order (Ekeland & Schachermayer, 2014) generated by the space of convex functions (see Sec. 2 for more details).

Previous work has focused on learning with stochastic orders in the one dimensional setting, as it has prominent applications in mathematical finance and distributional reinforcement learning (RL). The survival function gives a characterization of the convex order in one dimension (See Figure 1b and Sec. 2 for more details). For instance, in portfolio optimization (Xue et al., 2020; Post et al., 2018; Dentcheva & Ruszczynski, 2003) the goal is to find the portfolio that maximizes the expected return under dominance constraints between the return distribution and a benchmark distribution.

Work done while at IBM.

Published as a conference paper at ICLR 2023

Figure 1: VDC example in 1D. Figure 1a :µ+ is mixture of 3 Gaussians , µ corresponds to a single mode of the mixture. µ+ dominates µ in the convex order. Figure 1b : uni-variate characterization of the convex order with survival functions (See Sec. 2 for details). Figure 1c: Surrogate VDC computation with Input Convex Maxout Network and gradient descent. The surrogate VDC tends to zero at the end of the training and hence characterizes the convex dominance of µ+ on µ .

A similar concept was introduced in distributional RL (Martin et al., 2020) for learning policies with dominance constraints on the distribution of the reward. While these works are limited to the univariate setting, our work is the first, to the best of our knowledge, that provides a computationally tractable characterization of stochastic orders that is sample efficient and scalable to high dimensions.

The paper is organized as follows: in Sec. 3 we introduce the Variational Dominance Criterion (VDC); the VDC between measures µ+ and µ takes value 0 if and only if µ+ dominates µ in the convex order, but it suffers from the curse of dimension and cannot be estimated efficiently from samples. To remediate this, in Sec. 4 we introduce a VDC surrogate via Input Convex Maxout Networks (ICMNs). ICMNs are new variants of Input Convex Neural Nets (Amos et al., 2017) that we propose as proxy for convex functions and study their complexity. We show in Sec. 4 that the surrogate VDC has parametric rates and can be efficiently estimated from samples. The surrogate VDC can be computed using (stochastic) gradient descent on the parameters of the ICMN and can characterize convex dominance (See Figure 1c). We then show in Sec. 5 how to use the VDC and its surrogate to define a pseudo-distance on the probability space. Finally, in Sec. 6 we propose penalizing generative models training losses with the surrogate VDC to learn implicit generative models that have better coverage and spread than known baselines. This leads to a min-max game similar to GANs. We validate our framework in Sec. 7 with experiments on portfolio optimization and image generation.

2 THE CHOQUET OR CONVEX ORDER

Denote by P(Rd) the set of Borel probability measures on Rd and by P1(Rd) P(Rd) the subset of those which have finite first moment: µ P1(R) if and only if R

Rd x dµ(x) < + .

Comparing probability distributions Integral probability metrics (IPMs) are pseudo-distances between probability measures µ, ν defined as d F(µ, ν) = supf F Eµf Eνf, for a given function class F which is symmetric with respect to sign flips. They are ubiquitous in optimal transport and generative modeling to compare distributions: if F is the set of functions with Lipschitz constant 1, then the resulting IPM is the 1-Wasserstein distance; if F is the unit ball of an RKHS, the IPM is its maximum mean discrepancy. Clearly, d F(µ, ν) = 0 if and only Eµf = Eνf for all f F, and when F is large enough, this is equivalent to µ = ν.

The Choquet or convex order When the class F is not symmetric with respect to sign flips, comparing the expectations Eµf and Eνf for f F does not yield a pseudo-distance. In the case where F is the set of convex functions, the convex order naturally arises instead: Definition 1 (Choquet order, Ekeland & Schachermayer (2014), Def. 4). For µ , µ+ P1(Rd), we say that µ µ+ if for any convex function f : Rd R, we have R

Rd f(x) dµ (x) R

Rd f(x) dµ+(x).

Published as a conference paper at ICLR 2023

µ µ+ is classically denoted as µ is a balay ee of µ+ , or µ+ dominates µ . It turns out that is a partial order on P1(Rd), meaning that reflexivity (µ µ), antisymmetry (if µ ν and ν µ, then µ = ν), and transitivity (if µ1 µ2 and µ2 µ3, then µ1 µ3) hold. As an example, if µ , µ+ are Gaussians µ = N(0, Σ ), µ+ = N(0, Σ+), then µ µ+ if and only if Σ Σ+ in the positive-semidefinite order (M uller, 2001). Also, since linear functions are convex, µ µ+ implies that both measures have the same expectation: Ex µ x = Ex µ+x.

In the univariate case, µ µ+ implies that supp(µ ) supp(µ+) and that Var(µ ) Var(µ+), and we have that µ µ+ holds if and only if for all x R, R + x Fµ (t) dt R + x Fµ+(t) dt, where F is the survival function (one minus the cumulative distribution function). Note that this characterization can be checked efficiently if one has access to samples of µ and µ+.

In the high-dimensional case, there exists an alternative characterization of the convex order: Proposition 1 (Ekeland & Schachermayer (2014), Thm. 10). If µ , µ+ P1(Rd), we have µ µ+ if and only if there exists a Markov kernel R (i.e. x Rd, R

Rd y d Rx(y) = x) such that µ+ = R

Equivalently, there exists a coupling (X, Y ) such that Law(X) = µ , Law(Y ) = µ+ and X = E(Y |X) almost surely. Intuitively, this means that µ+ is more spread out than µ . Remark that this characterization is difficult to check, especially in high dimensions.

3 THE VARIATIONAL DOMINANCE CRITERION

In this section, we present a quantitative way to deal with convex orders. Given a bounded open convex subset Ω Rd and a compact set K Rd, let A = {u : Ω R, u convex and u K almost everywhere}. We define the Variational Dominance Criterion (VDC) between probability measures µ+ and µ supported on Ωanalogously to IPMs, replacing F by A: VDCA(µ+||µ ) := supu A R

Ωu d(µ µ+). (1)

Remark that when 0 K, VDCA(µ+||µ ) 0 because the zero function belongs to the set A. We reemphasize that since A is not symmetric with respect to sign flips as f A does not imply f A, the properties of the VDC are very different from those of IPMs. Most importantly, the following proposition, shown in App. A, links the VDC to the Choquet order. Proposition 2. Let K compact such that the origin belongs to the interior of K. If µ+, µ P(Ω), VDCA(µ+||µ ) := supu A R

Ωu d(µ µ+) = 0 if and only if R

Ωu d(µ µ+) 0 for any convex function on Ω(i.e. µ µ+ according to the Choquet order).

That is, Proposition 2 states that the VDC between µ+ and µ takes value 0 if and only if µ+ dominates µ . Combining this with the interpretation of Proposition 1, we see that intuitively, the quantity VDCA(µ+||µ ) is small when µ+ is more spread out than µ , and large otherwise. Hence, if we want to enforce or induce a Choquet ordering between two measures in an optimization problem, we can include the VDC (or rather, its surrogate introduced in Sec. 4) as a penalization term in the objective. Before this, we explore the connections between VDC and optimal transport, and study some statistical properties of the VDC.

3.1 THE VDC AND OPTIMAL TRANSPORT

Toland duality provides a way to interpret the VDC through the lens of optimal transport. In the following, W2(µ, ν) denotes the 2-Wasserstein distance between µ and ν. Theorem 1 (Toland duality, adapted from Thm. 1 of Carlier (2008)). For any µ+, µ P(Ω), the VDC satisfies:

VDCA(µ+||µ ) = sup ν P(K)

2W 2 2 (µ+, ν) 1

2W 2 2 (µ , ν) 1

Ω x 2 d(µ+ µ )(x) (2)

The optimal convex function u of VDC in (1) and the optimal ν in the right-hand side of (2) satisfy ( u)#µ+ = ( u)#µ = ν, where ( u)#µ+ denotes the pushforward of µ+ by u.

Published as a conference paper at ICLR 2023

Note that under the assumption 0 K, Theorem 1 implies that VDCA(µ+||µ ) = 0 if and only if W 2 2 (µ+, ν) 1

Ω x 2 dµ+ W 2 2 (µ , ν) 1

Ω x 2 dµ for any ν P(K). Under the equivalence VDCA(µ+||µ ) = 0 µ+ µ shown by Proposition 2, this provides yet another characterization of the convex order for arbitrary dimension.

3.2 STATISTICAL RATES FOR VDC ESTIMATION

In this subsection, we present an upper bound on the statistical rate of estimation of VDCA(µ||ν) using the estimator VDCA(µn||νn) based on the empirical distributions µn = 1

n Pn i=1 δxi, νn = 1 n Pn i=1 δyi built from i.i.d. samples (xi)n i=1, (yi)n i=1 from µ and ν, respectively.

Theorem 2. Let Ω= [ 1, 1]d and K = {x Rd | x 2 C} for an arbitrary C > 0. With probability at least 1 δ,

|VDCA(µ||ν) VDCA(µn||νn)| q

18C2d log( δ

4)( 2 n) + 8Kn 2

where K depends on C and d.

The proof of this result is in App. A. The dependency on n 2

d is indicative of the curse of dimension: we need a number of samples n exponential in d to control the estimation error. While Theorem 2 only shows an upper bound on the difference between the VDC and its estimator, in Subsec. 5.1 we study a related setting where a Ω(n 2

d ) lower bound is available. Hence, we hypothesize that VDC estimation is in fact cursed by dimension in general.

4 A VDC SURROGATE VIA INPUT CONVEX MAXOUT NETWORKS

Given the link between the VDC and the convex order, one is inclined to use the VDC as a quantitative proxy to induce convex order domination in optimization problems. Estimating the VDC implies solving an optimization problem over convex functions. In practice, we only have access to the empirical versions µn, νn of the probability measures µ, ν; we could compute the VDC between the empirical measures by solving a linear program similar to the ones used in non-parametric convex regression (Hildreth, 1954). However, the statistical rates for the VDC estimation from samples are cursed by dimension (Subsec. 3.2), which means that we would need a number of samples exponential in the dimension to get a good estimate. Our approach is to focus on a surrogate problem instead: supu ˆ A R

Ωu d(µ µ+),

where ˆ A is a class of neural network functions included in A over which we can optimize efficiently. In constructing ˆ A, we want to hardcode the constraints u convex and u K almost everywhere into the neural network architectures. A possible approach would be to use the input convex neural networks (ICNNs) introduced by Amos et al. (2017), which have been used as a surrogate of convex functions for generative modeling with normalizing flows (Huang et al., 2021) in optimal transport (Korotin et al., 2021a;b; Huang et al., 2021; Makkuva et al., 2020) and large-scale Wasserstein flows (Alvarez-Melis et al., 2021; Bunne et al., 2021; Mokrov et al., 2021).

However, we found in early experimentation that a superior alternative is to use input convex maxout networks (ICMNs), which are maxout networks (Goodfellow et al., 2013) that are convex with respect to inputs. Maxout networks and ICMNs are defined as follows: Definition 2 (Maxout networks). For a depth L 2, let M = (m1, . . . , m L) be a vector of positive integers such that m1 = d. Let FL,M,k be the space of k-maxout networks of depth L and widths M, which contains functions of the form

f(x) = 1 m L Pm L i=1 ai maxj [k] w(L 1) i,j , (x(L 1), 1) , ai R, w(L 1) i,j Rm L 1+1 (3)

where for any 2 ℓ L 1, and any 1 i mℓ, the i-th component of x(ℓ) = (x(ℓ) 1 , . . . , x(ℓ) mℓ) is computed recursively as:

x(ℓ) i = 1 mℓmaxj [k] w(ℓ 1) i,j , (x(ℓ 1), 1) , w(ℓ) i,j Rmℓ+1, (4)

with x(1) = x.

Published as a conference paper at ICLR 2023

Definition 3 (Input convex maxout networks or ICMNs). A maxout network f of the form (3)-(4) is an input convex maxout network if (i) for any 1 i ML, ai 0, and (ii) for any 2 < ℓ L 1, 1 i mℓ+1, 1 j k, the first mℓcomponents of w(ℓ) i,j are non-negative. We denote the space of ICMNs as FL,M,k,+.

In other words, a maxout network is an ICMN if all the non-bias weights beyond the first layer are constrained to be non-negative. This definition is analogous to the one of ICNNs in Amos et al. (2017), which are also defined as neural networks with positivity constraints on non-bias weights beyond the first layer. Proposition 5 in App. B shows that ICMNs are convex w.r.t to their inputs.

It remains to impose the condition u K almost everywhere, which in practice is enforced by adding the norms of the weights as a regularization term to the loss function. For theoretical purposes, we define FL,M,k(1) (resp. FL,M,k,+(1)) as the subset of FL,M,k (resp. FL,M,k,+) such that for all 1 ℓ L 1, 1 i mℓ, 1 j k, w(ℓ) i,j 2 1, and a 2 = Pm L i=1 a2 i 1. The following proposition, proven in App. B, shows simple bounds on the values of the functions in FL,M,k(1) and their derivatives.

Proposition 3. Let f be an ICMN that belongs to FL,M,k(1). For x almost everywhere in B1(Rd), f(x) 1. Moreover, for all x B1(Rd), |f(x)| L, and for 1 ℓ L, x(ℓ) ℓ.

When K = B1(Rd), we have that the space of ICMNs FL,M,k,+(1) is included in A. Hence, we define the surrogate VDC associated to FL,M,k,+(1) as:

VDCFL,M,k,+(1)(µ+||µ ) = supu FL,M,k,+(1) R

Ωu d(µ µ+). (5)

Theorem 3. Suppose that for all 1 ℓ L, the widths mℓsatisfy mℓ m, and assume that µ, ν are supported on the ball of Rd of radius r. With probability at least 1 δ,

|VDCFL,M,k,+(1)(µ||ν) VDCFL,M,k,+(1)(µn||νn)|

18r2 log( δ

4) 2 n + 512 q

(L 1)km(m+1)

(L + 1) log(2) + log(L2+1)

We see from Theorem 3 that VDCFL,M,k,+(1) in contrast to VDCA has parametric rates and hence favorable properties to be estimated from samples. In the following section, wes see that VDC can be used to defined a pseudo-distance on the probability space.

5 FROM THE CONVEX ORDER BACK TO A PSEUDO-DISTANCE

We define the Choquet-Toland distance (CT distance) as the map d CT,A : P(Ω) P(Ω) R given by

d CT,A(µ+, µ ) := VDCA(µ+||µ ) + VDCA(µ ||µ+).

That is, the CT distance between µ+ and µ is simply the sum of Variational Dominance Criteria. Applying Theorem 1, we obtain that d CT,A(µ+, µ ) = 1

2(supν P(K) W 2 2 (µ+, ν) W 2 2 (µ , ν) + supν P(K) W 2 2 (µ+, ν) W 2 2 (µ , ν) ). The following result, shown in App. C, states that d CT,K is indeed a distance.

Theorem 4. Suppose that the origin belongs to the interior of K. d CT,A is a distance, i.e. it fulfills

(i) d CT,A(µ+, µ ) 0 for any µ+, µ P(Ω) (non-negativity).

(ii) d CT,A(µ+, µ ) = 0 if and only if µ+ = µ (identity of indiscernibles).

(iii) If µ1, µ2, µ3 P(Ω), we have that d CT,A(µ1, µ2) d CT,A(µ1, µ3) + d CT,A(µ3, µ2).

As in (5), we define the surrogate CT distance as:

d CT,FL,M,k,+(1)(µ+, µ ) = VDCFL,M,k,+(1)(µ+||µ ) + VDCFL,M,k,+(1)(µ ||µ+). (6)

Published as a conference paper at ICLR 2023

5.1 STATISTICAL RATES FOR CT DISTANCE ESTIMATION

We show almost-tight upper and lower bounds on the expectation of the Choquet-Toland distance between a probability measure and its empirical version. Namely, E[d CT,A(µ, µn)] = O(n 2

d / log(n)). Theorem 5. Let C0, C1 be universal constants independent of the dimension d. Let Ω= [ 1, 1]d, and K = {x Rd | x 2 C}. Let µn be the n-sample empirical measure corresponding to a probability measure µ over [ 1, 1]d. When µ is the uniform probability measure and n C1 log(d), we have that

E[d CT,A(µ, µn)] C0 q

C d(1+C) log(n)n 2

For any probability measure µ over [ 1, 1]d,

E[d CT,A(µ, µn)] Kn 2

where K is a constant depending on C and d (but not the measure µ).

Overlooking logarithmic factors, we can summarize Theorem 5 as E[d CT,A(µ, µn)] n 2

d . The estimation of the CT distance is cursed by dimension: one needs a sample size exponential with the dimension d for µn to be at a desired distance from µ. It is also interesting to contrast the rate n 2

d with the rates for similar distances. For example, for the r-Wasserstein distance we have E[W r r (µ, µn)] n r

d (Singh & P oczos (2018), Section 4). Given the link of the CT distance and VDC with the squared 2-Wasserstein distance (see Subsec. 3.1), the n 2

d rate is natural.

The proof of Theorem 5, which can be found in App. D, is based on upper-bounding and lowerbounding the metric entropy of the class of bounded Lipschitz convex functions with respect to an appropriate pseudo-norm. Then we use Dudley s integral bound and Sudakov minoration to upper and lower-bound the Rademacher complexity of this class, and we finally show upper and lower bounds of the CT distance by the Rademacher complexity. Bounds on the metric entropy of bounded Lipschitz convex functions have been computed and used before (Balazs et al., 2015; Guntuboyina & Sen, 2013), but in Lp and supremum norms, not in our pseudo-norm.

Next, we see that the surrogate CT distance defined in (6) does enjoy parametric estimation rates. Theorem 6. Suppose that for all 1 ℓ L, the widths mℓsatisfy mℓ m. We have that

E[d CT,FL,M,k,+(1)(µ, µn)] 256 q

(L 1)km(m+1)

(L + 1) log(2) + log(L2+1)

In short, we have that E[d CT,FL,M,k,+(1)(µ, µn)] = O(Lm q

k n). Hence, if we take the number of samples n larger than k times the squared product of width and depth, we can make the surrogate CT distance small. Theorem 6, proven in App. D, is based on a Rademacher complexity bound for the space FL,M,k(1) of maxout networks which may be of independent interest; to our knowledge, existing Rademacher complexity bounds for maxout networks are restricted to depth-two networks (Balazs et al., 2015; Kontorovich, 2018).

Theorems 5 and 6 show the advantages of the surrogate CT distance over the CT distance are not only computational but also statistical; the CT distance is such a strong metric that moderate-size empirical versions of a distribution are always very far from it. Hence, it is not a good criterion to compare how close an empirical distribution is to a population distribution. In contrast, the surrogate CT distance between a distribution and its empirical version is small for samples of moderate size. An analogous observation for the Wasserstein distance versus the neural net distance was made by Arora et al. (2017).

If µn, νn are empirical versions of µ, ν, it is also interesting to bound |d CT,FL,M,k,+(1)(µn, νn) d CT,A(µ, ν)| |d CT,FL,M,k,+(1)(µn, νn) d CT,FL,M,k,+(1)(µ, ν)| + |d CT,FL,M,k,+(1)(µ, ν) d CT,A(µ, ν)|. The first term has a O( p

k/n) bound following from Theorem 6, while the second term is upper-bounded by 2 supf A inf f FL,M,k(1) f f , which is (twice) the approximation error

Published as a conference paper at ICLR 2023

of the class A by the class FL,M,k(1). Such bounds have only been derived in L = 2: Balazs et al. (2015) shows that supf A inf f F2,(d,1),k(1) f f = O(dk 2/d). Hence, we need k exponential in d to make the second term small, and thus n k exponential in d to make the first term small.

6 LEARNING DISTRIBUTIONS WITH THE SURROGATE VDC AND CT DISTANCE

We provide a min-max framework to learn distributions with stochastic orders. As in the generative adversarial network (GAN, Goodfellow et al. (2014); Arjovsky et al. (2017)) framework, we parametrize probability measures implicitly as the pushforward µ = g#µ0 of a base measure µ0 by a generator function g in a parametric class G and we optimize over g. The loss functions involve a maximization over ICMNs corresponding to the computation of a surrogate VDC or CT distance (and possibly additional maximization problems), yielding a min-max problem analogous to GANs.

Enforcing dominance constraints with the surrogate VDC. In some applications, we want to optimize a loss L : P(Ω) R under the constraint that µ = g#µ0 dominates a baseline measure ν. We can enforce, or at least, bias µ towards the dominance constraint by adding a penalization term proportional to the surrogate VDC between µ and ν, in application of Proposition 2.

A first instance of this approach appears in portfolio optimization (Xue et al., 2020; Post et al., 2018). Let ξ = (ξ1, . . . , ξp) be a random vector of return rates of p assets and let Y1 := G1(ξ), Y2 := G2(ξ) be real-valued functions of ξ which represent the return rates of two different asset allocations or portfolios, e.g. Gi(ξ) = ωi, ξ with ωi Rp. The goal is to find a portfolio G2 that enhances a benchmark portfolio G1 in a certain way. For a portfolio G with return rate Y := G(ξ), we let F (1) Y (x) = Pξ(Y x) be the CDF of its return rate, and F (2) Y (x) = R x F (1) Y (x ) dx . If Y1, Y2 are the return rates of G1, G2, we say that Y2 dominates Y1 in second order, or Y2 2 Y1 if for all x R, F (2) Y2 (x) F (2) Y1 (x), which intuitively means that the return rates Y2 are less spread out than those Y1, i.e. the risk is smaller. Formally, the portfolio optimization problem can be written as: max G2 E[Y2 := G2(ξ)], s.t. Y2 2 Y1 := G1(ξ). (9) It turns out that Y2 2 Y1 if and only if E[(η Y2)+] E[(η Y1)+] for any η R, or yet equivalently, if E[u(Y2)] E[u(Y1)] for all concave non-decreasing u : R R (Dentcheva & Ruszczy nski, 2004). Although different, note that the second order is intimately connected to the Choquet order for 1-dimensional distributions, and it can be handled with similar tools. Define FL,M,k, +(1) as the subset of FL,M,k,+(1) such that the first m1 components of the weights w(1) i,j are non-positive for all 1 i m2, 1 j k. If we set the input width m1 = 1, we can encode the condition Y2 2 Y1 as VDCFL,M,k, +(1)(ν||µ), where ν = L(Y1) and µ = L(Y2) are the distributions of Y1, Y2, resp. Hence, with the appropriate Lagrange multiplier we convert problem (9) into a min-max problem between µ and the potential u of VDC minµ:µ=L( ξ,ω2 ) R

R x dµ(x) + λVDCFL,M,k, +(1)(ν||µ). (10)

A second instance of this approach is in GAN training. Assuming that we have a baseline generator g0 that can be obtained via regular training, we consider the problem:

ming G n maxf F{EX νn[f(X)] EY µ0[f(g(Y ))]} + λVDCFL,M,k,+(1)(g#µ0||(g0)#µ0) o .

(11) The first term in the objective function is the usual WGAN loss (Arjovsky et al., 2017), although it can be replaced by any other standard GAN loss. The second term, which is proportional to VDCFL,M,k,+(1)(g#µ0||(g0)#µ0) = maxu FL,M,k,+(1){EY µ0[u(g0(Y ))] EY µ0[u(g(Y ))]}, enforces that g#µ0 (g0)#µ0 in the Choquet order, and thus u acts as a second Choquet critic. Tuning λ appropriately, the rationale is that we want a generator that optimizes the standard GAN loss, with the condition that the generated distribution dominates the baseline distribution. As stated by Proposition 1, dominance in the Choquet order translates to g#µ0 being more spread out than (g0)#µ0, which should help avoid mode collapse and improve the diversity of generated samples. In practice, this min-max game is solved via Algorithm 1 given in App. F. For the Choquet critic, this amounts to an SGD step followed by a projection step to impose non-negativity of hidden to hidden weights.

Published as a conference paper at ICLR 2023

Generative modeling with the surrogate CT distance. The surrogate Choquet-Toland distance is well-suited for generative modeling, as it can used in GANs in place of the usual discriminator. Namely, if ν is a target distribution, νn is its empirical distribution, and D = {µ = g#µ0|g G} is a class of distributions that can be realized as the push-forward of a base measure µ0 P(Rd0) by a function g : Rd0 Rd, the problem to solve is g = arg ming G d CT,FL,M,k,+(1)(g#µ0, νn) = arg ming G{maxu FL,M,k,+(1){EX νn[u(X)] EY µ0[u(g(Y ))]} + maxu FL,M,k,+(1){EY µ0[u(g(Y ))] EX νn[u(X)]}}. Algorithm 2 given in App. F summarizes learning with the surrogate CT distance.

7 EXPERIMENTS

Portfolio optimization under dominance constraints In this experiment, we use the VDC to optimize an illustrative example from Xue et al. (2020) (Example 1) that follows the paradigm laid out in Sec. 6. In this example, ξ R P is drawn uniformly from [0, 1], we define the benchmark portfolio as:

( i 20 ξ [0.05 i, 0.05 (i + 1)) i = 0, . . . , 19 1 ξ = 1

and the optimization is over the parameterized portfolio G2(ξ) := G(ξ; z) = zξ.

The constrained optimization problem is thus specified as:

min z EP [G(ξ; z)] s.t. G(ξ; z) 2 G1(ξ), 1 z 2

As stated in Xue et al. (2020), this example has a known solution at z = 2 where EP [G2(ξ; 2)] = 1 outperforms the benchmark EP [G1(ξ)] = 0.5. We relax the constrained optimization by including it in the objective function, thus creating min-max game (10) introduced in Sec. 6. We parameterize FL,M,k, +(1) with a 3-layer, fully-connected, decreasing ICMN with hidden dimension 32 and maxout kernel size of 4. After 5000 steps of stochastic gradient descent on z (learning rate 1e 3) and the parameters of the ICMN (learning rate 1e 3), using a batch size of 512 and λ = 1, we are able to attain accurate approximate values of the known solution: z = 2, 1 512 P512 j=1[Gz(ξj)] = 1.042, and

1 512 P512 j=1[G1(ξj)] = 0.496.

(a) g CIFAR-10 samples

(b) Initialisation

(c) 5k iterations

(d) 80k iterations

Figure 2: Training generative models by enforcing dominance with surrogate VDC on pre-trained CIFAR-10 WGAN-GP (Left) and with surrogate CT distance on 2D point clouds (Right). Ground truth point cloud distributions (blue) consist of swiss roll (Top), circle of eight Gaussians (Middle), and Github icon converted to a point cloud (Bottom).

Image generation with baseline model dominance Another application of learning with the VDC is in the high-dimensional setting of CIFAR-10 (Krizhevsky & Hinton, 2009) image generation. As detailed in Sec. 6, we start by training a baseline generator g0 using the regularized Wasserstein GAN paradigm (WGAN-GP) introduced in Arjovsky et al. (2017); Gulrajani et al. (2017), where

Published as a conference paper at ICLR 2023

gradient regularization is computed for the discriminator with respect to interpolates between real and generated data. When training g and g0, we used the same WGAN-GP hyperparameter configuration. We set λ in Equation (11) to 10 (see App. F and App. H for more details). Training runs for g and g0

Table 1: FID scores for WGAN-GP and WGAN-GP with VDC surrogate for convex functions approximated by either ICNNs with softplus activations or ICMNs. ICMNs improve upon the baseline g0 and outperform ICNNs with softplus. FID score for WGAN-GP + VDC includes mean values one standard deviation for 5 repeated runs with different random initialization seeds.

g0: WGAN-GP 69.67 g : WGAN-GP + VDC CP-Flow ICNN 83.470 3.732 g : WGAN-GP + VDC ICMN (Ours) 67.317 0.776

were performed in computational environments that contained 1 CPU and 1 A100 GPU. To evaluate performance of g vs. g0, we rely on the Fr echet Inception Distance (FID) introduced in Heusel et al. (2017). Note that when training a WGAN-GP baseline from scratch we used hyperparameters that potentially differ from those used in state-of-the-art implementations. Additionally, for computing FIDs we use a pytorch-lightning implementation of the inception network, which is different from the widely used Tensorflow implementation (Salimans et al., 2016), resulting in potential discrepancies in our reported baseline FID and those in the literature. FID results are reported in Table 1, where we see improved image quality from g (as measured by lower FID) relative to the pre-trained baseline g0. We therefore find that the VDC surrogate improves upon g0 by providing g with larger support, preventing mode collapse. Samples generated from g are displayed in Figure 2a.

In order to highlight the representation power of ICMNs, we replace them in the VDC estimation by the ICNN implementation of Huang et al. (2021). Instead of maxout activation, Huang et al. (2021) uses a Softplus activation and instead of a projection step it also uses a Softplus operation to impose the non-negativity of hidden to hidden weights to enforce convexity. We see in Table 1 that VDC s estimation with ICMN outperforms Huang et al. (2021) ICNNs. While ICMNs have maximum of affine functions as a building block, ICNN s proof of universality in Huang et al. (2021) relies on approximating the latter, this could be one of the reason behind ICMN superiority.

Probing mode collapse To investigate how training with the surrogate VDC regularizer helps alleviate mode collapse in GAN training, we implemented GANs trained with the IPM objective alone and compared this to training with the surrogate VDC regularizer for a mixture of 8 Gaussians target distribution. In Figure 3 given in App. G we quantify mode collapse by looking at two scores: 1) the entropy of the discrete assignment of generated points to the means of the mixture 2) the negative log likelihood (NLL) of the Gaussian mixture. When training with the VDC regularizer to improve upon the collapsed generator g0 (which is taken from step 55k from the unregularized GAN training), we see more stable training and better mode coverage as quantified by our scores.

2D point cloud generation with d CT We apply learning with the CT distance in a 2D generative modeling setting. Both the generator and CT critic architectures are comprised of fully-connected neural networks with maxout non-linearities of kernel size 2. Progression of the generated samples can be found in the right-hand panel of Figure 2, where we see the trained generator accurately learn the ground truth distribution. All experiments were performed in a single-CPU compute environment.

8 CONCLUSION

In this paper, we introduced learning with stochastic order in high dimensions via surrogate Variational Dominance Criterion and Choquet-Toland distance. These surrogates leverage input convex maxout networks, a new variant of input convex neural networks. Our surrogates have parametric statistical rates and lead to new learning paradigms by incorporating dominance constraints that improve upon a baseline. Experiments on synthetic and real image generation yield promising results. Finally, our work, although theoretical in nature, can be subject to misuse, similar to any generative method.

Published as a conference paper at ICLR 2023

David Alvarez-Melis, Yair Schiff, and Youssef Mroueh. Optimizing functionals on the space of probabilities with input convex neural networks, 2021.

Brandon Amos, Lei Xu, and J. Zico Kolter. Input convex neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 146 155. PMLR, 06 11 Aug 2017.

Martin Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875, 2017.

Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 224 232. JMLR, 2017.

Gabor Balazs, Andr as Gy orgy, and Csaba Szepesvari. Near-optimal max-affine estimators for convex regression. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38, pp. 56 64. PMLR, 2015.

Efim M. Bronshtein. ϵ-entropy of convex sets and functions. Siberian Mathematical Journal, 17: 393 398, 1976.

Charlotte Bunne, Laetitia Meng-Papaxanthos, Andreas Krause, and Marco Cuturi. Jkonet: Proximal optimal transport modeling of population dynamics, 2021.

G. Carlier. Remarks on toland s duality, convexity constraint and optimal transport. Pacific Journal of Optimization, 4, 09 2008.

Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs. ar Xiv preprint ar Xiv:1805.11973, 2018.

Darinka Dentcheva and Andrzej Ruszczynski. Optimization with stochastic dominance constraints. SIAM Journal on Optimization, 14(2):548 566, 2003.

Darinka Dentcheva and Andrzej Ruszczy nski. Optimality and duality theory for stochastic optimization problems with nonlinear dominance constraints. Mathematical Programming, 99:329 350, 01 2004. doi: 10.1007/s10107-003-0453-z.

R. M. Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290 330, 1967.

Ivar Ekeland and Walter Schachermayer. Optimal transport and the geometry of L1(Rd). Proceedings of the American Mathematical Society, 142, 10 2014.

William Falcon et al. Pytorch lightning. Git Hub. Note: https://github. com/Py Torch Lightning/pytorchlightning, 3:6, 2019.

Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28, pp. 1319 1327. PMLR, 2013.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672 2680, 2014.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 5767 5777, 2017.

Adityanand Guntuboyina and Bodhisattva Sen. Covering numbers for convex functions. IEEE Transactions on Information Theory, 59(4):1957 1965, 2013.

Published as a conference paper at ICLR 2023

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626 6637, 2017.

Clifford Hildreth. Point estimates of ordinates of concave functions. Journal of the American Statistical Association, 49(267):598 619, 1954.

Chin-Wei Huang, Ricky T. Q. Chen, Christos Tsirigotis, and Aaron Courville. Convex potential flows: Universal probability distributions with optimal transport and convex optimization. In International Conference on Learning Representations, 2021.

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401 4410, 2019.

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.

Aryeh Kontorovich. Rademacher complexity of k-fold maxima of hyperplanes, 2018.

Alexander Korotin, Lingxiao Li, Aude Genevay, Justin Solomon, Alexander Filippov, and Evgeny Burnaev. Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021a.

Alexander Korotin, Lingxiao Li, Justin Solomon, and Evgeny Burnaev. Continuous wasserstein-2 barycenter estimation without minimax optimization. In International Conference on Learning Representations, 2021b.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport mapping via input convex neural networks. In Hal Daum e Iii And Singh (ed.), Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 6672 6681. PMLR, 2020.

John D. Martin, Michal Lyskawinski, Xiaohu Li, and Brendan Englot. Stochastically dominant distributional reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 20. JMLR.org, 2020.

Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. In ICLR, 2017.

Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin Solomon, and Evgeny Burnaev. Large-scale wasserstein gradient flows, 2021.

Youssef Mroueh and Tom Sercu. Fisher gan. Advances in Neural Information Processing Systems, 30, 2017.

Alfred M uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997a.

Alfred M uller. Stochastic orders generated by integrals: A unified study. Advances in Applied Probability, 29(2):414 428, 1997b.

Alfred M uller. Stochastic ordering of multivariate normal distributions. Annals of the Institute of Statistical Mathematics, 53:567 575, 02 2001.

Inkit Padhi, Pierre Dognin, Ke Bai, Cicero Nogueira dos Santos, Vijil Chenthamarakshan, Youssef Mroueh, and Payel Das. Learning implicit text generation via feature matching. ar Xiv preprint ar Xiv:2005.03588, 2020.

Published as a conference paper at ICLR 2023

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019.

Thierry Post, Selc uk Karabatı, and Stelios Arvanitis. Portfolio optimization based on stochastic dominance and empirical likelihood. Journal of Econometrics, 206(1):167 186, 2018. ISSN 0304-4076.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pp. 2234 2242, 2016.

Shashank Singh and Barnab as P oczos. Minimax distribution estimation in wasserstein distance, 2018.

Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch olkopf, and Gert R. G. Lanckriet. On integral probability metrics, φ-divergences and binary classification, 2009.

V. N. Sudakov. A remark on the criterion of continuity of gaussian sample function. In Proceedings of the Second Japan-USSR Symposium on Probability Theory, pp. 444 454, Berlin, Heidelberg, 1973. Springer Berlin Heidelberg.

Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.

Meng Xue, Yun Shi, and Hailin Sun. Portfolio optimization with relaxation of stochastic second order dominance constraints via conditional value at risk. Journal of Industrial and Management Optimization, 16(6):2581 2602, 2020.

A Proofs of Sec. 3 13

B Proofs of Sec. 4 15

C Proofs of Sec. 5 19

D Proofs of Subsec. 5.1 19

E Simple examples in dimension 1 for VDC and d CT 23

F Algorithms for learning distributions with surrogate VDC and CT distance 25

G Probing mode collapse 25

H Additional experimental details 26

I Assets 29

A PROOFS OF SEC. 3

Proof of Proposition 2. Since A is included in the set of convex functions, the implication from right to left is straightforward.

For the other implication, we prove the contrapositive. Suppose that u is a convex function on Ωsuch that R

Ωu d(µ µ+) > 0. Then we show that we can construct u A such that R

Ω u d(µ µ+) > 0, which shows that the supremum over A is strictly positive. Denote by C the set of convex functions f : Rd R which are the point-wise supremum of finitely many affine functions, i.e. f(x) = maxi I{ yi, x ai} for some finite family (yi, ai) Rd R. For any convex function g, there is an increasing sequence (gn)n C such that g = supn gn pointwise (Ekeland & Schachermayer (2014), p. 3). Applying this to u, we know there exists an increasing sequence (un)n C such that u = limn un. By the dominated convergence theorem,

Ωun d(µ µ+) = R

Ωu d(µ µ+) > 0,

which means that for some N large enough, R

Ωu N d(µ µ+) > 0 as well. Since u N admits a representation f(x) = maxi I{ yi, x ai} for some finite family I, it may be trivially extended to a convex function on Rd.

Let (ηϵ)ϵ be a family of non-negative radially symmetric functions in C2(Rd) supported on the ball Bϵ(0) of radius ϵ centered at zero, and such that R

Rd ηϵ(x) dx = 1. Let Ωϵ = {x Ω| dist(x, Ω) ϵ}. For any x Rd, we have that

(u N ηϵ)(x) = R

Rd ηϵ(x y)u(y) dy = R

Rd ηϵ(y )u N(x y ) dy . (12)

By the convexity of u N, we have that u N(λx+(1 λ)x y ) λu N(x y )+(1 λ)u N(x y ) for any x, x , y Rd. Thus, (12) implies that (u N ηϵ)(λx + (1 λ)x ) λ(u N ηϵ)(x) + (1 λ)(u N ηϵ)(x ), which means that u N ηϵ is convex. Also, by the dominated convergence theorem,

Ω(u N ηϵ) d(µ µ+) = R

Ωu N d(µ µ+) > 0

Hence, there exists ϵ0 > 0 such that R

Ω(u N ηϵ) d(µ µ+) > 0. Since u N ηϵ0 is in C2(Rd) by the properties of convolutions, its gradient is continuous. Since the closure Ωis compact because Ω is bounded, by the Weierstrass theorem we have that

supx Ω (u N ηϵ0)(x) 2 < +

Let r > 0 be such that the ball Br(0) is included in K. Rescaling u N ηϵ0 by an appropriate constant, we have that supx Ω (u N ηϵ0)(x) 2 < r, and that means that (u N ηϵ0)(x) K for any x Ω. Thus, u N ηϵ0 A, and that means that supu A{ R

Ωu d(µ µ+)} > 0, concluding the proof.

Published as a conference paper at ICLR 2023

Lemma 1. Let Ω= [ 1, 1]d and K = {x Rd | x 2 C}. The function class A = {u : Ω R, u convex and u a.e. K} is equal to the space Fd(M, C) of convex functions on [ 1, 1]d such that |f(x)| M and |f(x) f(y)| C x y for any x, y [ 1, 1]d, up to a constant term. Here, M = C

Proof. Looking at problem (1), note that adding a constant to a function u A does not change the value of the objective function. Thus, we can add the restriction that u(0) = 0, and since Ωis compact and has Lipschitz constant upper-bounded by supx K x , any such function u must fulfill M := supu A supx Ω|u(x)| < + . Thus, we have that functions in A belong to {u : Ω R, u convex and u a.e. K, supx Ω|u(x)| M} up to a constant term, for a well chosen M.

Now we will use the particular form of Ωand K. First, note that we can take M = C

d without loss of generality. Given u A, we have that u(x) 2 C for a.e. x [ 1, 1]d. By the mean value theorem, we have that |u(x) u(y)| = | R 1 0 f(tx + (1 t)y), x y dt| C x y , implying that u is C-Lipschitz. This shows that A Fd(M, C). Rademacher s theorem states that C-Lipschitz functions are a.e. differentiable, and gradient norms must be upper-bounded by C wherever gradients exist as otherwise one reaches a contradiction. Hence, Fd(M, C) A, concluding the proof.

Lemma 2 (Metric entropy of convex functions, Bronshtein (1976), Thm. 6). Let Fd(M, C) be the compact space of convex functions on [ 1, 1]d such that |f(x)| M and |f(x) f(y)| C x y for any x, y [ 1, 1]d. The metric entropy of this space with respect to the uniform norm topology satisfies

log N(δ; Fd(M, C), ) Kδ d

2 , for some constant K that depend on C, M and d. Lemma 3 (Dudley s entropy integral bound, Wainwright (2019), Thm. 5.22, Dudley (1967)). Let {Xθ | θ T} be a zero-mean sub-Gaussian process with respect the metric ρX on T. Let D = supθ,θ T ρX(θ, θ ). Then for any δ [0, D] such that N(δ; T, ρX) 10, we have

E[supθ,θ T(Xθ Xθ )] E sup γ,γ T ρX(γ,γ ) δ (Xγ Xγ ) + 32 R D δ p

log N(t; T, ρX) dt.

Proposition 4. For any family of n points (Xi)n i=1 [ 1, 1]d, the empirical Rademacher complexity of the function class Fd(M, C) satisfies

Eϵ[ Sn Fd(M,C)] Kn 2

d , where K is a constant depending on M, C and d.

Proof. We choose Tn = {(f(Xi))n i=1 Rn | f Fd(M, C)}, we define the Rademacher process Xf = Pn i=1 ϵif(Xi), which is sub-Gaussian with respect to the metric ρn(f, f ) = p Pn i=1(f(Xi) f (Xi))2. Remark that D 2M n. For any δ [0, D], we apply Lemma 3 setting f 0 and we get

Eϵ[ Sn Fd(M,C)] = 1

n E h supf Tn Xf i

E sup f,f Tn ρn(f,f ) δ (Xf Xf ) + 32 R D δ p

log N(t; Fd(M, C), ρn) dt . (13)

Note that for any f, f Fd(M, C), ρn(f, f ) n f f , which means that log N(δ; Fd(M, C), ρn) log N(δ/ n; Fd(M, C), ). Thus, Lemma 2 implies that

log N(δ; Fd(M, C), ρn) K δ n d

Hence, R D δ p

log N(t; Fd(M, C), ρn) dt R D δ

Kn d 8 d 4 +1 (δ d

4 +1 (2M n) d

Published as a conference paper at ICLR 2023

We set δ = n 1 2 2

d , and we get that δ d

4 +1 = n( 1

4 +1) = n d

d . Hence, the right-hand side of (14) is upper-bounded by

K d 4 +1n1 2

d . And since Pn i=1 ϵi(f(Xi) f (Xi)) Pn i=1 |f(Xi)

f (Xi)| nρn(f, f ), we have

E sup γ,γ T ρX(γ,γ ) δ (Xγ Xγ ) nδ = nn 1 2 2

Plugging these bounds back into (13), we obtain

Eϵ[ Sn Fd(M,C)]

K d 4 +1 + 1 n 2

Since K already depends on d, we rename it as K

K d 4 +1 + 1, concluding the proof.

Proof of Theorem 2. Let Fd(C) be the space of convex functions on [ 1, 1]d such that |f(x)| C

d and |f(x) f(y)| C x y for any x, y [ 1, 1]d. Lemma 1 shows that when Ω= [ 1, 1]d and K = {x Rd | x 2 C}, functions in A belong to Fd(C) up to a constant term, which means that VDCA(µ||µ) = supu Fd(C) R

Ωu d(µ µ+). Theorem 11 of Sriperumbudur et al. (2009) shows that for any function class F on Ωsuch that r := supf F,x Ω|f(x)| < + , with probability at least 1 δ we have | supf F{Eµf(x) Eµf(x)} supf F{Eµnf(x) Eµnf(x)}|

18r2 log( δ

4)( 1 m + 1 n) + 2 ˆRn(F, (xi)n i=1) + 2 ˆRn(F, (yi)n i=1),

where ˆRn denotes the empirical Rademacher complexity. Proposition 4 shows that ˆRn(Fd(C), (xi)n i=1) Kn 2

d for any (xi)n i=1 Ω [ 1, 1]d, where K depends on C and d. This concludes the proof.

B PROOFS OF SEC. 4

Proposition 5. Input convex maxout networks (Definition 3) are convex with respect to their input.

Proof. The proof is by finite induction. We show that for any 2 ℓ L 1 and 1 i mℓ, the function x 7 x(ℓ) i is convex. The base case ℓ= 2 holds because x 7 x(2) i = 1 m2 maxj [k] w(1) i,j , (x, 1) is a pointwise supremum of convex (affine) functions, which is convex. For the induction case, we have that x 7 x(ℓ 1) i is convex for any 1 i mℓ 1 by the induction hypothesis. Since a linear combination of convex functions with non-negative coefficients is convex, we have that for any 1 i mℓ, 1 j k, x 7 w(ℓ 1) i,j , (x(ℓ 1), 1) is convex. Finally,

x 7 x(ℓ) i = 1 mℓmaxj [k] w(ℓ 1) i,j , (x(ℓ 1), 1) is convex because it is the pointwise supremum of convex functions.

Proof of Proposition 3. We can reexpress f(x) as:

f(x) = 1 m L

i=1 ai w(L 1) i,j L 1,i, (x(L 1), 1) , j L 1,i = arg max j [k] w(L 1) i,j , (x(L 1), 1) (15)

x(ℓ) i = 1 mℓ w(ℓ 1) i,j ℓ 1,i, (x(ℓ 1), 1) , j ℓ 1,i = arg max j [k] w(ℓ 1) i,j ℓ 1,i, (x(ℓ 1), 1) .

For 1 ℓ L 1, we define the matrices W ℓ Rmℓ+1,mℓsuch that their i-th row is the vector [ 1 mℓ+1 w(ℓ) i,j ℓ,i]1:mℓ, i.e. the vector containing the first mℓcomponents of 1 mℓ+1 w(ℓ) i,j ℓ,i. Iterating the

chain rule, one can see that for almost every x Rd, 1

f(x) = (W 1 ) (W 2 ) . . . (W L 1) a

1The gradient of f is well defined when there exists a neighborhood of x for which f is an affine function.

Published as a conference paper at ICLR 2023

Since the spectral norm 2 is sub-multiplicative and A 2 = A 2, we have that f(x) = (W 1 ) (W 2 ) . . . (W L 1) 2 a = W 1 2 W 2 2 . . . W L 1 2 a . We compute the Frobenius norm of W ℓ:

W ℓ 2 F = 1 mℓ Pmℓ i=1 [w(ℓ) i,j ℓ,i]1:mℓ 2 1 mℓ Pmℓ i=1 w(ℓ) i,j ℓ,i

Since for any matrix A, A 2 A F , and the vector a satisfies a 1, we obtain that f(x) 1. To obtain the bound on |f(x)|, we use again the expression (15). For 1 ℓ L 1, we let bℓ Rmℓ+1 be the vector such that the i-th component is [ 1 mℓ+1 w(ℓ) i,j ℓ,i]mℓ+1. Since

|[ 1 mℓ+1 w(ℓ) i,j ℓ,i]mℓ+1| 1 mℓ+1 , bℓ 1. It is easy to see that

f(x) = a W L 1 W 1 x + PL 1 ℓ=1 a W L 1 W ℓ+1bℓ.

|f(x)| a W L 1 W 1 x + PL 1 ℓ=1 W L 1 W ℓ+1bℓ L.

The bound on x(ℓ) follows similarly, as xℓ= W ℓ 1 W 1 x + Pℓ 1 ℓ =1 a W ℓ 1 W ℓ+1bℓ.

Proposition 6. Let FL,M,k(1) be the subset of FL,M,k such that for all 1 ℓ L 1, 1 i mℓ, 1 j k, w(ℓ) i,j 2 1, and a 2 = Pm L i=1 a2 i 1. For any f FL,M,k(1) and x B1(Rd), we have that |f(x)| 1. The metric entropy of FL,M,k(1) with respect to ρn admits the upper bound:

log N δ; FL,M,k(1), ρn PL ℓ=2 kmℓ(mℓ 1 + 1) log 1 + 23+L ℓ

δ + m L log 1 + 4

Proof. We define the function class GL,M,k that contains the functions from Rd to Rm L of the form

g(x) = 1 m L maxj [k] w(L 1) i,j , (x(L 1), 1) m L

2 ℓ L 1, x(ℓ) i = 1 mℓmaxj [k] w(ℓ 1) i,j , (x(ℓ 1), 1) , x(1) = x,

1 ℓ L 1, 1 i mℓ, 1 j k, w(ℓ) i,j 2 1.

Given {Xi}n i=1 B1(Rd) we define the pseudo-metric ρn between functions from Rd to Rm L as

ρn(f, f ) = q

1 n Pm L i=1 Pn j=1(fi(Xj) f i(Xj))2.

We prove by induction that

log N(δ; GL,M,k, ρn) PL ℓ=2 kmℓ(mℓ 1 + 1) log 1 + 22+L ℓ

To show the induction case, note that for L 3, any g GL,M,k can be written as

g(x) = 1 m L maxj [k] w(L 1) i,j , (h(x), 1) m L

where h GL 1,M,k. Remark that given a δ 2

(L 1)2+1-cover C of B1(Rm L 1+1), there exist

w(L 1) i,j C such that w(L 1) i,j w(L 1) i,j δ 2

(L 1)2+1. Hence, if h is such that ρn( h, h) δ

Published as a conference paper at ICLR 2023

and we define g(x) = ( 1 2m L maxj [k] w(L 1) i,j , ( h(x), 1) )m L i=1, we obtain

= 1 nm L Pm L i=1 Pn k=1

maxj [k] w(L 1) i,j , ( h(Xk), 1) maxj [k] w(L 1) i,j , (h(Xk), 1) 2

1 nm L Pm L i=1 Pn k=1 maxj [k]

w(L 1) i,j w(L 1) i,j , ( h(Xk), 1) w(L 1) i,j , (h(Xk) h(Xk), 0) 2

2 nm L Pm L i=1 Pn k=1 maxj [k]

w(L 1) i,j w(L 1) i,j , ( h(Xk), 1) 2 + w(L 1) i,j , (h(Xk) h(Xk), 0) 2

2 nm L Pm L i=1 Pn k=1 maxj [k] w(L 1) i,j w(L 1) i,j 2 ( h(Xk), 1) 2 + w(L 1) i,j 2 h(Xk) h(Xk) 2

2 nm L Pm L i=1 Pn k=1 maxj [k] w(L 1) i,j w(L 1) i,j 2 ( h(Xk), 1) 2 + w(L 1) i,j 2 h(Xk) h(Xk) 2

1 nm L Pm L i=1 Pn k=1

δ2 4(L2+1)(L2 + 1) + h(Xk) h(Xk) 2

1 nm L Pm L i=1 Pn k=1 δ2

n Pn k=1 h(Xk) h(Xk) 2 δ2.

In the second-to-last inequality we used that if h Gℓ,M,k, h(x) ℓ. This is equivalent to the bound x(ℓ) ℓshown in Proposition 3. Hence, we can build a δ-cover of GL,M,k in the pseudo-metric ρn from the Cartesian product of a δ

2-cover of GL 1,M,k in ρn and km L copies of a δ 2

(L 1)2+1-cover of B1(Rm L 1+1) in the 2 norm. Thus,

N(δ; GL,M,k, ρn) N δ 2; GL 1,M,k, ρn N δ 2

(L 1)2+1; B1(Rm L 1+1), 2

The metric entropy of the unit ball admits the upper bound log N(δ; B1(Rd), 2) d log(1 + 2

δ ) (Wainwright (2019), Example 5.8). Consequently,

log N(δ; GL,M,k, ρn)

log N δ 2; GL 1,M,k, ρn + km L log N δ 2

(L 1)2+1; B1(Rm L 1+1), 2

PL 1 ℓ=2 kmℓ(mℓ 1 + 1) log 1 + 22+L ℓ

+ km L(m L 1 + 1) log 1 + 4

= PL ℓ=2 kmℓ(mℓ 1 + 1) log 1 + 22+L ℓ

In the second inequality we used the induction hypothesis.

To conclude the proof, note that an arbitrary function f FL,M,k(1) can be written as f(x) = a, g(x) , where a B1(Rm L) and g GL,M,k. Applying an analogous argument, we see that a δ 2

L2+1-cover of B1(Rm L) and a δ

2-cover of GL,M,k give rise to a δ-cover of FL,M,k(1). Hence,

log N δ; FL,M,k(1), ρn log N δ 2; GL,M,k, ρn + log N δ 2; B1(Rm L), 2

PL ℓ=2 kmℓ(mℓ 1 + 1) log 1 + 23+L ℓ

+ m L log 1 + 4

Finally, to show the bound |f(x)| 1 for all f FL,M,k(1) and x B1(Rd), we use that |f(x)| | a, g(x) | a g(x) 1 and that if g Gℓ,M,k and x B1(Rd), then g(x) 1, as shown before.

Published as a conference paper at ICLR 2023

Proposition 7. Suppose that for all 1 ℓ L, the widths mℓsatisfy mℓ m. Then, the Rademacher complexity of the class FL,M,k(1) satisfies:

Eϵ[ Sn FL,M,k(1)] 64 q

(L 1)km(m+1)

(L + 1) log(2) + 1

2 log(L2 + 1) + π

Proof. We apply Dudley s entropy integral bound (Lemma 3). We choose Tn = {(f(Xi))n i=1 Rn | f FL,M,k(1)}, we define the Rademacher process Xf = 1 n Pn i=1 ϵif(Xi), which is sub-

Gaussian with respect to the metric ρn(f, f ) = q

1 n Pn i=1(f(Xi) f (Xi))2. Remark that D 2. Setting f 0 and δ = 0 in Lemma 3, we obtain that

Eϵ[ Sn FL,M,k(1)] = 1 n E h supf Tn Xf i

32 n R 2 0 p

log N(t; FL,M,k(1), ρn) dt.

Applying the metric entropy bound from Proposition 6, we get p

log N(δ; FL,M,k(1), ρn)

PL ℓ=2 kmℓ(mℓ 1 + 1) log 1 + 23+L ℓ

+ m L log 1 + 4

(L 1)km(m + 1) log 1 + 2L+1

(L 1)km(m + 1) log 2L+2

In the last equality we used that for δ [0, 2], 1 + 2L+1

L2+1 δ 2L+2

L2+1 δ . We compute the integral

L2+1 t dt = 2L+2

L2 + 1 R 2 (L+1) 1

log(t ) dt .

Applying Lemma 4 with z = 2 (L+1) 1

L2+1, we obtain that

R 2 (L+1) 1

log(t ) dt 2 (L+1) q

(L + 1) log(2) + 1

2 log(L2 + 1) + π

Putting everything together yields equation (16).

Lemma 4. For any z (0, 1], we have that R z 0 p

log(x) dx = z p

log(z)) z p

Here, erfc denotes the complementary error function, defined as erfc(x) = 2 π R + x e t2 dt.

Proof. We rewrite the integral as: R z 0 R log(x) 0 1 2 y dy dx = R log(z) 0 1 2 y R z 0 dx dy + R + log(z) 1 2 y R e y

= z R log(z) 0 1 2 y dy + R + log(z) e y 2 y dy

log(z) + R +

log(z) e t2 dt

The complementary error function satisfies the bound erfc(x) e x2 for any x > 0, which implies the final inequality.

Published as a conference paper at ICLR 2023

Proof of Theorem 3. Proposition 7 proves that under the condition mℓ m, the empirical Rademacher complexity of the class FL,M,k(1) satisfies ˆRn(FL,M,k(1))

(L 1)km(m+1)

(L + 1) log(2) + 1

2 log(L2 + 1) + π

2 ). Applying Theorem 11 of Sriperum-

budur et al. (2009) as in the proof of Theorem 2, and that for any f FL,M,k(1) and x Br(Rd), we have that |f(x)| r (see Proposition 6), we obtain the result.

C PROOFS OF SEC. 5

Proof of Theorem 4. To show (i), we have that for any ν P(K),

2W 2 2 (µ+, ν ) 1

2W 2 2 (µ , ν ) + 1

2W 2 2 (µ , ν ) 1

2W 2 2 (µ+, ν )

2W 2 2 (µ+, ν) 1

2W 2 2 (µ , ν) + sup ν P(K)

2W 2 2 (µ , ν) 1

2W 2 2 (µ+, ν) = d CT,A(µ+, µ ).

The right-to-left implication of (ii) is straight-forward. To show the left-to-right one, we use the definition for the CT distance, rewriting VDCA(µ+||µ ) and VDCA(µ ||µ+) in terms of their definitions:

d CT,A(µ+, µ ) = sup u A

Ω u d(µ+ µ ) + sup u A

Ω u d(µ+ µ ) . (17)

Since the two terms in the right-hand side are non-negative, d CT,A(µ+, µ ) = 0 implies that they are both zero. Then, applying Proposition 2, we obtain that µ µ+ and µ+ µ according to the Choquet order. The antisymmetry property of partial orders then implies that µ+ = µ . To show (iii), we use equation (17) again. The result follows from

Ωu d(µ1 µ2) supu A

Ωu d(µ1 µ3) + supu A

Ωu d(µ3 µ2) ,

Ωu d(µ2 µ1) supu A

Ωu d(µ2 µ3) + supu A

Ωu d(µ3 µ1) .

D PROOFS OF SUBSEC. 5.1

Lemma 5. For a function class F that contains the zero function, define Sn F = supf F | 1

n Pn i=1 ϵif(Xi)|, where ϵi are Rademacher variables. EX,ϵ[ Sn F] is known as the Rademacher complexity. Suppose that 0 belongs to the compact set K. We have that

1 2EX,ϵ[ Sn A] E[d CT,A(µ, µn)] 4EX,ϵ[ Sn A],

where A = {f Eµ[f] | f A} is the centered version of the class A.

Proof. We will use an argument similar to the proof of Prop. 4.11 of Wainwright (2019) (with the appropriate modifications) to obtain the Rademacher complexity upper and lower bounds. We start

Published as a conference paper at ICLR 2023

with the lower bound:

EX,ϵ[ Sn A] = EX,ϵ

i=1 ϵi(f(Xi) EYi[f(Yi)])

i=1 ϵi(f(Xi) f(Yi))

i=1 (f(Xi) f(Yi))

i=1 (f(Xi) E[f])

i=1 (f(Yi) E[f])

i=1 (f(Xi) E[f])

i=1 (E[f] f(Xi))

= 2E[VDCA(µn||µ) + VDCA(µ||µn)].

The last inequality follows from the fact that supf A | 1

n Pn i=1(f(Xi) E[f])| supf A 1 n Pn i=1(f(Xi) E[f]) + supf A 1 n Pn i=1(E[f] f(Xi)), which holds as long as the two terms in the right-hand side are non-negative. This happens when f 0 belongs to A as a consequence of 0 K.

The upper bound follows essentially from the classical symmetrization argument:

E[d CT,A(µ, µn)] 2EX

1 n Pn i=1(f(Xi) E[f])

1 n Pn i=1 ϵif(Xi)

Lemma 6 (Relation between Rademacher and Gaussian complexities, Exercise 5.5, Wainwright (2019)). Let Gn F = supf F | 1

n Pn i=1 zif(Xi)|, where zi are standard Gaussian variables. EX,z[ Gn F] is known as the Gaussian complexity. We have

EX,z[ Gn F]

2 log n EX,ϵ[ Sn F] p π

2 EX,z[ Gn F].

Given a set T Rn, the family of random variables {Gθ, θ T}, where Gθ := ω, θ = Pn i=1 ωiθi and ωi N(0, 1) i.i.d., defines a stochastic process is known as the canonical Gaussian process associated with T.

D.1 RESULTS USED IN THE LOWER BOUND OF THEOREM 5

Lemma 7 (Sudakov minoration, Wainwright (2019), Thm. 5.30; Sudakov (1973)). Let {Gθ, θ T} be a zero-mean Gaussian process defined on the non-empty set T. Then,

E supθ T Gθ supδ>0 δ 2 p

log MG(δ; T).

where MG(δ; T) is the δ-packing number of T in the metric ρG(θ, θ ) = p

E[(Xθ Xθ )2].

Proposition 8. Let C0, C1 be universal constants independent of the dimension d, and suppose that n C1 log(d). Recall that Fd(M, C) is the set of convex functions on [ 1, 1]d such that |f(x)| M and |f(x) f(y)| C x y for any x, y [ 1, 1]d, and that Fd(M, C) = {f Eµ[f] | f Fd(M, C)}. The Gaussian complexity of the set Fd(M, C) satisfies

EX,z[ Gn Fd(M,C)] C0 q

C d(1+C)n 2

Published as a conference paper at ICLR 2023

Proof. Given (Xi)n i=1 Rd, let Tn = {(f(Xi))n i=1 Rn | f Fd(M, C)}. Let Xi be sampled i.i.d. from the uniform measure on [ 1, 1]d. We have that with probability at least 1/2 on the instantiation of (Xi)n i=1,

Ez[ Gn Fd(M,C)] = 1

n E supθ Tn Xθ

Cn 128d(1+C)n 2

Cn 128d(1+C)n 2

d ; Fd(M, C), ρn

Cn 128d(1+C)n 2

d log 2 n 16

2d(1+C)Cn+1

Cn 128d(1+C)n 2

d n 16 log(2) log(32 p

2d(1 + C)C) 1 2 + 2

C d(1+C)n 2

The first inequality follows from Sudakov minoration (Lemma 7) by setting δ = q

Cn 128d(1+C)n 2

The second inequality follows from the lower bound on the packing number of A given by Corollary 1. In the following approximation we neglected the term 1 in the numerator inside of the logarithm. In the last inequality, C0 is a universal constant independent of the dimension d. The inequality holds as long as log(32 p

2d(1 + C)C) + ( 1

d) log(n) n

32 log(2), which is true when n C1 log(d) for some universal constant C1.

Proposition 9. With probability at least (around) 1/2 on the instantiation of (Xi)n i=1 as i.i.d. samples from the uniform distribution on Sd 1, the packing number of the set Fd(M, C) with respect to the metric ρn(f, f ) = p Pn i=1(f(Xi) f (Xi))2 satisfies

Cn 32d(1+C)n 2

d ; Fd(M, C), ρn

Proof. This proof uses the same construction of Thm. 6 of Bronshtein (1976), which shows a lower bound on the metric entropy of Fd(M, C) in the L norm. His result follows from associating a subset of convex functions in Fd(M, C) to a subset of convex polyhedrons, and then lower-bounding the metric entropy of this second set. Note that the pseudo-metric ρn is weaker than the L norm, which means that our result does not follow from his. Instead, we need to use a more intricate construction for the set of convex polyhedrons, and rely on the Varshamov-Gilbert lemma (Lemma 8).

First, we show that if we sample n points (Xi)n i=1 i.i.d. from the uniform distribution over the unit ball of Rd, with constant probability (around 1

2) there exists a δ-packing of the unit ball of Rd with

2 points. For any n Z+, we define the set-valued random variable

Sn = {i {1, . . . , n} | 1 j < i, Xi Xj 2 δ},

and the random variable An = |Sn|. That is, An is the number of points Xi that are at least epsilon away of any point with a lower index; clearly the set of such points constitutes a δ-packing of the unit ball of Rd. We have that E[An|(Xi)n 1 i=1 ] = An 1 + Pr( 1 i n 1, Xn Xi 2 δ). Since for a fixed Xi and uniformly distributed Xn, Pr( Xn Xi 2 δ) δd, a union bound shows that Pr( 1 i n 1, Xn Xi 2 δ) 1 (k 1)δd. Thus, by the tower property of conditional expectations:

E[An] E[An 1] + 1 (n 1)δd.

A simple induction shows that E[An] n Pn 1 i=0 iδd = n n(n 1)δd

2 . This is a quadratic function of n. For a given δ > 0, we choose n that maximizes this expression, which results in

2 0 = n δ d + 1

= E[An] δ d + 1

2 + (δ d+ 1

2 = δ d + 1

Published as a conference paper at ICLR 2023

where we used instead of = to remark that n must be an integer. Since An is lower-bounded by 1 and upper-bounded by n δ d + 1

2, Markov s inequality shows that

P(An E[An])

where the approximation works for δ 1. Taking k = An, this shows the existence of a δ-packing of the unit ball of size δ d/2

2 with probability at least (around) 1

Next, we use a construction similar to the proof of the lower bound in Thm. 6 of Bronshtein (1976). That is, we consider the set Sn = {(x, g(x)) | x Sn} Rd+1, where g : [ 1, 1]d R is the map that parameterizes part of the surface of the (d 1)-dimensional hypersphere Sd 1(R, t) centered at (0, . . . , 0, t) of radius R for well chosen t, R.

As in the proof of Thm. 6 of Bronshtein (1976), if x Sn Sd 1(R, t) and we let Sd 1(R, t)x denote the tangent space at x, we construct a hyperplane P that is parallel to Sd 1(R, t)x at a distance ϵ < R from the latter and that intersects the sphere. A simple trigonometry exercise involving the lengths of the chord and the sagitta of a two-dimensional circular segment shows that any point in Sd 1(R, t) that is at least

2Rϵ away from x will be separated from x by the hyperplane P. Thus, the convex hull of Sn \ {x} is at least ϵ away from x.

For any subset S Sn, we define the function g S as the function whose epigraph (the set of points on or above the graph of a convex function) is equal to the convex hull of S . Note that g S is convex and piecewise affine. Let us set δ =

2Rϵ. If x is a point in Sn \ S , by the argument in the previous paragraph the convex hull of S is at least ϵ away from x. Thus, |g S (x) g S {x}| ϵ. The functions g S and g S will differ by at least ϵ at each point in the symmetric difference S S .

By the Varshamov-Gilbert lemma (Lemma 8), there is a set of 2k/8 different subsets S such that g S differ pairwise at at least k/4 points in Sn by at least ϵ. Thus, for any S = S in this set, we have that

ρn(g S , g S ) = p Pn i=1(g S (Xi) g S (Xi))2 q

We have to make sure that all the functions g S belong to the set Fd(M, C). Since | d

R2 x2| = |x|

R2 x2 , the function g will be C-Lipschitz on [ 1, 1]d if we take R2 d(1+C)

C , and in that case

g S will be Lipschitz as well for any S Sn. To make sure that the uniform bound g S M holds, we adjust the parameter t.

To obtain the statement of the proposition we start from a certain n and set δ such that (18) holds: δ = n 1

d , and we set k n

2 . Since δ =

2Rϵ, we have that ϵ = 1 2Rδ2 1

C d(1+C)n 2

d , which means that

2k/8 2n/16, q

Cn 32d(1+C)n 2

Lemma 8 (Varshamov-Gilbert). Let N 8. There exists a subset Ω {0, 1}N such that |Ω| 2N/8 and for any x, x Ωsuch that x = x , at least N/4 of the components differ.

Lemma 9. For any ϵ > 0, the packing number of the centered function class Fd(M, C) = {f Eµ[f]| f Fd(M, C)} with respect to the metric ρn fulfills

M(ϵ/2; Fd(M, C), ρn) M(ϵ; Fd(M, C), ρn)/(4n C/ϵ + 1),

Proof. Let S be an ϵ-packing of Fd(M, C) with respect to ρn (i.e. |S| = M(ϵ; Fd(M, C), ρn)). Let Fd(M, C, t, δ) = {f Fd(M, C) | |Eµ[f] t| δ}. If we let (ti)m i=1 be a δ packing of [ C, C], we can write Fd(M, C) = m i=1Fd(M, C, ti, δ).

Published as a conference paper at ICLR 2023

By the pigeonhole principle, there exists an index i {1, . . . , m} such that |S Fd(M, C, ti, δ)| |S|/m. Let Fd(M, C, ti) = {f Fd(M, C) | Eµ[f] = ti}, and let P : Fd(M, C) Fd(M, C, ti) be the projection operator defined as f 7 f Eµ[f] + ti.

If we take δ = ϵ/(4n), we have that ρn(Pf, Pf ) ϵ/2 for any f = f S Fd(M, C, ti, δ), as ρn(Pf, Pf ) ρn(Pf, f) + ρn(f, f ) + ρn(f , Pf ), and ρn(f, Pf) n|Eµ[f] ti| nδ = ϵ/4, while ρn(f, f ) ϵ. Thus, the ϵ/2-packing number of Fd(M, C, ti) is lower-bounded by |S Fd(M, C, ti, δ)| |S|/m. Since m = (2C + 2δ)/(2δ) 4n C/ϵ + 1, this is lower-bounded by |S|/(4n C/ϵ + 1).

The proof concludes using the observation that the map Fd(M, C, ti) Fd(M, C) defined as f 7 f ti is an isometric bijection with respect to ρn.

Corollary 1. With probability at least 1/2, the packing number of the set Fd(M, C) with respect to the metric ρn satisfies

Cn 128d(1+C)n 2

d ; Fd(M, C), ρn

2d(1+C)Cn+1

Proof of Theorem 5. Lemma 5 provides upper and lower bounds to E[d CT,A(µ, µn)] in terms of the Rademacher complexities of A and its centered version, A. Lemma 1 in App. A states that A is equal to the space Fd(M, C) of convex functions on [ 1, 1]d such that |f(x)| M and |f(x) f(y)| C x y for any x, y [ 1, 1]d. Let EX,ϵ[ Sn F], EX,ϵ[ Gn F] be the Rademacher and Gaussian complexities of a function class F (see definitions in Lemma 5 and Lemma 6). We can write

EX,ϵ[ Sn A] EX,z[ Gn A] 2 log n C0 q

C 2d(1+C) log(n)n 2

where the first inequality holds by Lemma 6, and the second inequality follows from Proposition 8. This gives rise to (7) upon redefining C0 C0/

8. Equation (8) follows from the Rademacher complexity upper bound in Lemma 5, and the upper bound on the empirical Rademacher complexity of Fd(M, C) given by Proposition 4.

Proof of Theorem 6. We upper-bound E[d CT,FL,M,k,+(1)(µ, µn)] as in the upper bound of E[d CT,A(µ, µn)] in Lemma 5:

E[d CT,FL,M,k,+(1)(µ, µn)] 2EX

supf FL,M,k,+(1)

1 n Pn i=1(f(Xi) E[f])

supf FL,M,k,+(1)

1 n Pn i=1 ϵif(Xi)

= 4Eϵ[ Sn FL,M,k,+(1)].

Likewise, we get that E[VDCFL,M,k,+(1)(µ, µn)] 2Eϵ[ Sn FL,M,k,+(1)]. Since FL,M,k,+(1) FL,M,k(1), we have that Eϵ[ Sn FL,M,k,+(1)] Eϵ[ Sn FL,M,k(1)]. Then, the result follows from the Rademacher complexity upper bound from Proposition 7.

E SIMPLE EXAMPLES IN DIMENSION 1 FOR VDC AND d CT

For simple distributions over compact sets of R, we can compute the CT discrepancy exactly. Let η : R R be a non-negative bump-like function, supported on [ 1, 1], symmetric w.r.t 0, increasing on ( 1, 0], and such that R

R η = 1. Clearly η is the density of some probability measure with respect to the Lebesgue measure.

Same variance, different mean. Let µ+ P(R) with density η(x a) for some a > 0, and let µ P(R) with density η(x + a). We let K = [ C, C] for any C > 0.

Published as a conference paper at ICLR 2023

Proposition 10. Let F(x) = R x (η(y+a) η(y a)) dy, and G(x) = R x 0 F(y) dy. We obtain that VDCK(µ+||µ ) = 2CG(+ ). The optimum of (P2) is equal to the Dirac delta at C: ν = δ C, while an optimum of (P1) is u = Cx. We have that VDCK(µ+||µ ) = 2CG(+ ) as well, and thus, d CT,A(µ , µ+) = 4CG(+ ).

Proof. Suppose that first that a > 0. We have that F( ) = F( a 1) = 0, F(+ ) = F(a + 1) = 0. F is even, and supx |F(x)| = 1. We have that G is odd (in particular G( a 1) = G(a + 1)) and non-decreasing on R. Also, 0 < G(+ ) = G(a + 1) a + 1. Let K = [ C, C] where C > 0, and let Ω= [ a 1, a + 1]. For any twice-differentiable2 convex function u on Ω such that u K, we have R

Ωu(x) d(µ µ+)(y) = R

Ωu(x)(η(y + a) η(y a)) dy = [u(x)F(x)]a+1 a 1 R

Ωu (x)F(x) dy

Ωu (x)F(x) dy = [ u (x)G(x)]a+1 a 1 + R

Ωu (x)G(x) dy

= G(a + 1)(u (a + 1) + u ( a 1)) + R

Ωu (x)G(x) dy

If we reexpress u(a + 1) = u ( a 1) + R a+1 a 1 u (x) dx, the right-hand side becomes

2G(a + 1)u ( a 1) + R

Ωu (x)(G(x) G(a + 1)) dy (19)

Since G is increasing, for any x [ a 1, a + 1), G(x) < G(a + 1). The convexity assumption on u implies that u 0 on Ω, and the condition u K means that u [ C, C]. Thus, the function u that maximizes (19) fulfills u C, u 0. Thus, we can take u = Cx. The measure ν = ( u)#µ = ( u)#µ+ is equal to the Dirac delta at C: ν = δ C. Hence,

Ωu d(µ µ+) = 2CG(a + 1).

Thus, VDCK(µ+||µ ) = 2CG(a + 1). Reproducing the same argument yields VDCK(µ ||µ+) = 2CG(a + 1), and hence the CT distance is equal to d CT,A(µ , µ+) = 4CG(a + 1).

Same mean, different variance. Let µ+ P(R) with density 1

a) for some a > 0, and let µ P(R) with density η(x). Note that µ+ has support [ a, a], and standard deviation equal to a times the standard deviation of µ . We let K = [ C, C] for any C > 0 as before.

Proposition 11. Let F(x) = R x (η(y) 1

a)) dy, and G(x) = R x F(y) dy. When a < 1, we have that VDCK(µ+||µ ) = 2CG(0). The optimum of (P2) is equal to ν = 1

2δC, while an optimum of (P1) is u = C|x|. When a > 1, VDCK(µ+||µ ) = 0. For any a > 0, d CT,A(µ , µ+) = 2CG(0).

Proof. We define Ω= [ max{a, 1}, max{a, 1}] and K = [ C, C]. We have that F( ) = F( max{a, 1}) = 0, F(+ ) = F(max{a, 1}) = 0, and F(0) = 0. F is odd. When a < 1 it is non-positive on [0, + ) and non-negative on ( , 0]. When a > 1, it is non-negative on [0, + ) and non-positive on ( , 0]. We have that G( ) = G( max{a, 1}) = 0 and G(+ ) = G(max{a, 1}) = 0. G is even. When a < 1, G is non-negative, non-decreasing on ( , 0] and non-increasing on [0, + ): it has a global maximum at 0. When a > 1, G is nonpositive, non-increasing on ( , 0] and non-decreasing on [0, + ): it has a global minimum at 0. We have that R

Ωu(x) d(µ µ+)y = R

Ωu(x) η(y) 1

dy = [u(x)F(x)]a+1 a 1 R

Ωu (x)F(x) dy

Ωu (x)F(x) dy = [ u (x)G(x)]a+1 a 1 + R

Ωu (x)G(x) dy = R

Ωu (x)G(x) dy. (20)

Thus, when a < 1 this expression is maximized when u δ0. Taking into account the constraints u [ C, C], the optimal u is u (x) = Csign(x), which means that an optimal u is u(x) = C|x|.

2Using a mollifier sequence, any convex function can be approximated arbitrarily well by a twicedifferentiable convex function.

Published as a conference paper at ICLR 2023

Thus, the measure ν = ( u)#µ = ( u)#µ+ is equal to the average of Dirac deltas at C and C: ν = 1

2δC. We obtain that

Ωu d(µ µ+) = 2CG(0).

When a > 1, the expression (20) is maximized when u = 0, which means that any u constant and any u affine work. Any measure ν concentrated at a point in [ C, C] is optimal. Thus, supu A R

Ωu d(µ µ+) = 0. We conclude that d CT,A(µ+, µ ) = 2CG(0).

F ALGORITHMS FOR LEARNING DISTRIBUTIONS WITH SURROGATE VDC AND CT DISTANCE

In Algorithms 1 and 2, we present the steps for learning with the VDC and the CT distance, respectively. In order to enforce convexity of the ICMNs, we leverage the projected gradient descent algorithm after updating hidden neural network parameters. Additionally, to regularize the u networks in Algorithm 1, we include a term that penalizes the square of the outputs of the Choquet critic on the baseline and generated samples (Mroueh & Sercu, 2017).

Algorithm 1 Enforcing dominance constraints with the surrogate VDC

Input: Target distribution ν, baseline g0, latent distribution µ0, integer max Epochs, integer discriminator Epochs, Choquet weight λ, GAN learning rate η, Choquet critic learning rate ηVDC, Choquet critic regularization weight λu,reg, WGAN gradient penalty regularization weight λGP Initialize: untrained discriminator fφ, untrained generator gθ, untrained Choquet ICMN critic uψ ψ Project Hidden Weights To Non Negative() for i = 1 to max Epochs do

LWGAN(φ, θ) = EY µ0[fφ(gθ(Y ))] EX ν[fφ(X)] LGP = Et Unif[0,1][EX ν,Y µ0(|| xfφ(tgθ(Y ) + (1 t)X)|| 1)2] LVDC(ψ, θ) = EY µ0[uψ(gθ(Y )) uψ(g0(Y ))] for j = 1 to discriminator Epochs do

φ φ η φ(LWGAN + λGPLGP) {ADAM optimizer} LVDCreg = EY µ0[uψ(gθ(Y ))2 + uψ(g0(Y ))2] ψ ψ ηVDC ψ(LVDC + λu,reg LVDCreg) {ADAM optimizer} ψ Project Hidden Weights To Non Negative() end for θ θ + η θ(LWGAN + λLVDC) {ADAM optimizer} end for Return gθ

G PROBING MODE COLLAPSE

As described in Sec. 7, to investigate how training with the surrogate VDC regularizer helps alleviate mode collapse in GAN training, we implemented GANs trained with the IPM objective alone and compared this to training with the surrogate VDC regularizer for a mixture of 8 Gaussians target distribution. To track mode collapse, we report two metrics: 1) The mode collapse score that we define as follows: for each generated point, we assign it to the nearest neighbor cluster in the target mixture and obtain a histogram over the modes computed on all generated points. The closer this histogram is to the uniform distribution, the less mode collapsed is the generator. We quantify this with the KL distance of this histogram to the uniform distribution on 8 modes. 2) The negative log likelihood (NLL) of the Gaussian mixture. A converged generator needs to have a low negative likelihood and low mode collapse score. In Figure 3, we see that in the baseline training (unregularized GAN training), we observe mode collapse and cycling between modes, evidenced by the fluctuating mode collapse score and NLL. In contrast, when training with the VDC regularizer to improve upon the collapse generator g0 (which is taken from step 55k from the unregularized GAN training), we see more stable training and better mode coverage. As the regularization weight λVDC for VDC increases, the dominance constraint is more strongly enforced resulting in a better NLL and smaller mode collapse score.

Published as a conference paper at ICLR 2023

Algorithm 2 Generative modeling with the surrogate CT distance

Input: Target distribution ν, latent distribution µ0, integer max Epochs, integer critic Epochs, learning rate η Initialize: untrained generator gθ, untrained Choquet ICMN critics uψ1, uψ2 ψ1 Project Hidden Weights To Non Negative() ψ2 Project Hidden Weights To Non Negative() for i = 1 to max Epochs do

LDCT,1(ψ1, θ) = EY µ0[uψ1(gθ(Y ))] EX ν[uψ1(X)] LDCT,2(ψ2, θ) = EX ν[uψ2(X)] EY µ0[uψ2(gθ(Y ))] Ld CT(ψ1, ψ2, θ) = LDCT,1(ψ1, θ) + LDCT,2(ψ2, θ) for j = 1 to discriminator Epochs do

ψ1 ψ1 η ψ1LDCT,1 {ADAM optimizer} ψ1 Project Hidden Weights To Non Negative() ψ2 ψ2 η ψ2LDCT,2 {ADAM optimizer} ψ2 Project Hidden Weights To Non Negative() end for θ θ + η θLd CT {ADAM optimizer} end for Return gθ

Figure 3: Probing mode collapse for GAN training.

H ADDITIONAL EXPERIMENTAL DETAILS

Portfolio optimization In Figure 4 we plot the trajectory of the z parameter from the portfolio optimization example solved in Sec. 7.

The convex decreasing u network was parameterized with a 3-layer, fully-connected, decreasing ICMN with hidden dimension 32 and maxout kernel size of 4. See Table 2 for full architectural details. Optimization was performed on CPU.

Published as a conference paper at ICLR 2023

Figure 4: Trajectory of z parameter from the portfolio optimization example in Sec. 7.

Table 2: Architectural details for convex decreasing u network used in portfolio optimization.

Convex decreasing u

Input dimension Output dimension Kernel Restriction

Linear 1 32 - Non-positivity Max Out - - 4 - Linear 8 32 - Non-negativity Max Out - - 4 - Linear 8 1 - Non-negativity

Image generation In training WGAN-GP and WGAN-GP + VDC, we use residual convolutional neural network (CNN) for the generator and CNN for the discriminator (both with Re LU nonlinearities). See Table 3, Table 4, and Table 5 for the full architectural details. Note that in Table 3, Pixel Shuffle refers to a dimension rearrangement where an input of dimension Cr2 H W is rearranged to C Hr Wr3. Our latent dimension for µ0 is 128 and λGP = 10. We use ADAM optimizers (Kingma & Ba, 2015) for both networks, learning rates of 1e 4, and a batch size of 64. We use the CIFAR-10 training data and split it as 95% training and 5% validation. FID is calculated using the validation set. The generator was trained every 6th epoch, and training was executed for about 400 epochs in total. When training g we use learning rate of 1e 5 for the Choquet critic, λ = 10, and λu,reg = 10. Training the baseline g0 and g with the surrogate VDC was done on a compute environment with 1 CPU and 1 A100 GPU.

2D point cloud generation When training with the d CT surrogate for point cloud generation, the generator and Choquet critics are parameterized by residual maxout networks with maxout kernel size of 2. The critics are ICMNs. Our latent dimension for µ0 is 32. Both the generator and critics have hidden dimension of 32. The generator consists of 10 fully-connected layers and the critics consists of 5. For all networks, we add residual connections from input-to-hidden layers (as opposed to hidden-to-hidden). The last layer for all networks is fully-connected linear. See Table 6 for full architectural details. We use ADAM optimizers for all networks, learning rate of 5e 4 for the generator, learning rates of 1e 4 for the Choquet critics, and a batch size of 512. Training was done on a single-CPU environment.

3See pytorch documentation for more details.

Published as a conference paper at ICLR 2023

Table 3: Architectural details for WGAN-GP generator gθ.

WGAN Generator gθ

Kernel size Output shape

Y µ0 - 128 Conv Transpose 4 4 128 4 4 Residual Block 3 3 128 8 8 Residual Block 3 3 128 16 16 Residual Block 3 3 128 32 32 Batch Norm, Re LU, Dropout - - Conv 3 3 3 32 32

Residual Block (kernel size specified above)

Main Layer Norm, Re LU Pixel Shuffle 2x Conv Layer Norm, Re LU Conv

Residual Pixel Shuffle 2x Conv

Table 4: Architectural details for WGAN-GP discriminator fφ.

WGAN Discriminator fφ

Kernel size Stride Output shape

Input - - 3 32 32 Conv w/Re LU 3 3 2 2 128 16 16 Conv w/Re LU 3 3 2 2 256 8 8 Conv w/Re LU 3 3 2 2 512 4 4 Linear - - 1

Table 5: Architectural details for Choquet critic uψ in the image domain.

Choquet critic uψ

Kernel size Stride Output shape

Input - - 3 32 32 Conv 3 3 2 2 1024 16 16 Max Out + Dropout 16 - 64 16 16 Conv 3 3 2 2 2048 8 8 Max Out + Dropout 16 - 128 8 8 Conv 3 3 2 2 4096 4 4 Max Out + Dropout 16 - 256 4 4 Linear - - 1

Published as a conference paper at ICLR 2023

Table 6: Architectural details for generator gθ and Choquet critics uψ1, uψ2 in the 2D point cloud domain.

Generator gθ and Choquet critics uψ1, uψ2

Generator Choquet critics Input dim 32 2 Hidden dim 32 32 Num. Layers 10 5 Output dim 2 1

Fully-connected Residual Blocks

Main Linear (hidden-to-hidden) Max Out kernel size 2 Dropout

Residual Linear (input-to-hidden)

Libraries Our experiments rely on various open-source libraries, including pytorch (Paszke et al., 2019) (license: BSD) and pytorch-lightning (Falcon et al., 2019) (Apache 2.0).

Code re-use For several of our generator, discriminator, and Choquet critics, we draw inspiration and leverage code from the following public Github repositories: (1) https://github.com/caogang/wgan-gp, (2) https://github.com/ozanciga/ gans-with-pytorch, and (3) https://github.com/CW-Huang/CP-Flow.

Data In our experiments, we use following publicly available data: (1) the CIFAR-10 (Krizhevsky & Hinton, 2009) dataset, released under the MIT license, and (2) the Github icon silhouette, which was copied from https://github.com/CW-Huang/CP-Flow/blob/main/imgs/github. png. CIFAR-10 is not known to contain personally identifiable information or offensive content.