# generalised_lipschitz_regularisation_equals_distributional_robustness__699674e6.pdf

Generalised Lipschitz Regularisation Equals Distributional Robustness

Zac Cranko * 1 Zhan Shi * 2 Xinhua Zhang 2 Richard Nock 3 Simon Kornblith 3

The problem of adversarial examples has highlighted the need for a theory of regularisation that is general enough to apply to exotic function classes, such as universal approximators. In response, we have been able to signiﬁcantly sharpen existing results regarding the relationship between distributional robustness and regularisation, when deﬁned with a transportation cost uncertainty set. The theory allows us to characterise the conditions under which the distributional robustness equals a Lipschitz-regularised model, and to tightly quantify, for the ﬁrst time, the slackness under very mild assumptions. As a theoretical application we show a new result explicating the connection between adversarial learning and distributional robustness. We then give new results for how to achieve Lipschitz regularisation of kernel classiﬁers, which are demonstrated experimentally.

1. Introduction

When learning a statistical model, it is rare that one has complete access to the distribution. More often it is the case that one approximates the risk minimisation by an empirical risk, using sequence of samples from the distribution. In practice this can be problematic particularly when the curse of dimensionality is in full force to a) know with certainty that one has enough samples, and b) guarantee good performance away from the data. Both of these two problems can, in effect, be cast as problems of ensuring generalisation. A remedy for both of these problems has been proposed in the form of a modiﬁcation to the risk minimisation framework, wherein we integrate a certain amount of distrust of the distribution. This distrust results in a guarantee of worst case performance if it turns out later that the distribution was speciﬁed imprecisely, improving generalisation.

*Equal contribution 1Universität Tübingen, Tübingen, Germany 2University of Illinois at Chicago, IL, USA 3Google Brain. Correspondence to: Xinhua Zhang <zhangx@uic.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

In order to make this notion of distrust concrete, we introduce some mathematical notation. The set of Borel probability measures on an outcome space is P( ). A loss function is a mapping f : ! R so that f(!) is the loss incurred with some prediction under the outcome ! 2 . For example, if = X Y then fv(x, y) = (v(x) y)2

could be a loss function for regression or classiﬁcation with some classiﬁer v : X ! Y . For a distribution µ 2 P( ) we replace the objective in the classical risk minimisation minv Eµ[fv] with the robust Bayes risk:

sup 2Bc(µ,r) E [f] (r B)

where Bc(µ, r) P( ) is a set containing µ, called the uncertainty set (viz. Berger, 1993; Vidakovic, 2000, Grünwald & Dawid, 2004, 4). It is in this way that we introduce distrust into the classical risk minimisation, by instead minimising the worst case risk over a set of distributions.

It is sometimes the case that for an uncertainty set, Bc(µ, r) P( ), there is a function, r lipc : R ! R 0 (not necessarily the usual Lipschitz constant), so that

sup 2Bc(µ,r) E [f] Eµ[f] + r lipc(f). (L)

Results like (L) have been studied in the literature, however these usually make onerous assumptions on the structure of the loss function/model class (Shaﬁeezadeh-Abadeh et al., 2019; Blanchet et al., 2019) or on the cost function underpinning the uncertainty set (Kuhn et al., 2019). Thus ruling out application to many common machine learning and statistical techniques. Therefore, in 3, our ﬁrst major contribution is to revisit such a result using a new proof technique that relies on the difference-convex optimization literature to strictly generalise and improve upon several well-known related results (summarised in Table 1). In particular, a major novelty of our approach lies with the characterisation of when (L) holds as an equality, and when the bound is tight. These are quite involved and are as important as the inequality (L) itself.

In practice, however, the evaluation of Lipschitz constant is NP-hard for neural networks (Scaman & Virmaux, 2018), compelling approximations of it, or the explicit engineering of Lipschitz layers and analysing the resulting expressiveness in speciﬁc cases (e.g., 1-norm, Anil et al., 2019). By

Generalised Lipschitz Regularisation Equals Distributional Robustness

Table 1: Comparison of results related to (L). Assumptions listed in boldface are the weakest.

Reference (L) f c µ X

(Shaﬁeezadeh-Abadeh et al., 2019, Thm. 14) = convex Lipschitz margin loss with linear classiﬁer norm empirical dist. Rd

(Kuhn et al., 2019, Thm. 5) upper semicontinuous norm empirical dist. Rd

(Kuhn et al., 2019, Thm. 10) = convex, Lipschitz norm empirical dist. Rd

(Gao & Kleywegt, 2016, Cor. 2 (iv)) similar to generalised

Lipschitz p-metric empirical dist. Rd

Theorem 1 (this paper)

- - probability

separable Banach space = convex, generalised

convex, k-positively

homogeneous

comparison, kernel machines have a reproducing kernel Hilbert space (RKHS) encompassing a family of models that are universal (Micchelli et al., 2006). Our second major contribution, in 4, is to show that product kernels, such as Gaussian kernels, have a Lipzchitz constant that can be efﬁciently approximated and optimised with high probability. By using the Nyström approximation (Williams & Seeger, 2000; Drineas & Mahoney, 2005). we show that an approximation error requires only O(1/ 2) samples. Such a sampling-based approach also leads to a single convex constraint, making it scalable to large sample sizes, even with an interior-point solver ( 5). As our experiments show, this method achieves higher robustness than state of the art (Cisse et al., 2017; Anil et al., 2019).

2. Preliminaries

Let R def= [ 1, 1] and R 0

def= [0, 1], with similar notations for the real numbers. Let [n] denote the set {1, . . . , n} for n 2 N. Unless otherwise speciﬁed, X, Y, are topological outcome spaces. Often X will be used when there is some linear structure so that = X Y may be interpreted as the classical outcome space for classiﬁcation problems (cf. Vapnik, 2000). In particular, in all cases X and Y can be taken to be Rd and {1, . . . , k} respectively.

The Dirac measure at some point ! 2 is δ! 2 P( ), and the set of Borel mappings X ! Y is L0(X, Y ). For µ 2 P( ), denote by Lp( , µ) the Lebesgue space of functions

f 2 L0( , R) satisfying

|f(!)|pµ(d!)

#1/p < 1 for p 1. The continuous real functions on are collected in C( ). In many of our subsequent formulas it is more convenient to write an expectation directly as an integral: Eµ[f] =

For two measures µ, 2 P( ) the set of (µ, )-couplings is (µ, ) P( ) where 2 (µ, ) if and only if the marginals of are µ and . For a coupling function c : ! R, the c-transportation cost of µ, 2 P( ) is costc(µ, ) def= inf 2 (µ, )

c d . The ctransportation cost ball of radius r 0 centred at µ 2 P( ) is Bc(µ, r) def= { 2 P( ) | costc(µ, ) r}, and serves as our uncertainty set. The the least c-Lipschitz constant (cf.

Cranko et al., 2019) of a function f : X ! R is the number lipc(f) def= inf c(f), where

c(f) def= {λ 0 | 8x,y2X : |f(x) f(y)| λc(x, y)}.

Thus when (X, d) is a metric space, lipd(f) agrees with the usual Lipschitz notion. When c maps X ! R, for example when c is a norm, we let c(x, y) def= c(x y) for all x, y2X.

A function f : X ! R is called k-positively homogeneous if, for all a > 0, there is f(ax) = akf(x) for all x 2 X. Throughout we always assume k 1.

To a function f : X ! R we associate another function co f : X ! R, called the convex envelope of f, deﬁned to be the greatest closed convex function that minorises f. The quantity (f) def= supx2X(f(x) co f(x)) was ﬁrst suggested by Aubin & Ekeland (1976) to quantify the lack of convexity of a function f, and has since shown to be of considerable interest for, among other things, bounding the duality gap in nonconvex optimisation (cf. Lemaréchal & Renaud, 2001; Udell & Boyd, 2016; Askari et al., 2019; Kerdreux et al., 2019). In particular, observe

(f) = 0 () f = co f () f is closed convex.

While it may seem like somewhat of an intractable quantity, (f) can be estimated in principle, details of which are included in the supplementary material (Supplement B). Complete proofs of all technical results are relegated to the supplementary material.

3. Distributional robustness

In this section we present our major result regarding identities of the form (L).

Theorem 1. Suppose X is a separable Banach space and ﬁx µ 2 P(X). Suppose c : X ! R 0 is closed convex, k-positively homogeneous, and f 2 L1(X, µ) is upper semicontinuous with lipc(f) < 1. Then for all r 0, there exists f,c,r(µ) 0 so that

sup 2Bc(µ,r)

f d + f,c,r(µ) =

f dµ + r lipc(f),(1)

Generalised Lipschitz Regularisation Equals Distributional Robustness

Table 2: Comparison of results related to Theorem 2. Assumptions listed in boldface are the weakest, and assumptions in red are prohibitive.

Reference Result f c µ X

(Staib & Jegelka, 2017, Prop. 3.1) unclear p-metric unclear metric space

(Shaﬁeezadeh-Abadeh et al., 2019, Thm. 12)

Lipschitz margin loss with

linear classiﬁer norm empirical dist. Rd

= additional strong regularity

(Gao & Kleywegt, 2016, Cor. 2 (ii)) = concave p-metric empirical dist. convex subset

Theorem 2 (this paper)

probability

measure separable Banach space = continuous non-atomic, compact support

f,c,r(µ) r lipc(f)

max{0, r lipc(co f) Eµ[f co f]}. (2)

Observe that when f is closed convex, (2) implies f,c,r(µ) = 0.

A summary of the results Theorem 1 improves upon is presented in Table 1 and a more detailed discussion follows in the supplementary material (Supplement A).

Proposition 1. Suppose X is a separable Banach space. Suppose c : X ! R 0 satisﬁes the conditions of Theorem 1, and f 2 T

µ2P(X0) L1(X, µ) is upper semicontinuous, has lipc(f) < 1, and attains its maximum on X0 X. Then for all r 0

supµ2P(X0) f,c,r(µ)

= r lipc(f) max

0, r lipc(co f) (f)

Remark 1. Proposition 1 shows that for any compact subset X0 Rd (such as the set of d-dimensional images, [0, 1]d) the bound (1) is tight with respect to the set of distributions supported here, for any upper semicontinuous f 2 T

µ2P(X0) L(X, µ).

It is the ﬁrst time to our knowledge that the slackness (2) has been characterised tightly. Remark 3 (in A.1) discusses a similar way to construct such a bound from some existing results in the literature, and compares it to Theorem 2.

3.1. Adversarial learning

Szegedy et al. (2014) observe that deep neural networks, trained for image classiﬁcation using empirical risk minimisation, exhibit a curious behaviour whereby an image, x 2 Rd, and a small, imperceptible amount of noise, δx 2 Rd, may found so that the network classiﬁes x and x+δx differently. Imagining that the troublesome noise vector is sought by an adversary seeking to defeat the classiﬁer,

such pairs have come to be known as adversarial examples (Moosavi Dezfooli et al., 2017; Goodfellow et al., 2015; Kurakin et al., 2017).

The closed c : X ! R ball of radius r 0, centred at x 2 X is denoted Bc(x, r) def= {y 2 X | c(x y) r}. Let X be a linear space and Y a topological space. Fix µ 2 P(X Y ). The following objective has been proposed as a means of learning classiﬁers that are robust to adversarial examples (viz. Madry et al., 2018; Shaham et al., 2018; Carlini & Wagner, 2017; Cisse et al., 2017)

sup δ2Bc(0,r) f(x + δ, y)µ(dx dy), (3)

where f : X Y ! R is the loss of some classiﬁer.

Theorem 2. Suppose (X, c0) is a separable Banach space. Fix µ 2 P(X) and for r 0 let Rµ(r) def= {g 2 L0(X, R 0) |

g dµ r}. Then for f 2 L0( , R) and r 0 there is

sup g2Rµ(r)

µ(d!) sup !02Bc0(!,g(!))

f(!0) sup 2Bc0(µ,r)

If f is continuous and µ is non-atomically concentrated with compact support, then (4) is an equality.

Remark 2. By observing the constant function gr r is included in the set Rµ(r), it s easy to see that the adversarial risk (3) is upper bounded as follows

sup !02Bc(!,r)

sup g2Rµ(r)

µ(d!) sup !02Bc(!,g(!))

where, in the equality, we extend c0 to a metric c on X Y in the same way as (B.6).

Theorem 2 generalises and subsumes a number of existing results to relate the adversarial risk minimisation (3) to the

Generalised Lipschitz Regularisation Equals Distributional Robustness

distributionally robust risk in Theorem 1. A discussion and summary of the improvements made by Theorem 2 on other comparable results is presented in 3.2, with a table that is similar to Table 1.

A simulation is in place demonstrating that the sum of the gaps from Theorems 1 and 2 and Equation (5) is relatively low. We randomly generated 100 Gaussian kernel classiﬁers f = P100

i=1 γik(xi, ), where xi was sampled from the MNIST dataset and γi sampled uniformly from [ 2, 2]. The bandwidth was set to the median of pairwise distances. In Figure 3, the x-axis is the adversarial risk (LHS of (5), i.e., (3)) where the perturbation δ is bounded in an p ball and computed by projected gradient descent (PGD). The y-axis is the Lipschitz regularised empirical risk (RHS of (1)). The scattered dots lie closely to the diagonal, demonstrating that the above bounds are tight in practice.

3.2. Results related to Theorem 2

Similarly to Theorem 1, Theorem 2 improves upon a number of existing results in the literature. These are listed in Table 2. The majority of other results mentioned are are formulated with respect to an empirical distribution, that is, an average of Dirac masses. Of course any ﬁnite set is compact, and so these empirical distributions satisfy the concentration assumption. Staib & Jegelka (2017, Prop. 3.1) also state an equality result, but this is in the setting of an 1-Wasserstein ball, which is a much more exotic object (viz. Champion et al., 2008) and is not obvious how it relates to the other results, so we choose to omit it from Table 2.

4. Lipschitz regularisation for kernel methods

Theorems 1 and 2 open up a new path to optimising the adversarial risk (3) by Lipschitz regularisation (RHS of (1)). In general, however, it is still hard to compute the Lipschitz constant for a nonlinear model (Scaman & Virmaux, 2018). Interestingly, we will show that for some types of kernels, this can be done efﬁciently on functions in its RKHS, which is rich enough to approximate continuous functions on a bounded domain (Micchelli et al., 2006). Thanks to the connections between kernel method and deep learning, this technique also potentially beneﬁts the latter. For example, 1-regularised neural networks are compactly contained in the RKHS of multi-layer inverse kernels k(x, y) = (2 x>y) 1 with kxk2 1 and kyk2 1 (Zhang et al., 2016, Lem. 1 & Thm. 1) and (Shalev-Shwartz et al., 2011; Zhang et al., 2017), and possibly Gaussian kernels k(x, y)=exp( 1

2σ2 kx yk2) (Shalev-Shwartz et al., 2011, 5).

Consider a Mercer s kernel k on a convex domain X Rd, with the corresponding RKHS denoted as H. The standard kernel method seeks a discriminant function f from H with the conventional form of ﬁnite kernel expansion f(x) =

a=1 γa k(xa, ), such that the regularised empirical risk can be minimised with the standard (hinge) loss and RKHS norm. We start with real-valued f for univariate output such as binary classiﬁcation, and later extend it to multiclass.

Our goal here is to additionally enforce, while retaining a convex optimisation in γ def= {γa}, that the Lipschitz constant of f falls below a prescribed threshold L > 0, which is equivalent to supx2X krf(x)k2 L thanks to the convexity of X. A quick but primitive solution is to piggyback on the standard RKHS norm constraint kfk H C, in view that it already induces an upper bound on krf(x)k2 as shown in Example 3.23 of Shaﬁeezadeh-Abadeh et al. (2019):

krf(x)k2 kfk H sup

1 zg(z), (6)

where g(z) sup x,x02X:kx x0k2=z

kk(x, ) k(x0, )k H .

For Gaussian kernels, g(z) = max{σ 1, 1}z. For exponential and inverse kernels, g(z) = z (Bietti & Mairal, 2019). Bietti et al. (2019) justiﬁed that the RKHS norm of a neural network may serve as a surrogate for Lipschitz regularisation. But the quality of such an approximation, i.e., the gap in (6), can be loose as we will see later in Figure 4. Besides, C and L are supposed to be independent parameters.

How can we tighten the approximation? A natural idea is to directly bound the gradient norm at n random locations {ws}n

s=1 sampled i.i.d. from X, an approach adopted by Arbel et al. (2018, Appendix D). These obviously result in convex constraints on γ. But how many samples are needed to ensure krf(x)k2 L + for all x 2 X? Unfortunately, as shown in C.1, n may have to grow exponentially by 1/ d for a d-dimensional space. Therefore we seek a more efﬁcient approach by ﬁrst slightly relaxing krf(x)k2. Let gj(x) def= @jf(x) be the partial derivative with respect to the j-th coordinate of x, and @i,jk(x, y) be the partial derivative to xi and yj. i or j being 0 means no derivative. Assuming supx2X k(x, x) = 1 and gj 2 H (true for various kernels considered by Assumptions 1 and 2 below), we get a bound

j=1 hgj, k(x, )i2

sup φ:kφk H=1

j=1 hgj, φi2

= λmax(G>G), (7)

where λmax evaluates the maximum eigenvalue, and G def= (g1, . . . , gd). The matrix is only a notation because each

column is a function in H, and obviously the (i, j)-th entry of G>G is hgi, gji H.

Why does λmax(G>G) tend to provide a lower (i.e., tighter) approximation of the Lipschitz constant than (6)? To gain some intuition, note that the latter takes two

Generalised Lipschitz Regularisation Equals Distributional Robustness

(b) kδk1 0.3 Figure 3: Empirical evaluation of the sum of the gaps from Theorems 1 and 2. The Lipschitz constants supx2X krf(x)kq (left: p = 2, right: p = 1, 1/p + 1/q = 1) were estimated by BFGS.

0 20 40 60 80 0

(a) 5-layer inverse kernel

0 10 20 30 0

(b) Gaussian kernel Figure 4: Comparison of λmax(G>G) and the RHS of (6), as upper bounds for the Lipschitz constant. Smaller values are tighter. We sampled 100 functions in the same way as in Figure 3.

steps of relaxation: |f(x) f(x0)| kfk H kk(x, ) k(x0, )k H and kk(x, ) k(x0, )k H

kx x0k2 supz>0

z . They attain equality at potentially very different (x, x0) pairs, and the former depends on f while the latter does not. In contrast, our bound in (7) only relaxes once, leveraging the efﬁciently approximable partial derivatives gj in 4.1 and capturing the correlations across different coordinates j by the eigenvalue.

An empirical comparison is further shown in Figure 4, where λmax(G>G) was computed from (9) derived below, and the landmarks {ws} consisted of the whole training set; drawing more samples led to little difference. The gap is smaller when the bandwidth σ is larger, making functions smoother. To be fair, both Figure 3 and Figure 4 set σ to the median of pairwise distances, a common practice.

Such a positive result motivated us to develop reﬁned algorithms to address the only remaining obstacle to leveraging λmax(G>G): a computational strategy. Interestingly, it is readily approximable in both theory and practice. Indeed, the role of gj can be approximated by its Nyström approximation gj 2 Rd (Williams & Seeger, 2000; Drineas & Mahoney, 2005) with K def= [k(wi, wi0)]i,i0 and Z def= (k(w1, ), k(w2, ), . . . , k(wn, )):

def= K 1/2(gj(w1), . . . , gj(wn))>

= (Z>Z) 1/2Z>gj (8) because gj(wi) =

gj, k(wi, )

H. Then to ensure λmax(G>G) L2 + , intuitively we can enforce λmax( G> G) L2, where G def= ( g1, . . . , gd). It retains the convexity in the constraint on γ. However, to guarantee error, the number of samples (n) required is generally exponential (Barron, 1994). Fortunately, we will next show that n can be reduced to polynomial for quite a general class of kernels that possess some decomposed structure.

4.1. A Nyström approximation for product kernels

A number of kernels factor multiplicatively over the coordinates, such as periodic kernels (Mac Kay, 1998), Gaussian

kernels, and Laplacian kernels. Let us consider k(x, y) = Qd

j=1 k0(xj, yj) where X = (X0)d and k0 is a base kernel on an interval X0. Let the RKHS of k0 be H0, and let µ0 be a ﬁnite Borel measure with supp[µ0] = X0. Periodic kernels have k0(xj, yj) = exp

We stress that product kernels can induce very rich function spaces. For example, Gaussian kernel is universal (Micchelli et al., 2006), meaning that its RKHS is dense in the space of continuous functions in the 1 norm over any bounded domain. Also note that the factorization of kernel k does not imply a function f 2 H must factor as Q

The key beneﬁt of this decomposition of k is that the derivative @0,1k(x, y) can be written as @0,1k0(x1, y1) Qd

j=2 k0(xj, yj). Since k0(xj, yj) can be easily dealt with, approximation will be needed only for @0,1k0(x1, y1). Applying this idea to gj = 1

a=1 γa@0,jk(xa, ), we can derive

1, ), @0,1k0(xb

l2 hg1, g2i H =

γaγb@0,1k0(xa

1)@0,2k0(xb

So the off-diagonal entries of G>G can be computed exactly. But this is not the case for the diagonal entries because Ma,b is not equal to @1,1k0(xa

1). This differs from the ( @

@x1 f(x))2 used in Arbel et al. (2018), which can be computed with more ease via

f, [@1,0k(x, ) @1,0k(x, )]f

H. Now it is natural to apply Nyström approximation to Ma,b in the diagonal, using samples {w1

1, . . . , wn

1 } from µ0:

Ma,b @0,1k0(xa

1 @0,1k0(xb

def= (k0(w1

1, ), . . . , k0(wn

1 , )). Note

1 @0,1k0(xa

1, )=(@0,1k0(xa

1), . . . , @0,1k0(xa

Generalised Lipschitz Regularisation Equals Distributional Robustness

and similarly for Z>

1 @0,1k0(xb

1, ). Denote this approximation of G>G as PG. Clearly, λmax( PG) L2 is a convex constraint on γ, based on i.i.d. samples {ws

j | s 2 [n], j 2 [d]} from µ0.

The overall convex training procedure is summarised in Algorithm 1, where the goal is to train a kernel SVM with the additional constraint that the Lipschitz constant is at most L. More detailed formulations are available in D. The three different ways to enforce the Lipschitz constant as discussed above correspond to options 1 to 3 . For practical efﬁciency, we greedily expand the Nyström landmark set S by locally maximizing the norm of the gradient at each iteration (step b ). Figure 7 in 5.1 will show that the Nyström based algorithm is much more efﬁcient than the brute-force counterpart, and the greedy approach signiﬁcantly reduces the number of samples for both algorithms.

4.2. General sample complexity and assumptions

Finally, it is important to analyse how many samples ws

j are needed, such that with high probability

λmax( PG) L2 =) λmax(G>G) L2 + .

Fortunately, product kernels only require approximation bounds for each coordinate, making the sample complexity immune to the exponential growth in the dimensionality d. Speciﬁcally, we ﬁrst consider base kernels k0 with a scalar input, i.e., X0 R. Recall from Steinwart & Christmann (2008, 4) that the integral operator for k0 and µ0 is Tk0

def= I Sk0, where Sk0 : L2(X0, µ0) ! C(X0) operates according to (Sk0f)(x) def=

k0(x, y)f(y)µ0(dy) for all f 2 L2(X0, µ0), and I: C(X0) ,! L2(X0, µ0) is the inclusion operator. By the spectral theorem, if Tk0 is compact, then there is an at most countable orthonormal set { ej}j2J of L2(X0, µ0) and {λj}j2J with λ1 λ2 . . . > 0 such that Tk0f = P

j2J λj hf, eji L2(X0,µ0) ej for all f 2

L2(X0, µ0). It follows that 'j

λjej is an orthonormal basis of H0 (cf. Steinwart & Christmann, 2008).

Our proof is built upon the following two assumptions on the base kernel. The ﬁrst one asserts that ﬁxing x, the energy of k0(x, ) and @0,1k0(x, ) concentrates on the leading eigenfunctions. Assumption 1. Suppose k0(x, x) = 1 and @0,1k0(x, ) 2 H0 for all x 2 X0. For all > 0, there exists N 2 N such that the tail energy of @0,1k0(x, ) beyond the N -th eigenpair is less than , uniformly for all x 2 X0. That is, denoting Φm

def= ('1, . . . , 'm), N < 1 is the smallest m such that

11@0,1k0(x, ) ΦmΦ>

m@0,1k0(x, )

11k0(x, ) ΦmΦ>

The second assumption asserts the smoothness and range of eigenfunctions in a uniform sense.

Algorithm 1 Training L-Lipschitz binary SVM

1 Randomly sample S = {w1, . . . , wn} from X.

2 for i = 1, 2, . . . do

3 Train an SVM under one of the following constraints:

1 Brute-force: krf(w)k2

2 L2, 8 w 2 S

2 Nyström holistic: λmax( G> G) L2 in (8) by S

3 Nyström coordinate wise: λmax( PG) L2 in

(10) by using S

4 Let the trained SVM be f (i).

5 Add a new w to S by one of the following methods:

a Random: randomly sample w from X.

b Greedy: ﬁnd arg maxx2X

11rf (i)(x)

11 (local optimisation) by L-BFGS with 10 random initialisations and add the distinct results

6 Return if L(i) def= maxx2X

11rf (i)(x)

11 falls below L

Assumption 2. Under Assumption 1, {ej(x) : j 2 N } is uniformed bounded over x 2 X0, and the RKHS inner product of @0,1k0(x, ) with {ej : j 2 N } is also uniformly bounded over x 2 X0:

@0,1k0(x, ), ej

max j2[N ] |ej(x)| < 1.

Theorem 3. Suppose k0, X0, and µ0 satisfy Assumptions 1 and 2. Let {ws

j : s 2 [n], j 2 [d]} be sampled i.i.d. from µ0. Then for any f whose coordinate-wise Nyström approximation (9) and (10) satisfy λmax( PG) L2, the Lipschitz condition λmax(G>G) L2 + is met with probability 1 δ, as long as n

, almost independent of d. Here hides all poly-log terms except those involving d. The proof is deferred to C.3.

The log d dependence on dimension d is interesting, but not surprising. After all, only the diagonal entries of G>G need approximation, and the quantity of interest is its spectral norm, not Frobenious norm. Compared with the brute-force approach in Arbel et al. (2018) which costs exponential sample complexity, we manage to reduce it to 1/ 2 by making two assumptions, which interestingly hold true for important classes of kernels. Theorem 4. Assumptions 1 and 2 hold for periodic kernel and Gaussian kernel with O(1) values of N , M , and Q .

The proof is in C.4 and C.5. It remains open whether non-product kernels such as inverse kernel also enjoy this polynomial sample complexity. C.6 suggests that its complexity may be quasi-polynomial.

Generalised Lipschitz Regularisation Equals Distributional Robustness

5. Experimental results

We studied the empirical robustness and accuracy of the proposed Lipschitz regularisation technique for adversarial training of kernel methods, under both Gaussian kernel and inverse kernel. Comparison will be made with state-of-theart defence algorithms under effective attacks.

Datasets We tested on three datasets: MNIST, Fashion-MNIST, and CIFAR10. The number of training/validation/test examples for the three datasets are 54k/6k/10k, 54k/6k/10k, 45k/5k/10k, respectively. Each image in MNIST and Fashion-MNIST is represented as a 784-dimensional feature vector, with each feature/pixel normalised to [0, 1]. For CIFAR10, we trained it on a residual network to obtain a 512-dimensional feature embedding, which were subsequently normalised to [0, 1].

Attacks To evaluate the robustness of the trained model, we attacked them on test examples using the random initialized Projected Gradient Descent method with 100 steps (PGD, Madry et al., 2018) under two losses: cross-entropy and C&W loss (Carlini & Wagner, 2017). The perturbation δ was constrained in an 2-norm or 1-norm ball. To evaluate robustness, we scaled the perturbation bound δ from 0.1 to 0.6 for 1-norm norm, and from 1 to 6 for 2-norm norm (when δ = 6, the average magnitude per coordinate is 0.214). We normalised gradient and ﬁne-tuned the step size.

Algorithms We compared four training algorithms. The Parseval network orthonormalises the weight matrices to enforce the Lipschitz constant (Cisse et al., 2017). We used three hidden layers of 1024 units and Re LU activation (Par-Re LU). Also considered is the Parseval network with Max Min activations (Par-Max Min), which enjoys much improved robustness (Anil et al., 2019). Both algorithms can be customised for 2-norm or 1-norm attacks, and were trained under the corresponding norms. Using multi-class hinge loss, they constitute strong baselines for adversarial learning. We followed the code from LNets with β = 0.5, which is equivalent to the ﬁrst-order Bjorck algorithm. The ﬁnal upper bound of Lipschitz constant computed from the learned weight matrices satisﬁed the orthogonality constraint as shown by Anil et al. (2019, Fig. 13).

Both Gaussian and inverse kernel machines applied Lipschitz regularisation by randomly and greedily selecting {ws}, and they will be referred to as Gauss-Lip and Inverse-Lip, respectively. In practice, Gauss-Lip with the coordinate-wise Nyström approximation (λmax( PG) from (10)) can approximate λmax(G>G) with a much smaller number of sample than if using the holistic approximation as in (8). Furthermore, we found an even more efﬁcient approach. Inside the iterative training algorithm, we used L-BFGS to ﬁnd the input that yields the steepest gradient under the current solution, and then added it to the set {ws}

(which was initialized with 15 random points). Although L-BFGS is only a local solver, this greedy approach empirically reduces the number of samples by an order of magnitude. See the empirical convergence results in 5.1. Its theoretical analysis is left for future investigation. We also applied this greedy approach to Inverse-Lip.

Extending binary kernel machines to multiclass The standard kernel methods learn a discriminant function f c def=

ak(xa, ) for each class c 2 [10], based on which a large variety of multiclass classiﬁcation losses can be applied, e.g., CS (Crammer & Singer, 2001) which was used in our experiment. Since the Lipschitz constant of the mapping from {f c} to a realvalued loss is typically at most 1, it sufﬁces to bound the Lipschitz constant of x 7! (f 1(x), . . . , f 10(x))>

via maxx λmax(G(x)G(x)>), where G(x) def= [rf 1(x), , rf 10(x)] = [

H]j2[d],c2[10]. As kk(x, )k H = 1, we then enforce

max kφk H=1 λmax

The LHS of (11) is amenable to the same Nyström approximation as in the binary case. Further, the principle can be extended to 1-norm attacks, whose details are in D.1.

Parameter selection We used the same parameters as in Anil et al. (2019) for training Par-Re LU and Par-Max Min. To defend against 2-norm attacks, we set L = 100 for all algorithms. Gauss-Lip achieved high accuracy and robustness on the validation set with bandwidth σ = 1.5 for Fashion MNIST and CIFAR-10, and σ = 2 for MNIST. To defend against 1-norm attacks, we set L = 1000 for all the four methods as in Anil et al. (2019). The best σ for Gauss-Lip is 1 for all datasets. Inverse-Lip used 5 layers.

Results Figures. 5 and 6 show how the test accuracy decays as an increasing amount of perturbation (δ) in 2-norm and 1-norm norm is added to the test images, respectively. Clearly Gauss-Lip achieves higher accuracy and robustness than Par-Re LU and Par-Max Min on the three datasets, under both 2-norm and 1-norm bounded PGD attacks with C&W loss. In contrast, Inverse-Lip only performs similarly to Par-Re LU. Interestingly, 2-norm based Par-Max Min are only slightly better than Par-Re LU under 2-norm attacks, although the former does perform signiﬁcantly better under 1-norm attacks.

The results for cross-entropy PGD attacks are deferred to Figures. 9 and 10 in E.1. Here cross-entropy PGD attackers ﬁnd stronger attacks to Parseval networks but not to our kernel models. Our Gauss-Lip again signiﬁcantly outperforms Par-Max Min on all the three datasets and under both

Generalised Lipschitz Regularisation Equals Distributional Robustness

0 1 2 3 4 5 6 0

Test accuracy (%)

Gauss-Lip (PGD-cw) Par-Max Min (PGD-cw) Inverse-Lip (PGD-cw) Par-Re LU (PGD-cw)

0 1 2 3 4 5 6 0

Test accuracy (%)

Gauss-Lip (PGD-cw) Par-Max Min (PGD-cw) Inverse-Lip (PGD-cw) Par-Re LU (PGD-cw)

(b) Fashion-MNIST

0 1 2 3 4 5 6 0

Test accuracy (%)

Gauss-Lip (PGD-cw) Par-Max Min (PGD-cw) Inverse-Lip (PGD-cw) Par-Re LU (PGD-cw)

(c) CIFAR10

Figure 5: Test accuracy under PGD attacks on the C&W approximation with 2-norm norm bound

0 0.1 0.2 0.3 0.4 0.5 0.6 0

Test accuracy (%)

Gauss-Lip (PGD-cw) Par-Max Min (PGD-cw) Inverse-Lip (PGD-cw) Par-Re LU (PGD-cw)

0 0.1 0.2 0.3 0.4 0.5 0.6 0

Test accuracy (%)

Gauss-Lip (PGD-cw) Par-Max Min (PGD-cw) Inverse-Lip (PGD-cw) Par-Re LU (PGD-cw)

(b) Fashion-MNIST

0 0.1 0.2 0.3 0.4 0.5 0.6 0

Test accuracy (%)

Gauss-Lip (PGD-cw) Par-Max Min (PGD-cw) Inverse-Lip (PGD-cw) Par-Re LU (PGD-cw)

(c) CIFAR10

Figure 6: Test accuracy under PGD attacks on the C&W approximation with 1-norm norm bound

2-norm and 1-norm norms. The improved robustness of Gauss-Lip does not seem to be attributed to the obfuscated (masked) gradient (Athalye et al., 2018), because as shown Figures. 5, 6, 9 and 10, increased distortion bound does increase attack success, and unbounded attacks drive the success rate to very low. In practice, we also observed that random sampling ﬁnds much weaker attacks, and taking 10 steps of PGD is much stronger than one step.

Obfuscated gradient To further illustrate the property of Gauss-Lip trained models, we visualised large perturbation adversarial examples with the 2-norm norm bounded by 8. Figure 11 in E.2 shows the result of running PGD attack for 100 steps on Gauss-Lip trained model using (targeted) cross-entropy approximation. On a randomly sampled set of 10 images from MNIST, PGD successfully turned all of them into any target class by following the gradient. We further ran PGD on C&W approximation in Figure 12, and this untargeted attack succeeds on all 10 images. In both cases, the ﬁnal images are quite consistent with human s perception.

5.1. Efﬁciency of enforcing Lipschitz constant

Figure 7 plots how fast the Lipschitz constant L(i) at iteration i is reduced by the variants 1a, 1c, 3a, and 3c in Algorithm 1, when more and more points w are added to the constraint set S. We used 400 random examples in the MNIST dataset (200 images of digit 1 and 0 each) and set L = 3 and RKHS norm kfk H 1 for all algorithms.

Clearly the Nyström algorithm is more efﬁcient than the brute-force algorithm, and the greedy method signiﬁcantly

100 102 100

Lipschitz constant L(i) in Alg. 1

3c Nystrom (greedy) 1c Brute-force (greedy) 3a Nystrom (random) 1a Brute-force (random)

Size of S (number of w added), i.e., index i in Alg. 1

Figure 7: Comparison of efﬁciency in enforcing Lipschitz constant by various methods.

reduces the number of samples for both algorithms. In fact, Nyström with greedy selection (3c) eventually fell slightly below the pre-speciﬁed L, because of the gap in (7).

6. Conclusion

Risk minimisation can fail to be optimal when there is some misspeciﬁcation of the distribution, such as when, as we always must, work with its empirical counterpart. Therefore we must turn to other techniques in order to ensure stability when learning a model. The robust Bayes framework provides a systematic approach to these problems, however it leaves open the choice as to which uncertainty set is most appropriate. We show that in many cases, the popular Lipschitz regularisation corresponds to robust Bayes with a transportation-cost-based uncertainty set.

Generalised Lipschitz Regularisation Equals Distributional Robustness

Anil, C., Lucas, J., and Grosse, R. Sorting out Lipschitz

function approximation. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th international conference on machine learning, volume 97 of Proceedings of machine learning research, pp. 291 301, Long Beach, CA, USA, June 2019. PMLR.

Arbel, M., Sutherland, D. J., Binkowski, M., and Gretton, A.

On gradient regularizers for MMD GANs. In Advances in Neural Information Processing Systems (NIPS), 2018.

Askari, A., d Aspremont, A., and El Ghaoui, L. Naive fea-

ture selection: sparsity in naive bayes. ar Xiv:1905.09884 [cs, stat], May 2019. ar Xiv: 1905.09884.

Athalye, A., Carlini, N., and Wagner, D. Obfuscated gra-

dients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning (ICML), 2018.

Aubin, J.-P. and Ekeland, I. Estimates of the duality gap in

nonconvex optimization. Mathematics of Operations Research, 1(3):225 245, 1976. ISSN 0364765X, 15265471. doi: 10.1287/moor.1.3.225.

Barron, A. R. Approximation and estimation bounds for

artiﬁcial neural networks. Machine Learning, 14:115 133, 1994.

Benoist, J. and Hiriart-Urruty, J.-B. What is the subdiffer-

ential of the closed convex hull of a function? SIAM Journal on Mathematical Analysis, 27(6):1661 1679, November 1996. ISSN 0036-1410, 1095-7154. doi: 10.1137/S0036141094265936.

Berger, J. O. Statistical decision theory and Bayesian anal-

ysis. Springer series in statistics. Springer-Verlag, New York, NY, USA, 2 edition, 1993. ISBN 0-387-96098-

8. tex.mrclass: 62-02 (62A15 62Cxx) tex.mrnumber: 1234489.

Bietti, A. and Mairal, J. Group invariance, stability to deformations, and complexity of deep convolutional representations. Journal of Machine Learning Research, 20 (25):1 49, 2019.

Bietti, A., Mialon, G., Chen, D., and Mairal, J. A kernel

perspective for regularizing deep neural networks. In International Conference on Machine Learning (ICML), 2019.

Blanchet, J. and Murthy, K. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565 600, May 2019. ISSN 0364-765X, 1526-5471. doi: 10.1287/moor.2018.0936.

Blanchet, J., Kang, Y., and Murthy, K. Robust Wasser-

stein proﬁle inference and applications to machine learning. Journal of Applied Probability, 56(3):830 857, 2019. ISSN 0021-9002. doi: 10.1017/jpr.2019.49.

Carlini, N. and Wagner, D. Towards Evaluating the Ro-

bustness of Neural Networks. In IEEE Symposium on Security and Privacy, pp. 39 57, San Jose, CA, USA, May 2017. IEEE. ISBN 978-1-5090-5533-3. doi: 10.1109/sp.2017.49.

Champion, T., De Pascale, L., and Juutinen, P. The 1-

Wasserstein distance: Local solutions and existence of optimal transport maps. SIAM Journal on Mathematical Analysis, 40(1):1 20, 2008. ISSN 0036-1410, 1095-7154.

doi: 10.1137/07069938x.

Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and

Usunier, N. Parseval networks: Improving robustness to adversarial examples. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th international conference on machine learning, volume 70, pp. 854 863, Sydney, Australia, August 2017. Proceedings of machine learning research.

Crammer, K. and Singer, Y. On the algorithmic implementa-

tion of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265 292, 2001.

Cranko, Z., Menon, A., Nock, R., Ong, C. S., Shi, Z., and

Walder, C. Monge blunts Bayes: hardness results for adversarial training. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th international conference on machine learning, volume 97, pp. 1406 1415, Long Beach, CA, USA, June 2019. Proceedings of machine learning research.

Drineas, P. and Mahoney, M. On the nystr om method for

approximating a gram matrix for improved kernel-based learning. JMLR, 6:2153 2175, 2005.

E Fasshauer, G. Positive deﬁnite kernels: Past, present and

future. Dolomite Res. Notes Approx., 4, 01 2011.

Fasshauer, G. and Mc Court, M. Stable evaluation of Gaus-

sian radial basis function interpolants. SIAM Journal on Scientiﬁc Computing, 34(2):A737 A762, 2012.

Gao, R. and Kleywegt, A. J. Distributionally robust stochastic optimization with wasserstein distance. ar Xiv:1604.02199 [math], July 2016. ar Xiv: 1604.02199.

Giner, E. Necessary and sufﬁcient conditions for the interchange between inﬁmum and the symbol of integration. Set-Valued and Variational Analysis, 17(4): 321 357, 2009. ISSN 1877-0533. doi: 10.1007/ s11228-009-0119-y.

Generalised Lipschitz Regularisation Equals Distributional Robustness

Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and

harnessing adversarial examples. International Conference on Learning Representations, 2015.

Grünwald, P. D. and Dawid, A. P. Game theory, maxi-

mum entropy, minimum discrepancy and robust Bayesian decision theory. The Annals of Statistics, 32(4):1367 1433, August 2004. ISSN 0090-5364. doi: 10.1214/

009053604000000553.

Hiriart-Urruty, J.-B. A general formula on the conjugate

of the difference of functions. Canadian Mathematical Bulletin, 29(4):482 485, December 1986. ISSN 00084395, 1496-4287. doi: 10.4153/cmb-1986-076-7.

Hiriart-Urruty, J.-B. From convex optimization to nonconvex optimization. necessary and sufﬁcient conditions for global optimality. In Clarke, F. H., Dem yanov, V. F., and Giannessi, F. (eds.), Nonsmooth Optimization and Related Topics, pp. 219 239. Springer, Boston, MA, USA, 1989. ISBN 978-1-4757-6019-4. doi: 10.1007/978-1-4757-6019-4_13.

Hiriart-Urruty, J.-B. and Lemaréchal, C. Convex analysis

and minimization algorithms II. Springer-Verlag, Berlin, Germany, 2010. ISBN 978-3-642-08162-0. OCLC: 864385173.

Kerdreux, T., Colin, I., and d Aspremont, A. An approximate Shapley-Folkman theorem. ar Xiv:1712.08559 [math], July 2019. ar Xiv: 1712.08559.

König, H. Eigenvalue Distribution of Compact Operators.

Birkhäuser, Basel, 1986.

Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shaﬁeezadeh-

Abadeh, S. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Netessine, S., Shier, D., and Greenberg, H. J. (eds.), Operations Research & Management Science in the Age of Analytics, pp. 130 166. INFORMS, 2019. ISBN 978-0-

9906153-3-0. doi: 10.1287/educ.2019.0198.

Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial

examples in the physical world. Technical report, 2017.

Lafferty, J. and Lebanon, G. Diffusion kernels on statistical

manifolds. Journal of Machine Learning Research, 6: 129 163, 01 2005.

Lemaréchal, C. and Renaud, A. A geometric study of duality

gaps, with applications. Mathematical Programming, 90 (3):399 427, May 2001. ISSN 0025-5610. doi: 10.1007/ pl00011429.

Lin, S.-B., Guo, X., and Zhou, D.-X. Distributed learn-

ing with regularized least squares. Journal of Machine Learning Research, 18(92):1 31, 2017.

LNets. https://github.com/cemanil/LNets.

Mac Kay, D. J. C. Introduction to Gaussian processes. In

Bishop, C. M. (ed.), Neural Networks and Machine Learning, pp. 133 165. Springer, Berlin, 1998.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and

Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.

Mc Shane, E. J. Extension of range of functions. Bulletin of the American Mathematical Society, 40(12):837 842, 1934. doi: 10.1090/s0002-9904-1934-05978-0. tex.fjournal: Bulletin of the American Mathematical Society.

Micchelli, C. A., Xu, Y., and Zhang, H. Universal kernels.

Journal of Machine Learning Research, 7:2651 2667, 2006.

Minh, H. Q. Some properties of Gaussian reproducing

kernel Hilbert spaces and their implications for function approximation and learning theory. Constructive Approximation, 32(2):307 338, 2010.

Minh, H. Q., Niyogi, P., and Yao, Y. Mercer s theorem,

feature maps, and smoothing. In Lugosi, G. and Simon, H. U. (eds.), Conference on Computational Learning Theory (COLT), pp. 154 168, 2006.

Moosavi Dezfooli, S. M., Fawzi, A., Fawzi, O., and

Frossard, P. Universal adversarial perturbations. In Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9, Hon-

olulu, HI, USA 2017. IEEE. ISBN 978-1-5386-04571. doi: 10.1109/Cvpr.2017.17. tex.address: New York

tex.publisher: Ieee.

Pratelli, A. On the equality between Monge s inﬁmum and

Kantorovich s minimum in optimal mass transportation. Annales de l Institut Henri Poincare (B) Probability and Statistics, 43(1):1 13, January 2007. ISSN 02460203. doi: 10.1016/j.anihpb.2005.12.001.

Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes

for Machine Learning. MIT Press, Cambridge, MA, 2006.

Scaman, K. and Virmaux, A. Lipschitz regularity of deep

neural networks: analysis and efﬁcient estimation. In Proceedings of the 32nd international conference on neural information processing systems, pp. 3839 3848, Montréal, Canada, 2018. Curran Associates Inc.

Schneider, H. An inequality for latent roots applied to deter-

minants with dominant principal diagonal. Journal of the London Mathematical Society, s1-28(1):8 20, 1953.

Shaﬁeezadeh-Abadeh, S., Kuhn, D., and Esfahani, P. M.

Regularization via mass transportation. Journal of Machine Learning Research, 20(103):1 68, 2019.

Shaham, U., Yamada, Y., and Negahban, S. Understanding

adversarial training: Increasing local stability of supervised models through robust optimization. Neurocomputing, 307:195 204, September 2018. ISSN 09252312. doi: 10.1016/j.neucom.2018.04.027.

Shalev-Shwartz, S., Shamir, O., and Sridharan, K. Learning

kernel-based halfspaces with the 0-1 loss. SIAM Journal on Computing, 40(6):1623 1646, 2011.

Shi, Z.-C. and Wang, B.-Y. Bounds for the determinant,

characteristic roots and condition number of certain types of matrices. Acta Math. Sinica, 15(3):326 341, 1965.

Sinha, A., Namkoong, H., and Duchi, J. Certiﬁable distri-

butional robustness with principled adversarial training. In International conference on learning representations, 2018.

Staib, M. and Jegelka, S. Distributionally robust deep learn-

ing as a generalization of adversarial training. In 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017.

Steinwart, I. and Christmann, A. Support vector machines.

Springer Science & Business Media, 2008.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,

D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. International conference on learning representations, 2014.

Toland, J. F. A duality principle for non-convex optimisa-

tion and the calculus of variations. Archive for Rational Mechanics and Analysis, 71(1):41 61, May 1979. ISSN 0003-9527, 1432-0673. doi: 10.1007/bf00250669.

Tropp, J. A. An introduction to matrix concentration in-

equalities. Foundations and Trends in Machine Learning, 8(1-2):1 230, 2015.

Udell, M. and Boyd, S. Bounding duality gap for separable

problems with linear constraints. Computational Optimization and Applications, 64(2):355 378, June 2016. ISSN 1573-2894. doi: 10.1007/s10589-015-9819-4.

Vapnik, V. N. The nature of statistical learning theory.

Springer, New York, NY, USA, 2000. ISBN 978-1-47573264-1 978-1-4419-3160-3. OCLC: 864225872.

Vidakovic, B. Γ-Minimax: a paradigm for conservative

robust bayesians. In Bickel, P., Diggle, P., Fienberg, S., Krickeberg, K., Olkin, I., Wermuth, N., Zeger, S., Insua, D. R., and Ruggeri, F. (eds.), Robust Bayesian Analysis,

volume 152, pp. 241 259. Springer, New York, NY, USA, 2000. ISBN 978-0-387-98866-5 978-1-4612-1306-2. doi: 10.1007/978-1-4612-1306-2_13.

Villani, C. Optimal transport: old and new. Number 338 in Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin, Germany, 2009. ISBN 978-3540-71049-3. OCLC: ocn244421231.

Whitney, H. Analytic extensions of differentiable functions

deﬁned in closed sets. Transactions of the American Mathematical Society, 36(1):63 89, 1934. doi: 10.1090/ s0002-9947-1934-1501735-3.

Williams, C. K. I. and Seeger, M. Using the Nyström method

to speed up kernel machines. In Advances in Neural Information Processing Systems (NIPS), 2000.

Williamson, R. C., Smola, A. J., and Schölkopf, B. Gener-

alization bounds for regularization networks and support vector machines via entropy numbers of compact operators. IEEE Transactions on Information Theory, 47(6): 2516 2532, 2001.

Yang, F. and Wei, Z. Generalized Euler identity for subdifferentials of homogeneous functions and applications. Journal of Mathematical Analysis and Applications, 337(1):516 523, 2008. ISSN 0022247X. doi: 10.1016/j.jmaa.2007.04.008.

Zhang, Y., Lee, J. D., and Jordan, M. I. 1-regularized neural

networks are improperly learnable in polynomial time. In International Conference on Machine Learning (ICML), 2016.

Zhang, Y., Liang, P., and Wainwright, M. Convexiﬁed con-

volutional neural networks. In International Conference on Machine Learning (ICML), 2017.

Zhou, D.-X. The covering number in learning theory. Jour-

nal of Complexity, 18(3):739 767, 2002.

Zhu, H., Williams, C. K., Rohwer, R. J., and Morciniec, M.

Gaussian regression and optimal ﬁnite dimensional linear models. In Bishop, C. M. (ed.), Neural Networks and Machine Learning. Springer-Verlag, Berlin, 1998.

Z alinescu, C. Convex analysis in general vector spaces.

World Scientiﬁc, River Edge, NJ, USA, 2002. ISBN 978-981-238-067-8. OCLC: 845511462.