# dimensionfree_error_bounds_from_random_projections__f239499e.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Dimension-Free Error Bounds from Random Projections

Ata Kab an School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK A.Kaban@cs.bham.ac.uk

Learning from high dimensional data is challenging in general however, often the data is not truly high dimensional in the sense that it may have some hidden low complexity geometry. We give new, user-friendly PAC-bounds that are able to take advantage of such benign geometry to reduce dimensional-dependence of error-guarantees in settings where such dependence is known to be essential in general. This is achieved by employing random projection as an analytic tool, and exploiting its structure-preserving compression ability. We introduce an auxiliary function class that operates on reduced dimensional inputs, and a new complexity term, as the distortion of the loss under random projections. The latter is a hypothesis-dependent data-complexity, whose analytic estimates turn out to recover various regularisation schemes in parametric models, and a notion of intrinsic dimension, as quantiﬁed by the Gaussian width of the input support in the case of the nearest neighbour rule. If there is benign geometry present, then the bounds become tighter, otherwise they recover the original dimension-dependent bounds.

Introduction

We consider learning settings where the generalisation error has a known essential dependence on the dimension of the input representation examples include learning on unbounded input domains, learning with scale-insensitive loss functions, metric learning, and non-parametric methods. Much of the statistical analysis of learning concerns how fast the empirical error converges to the true error as the sample size increases, and how large the sample must be to match the complexity of the model or hypothesis class of choice. However, in practice, especially in the above settings, the answers to these questions are often not sufﬁciently informative for one cannot have access to unlimited sample sizes. Instead, here we are mainly interested in the questions of what error-guarantee can be given for the available sample size in problems where the input dimension can be arbitrarily high, and what characteristics of the problem ensure good generalisation despite the sample size is small? A nice

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

example illustrating that analysis in small and large sample regimes may bring different insights is found in (Kontorovich and Weiss 2014). We approach this problem from ﬁrst principles. Although it is common practice, as well as technically convenient, to prime the analysis with a margin or some regularisation scheme in order to obtain dimension-free guarantees where these geometric structures come from some form of prior knowledge, or experience instead here we are interested in a generic principle that brings these in automatically when necessary. To this end, we introduce a notion of compressibility that quantiﬁes the distortion suffered by the loss when the inputs are subjected to a universal compression that is a random projection. Random projection is often used for dimensionality reduction in algorithms, however in this context, the role of this compression will be purely analytic. In this role, it is somewhat analogous to the one-dimensional random projection of function outputs in Rademacher and Gaussian complexities, but here instead a random matrix acts on the inputs to the functions. Based on these ideas we derive new generalisation bounds, which will depend on the complexity of an auxiliary low-dimensional hypothesis class instead the original one, plus a new complexity term that we call the data-complexity of the original function class. The latter captures and identiﬁes benign geometric structures for the problem (of which margin is a special case). We give the generic formalism ﬁrst, which we then instantiate in concrete models.

Framework Notations and Problem Setup Let Xd Rd be an input domain, and Y the set of target values. In binary classiﬁcation Y = { 1, 1} is the set of class labels, in regression Y R. Let Hd be a function class (hypothesis class) with elements h Hd of the form h : Xd Y. Let ℓ: Y Y [0, ℓ] be a bounded loss function. We are given a set of labelled examples TN = {(x1, y1), . . . , (x N, y N)} drawn i.i.d. from some unknown distribution Dd over Xd Y. The learning problem is to use these to select a function from Hd with smallest generalisation error E(x,y) Dd[ℓ(h(x), y)].

Let Gd = ℓ Hd = {g : (x, y) Xd Y ℓ(h(x), y) : h Hd} denote the function class under study. Expectation w.r.t. the unknown data distribution Dd is denoted as E[g] := E(x,y) Dd[g(x, y)]. Expectation w.r.t. the empirical measure deﬁned by a sample TN will is ˆETN [g] = 1 N PN n=1 g(xn, yn) where DTN = 1 N PN n=1 δxn, and δx is the probability distribution concentrated at the point x.

Deﬁnitions of Concepts For the analysis that follows we make an auxiliary construction. Let R Rk d, k d be a random matrix; a so-called random projection (RP) matrix. For instance a random matrix with i.i.d. Gaussian entries, or a Haar matrix is convenient, so R has full row rank almost surely (a.s.) and has low distortion property when used to reduce dimension (Dasgupta and Gupta 2003). As this is used in a purely analytic role, computationally fast variants are not required. We apply R to the points in Xd. For labelled points we use the convention R(x, y) (Rx, y). This creates a random input space RXd that is k-dimensional a.s., and which we denote as XR. When we refer to this as a k-dimensional input domain, then we use Xk instead. On Xk (or XR), we deﬁne an auxiliary function class, denoted Gk = ℓ Hk (or GR = ℓ HR) with elements g R = ℓ h R. This class may be chosen, and each choice gives rise to a different generalisation bound. Examples will be given later. A natural choice is for instance to have the same functional form as the elements of Gd, but operating on k rather than d-dimensional inputs. An often more convenient choice is GR := Gd RT . Deﬁnition 1. We deﬁne the following functional to represent the compressive distortion of a function g Gd relative to g R GR:

DR(g, g R) |g R R g|

When g R = g RT , we write instead DR(g) g RT R g . We note that, conveniently, under suitable and fairly standard assumptions on the loss function, the compressive distortion can be bounded independently of the targets y. Examples in a later section will make this concrete. Remark 2. (i) k d s.t. DR(g) = 0; (ii) If g(x, y) [0, ℓ], (x, y) then DR(g) [0, ℓ], k. By Remark 2(i), it is always possible to choose GR and k to have zero compressive distortion. In particular, the choice GR = Gd RT with k = d will recover the traditional error analysis. In turn, the use of the above quantity captures a condition that allows us to reduce dimensional dependence in error bounds. To this end we deﬁne the following new notion of complexity that will play a key role in the sequel. Deﬁnition 3. We deﬁne the data-complexity of a function class Gd as the following:

C2N,k(Gd) = ET2N D2N d sup g Gd ER inf g R Gk ˆET2N [DR(g, g R)]

We may think of this as the largest (w.r.t. g Gd) mimicking error on average (over training sets) of an ensemble

of learners that each receive a randomly compressed version of the inputs and train to behave as g. We note that this complexity term is always non-negative and, as already mentioned, it can be made zero by our choices of GR and k. The interesting cases are those in which this quantity is small despite k < d.

Generic Bound This section gives a novel and generic PAC-style uniform generalisation bound. It bounds the error of any function in a given class in terms of the notion of data-complexity of the class introduced in the previous section, which allows the use of a low complexity auxiliary function class that operates on low dimensional random projections of the inputs. The former is based on the distortion of the loss incurred by the random projection of the inputs therefore, due to the structure-preserving property of random projections, this term may be low when the data exhibits some benign geometry with respect to the function class under study. This allows us to quantify and exploit benign geometry that may be present that is, naturally occurring structures that make a learning problem less complex than it appears to be. Examples will follow in the subsequent sections. In generic terms, we prove the following Theorem 4. Recall the empirical Rademacher complexity of a function class G is deﬁned as ˆRN(G) = 1 N Eσ supg G PN n=1 σng(xn), where σ = (σ1, . . . , σN) i.i.d Uniform( 1). Theorem 4. Let Gd be the function class associated with the class of functions Hd, with a bounded loss function ℓ taking values in [0, ℓ]: Gd = {g : (x, y) Xd Y ℓ(h(x), y), s.t. h Hd}. Let TN = {(xn, yn)N n=1 DN d } be a training set over Xd Y. Then, for any δ > 0, w.p. at least 1 2δ, uniformly for all g Gd, we have: E[g] ˆETN [g] + 2C2N,k(Gd)

+ 2ER[ ˆRN(GR)] + 3 ℓ

2N where R is a k d, k d random matrix independent of the sample, and GR = {x g R(x ) Y : x XR} is an auxiliary function class chosen before seeing the sample. By Remark 2(i), for any choice of ζ 0 we can have the data-complexity term C2N,k(Gd) ζ for a suitable choice of k. In particular, ζ = 0 is always satisﬁed by k = d, in which case we recover the classical Rademacher complexity based bound. However when the Rademacher complexity term is dimension dependent, then the data-complexity term may reduce this dependence by reducing dimension and exploiting the presence of naturally occurring structures through the structure preserving properties of random projections. In addition, even in the absence of favourable geometry, k and ζ allow us to control the tradeoff between bias and variance by choosing k d according to the available sample size N at the expense of a bias ζ. As we shall see in concrete examples in the next section, the analytic estimate of the data-complexity term takes different forms, depending on the form of the function class of

interest in some of the parametric models it brings in a constraint on the margin distribution, or a ﬁniteness constraint on the norms of various parameters akin to regularisation. In the nonparametric case it takes the form of a notion of intrinsic dimension.

Proof of Theorem 4. Let

φ(TN) := sup g Gd |E[g] ˆETN [g]|.

By Mc Diarmid inequality, w.p. 1 δ,

φ(TN) ETN DN d [φ(TN)] + ℓ

Now, bounding this expectation, we have:

ETN DN d [φ(TN)] = ETN DN d sup g Gd

E[g] ˆETN [g]

= ETN DN d sup g Gd

ET N DN d [ ˆET N [g]] ˆETN [g]

ETN,T N D2N d sup g Gd

ˆET N [g] ˆETN [g] by Jensen ineq.

ETN,T N D2N d sup g Gd ER inf g R GR

n ˆET N [g] ˆERT N [g R]

+ ˆERT N [g R] ˆERTN [g R]

+ ˆERTN [g R] ˆETN [g] o

ETN,T N D2N d sup g Gd ER inf g R GR

n ˆET N [g] ˆERT N [g R]

+ ˆERTN [g R] ˆETN [g] o

+ ETN,T N D2N d ER sup g R GR

ˆERT N [g R] ˆERTN [g R]

The ﬁrst term can be bounded by triangle inequality:

ETN,T N D2N d sup g Gd ER inf g R GR

n ˆET N [g] ˆERT N [g R]

+ ˆERTN [g R] ˆETN [g] o

ETN,T N D2N d sup g Gd ER inf g R GR

n=1 |g(x n, y n) g R(Rx n, y n)|

n=1 |g(xn, yn) g R(Rxn, yn)|

= ET2N D2N d sup g Gd ER inf g R GR

n=1 |g(xn, yn) g R(Rxn, yn)|

= 2C2N,k(Gd)

To estimate the second term we observe that it is the supremum of the empirical process indexed by the reduced class GR, hence we can apply the classical steps of symmetrization and Mc Diarmid inequality. Let σ Uniform( 1)N be Rademacher variables. We have:

ETN ,T N D2N d ER sup g R GR

ˆERT N [g R] ˆERTN [g R] =

ERETN ,T N ,σ sup g R

n=1 σn[g R(Rxn, yn) g R(Rx n, y n)]

= 2ERETN DN d ,σ sup g R GR

n=1 σng R(Rxn, yn)

= 2ER[RN(GR R)]

=1 δ 2ER[ ˆ RN(GR R)] + 2 ℓ

Plugging back and noticing that RN(GR R) = RN(GR) completes the proof.

Examples To instantiate Theorem 4 in concrete learning settings, we will need to estimate our newly introduced data-complexity term for the speciﬁc function classes. The following properties will be useful for doing so.

Remark 5. The following simpler expressions upper bound the data-complexity:

(i) C2N,k(Gd) ETN DN d sup g Gd ˆETN ER[DR(g)] Ck(Gd)

(ii) Ck(Gd) sup g Gd sup (x,y) X Y ER[DR(g)].

Proof. Relaxing the inﬁmum, and using Jensen s inequality,

C2N,k(Gd) = ...

ETN,T N D2N d sup g Gd ER inf g R GR ˆETN S T N |g R R g|

ETN,T N D2N d sup g Gd ER ˆETN S T N |g R R g|, g R GR

2ETN DN d sup g Gd ER ˆETN |g R R g|

2ET N DN d sup g Gd ER ˆET N |g R R g|, g R GR

ETN DN d sup g Gd ER ˆETN |g R R g|, g R GR

Now choose GR = Gd RT to complete the proof.

Thresholded Linear Model with the 0-1 Loss We start with the classical example of learning a halfspace, which allows us to demonstrate the working of our generic Theorem 4 on a simple example, and at the same time it gives us chance to ﬁx an error in the existing literature. Consider the linear function class in Rd, Hd = {x h T x : h, x Rd} with the 0-1 loss, ℓ01 : Y Y {0, 1}, ℓ(ˆy, y) = 1(ˆyy 0), where 1( ) takes the value 1 if its argument is true, and 0 otherwise. Let Gd = ℓ01 Hd.

By a slight abuse of notation we will identify the hypothesis h with its parameter vector. It is known that the generalisation error of this class, when working with the 0-1 loss directly and allowing X to be unbounded, is of order Θ(

N), and this dependence on d cannot be removed in general. However, deploying our Theorem 4 yields a condition of geometric nature under which we can prove the following dimension-free bound. Theorem 6. Let Gd = ℓ01 Hd as above. Let TN = {(xn, yn)N n=1 DN d } be a training set over Xd Y. Let R be a k d Gaussian random matrix, k d. Suppose that for some ζ = ζ(k) > 0 we have for all g Gd and for all samples TN DN d of size N that 1 N PN n=1 1 sign(h T xn) = sign(h T RT Rxn) ζ(k) w.p. 1 δ with respect to the random draw of R. Then, for any δ > 0, w.p. 1 2δ the following holds uniformly for all g Gd:

E[g] ˆE[g] + 2ζ(k)1(k < d) + C

2N where C > 0 is an absolute constant. The use of random projection (RP) to derive a dimensionfree bound for this function class was previously attempted in (Garg, Har-Peled, and Roth 2002), but unfortunately an error in their proof1 makes a meaningful comparison difﬁcult. Nevertheless we believe the idea itself has potential. An alternative approach specialised to halfspace learning in pursued in (Kab an and Durrant 2017). We note that both k and ζ(k) need to be chosen before seeing the sample. Interesting to notice that the role of k in the above bound replaces that of the VC dimension in classical bounds, which in this example would be d. A sensible choice is to set k proportional to N which is typically known. The classical VC bound is recovered if we take the worst case complexity d to have ζ(k) = 0, and demand the sample size N to be proportional to it. However, one typically cannot have access to unlimited sample size N. Instead this new type of bound allows us to choose k proportional to the N we do have access to, and pay the price accordingly by ζ(k). Note however, that ζ(k) may still be small if there is benign geometry present for instance if most points of the two classes are well separated. The proof of Theorem 6 shows how the requirement for small data-complexity translates into an average margin distribution like condition in this case.

Proof of Theorem 6. We apply Theorem 4. Choosing GR := Gd RT , by Remark 5 (i) we have:

C2N,k(Gd) ETN DN d sup g Gd ER ˆETN g RT R g

1In (Garg, Har-Peled, and Roth 2002), Lemma 3.5 correctly bounds the absolute difference of empirical errors from two independent samples S1, S2, in terms of their RP-ed counterpart, for a ﬁxed classiﬁer h. But the authors then attempt to apply Lemma 3.5, on page 8 after their eq.(3), with respect to a supremum over all hypotheses unfortunately neglecting the effect of taking unions over events outside the subspace of the RP. The error carries over throughout the rest of the proof of their Theorem 3.1. and invalidates its main statement.

and plugging in the form of Gd, we get:

ˆETN |g R R g|

1(h T xn = yn) 1(h T RT Rxn = yn)

n=1 1 sign(h T xn) = sign(h T RT Rxn) .

Hence, C2N,k(Gd) . . .

ETN DN d sup h Hd

n=1 Pr R sign(h T xn) = sign(h T RT Rxn)

It now remains to estimate the complexity of the function class in the reduced space, ˆRN(Hk). By Lemma 3.1 in (Mohri, Rostamizadeh, and Talwalkar 2012), and the known inequality between empirical Rademacher complexity and VC dimension for binary valued function classes (Bartlett

and Mendelson 2002), we have ˆRN(Hk) C q

N where C > 0 is an absolute constant, and V (Hk) = k in this example. Hence the result follows from Theorem 4.

Generalised Linear Models Consider the function class Gd = ℓ Hd where Hd = {x h T x : x X, h Rd} and ℓ: Y Y [0, ℓ] is a bounded loss function that is also Lℓ-Lipschitz in its ﬁrst argument. The input domain is not assumed to be bounded; it will be sufﬁcient to require that Ex[xx T ] < . Furthermore we do not impose any constraint a-priori on the parameter vector h this will pop out of deploying our generic Theorem 4. Because of the unbounded input domain and the absence of a-priori constraints on h, the error is again known to be of order Θ(

N) in general. By deploying our Theorem 4 we can prove following. Theorem 7. Let Gd be the class of generalised linear models of the form Gd = ℓ Hd, where Hd = {x h T x : h, x Rd, and the loss function ℓ: Y Y [0, ℓ] is Lℓ-Lipschitz in its ﬁrst argument. Let TN = {(xn, yn)N n=1 DN d } be a training set over Xd Y, where Dd satisﬁes that Σ Ex Dd[xx T ] is ﬁnite. Then, for any k d positive integer, and any δ > 0, w.p. 1 2δ we have uniformly for all g Gd:

E[g] ˆETN [g]...

2 k Ex[ x 2]1(k < rank(Σ)) sup h Hd h 2

where C is an absolute constant. We can interpret the various terms in the bound as follows. The term after the empirical error on the r.h.s. is an upper bound on the data-complexity term. It provides, as in the previous section, a condition under which a dimensionfree bound holds. Speciﬁcally, in this example it tells us that

ﬁniteness of h 2 is such a condition, and that small values of this norm represent a benign geometric structure that reduces generalisation error when N is limited. While this is no news, ﬁnding it from our generic theorem validates the principle behind it, namely that the distortion of the loss under a random projection of the inputs is able to capture a meaningful condition that explains what makes the learning problem easier despite the limited sample size. It will be interesting to follow this principle through in several other models too in the next couple of subsections. It is also interesting to note similarities and differences of the obtained Theorem 7 with traditional Rademacher bounds. It is well known that, if X is bounded, then together with the norm constraint on h 2 derived above (or pre-imposed, as usual in the literature), one can have a dimension-free estimate of the empirical Rademacher complexity term. If we were to assume that X is bounded then our bound recovers exactly the Rademacher bound with the choice k rank(Σ) that is, when the data-complexity term vanishes and other choices of k have a disadvantage. On the other hand, boundedness of X is not required for Theorem 7 to hold, and as already discussed, the constraint on h 2 pops out from a generic procedure without the need for any prior knowledge.

Proof of Theorem 7. We can work with Gaussian R with i.i.d. 0-mean entries and variance 1/k so that on average any projected vector has the same squared length as the original. Choose the auxiliary function class be GR = Gd RT , and we have for the data-complexity term: C2N,k(Gd) . . .

LℓETN,T N D2N d sup h Hd ER 1 2N

h T xn h T RT Rxn

By Jensen inequality, and Lemma 2 from (Kab an 2014) we get (after some algebra):

ER h T x h T RT Rx ER h T x h T RT Rx 2 1/2

2 k h 2 x 2

Having decoupled h and the data, we plug this back into eq. (1) to get the following when k < rank(Xd) (and 0 otherwise, by construction):

C2N,k(Gd) Lℓ

2 k E x 2 sup h Hd h 2 (2)

We note that the factor of

2 can be improved to 1 at the expense of a lengthier derivation if we chose to work with a scaled Haar distributed R. Next, we estimate the Rademacher complexity term in the reduced space. Since there is no constraint on the parameters or the input domain, we instead exploit that the loss function is bounded through the following lemma (proof omitted).

Lemma 8. Let Fk = {x f(w T x) [0, 1] : x Rk}.

Then C > 0 s.t. ˆRN(Fk) C q

Using this, let FR := GR/ ℓ, and we have:

ˆRN(GR) ℓˆRN(GR/ ℓ) ℓC

Putting together eqs. (3) and (2) completes the proof.

Mahalanobis Metric Learning Classiﬁers

In this section we consider the problem of learning a generic classiﬁer simultaneously with a linear transformation of the inputs equivalent to learning a Mahalanobis metric that enhances classiﬁcation performance. Here we will assume a bounded input space living in a ball of radius B for convenience, i.e. Xd B(0, B) as previous work by (Verma and Branson 2015) that studied the exact same problem. The work of (Verma and Branson 2015) gave an error bound with dimensional dependence of order O(d p

log(d)), and proved that this is not improvable in general. The authors then proposed a restriction on the Mahalanobis metrics, under which they show for the speciﬁc class of 2-layer perceptrons that the dimensional dependence of the risk upper bound reduces to O( p

log(d)). Our main purpose here is to demonstrate how this regulariser comes out automatically from the data-complexity term when applying our generic Theorem 4, without the need to specify it by a-priori knowledge, and we may take advantage if it more widely to tighten the bound whenever there are weakly informative features. Let Hd = {x h(x) : x Xd} be some parametric Lh-Lipschitz function class with uniformly bounded fat shattering dimension over all scales2. Deﬁne Md = {M Rd d, σmax(M) 1}, where the condition σmax(M) 1 (also assumed in (Verma and Branson 2015)) is not a restriction but serves to remove arbitrary scaling. The matrix M T M may be thought of as a metric tensor in the high dimensional space, although we work with M directly. As in the previous section, the loss function ℓ: Y Y [0, ℓ] is assumed to be bounded and Lℓ-Lipschitz in its ﬁrst argument. The function class of our interest is then Gd = ℓ Hd Md. Again we deploy Theorem 4, and now obtain the following.

Theorem 9. Consider the class of functions of the form Gd = ℓ Hd Md, where Md = {M Rd d, σmax(M) 1}, Hd = {x h(x) : x Xd, h is Lh-Lipschitz, fatα(H) fd} for some fd > 0, the loss function ℓ: Y Y [0, ℓ] is Lℓ-Lipschitz in its ﬁrst argument, and X B(0, B). Let TN = {(xn, yn)N n=1 DN d } be a training set of size N taking values in Xd Y. Then, for any k d positive integer, and any η > 0, δ > 0, w.p.

2This condition can be dropped at the expense of a p

log(N) factor in the error bound (by using a ﬁxed scale α instead of Dudley s integral inequality in the proof, as it was done in Lemma 3 of (Verma and Branson 2015)). However, for function classes that are closed under scalar multiplication it is simply equivalent to having a bounded pseudo-dimension, cf. Theorem 11.14. in (Anthony and Bartlett 1999).

1 2δ we have uniformly for all g Gd that:

E[g] ˆETN [g] + 2LℓLh1(k < rank(Xd))

k E x 2 sup M M M F ro + exp η2

kd ln 1 + 2Lh B

d + fk ln (4) + 2 p

The proof is rather lengthy, and deferred to the full version. The main steps are to choose GR in a way to make sure that the scaling indeterminacy of the metric is taken care of in the auxiliary function class if we work with Gaussian R, this involves a truncation step and to estimate the Rademacher complexity in the reduced class. The latter involves Dudley inequality and covering number estimation (Mendelson 2003) on the matrix class that represents the metric in the reduced space. The bound of Theorem 9 has the same high-level structure to what we have seen in the previous sections: On the r.h.s. we have the empirical error, the data-complexity term which can only be nonzero for k < d and the complexity of the auxiliary function class whose dependence on d

is reduced to O( q

d)) whenever k < d is chosen. The choice k = d (assuming Xd is not degenerate) recovers the bound of (Verma and Branson 2015) with two small improvements: in the log factor d is improved to

d, and a log(N) factor is eliminated. Note that the complexity of the high dimensional class Hd is not present in this bound, instead we have the complexity of its k-variate version, fk, which is the fat shattering dimension of Hk instead. Note also that we did not specify the class Hd so we can plug in any function class if we have an estimate of its fat shattering dimension. Upon more speciﬁcation the dependence on d may be further reducible. In general, coming up with the extra constraints that reduce dimensional dependence is not a trivial task, and typically relies on some prior knowledge about the problem. By contrast, in our framework we do not require any a-priori knowledge of the speciﬁc problem, instead require a robustness to perturbations created by the random projection. This automatically yields an appropriate constraint in the present example this is the magnitude of M F ro that reﬂects robustness to distortions from random projection that our generic Theorem 4 rests on. Since σmax(M) 1, this is at most

d but it can be much smaller if there are weakly informative features that are then down-weighted by the linear mapping being learned. An equivalent formulation of Theorem 9 is to say that for some k and ζ(k) chosen before seeing the data, we require the precondition that, M Md, M F ro ζ(k). On the r.h.s. then ζ(k) takes the place of sup M Md M F ro. Of course, the supremum of the Frobenius norms of all matrices in Md is not likely to be known, but we can use the technique of Structural Risk Minimisation (Vapnik 1998) to covert the bound into an algorithm that uses an estimate from

the data. In this example, this would yield Frobenius-norm regularisation of the metric. Finally, for the speciﬁc class of 2-layer perceptrons, it is natural to wonder whether we can recover a bound in O( p

log(d)) as in (Verma and Branson 2015). It turns out that, by using from our main Theorem 4 with a choice of ﬁxed metric in the auxiliary class GR we can actually obtain a dimension-free bound: Corollary 10. Consider the class of functions of the form Gd = ℓ Hd Md, where Hd has the following form. Let φ : R [ b, b] be Lφ-Lipschitz, and Hd = {x Pm i=1 viφ(w T i x) : x Xd, wi 1 1, vi 1 1}. Let TN be a training set, as before. Then, for any k d positive integer, and any δ > 0, w.p. 1 δ we have uniformly for all g Gd that:

E[g] ˆETN [g] + CLℓb

k E x 21(k < rk(Xd)) sup v,W,M v 2 W T M F ro

Nearest Neighbour The previous sections concerned various linear and nonlinear parametric models. Here we take a simple representative of nonparametric models a nearest neighbour classiﬁer. The nearest neighbour rule can be expressed as the following (Kontorovich and Weiss 2015). Denote by T + N , T N TN, T + N T N = TN the positively and negatively labelled training points respectively. Deﬁne the distance of a point x X to a set S as d(x, S) = infz S{ x z }. Then N +(x) d(x, T + N ) and N (x) d(x, T N ) are the nearest positive and nearest negative neighbours of x respectively, and the label prediction for x X is given by the sign of the following 1-Lipschitz function:

h(x : T + N , T N ) := 1 2(d(x, T N ) d(x, T + N )) (4)

= 1 2 x N (x) x N +(x)

We use Euclidean norms throughout. To facilitate the analysis, we assume a bounded input domain, Xd B(0, B), same as in (Kontorovich and Weiss 2015) and the truncated γ-margin loss, which is [0, 1]-valued and 1/γ-Lipschitz, where γ (0, 1] so that in the reduced space we can make use of existing estimates. The function class of our interest is therefore the composition of the 1/γ-Lipschitz loss and the 1-Lipschitz classiﬁer of the form given in eq. (4) that is, Gd {x g(x) : x Xd, g is 1/γ-Lipschitz}. By applying again our generic Theorem 4, we will obtain a bound where the data-complexity term turns out to be bounded by the Gaussian width of Xd. For a set T the Gaussian width (Vershynin 2018; Liaw et al. 2017) is deﬁned as:

sup x T { g, x } ,

where g N (0, Id). It is a measure of complexity of a set (see e.g. (Vershynin 2018), sec. 7.5 and references therein).

Hence we shall see from the below example that the name data-complexity that we introduced in an early section is quite appropriate. In this setting, Theorem 4 yields the following: Theorem 11. Let Xd B(0, B), Y = { 1, 1}, and TN DN. For any k d positive integer, for any γ > 0, δ (0, 1), w.p. 1 δ, uniformly for all g Gd we have:

E[g] ˆETN [g] + 4c γ

k w(Xd)1(k < d)

γ BN 1 k+1 + 3

where c and C are constants. Eq. (5) holds for any positive integer k chosen before seeing the data. For instance, if we set it to make the datacomplexity term below some η (0, 1), this is:

k 16cw2(Xd)

Then replacing this choice of k into the bound of Theorem 11 resembles the ﬂavour of the bounds obtained previously in doubling metric spaces in (Gottlieb, Kontorovich, and Krauthgamer 2016), with the squared Gaussian width taking the place of the doubling dimension. This is an interesting connection since there is a known link between the doubling dimension and the squared Gaussian width (Indyk 2007) in an Euclidean metric space with algebraic dimension d they are both of order Θ(d), but are otherwise more general and can take fractional values. The Gaussian width is sensitive to structure embedded in Euclidean spaces, such as the existence of a sparse representation, smooth manifold structure, and so on. Despite the above connection, there are differences. If the sample size is too small compared to intrinsic dimension of the input space, then we might opt for a lower value for k in the bound of Theorem 11 than that of eq.(6) at the expense of a larger bias. This is also practical since N is known while the Gaussian width may be unknown. As we see from the function class complexity term, the sample size needs to be exponential in k. Conversely, if N increases then k should be increased as well, in order to reduce or to eliminate (as k reaches d) the bias. Another difference is in the methodological focus: In (Kontorovich and Weiss 2015; Gottlieb, Kontorovich, and Krauthgamer 2016), bounding the error in terms of a notion of intrinsic dimension was made possible due to a property of the Lipschitz class, by which the covering numbers of the function class are upper bounded in terms of the covering numbers of the input space. By contrast, in our strategy the starting point was to exploit random projection to obtain an auxiliary class with lower complexity, and as such, the Lipschitz property of the classiﬁer functions is not in general required for our strategy to yield bounds in terms of the complexity of the input space. Indeed, we have seen throughout the various examples in this section that the same starting point has drawn together margin distribution and some widely used regularisation schemes in the case of parametric

models, as well as the Gaussian width in the nearest neighbour example. We note that in Theorem 11, the parameter γ needs to be chosen before seeing the data. Alternatively, if the Gaussiuan width is known, and noting that the constant C is speciﬁed in (Kontorovich and Weiss 2015), one can pursue SRM to tune the value of γ on the training set by minimising the bound.

Proof of Theorem 11. We take R Rk d a random projection matrix with 0-mean 1/k-variance i.i.d. Gaussian entries and will use the notations N + R (x) and N R (x) for the points whose image under a random projection is the nearest positive or nearest negative to Rx. For GR we choose a function class of the same form as Gd, but acting on the compressed k-dimensional inputs instead. The data-complexity term can be bounded as follows:

C2N,k(Gd) Ck(Gd) (7)

= ETN sup g Gd ER 1 N

n=1 |g R(Rxn, yn) g(xn, yn)|

|g R(Rx, y) g(x, y)| 1

Rx RN R (x) Rx RN+ R (x)

x N (x) + x N+(x)

2γ Rx RN R (x) x N (x)

+ Rx RN+ R (x) x N+(x) (8)

Note that Rx RN R (x) Rx RN (x) , and x N (x) x N R (x) , hence

2γ max Rx RN (x) x N (x) , Rx RN R (x) x N R (x)

+ max Rx RN+(x) x N+(x) , Rx RN+ R (x) x N+ R (x)

Therefore the supremum over g Gd amounts to a supremum over N +(x), N (x) Xd. So, eq. (7) . . .

1 2γ ETN sup N (x) ER 1 N

Rxn RN (xn) xn N (xn)

2γ ETN sup N+(x) ER 1 N

Rxn RN+(xn) xn N +(xn)

1 2γ sup x,N (x) Xd ER Rx RN (x) x N (x)

2γ sup x,N+(x) Xd ER Rx RN+(x) x N+(x)

1 2γ ER sup x,N (x) Xd

Rx RN (x) x N (x)

2γ ER sup x,N+(x) Xd

Rx RN+(x) x N +(x)

γ ER sup x,x Xd

Rx Rx x x (9)

γ w(Xd Xd)/

k w(Xd) (11)

where w(Xd) is the Gaussian width of Xd, and the last two steps follow from (Liaw et al. 2017). For the function class complexity term we can use existing estimates. Note that GR is a class of 1/γ-Lipschitz functions, on k-dimensional inputs. The empirical Rademacher complexity3 of the class of Lipschitz functions with a ﬁxed Lipschitz constant has been derived in (Gottlieb, Kontorovich, and Krauthgamer 2016), which in our case takes the following form:

γ diam(RXd))k/2

γ diam(RXd)N 1 k+1 (12)

where C is an absolute constant. To assemble the generalisation bound, we require the expected value ER[ ˆRN(GR)]. Since ER Rx 2 = x 2, and by convexity of the supremum and the square root functions, Jensen s inequality yields that: ER[diam(RXd)] 2B, (13) where the factor of 2 can be absorbed into the constant C above, and so

ER[ ˆRN(GR)] C 1

γ BN 1 k+1 (14)

Putting the pieces together completes the proof.

Conclusions We presented an approach to reduce dimensional dependence of error bounds for learning settings where such dependence is known to be essential in general. This is achieved by the ability of random projections to take advantage of benign low complexity geometry leveraged here in a purely analytic (rather than algorithmic) role. First we gave a generic uniform upper bound on the generalisation error in terms of the complexity of an auxiliary function class, and a new complexity term that quantiﬁes benign geometry in a problem-dependent manner. We then instantiated this to parametric linear and nonlinear models, as well as a simple non-parametric model. If there is benign geometry present, then the bounds become tighter, otherwise they recover existing bounds. It is also possible in principle to use these results in conjunction with the classical technique of structural risk minimisation to convert them into regularised estimators. Future work will extend this framework to other learning settings, to ﬁnd out what other geometric structures are benign for high dimensional learning.

Acknowledgements We thank Henry Reeve and Vinesh Solanki for useful discussions. This work is funded by EPSRC under Fellowship grant EP/P004245/1, and a Turing Fellowship (grant EP/N510129/1).

3An alternative approach, pursued in (Gilad-Bachrach, Navot, and Tishby 2004; Gottlieb and Kontorovich 2014), would be to use fat-shattering dimension of the Lipschitz class, which gives a better convergence rate once the sample size exceeds a large threshold, but it is less tight in the small sample regime.

References Anthony, M., and Bartlett, P. L. 1999. Neural Network Learning: Theoretical Foundations. Cambridge University Press. Bartlett, P. L., and Mendelson, S. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. J. of Machine Learning Research 3:463 482. Dasgupta, S., and Gupta, A. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms 22(1):60 65. Garg, A.; Har-Peled, S.; and Roth, D. 2002. On generalization bounds, projection proﬁle, and margin distribution. In International Conference on Machine Learning (ICML), 171 178. Gilad-Bachrach, R.; Navot, A.; and Tishby, N. 2004. Margin based feature selection: Theory and algorithms. In International Conference on Machine Learning (ICML). Gottlieb, L.-A., and Kontorovich, A. 2014. Efﬁcient classiﬁcation for metric data. IEEE Trans. Inform. Theory 60(9):5750 5759. Gottlieb, L.-A.; Kontorovich, A.; and Krauthgamer, R. 2016. Adaptive metric dimensionality reduction. Theoretical Computer Science 620(21):105 118. Indyk, P. 2007. Nearest-neighbor-preserving embeddings. ACM Transactions on Algorithms 3:3. Kab an, A., and Durrant, R. J. 2017. Structure-aware error bounds for linear classiﬁcation with the zero-one loss. ar Xiv preprint ar Xiv:1709.09782. Kab an, A. 2014. New bounds on compressed linear least squares regression. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 448 456. Kontorovich, A., and Weiss, R. 2014. Maximum margin multiclass nearest neighbors. In Proceedings of the 31th International Conference on Machine Learning, ICML, 892 900. Kontorovich, A., and Weiss, R. 2015. A Bayes consistent 1-NN classiﬁer. AISTATS. Liaw, C.; Mehrabian, A.; Plan, Y.; and Vershynin, R. 2017. A simple tool for bounding the deviation of random matrices on geometric sets. Geometric Aspects of Functional Analysis 277 299. Mendelson, S. 2003. A few notes on statistical learning theory. In Mendelson, S., and Smola, A. J., eds., Advanced Lectures in Machine Learning, vol. 2600 of Lecture Notes in Computer Science. Springer-Verlag, Berlin. 1 40. Mohri, M.; Rostamizadeh, A.; and Talwalkar, A. 2012. Foundations of machine learning. MIT Press. Vapnik, V. N. 1998. Statistical Learning Theory. New York: Wiley-Interscience. Verma, N., and Branson, K. 2015. Sample complexity of learning mahalanobis distance metrics. Advances in Neural Information Processing Systems (NIPS) 28:2584 2592. Vershynin, R. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press.