# nonparametric_online_regression_while_learning_the_metric__7a12c60c.pdf Nonparametric Online Regression while Learning the Metric Ilja Kuzborskij EPFL Switzerland ilja.kuzborskij@gmail.com Nicol o Cesa-Bianchi Dipartimento di Informatica Universit a degli Studi di Milano Milano 20135, Italy nicolo.cesa-bianchi@unimi.it We study algorithms for online nonparametric regression that learn the directions along which the regression function is smoother. Our algorithm learns the Mahalanobis metric based on the gradient outer product matrix G of the regression function (automatically adapting to the effective rank of this matrix), while simultaneously bounding the regret on the same data sequence in terms of the spectrum of G. As a preliminary step in our analysis, we extend a nonparametric online learning algorithm by Hazan and Megiddo enabling it to compete against functions whose Lipschitzness is measured with respect to an arbitrary Mahalanobis metric. 1 Introduction An online learner is an agent interacting with an unknown and arbitrary environment over a sequence of rounds. At each round t, the learner observes a data point (or instance) xt X Rd, outputs a prediction byt for the label yt R associated with that instance, and incurs some loss ℓt(byt), which in this paper is the square loss (byt yt)2. At the end of the round, the label yt is given to the learner, which he can use to reduce his loss in subsequent rounds. The performance of an online learner is typically measured using the regret. This is defined as the amount by which the learner s cumulative loss exceeds the cumulative loss (on the same sequence of instances and labels) of any function f in a given reference class F of functions, ℓt(byt) ℓt f(xt) f F . (1) Note that typical regret bounds apply to all f F and to all individual data sequences. However, the bounds are allowed to scale with parameters arising from the interplay between f and the data sequence. In order to capture complex environments, the reference class of functions should be large. In this work we focus on nonparametric classes F, containing all differentiable functions that are smooth with respect to some metric on X. Our approach builds on the simple and versatile algorithm for nonparametric online learning introduced in [6]. This algorithm has a bound on the regret RT (f) of order (ignoring logarithmic factors) T d 1+d f F . (2) Here if is the value of the partial derivative f(x) xi maximized over x X. The square root term is the Lipschitz constant of f, measuring smoothness with respect to the Euclidean metric. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. However, in some directions f may be smoother than in others. Therefore, if we knew in advance the set of directions along which the best performing reference functions f are smooth, we could use this information to control regret better. In this paper we extend the algorithm from [6] and make it adaptive to the Mahalanobis distance defined through an arbitrary positive definite matrix M with spectrum (ui, λi) d i=1 and unit spectral radius (λ1 = 1). We prove a bound on the regret RT (f) of order (ignoring logarithmic factors) ρT 1+ρT f F . (3) Here ρT d is, roughly, the number of eigenvalues of M larger than a threshold shrinking polynomially in T, and detκ(M) 1 is the determinant of M truncated at λκ (with κ ρT ). The quantity uif 2 is defined like if , but with the directional derivative f(x) u instead of the partial derivative. When the spectrum of M is light-tailed (so that ρT d and, simultaneously, detκ(M) 1), with the smaller eigenvalues λi corresponding to eigenvectors in which f is smoother (so that the ratios uif 2 λi remain controlled), then our bound improves on (2). On the other hand, when no preliminary knowledge about good f is available, we may run the algorithm with M equal to the identity matrix and recover exactly the bound (2). Given that the regret can be improved by informed choices of M, it is natural to ask whether some kind of improvement is still possible when M is learned online, from the same data sequence on which the regret is being measured. Of course, this question makes sense if the data tell us something about the smoothness of the f against which we are measuring the regret. In the second part of the paper we implement this idea by considering a scenario where instances are drawn i.i.d. from some unknown distribution, labels are stochastically generated by some unknown regression function f0, and we have no preliminary knowledge about the directions along which f0 is smoother. In this stochastic scenario, the expected gradient outer product matrix G = E f0(X) f0(X) provides a natural choice for the matrix M in our algorithm. Indeed, E f0(X) ui 2 = µi where u1, . . . , ud are the eigenvectors of G while µ1, . . . , µd are the corresponding eigenvalues. Thus, eigenvectors u1, . . . ud capture the principal directions of variation for f. In fact, assuming that the labels obey a statistical model Y = g(BX) + ε where ε is the noise and B Rk d projects X onto a k-dimensional subspace of X, one can show [21] that span(B) span(u1, . . . , ud). In this sense, G is the best metric, because it recovers the k-dimensional relevant subspace. When G is unknown, we run our algorithm in phases using a recently proposed estimator b G of G. The estimator is trained on the same data sequence and is fed to the algorithm at the beginning of each phase. Under mild assumptions on f0, the noise in the labels, and the instance distribution, we prove a high probability bound on the regret RT (f0) of order (ignoring logarithmic factors) ujf0 + V f0 2 eρT 1+eρT . (4) Observe that the rate at which the regret grows is the same as the one in (3), though now the effective dimension parameter eρT is larger than ρT by an amount related to the rate of convergence of the eigenvalues of b G to those of G. The square root term is also similar to (3), but for the extra quantity V f0 , which accounts for the error in approximating the eigenvectors of G. More precisely, V f0 is vf maximized over directions v in the span of V , where V contains those eigenvectors of G that cannot be identified because their eigenvalues are too close to each other (we come back to this issue shortly). Finally, we lose the dependence on the truncated determinant, which is replaced here by its trivial upper bound 1. The proof of (2) in [6] is based on the sequential construction of a sphere packing of X, where the spheres are centered on adaptively chosen instances xt, and have radii shrinking polynomially with time. Each sphere hosts an online learner, and each new incoming instance is predicted using the learner hosted in the nearest sphere. Our variant of that algorithm uses an ellipsoid packing, and computes distances using the Mahalanobis distance M. The main new ingredient in the analysis leading to (3) is our notion of effective dimension ρT (we call it the effective rank of M), which measures how fast the spectrum of M vanishes. The proof also uses an ellipsoid packing bound and a lemma relating the Lipschitz constant to the Mahalanobis distance. The proof of (4) is more intricate because G is only known up to a certain approximation. We use an estimator b G, recently proposed in [14], which is consistent under mild distributional assumptions when f0 is continuously differentiable. The first source of difficulty is adjusting the notion of effective rank (which the algorithm needs to compute) to compensate for the uncertainty in the knowledge of the eigenvalues of G. A further problematic issue arises because we want to measure the smoothness of f0 along the eigendirections of G, and so we need to control the convergence of the eigenvectors, given that b G converges to G in spectral norm. However, when two eigenvalues of G are close, then the corresponding eigenvectors in the estimated matrix b G are strongly affected by the stochastic perturbation (a phenomenon known as hybridization or spectral leaking in matrix perturbation theory, see [1, Section 2]). Hence, in our analysis we need to separate out the eigenvectors that correspond to well spaced eigenvalues from the others. This lack of discrimination causes the appearance in the regret of the extra term V f0 . 2 Related works Nonparametric estimation problems have been a long-standing topic of study in statistics, where one is concerned with the recovery of an optimal function from a rich class under appropriate probabilistic assumptions. In online learning, the nonparametric approach was investigated in [15, 16, 17] by Vovk, who considered regression problems in large spaces and proved bounds on the regret. Minimax rates for the regret were later derived in [13] using a non-constructive approach. The first explicit online nonparametric algorithms for regression with minimax rates were obtained in [4]. The nonparametric online algorithm of [6] is known to have a suboptimal regret bound for Lipschitz classes of functions. However, it is a simple and efficient algorithm, well suited to the design of extensions that exhibit different forms of adaptivity to the data sequence. For example, the paper [9] derived a variant that automatically adapts to the intrinsic dimension of the data manifold. Our work explores an alternative direction of adaptivity, mainly aimed at taming the effect of the curse of dimensionality in nonparametric prediction through the learning of an appropriate Mahalanobis distance on the instance space. There is a rich literature on metric learning (see, e.g., the survey [2]) where the Mahalanobis metric M is typically learned through minimization of the pairwise loss function of the form ℓ(M, x, x ). This loss is high whenever dissimilar pairs of x and x are close in the Mahalanobis metric, and whenever similar ones are far apart in the same metric see, e.g., [19]. The works [5, 7, 18] analyzed generalization and consistency properties of online learning algorithms employing pairwise losses. In this work we are primarily interested in using a metric M where M is close to the gradient outer product matrix of the best model in the reference class of functions. As we are not aware whether pairwise loss functions can indeed consistently recover such metrics, we directly estimate the gradient outer product matrix. This approach to metric learning was mostly explored in statistics e.g., by locally-linear Empirical Risk Minimization on RKHS [12, 11], and through Stochastic Gradient Descent [3]. Our learning approach combines in a phased manner a Mahalanobis metric extension of the algorithm by [6] with the estimator of [14]. Our work is also similar in spirit to the gradient weights approach of [8], which learns a distance based on a simpler diagonal matrix. Preliminaries and notation. Let B(z, r) Rd be the ball of center z and radius r > 0 and let B(r) = B(0, r). We assume instances x belong to X B(1) and labels y belong to Y [0, 1]. We consider the following online learning protocol with oblivious adversary. Given an unknown sequence (x1, y1), (x2, y2), X Y of instances and labels, for every round t = 1, 2, . . . 1. the environment reveals instance xt X; 2. the learner selects an action byt Y and incurs the square loss ℓt byt = byt yt 2; 3. the learner observes yt. Given a positive definite d d matrix M, the norm x z M induced by M (a.k.a. Mahalanobis distance) is defined by p (x z) M(x z). Definition 1 (Covering and Packing Numbers). An ε-cover of a set S w.r.t. some metric ρ is a set {x 1, . . . , x n} S such that for each x S there exists i {1, . . . , n} such that ρ(x, x i) ε. The covering number N(S, ε, ρ) is the smallest cardinality of a ε-cover. An ε-packing of a set S w.r.t. some metric ρ is a set {x 1, . . . , x m} S such that for any distinct i, j {1, . . . , m}, we have ρ(x i, x j) > ε. The packing number M(S, ε, ρ) is the largest cardinality of a ε-packing. It is well known that M(S, 2ε, ρ) N(S, ε, ρ) M(S, ε, ρ). For all differentiable f : X Y and for any orthonormal basis V {u1, . . . , uk} with k d we define V f = max v span(V ) v = 1 sup x X f(x) v . If V = {u} we simply write uf . In the following, M is a positive definite d d matrix with eigenvalues λ1 λd > 0 and eigenvectors u1, . . . , ud. For each k = 1, . . . , d the truncated determinant is detk(M) = λ1 λk. Figure 1: Quickly decreasing spectrum of M implies slow growth of its effective rank in t. 1 2 3 4 5 6 7 8 9 10 0.0 1.0 Eigenvalues of M 0 2000 4000 6000 8000 10000 t 1 2 3 4 5 6 7 8 9 10 Effective Rank of M The kappa function for the matrix M is defined by κ(r, t) = max n m : λm t 2 1+r , m = 1, . . . , d o (5) for t 1 and r = 1, . . . , d. Note that κ(r + 1, t) κ(r, t). Now define the effective rank of M at horizon t by ρt = min {r : κ(r, t) r, r = 1, . . . , d} . (6) Since κ(d, t) d for all t 1, this is a well defined quantity. Note that ρ1 ρ2 d. Also, ρt = d for all t 1 when M is the d d identity matrix. Note that the effective rank ρt measures the number of eigenvalues that are larger than a threshold that shrinks with t. Hence matrices M with extremely light-tailed spectra cause ρt to remain small even when t grows large. This behaviour is shown in Figure 1. Throughout the paper, we use f O= (g) and f e O= (g) to denote, respectively, f = O(g) and f = e O(g). 3 Online nonparametric learning with ellipsoid packing In this section we present a variant (Algorithm 1) of the online nonparametric regression algorithm introduced in [6]. Since our analysis is invariant to rescalings of the matrix M, without loss of generality we assume M has unit spectral radius (i.e., λ1 = 1). Algorithm 1 sequentially constructs a packing of X using M-ellipsoids centered on a subset of the past observed instances. At each step t, the label of the current instance xt is predicted using the average byt of the labels of past instances that fell inside the ellipsoid whose center xs is closest to xt in the Mahalanobis metric. At the end of the step, if xt was outside of the closest ellipsoid, then a new ellipsoid is created with center xt. The radii εt of all ellipsoids are shrunk at rate t 1/(1+ρt). Note that efficient (i.e., logarithmic in the number of centers) implementations of approximate nearest-neighbor search for the active center xs exist [10]. The core idea of the proof (deferred to the supplementary material) is to maintain a trade-off between the regret contribution of the ellipsoids and an additional regret term due to the approximation of f by the Voronoi partitioning. The regret contribution of each ellipsoid is logarithmic in the number of predictions made. Since each instance is predicted by a single ellipsoid, if we ignore log factors the overall regret contribution is equal to the number of ellipsoids, which is essentially controlled by the packing number w.r.t. the metric defined by M. The second regret term is due to the fact that at any point of time the prediction of the algorithm is constant within the Voronoi cells of X induced by the current centers (recall that we predict with nearest neighbor). Hence, we pay an extra term equal to the radius of the ellipsoids times the Lipschitz constant which depends on the directional Lipschitzness of f with respect to the eigenbasis of M. Theorem 1 (Regret with Fixed Metric). Suppose Algorithm 1 is run with a positive definite matrix M with eigenbasis u1, . . . , ud and eigenvalues 1 = λ1 λd > 0. Then, for any differentiable Algorithm 1 Nonparametric online regression Input: Positive definite d d matrix M. 1: S Centers 2: for t = 1, 2, . . . do 3: εt t 1 1+ρt Update radius 4: Observe xt 5: if S then 6: S {t}, Tt Create initial ball 7: end if 8: s arg min s S xt xs M Find active center 9: if Ts then 10: yt = 1 12: byt 1 |Ts| t Ts yt Predict using active center 13: end if 14: Observe yt 15: if xt xs M εt then 16: Ts Ts {t} Update list for active center 17: else 18: S S {s}, Ts Create new center 19: end if 20: end for f : X Y we have that RT (f) e O= where κ = κ(ρT , T) ρT d. We first prove two technical lemmas about packings of ellipsoids. Lemma 1 (Volumetric packing bound). Consider a pair of norms , and let B, B Rd be the corresponding unit balls. Then M(B, ε, ) vol B + ε Lemma 2 (Ellipsoid packing bound). If B is the unit Euclidean ball then λi where s = max n i : p λi ε, i = 1, . . . , d o . The following lemma states that whenever f has bounded partial derivatives with respect to the eigenbase of M, then f is Lipschitz with respect to M. Lemma 3 (Bounded derivatives imply Lipschitzness in M-metric). Let f : X R be everywhere differentiable. Then for any x, x X, f(x) f(x ) x x M 4 Learning while learning the metric In this section, we assume instances xt are realizations of i.i.d. random variables Xt drawn according to some fixed and unknown distribution µ which has a continuous density on its support X. We also assume labels yt are generated according to the noise model yt = f0(xt) + ν(xt), where f0 is some unknown regression function and ν(x) is a subgaussian zero-mean random variable for all x X. We then simply write RT to denote the regret RT (f0). Note that RT is now a random variable which we bound with high probability. We now show how the nonparametric online learning algorithm (Algorithm 1) of Section 3 can be combined with an algorithm that learns an estimate t=1 b f0(xt)b f0(xt) (7) of the expected outer product gradient matrix G = E f0(X) f0(X) . The algorithm (described in the supplementary material) is consistent under the following assumptions. Let X(τ) be X blown up by a factor of 1 + τ. Assumption 1. 1. There exists τ0 > 0 such that f0 is continuously differentiable on X(τ0). 2. There exists G > 0 such that max x X(τ0) f0(x) G. 3. The distribution µ is full-dimensional: there exists Cµ > 0 such that for all x X and ε > 0, µ B(x, ε) Cµεd. In particular, the next lemma states that, under Assumption 1, b Gn is a consistent estimate of G. Lemma 4 ([14, Theorem 1]). If Assumption 1 holds, then there exists a nonnegative and nonincreasing sequence {γn}n 1 such that for all n, the estimated gradient outerproduct (7) computed with parameters εn > 0, and 0 < τn < τ0 satisfies b Gn G 2 γn with high probability with respect do the random draw of X1, . . . , Xn. Moreover, if τn = Θ ε1/4 n , εn = Ω ln n 2 and εn = O n 1 2(d+1) then γn 0 as n . Our algorithm works in phases i = 1, 2, . . . where phase i has length n(i) = 2i. Let T(i) = 2i+1 2 be the index of the last time step in phase i. The algorithm uses a nonincreasing regularization sequence γ0 γ1 > 0. Let c M(0) = γ0I. During each phase i, the algorithm predicts the data points by running Algorithm 1 with M = c M(i 1) c M(i 1) 2 (where 2 denotes the spectral norm). Simultaneously, the gradient outer product estimate (7) is trained over the same data points. At the end of phase i, the current gradient outer product estimate b G(i) = b GT (i) is used to form a new matrix c M(i) = b G(i) + γT (i)I. Algorithm 1 is then restarted in phase i + 1 with M = c M(i) c M(i) 2. Note that the metric learning algorithm can be also implemented efficiently through nearest-neighbor search as explained in [14]. Let µ1 µ2 µd be the eigenvalues and u1, . . . , ud be the eigenvectors of G. We define the j-th eigenvalue separation j by j = min k =j For any > 0 define also V uj : |µj µk| , k = j and V = {u1, . . . , ud} \ V . Our results are expressed in terms of the effective rank (6) of G at horizon T. However, in order to account for the error in estimating the eigenvalues of G, we define the effective rank eρt with respect to the following slight variant of the function kappa, eκ(r, t) = max n m : µm + 2γt µ1t 2 1+r , m = 1, . . . , d o t 1 and r = 1, . . . , d. Let c M(i) be the estimated gradient outer product constructed at the end of phase i, and let bµ1(i) + γ(i) bµd(i)+γ(i) and bu1(i), . . . , bud(i) be the eigenvalues and eigenvectors of c M(i), where we also write γ(i) to denote γT (i). We use bκ to denote the kappa function with estimated eigenvalues and bρ to denote the effective rank defined through bκ. We start with a technical lemma. Lemma 5. Let µd, α > 0 and d 1. Then the derivative of F(t) = µd + 2 T0 + t α t 2 1+d is positive for all t 1 when T0 d+1 Proof. We have that F (t) 0 if and only if t 2(T0+t) α(d+1) 1 + (T0 + t)αµd . This is implied by t 2µd(T0 + t)1+α α(d + 1) or, equivalently, T0 A1/(1+α)t1/(1+α) t where A = α(d + 1)/(2µd). The right-hand side A1/(1+α)t1/(1+α) t is a concave function of t. Hence the maximum is found at the value of t where the derivative is zero, this value satisfies 1 + α t α/(1+α) = 1 which solved for t gives t = A1/α(1 + α) (1+α)/α . Substituting this value of t in A1/(1+α)t1/(1+α) t gives the condition T0 A1/αα(1+α) (1+α)/α which is satisfied when T0 d+1 Theorem 2. Suppose Assumption 1 holds. If the algorithm is ran with a regularization sequence γ0 = 1 and γt = t α for some α > 0 such that γt γt for all t d + 1 2µd 1/α and for γ1 γ2 > 0 satisfying Lemma 4, then for any given > 0 ujf0 + V f0 2 with high probability with respect to the random draw of X1, . . . , XT . Note that the asymptotic notation is hiding terms that depend on 1/ , hence we can not zero out the term V f0 in the bound by taking arbitrarily small. Proof. Pick the smallest i0 such that T(i0) d + 1 (we need this condition in the proof). The total regret in phases 1, 2, . . . , i0 is bounded by d + 1 2µd 1/α = O(1). Let the value bρT (i) at the end of phase i be denoted by bρ(i). By Theorem 1, the regret RT (i + 1) of Algorithm 1 in each phase i + 1 > i0 is deterministically upper bounded by 8 ln e2i+1 8 2 bρ(i+1) + 4 buj(i)f0 2 λj(i) λ1(i) 2(i+1) bρ(i+1) 1+bρ(i+1) (9) where λj(i) = bµj(i) + γ(i). Here we used the trivial upper bound detκ c M(i) c M(i) 2 1 for all κ = 1, . . . , d. Now assume that bµ1(i) + γ(i) bµm(i) + γ(i) t 2 1+r for some m, r {1, . . . , d} and for some t in phase i + 1. Hence, using Lemma 4 and γt γt, we have that max j=1,...,d bµj(i) µj b G(i) G 2 γ(i) with high probability. (10) where the first inequality is straightforward. Hence we may write µ1 µ1 γ(i) + γ(i) bµ1(i) + γ(i) bµm(i) + γ(i) t 2 1+r µm + γ(i) + γ(i) t 2 1+r (using Lemma 4) µm + 2γ(i) t 2 1+r . Recall γ(i) = T(i) α. Using Lemma 5, we observe that the derivative of F(t) = µm + 2 T(i) + t α t 2 1+r is positive for all t 1 when which is guaranteed by our choice (8). Hence, µm + 2γ(i) t 2 1+r µm + 2γT ) T 2 1+r and so bµ1(i) + γ(i) bµm(i) + γ(i) t 2 1+r implies µ1 µm + 2γT T 2 1+r . Recalling the definitions of eκ and bκ, this in turn implies bκ(r, t) eκ(r, T), which also gives bρt eρT for all t T. Next, we bound the approximation error in each individual eigenvalue of G. By (10) we obtain, for any phase i and for any j = 1, . . . , d, µj + 2γ(i) µj + γ(i) + γ(i) bµj(i) + γ(i) µj γ(i) + γ(i) µj . Hence, bound (9) implies 8 ln e2i+1 12eρT + 4 v u u t µ1 + 2γ(i) d X 2(i+1) eρT 1+eρT . (11) The error in approximating the eigenvectors of G is controlled via the following first-order eigenvector approximation result from matrix perturbation theory [20, equation (10.2)], for any vector v of constant norm, v buj(i) uj = X u k c M(i) G uj µj µk v uk + o c M(i) G 2 2 2γ(i) µj µk v uk + o where we used u k c M(i) G uj c M(i) G 2 γ(i) + γ(i) 2γ(i). Then for all j such that uj V , f0(x) buj(i) uj = X 2γ(i) µj µk f0(x) uk + o d f0(x) 2 + o Note that the coefficients αk = u k c M(i) G uj µj µk + o γ(i)2 k = j are a subset of coordinate values of vector buj(i) uj w.r.t. the orthonormal basis u1, . . . , ud. Then, by Parseval s identity, 4 buj(i) uj 2 2 X k =j α2 k . Therefore, it must be that u k c M(i) G uj µj µk For any j such that uj V , since µj µk for all uk V , we may write f0(x) buj(i) uj uk V f0(x) uk + 2 + o f0(x) uk + o d P V f0(x) 2 + 2 + o d P V f0(x) 2 + o where P V and P V are the orthogonal projections onto, respectively, V and V . Therefore, we have that bujf0 = sup x X f0(x) buj(i) = sup x X f0(x) buj(i) uj + uj sup x X f0(x) uj + sup x X f0(x) buj(i) uj ujf0 + 2γ(i) d V f0 + 2 + o Letting α (i) = 2γ(i) d V f0 + 2 + o γ(i)2 we can upper bound (11) as follows 8 ln e2i+1 12eρT+ 4 v u u t µ1 + 2γ(i) d X ujf0 + α (i) 2 2(i+1) eρT 1+eρT . Recall that, due to (10), the above holds at the end of each phase i + 1 with high probability. Now observe that γ(i) = O 2 αi and so α (i) O= ( V f0 / + V f0 ). Hence, by summing over phases i = 1, . . . , log2 T and applying the union bound, 8 ln e T 12d+ 4 v u u t µ1 + 2γ(i 1) d X ujf0 + α (i 1) 2 eρT 1+eρT i ujf0 + V f0 concluding the proof. 5 Conclusions and future work We presented an efficient algorithm for online nonparametric regression which adapts to the directions along which the regression function f0 is smoother. It does so by learning the Mahalanobis metric through the estimation of the gradient outer product matrix E[ f0(X) f0(X) ]. As a preliminary result, we analyzed the regret of a generalized version of the algorithm from [6], capturing situations where one competes against functions with directional Lipschitzness with respect to an arbitrary Mahalanobis metric. Our main result is then obtained through a phased algorithm that estimates the gradient outer product matrix while running online nonparametric regression on the same sequence. Both algorithms automatically adapt to the effective rank of the metric. This work could be extended by investigating a variant of Algorithm 1 for classification, in which ball radii shrink at a nonuniform rate, depending on the mistakes accumulated within each ball rather than on time. This could lead to the ability of competing against functions f that are only locally Lipschitz. In addition, it is conceivable that under appropriate assumptions, a fraction of the balls could stop shrinking at a certain point when no more mistakes are made. This might yield better asymptotic bounds than those implied by Theorem 1, because ρT would never attain the ambient dimension d. Acknowledgments Authors would like to thank S ebastien Gerchinovitz and Samory Kpotufe for useful discussions on this work. IK would like to thank Google for travel support. This work also was in parts funded by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement no 637076). [1] R. Allez and J.-P. Bouchaud. Eigenvector dynamics: general theory and some applications. Physical Review E, 86(4):046202, 2012. [2] A. Bellet, A. Habrard, and M. Sebban. A Survey on Metric Learning for Feature Vectors and Structured Data. ar Xiv preprint ar Xiv:1306.6709, 2013. [3] X. Dong and D.-X. Zhou. Learning Gradients by a Gradient Descent Algorithm. Journal of Mathematical Analysis and Applications, 341(2):1018 1027, 2008. [4] P. Gaillard and S. Gerchinovitz. A chaining Algorithm for Online Nonparametric Regression. In Conference on Learning Theory (COLT), 2015. [5] Z.-C. Guo, Y. Ying, and D.-X. Zhou. Online Regularized Learning with Pairwise Loss Functions. Advances in Computational Mathematics, 43(1):127 150, 2017. [6] E. Hazan and N. Megiddo. Online Learning with Prior Knowledge. In Learning Theory, pages 499 513. Springer, 2007. [7] R. Jin, S. Wang, and Y. Zhou. Regularized Distance Metric Learning: Theory and Algorithm. In Conference on Neural Information Processing Systems (NIPS), 2009. [8] S. Kpotufe, A. Boularias, T. Schultz, and K. Kim. Gradients Weights Improve Regression and Classification. Journal of Machine Learning Research, 17(22):1 34, 2016. [9] S. Kpotufe and F. Orabona. Regression-Tree Tuning in a Streaming Setting. In Conference on Neural Information Processing Systems (NIPS), 2013. [10] R. Krauthgamer and J. R. Lee. Navigating nets: simple algorithms for proximity search. In Proceedings of the 15th annual ACM-SIAM Symposium on Discrete algorithms, pages 798 807. Society for Industrial and Applied Mathematics, 2004. [11] S. Mukherjee and Q. Wu. Estimation of Gradients and Coordinate Covariation in Classification. Journal of Machine Learning Research, 7(Nov):2481 2514, 2006. [12] S. Mukherjee and D.-X. Zhou. Learning Coordinate Covariances via Gradients. Journal of Machine Learning Research, 7(Mar):519 549, 2006. [13] A. Rakhlin and K. Sridharan. Online Non-Parametric Regression. In Conference on Learning Theory (COLT), 2014. [14] S. Trivedi, J. Wang, S. Kpotufe, and G. Shakhnarovich. A consistent Estimator of the Expected Gradient Outerproduct. In Conference on Uncertainty in Artificial Intelligence (UAI), 2014. [15] V. Vovk. Metric entropy in competitive on-line prediction. ar Xiv preprint cs/0609045, 2006. [16] V. Vovk. On-line regression competitive with reproducing kernel Hilbert spaces. In International Conference on Theory and Applications of Models of Computation. Springer, 2006. [17] V. Vovk. Competing with wild prediction rules. Machine Learning, 69(2):193 212, 2007. [18] Y. Wang, R. Khardon, D. Pechyony, and R. Jones. Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions. In Conference on Learning Theory (COLT), 2012. [19] K. Q. Weinberger and L. K. Saul. Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 10:207 244, 2009. [20] J. H. Wilkinson. The Algebraic Eigenvalue Problem, volume 87. Clarendon Press Oxford, 1965. [21] Q. Wu, J. Guinney, M. Maggioni, and S. Mukherjee. Learning gradients: predictive models that infer geometry and statistical dependence. Journal of Machine Learning Research, 11(Aug):2175 2198, 2010.