# compact_convex_projections__3d23136e.pdf

Journal of Machine Learning Research 18 (2018) 1-43 Submitted 3/16; Revised 1/18; Published 4/18

Compact Convex Projections

Steffen Gr unew alder S.GRUNEWALDER@LANCASTER.AC.UK Department of Mathematics and Statistics Lancaster University Lancaster, England

Editor: Mark Schmidt

We study the usefulness of conditional gradient like methods for determining projections onto convex sets, in particular, projections onto naturally arising convex sets in reproducing kernel Hilbert spaces. Our work is motivated by the recently introduced kernel herding algorithm which is closely related to the Conditional Gradient Method (CGM). It is known that the herding algorithm converges with a rate of 1=t, where t counts the number of iterations, when a point in the interior of a convex set is approximated. We generalize this result and we provide a necessary and sufﬁcient condition for the algorithm to approximate projections with a rate of 1=t. The CGM, which is in general vastly superior to the herding algorithm, achieves only an inferior rate of 1=pt in this setting. We study the usefulness of such projection algorithms further by exploring ways to use these for solving concrete machine learning problems. In particular, we derive non-parametric regression algorithms which use at their core a slightly modiﬁed kernel herding algorithm to determine projections. We derive bounds to control approximation errors of these methods and we demonstrate via experiments that the developed regressors are en-par with state-of-the-art regression algorithms for large scale problems.

1. Introduction

Convex sets and projections onto convex sets are omnipresent in Machine Learning and Statistics. Projections appear already in the most basic approaches like in ordinary least squares regression where the the estimate can be interpreted as the projection of some target vector onto a linear subspace. By adding constraints one arrives naturally at a convex projection problem. Similarly, the ridge regressor and Gaussian process regressor are closely related to projections onto balls and the lasso estimator (Tibshirani, 1996) corresponds to a projection onto a simplex (again a convex set). Regularization approaches with convex constraints are, in general, closely related to the problem of determining a projection. For example, sparsity constraints are often enforced by optimizing over the standard simplex and algorithms that determine projections onto the simplex have been studied extensively (Duchi, Shalev-Shwartz, Singer, and Chandra, 2008). This paper is about exploring the usefulness of some closely related algorithms for determining projections onto convex sets in the large data context. Our inspiration for this line of research dates back to the paper by Welling (2009) in which an algorithm, called the herding algorithm, has been introduced which computes compact representations of probability measures. By a compact representation we mean here a representation that is supported on few points of the sample space and that can be used to calculate efﬁciently expectations of functions. The algorithm has garnered attention in recent years and various generalizations of the method have been explored. In one of

c 2018 Steffen Gr unew alder.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/16-147.html.

GR UNEW ALDER

these works it has been studied how the algorithm behaves in a non-parametric setting where a kernel function is deﬁned on the sample space (Chen, Welling, and Smola, 2010). The authors analyzed the interplay between the algorithm and the reproducing kernel Hilbert space (RKHS, Aronszajn (1950)) associated with the kernel function. Their main ﬁnding has been that the algorithm has a guaranteed rate of convergence towards the given probability measure (or, more accurately, towards its representer in the RKHS) of 1=t, where t is the number of iterations for which the algorithm is run. The rate of 1=t is an improvement over randomly selected support points, which would guarantee a rate of 1=pt, but the rate is slow compared to what many other numerical algorithms yield. It is worth pointing out that we consider here convergence in k k, while in the literature results are often reported for quadratic objectives like k k2. The virtue of this algorithm is that the computational costs per iteration are linear in the sample size. These low costs in dependence of the sample size make this algorithm, and related algorithms, attractive in the large data context where algorithms with faster rates of convergence are often prohibitively expensive to compute. The herding algorithm is iterative and uses at its core an inner product between the approximation error y pt at iteration t and a set of candidates S that can be chosen to reduce the error. Here, y lies inside a convex set C, pt is the approximation of y at iteration t and S is a subset of C. The algorithm selects elements which maximize this inner product hy pt; xi over x 2 S. We asked ourselves somewhat na ıvely what would happen if y lies outside of C. One would hope that the algorithm converges in this case to the point in C that is closest to y, or in other words, that it would converge to the projection Py of y onto C. The ﬁrst observation we had was that this will, in fact, be true if y Py stands orthogonal on the (afﬁne) subspace spanned by C and if Py lies in the (afﬁne) interior of C, because in this case the algorithm is completely unaffected by y Py and it behaves equivalently to the case where we would apply it directly to Py 2 C. In this setting, the standard guarantees tell us that the algorithm converges with a rate of 1=t to the projection. Our initial observation was in a way na ıve since the literature on herding provides a way to show that the algorithm with an added line search converges to the projection. The argument to verify the convergence is based on the observation from Bach, Lacoste-Julien, and Obozinski (2012) that the herding algorithm with an added line search is equivalent to a well known method called the conditional gradient method (CGM, Frank and Wolfe (1956)). Basic guarantees on the convergence of the CGM imply directly that the method converges with a rate of 1=pt to the projection (see our Literature Section below for more details). This convergence result is reafﬁrming, however, the rate of convergence the result guarantees is a slow rate of 1=pt and not 1=t. The rate of 1=pt is, in fact, the actual rate with which CGM converges in non-trivial projection problems. The reason for this is that, in interesting cases, projections lie in the boundary of the convex set and CGM itself achieves only in exceptional cases a rate that is better than 1=pt if the solution of an optimization problem lies in the boundary (Cannon and Cullum, 1968). In recent years extensions of the basic CGM algorithm have become popular and it was shown that a linear rate of convergence (a rate signiﬁcantly better than 1=t) is achieved by a particular extension, the CGM with away steps algorithm, even if the solution lies in the boundary (Lacoste-Julien and Jaggi, 2015). This is a signiﬁcant result, but it comes with a caveat: we gain strong guarantees only if the convex set has a large pyramidal width. For instance, the d dimensional cube has a large pyramidal width and the authors provide an upper bound on the reduction of the approximation error in the order of 1=d 2 per step. The convex sets we are mainly interested in are of the form C D fk.x; / W x 2 Xg H, where k is the reproducing kernel of the RKHS H. Here, the dimension of the spanned subspace is often equal to the sample size (when using a Gaussian kernel,

COMPACT CONVEX PROJECTIONS

Figure 1: An illustration of the application of the herding algorithm to a projection problem. The aim is to ﬁnd the projection of a ray starting at the black dot onto the house. The house is compact and convex. The herding algorithm ﬁnds for such sets approximations of the projection with an accuracy in the order of 1=t, where t is the number of iterations. The red dots show the approximation after 0; 2; 5 and 100 steps with an initial estimate that equals the North-West corner on the bottom of the house.

for example). Furthermore, C is signiﬁcantly more complicated than a cube and we expect that the upper bound gives us guarantees on the reduction that are signiﬁcantly worse than 1=n2 per iteration, where n is the size of the sample. In the large data context, where n 105, these reductions per step are tiny. Another recent approach to improve the convergence behavior of the CGM has been developed in Garber and Hazan (2013). Standard CGM is in this approach combined with what the authors call a Local Linear Optimization Oracle (LLOO) and it is shown that a linear rate of convergence is achieved by their method. Like in the away step approach the geometry of the convex set factors into the performance. For the method from Garber and Hazan (2013) the geometry affects the run-time of the LLOO and, hence, the run-time of the CGM with the added LLOO. The approach works well for simple shaped convex sets like the simplex but becomes difﬁcult when C has no particular structure as in our RKHS setting. Understanding better the relevant geometric properties of C with respect to the reproducing kernel is an intriguing problem, but it is beyond the aims of this paper. We aim here for a deeper understanding of the core algorithms and their interplay with the projection problem in Hilbert space. The following ﬁgure visualizes the relation between the different CGM like methods. We focus in this paper on the methods in the shaded area.

Herding CGM

CGM with away steps

CGM with LLOO

Figure 2: The relation between various CGM like methods.

CGM is known to converge with a linear rate if there exists a ball around the solution inside the convex set (this is also known as Slater s condition, Beck and Teboulle (2004)). Similarly, for the herding algorithm it has been shown that this very same condition guarantees a rate of 1=t (Chen et al., 2010). To the best of our knowledge, both convergence results have been derived

GR UNEW ALDER

independently of each other and it is telling that in both approaches the same assumption plays a fundamental role (the existence of a ball around the optimum). The assumption appears naturally in the analyses since it allows a strong reduction of estimation errors per iteration independently of the direction in which the errors point. If this assumption does not hold then there are directions which pose problems, i.e. if an error builds up in such a direction then these errors will only decay slowly over time. This is why solutions that lie in the boundary pose problems: errors that point away from the convex set can not be easily reduced. We reﬁne the convergence argument to be able to deal with such errors. On a high level our approach can be described in the following way: consider a convex set with ﬁnitely many extremes then each point in the boundary is in the relative interior of a face of the original convex set. For example, in Figure 1 our convex set is the house and we want to compute the projection of the black dot onto the house. The dotted line visualizes the process of projecting onto the house and the red dot at the end of the line is the projection that we want to determine algorithmically. The projection does not lie in the interior of the house but it lies inside a face of the convex set (the triangle that contains the projection). A face of a convex set is in turn a convex set and if we would apply the algorithm directly to this face then we could show fast convergence. In general, we can not identify the face that contains the projection without using signiﬁcant computational resources. However, we can treat elements outside of this face, which are chosen by the algorithm, as perturbations that disturb the behavior of the algorithm. For instance, we applied the herding algorithm to the problem in Figure 1 and we used an initial approximation of the projection which corresponds to the North-West corner on the bottom of the house. The dashed line in red shows how the approximation evolves over multiple steps and how the error that is introduced by the initialization shrinks over time. Our line of attack in the theoretical part of this paper is to control such perturbations and to show that the rate of convergence of the herding algorithm is not negatively affected by these.

We were also interested in understanding how CGM like algorithms to determine projections in Hilbert spaces can be exploited in statistical problems. A blueprint is given in the paper by Chen et al. (2010) where the herding algorithm is applied in an RKHS H to approximate elements inside a compact and convex set C H. It is natural to consider extensions in which C is modiﬁed in a suitable way to represent a space of solutions of certain statistical problems. We showcase this approach in the regression problem. In this case, C is the space of regression functions. The projection approach itself is generic and one can gain with relative ease algorithms that infer kernel regressors from data. We call these CCP-regressors where the acronym CCP stands for compact convex projection. The advantage of this approach is that the corresponding algorithms are cheap to compute and they are applicable in the large data context. We ﬁnd in experiments that the algorithms we derive are en-par with established kernel regression algorithms like the Gaussian process or the fast kernel ridge regressor (Fast KRR, Zhang, Duchi, and Wainwright (2013)).

1.1 Literature

We aim in this section for a short overview summarizing key results that put our work in perspective. We start by stating the herding algorithm for approximating a point x inside a convex set C that is induced by the ﬁnite set of points S, i.e. C D cch S, where cch denotes the closed convex hull operator. C is typically a subset of Rd equipped with the Euclidean inner product or some other Hilbert space. Given the starting value w0 D x and the initialization t D 1 the herding algorithm iterates

COMPACT CONVEX PROJECTIONS

1. choose xt 2 arg maxx2S hwt 1; xi,

2. set wt D wt 1 .xt x /,

3. set t t C 1.

This basic routine is combined with some termination criterion like running the algorithm for T iterations. The approximation of x is then .x1 C : : : C x T /=T where T denotes here the number of iterations the algorithm was run for. The herding algorithm is a special case of what is called the conditional gradient method (CGM) or the Frank-Wolfe algorithm (Frank and Wolfe, 1956; Bach et al., 2012). The conditional gradient method is known since the ﬁfties and various results about its convergence behavior have been derived throughout the years. The CGM method tries to approximate the minimal value of a differentiable function f W Rd ! R on a set C by iterating two steps after an initialization with some value x0 2 C:

1. Choose an x? 2 arg minx2C hx xt 1; rf .xt 1/i,

2. perform a line search over 2 Œ0; 1 , i.e. choose a

? 2 arg min 2Œ0;1 f .xt 1 C .x? xt 1//

and update xt D xt 1 C ?.x? xt 1/.

Standard results for the CGM method hold for continuously differentiable functions on convex compact subsets C of Rd (Beck and Teboulle, 2004). Let f D minx2C f .x/ then it is known that a sequence fxtgt 0 produced by the CGM attains the minimum eventually, i.e. limt!1 f .xt/ D f and there exists a constant b such that f .xt/ f b=t. A differentiable function on a compact set is Lipschitz continuous and the constant b depends on the Lipschitz constant and on the diameter of the set C, diam .C/ D supx;y2C kx yk. We can use the CGM to ﬁnd the projection of z onto C. Letting f .x/ D kx zk2, with k k the Euclidean norm, we can observe that f is continuously differentiable with derivative 2 hx z; i and we are assured that kxt zk2 minx2C kx zk2 b=t for some constant b and all t 0. Observe that this rate is slower then what we aim for since we gain here a rate of about t 1=2 for the convergence of kxt zk. The norm itself is not differentiable and we can not derive the faster rate of 1=t through these results. It is also known that the rate of convergence can not be improved in general (Cannon and Cullum, 1968)[Thm. 2]. Hence, stronger assumptions are needed to gain faster rates of convergence. One typical assumption is that Slater s condition holds. Slater s condition is in our context the assumption that z 2 int C, where int C denotes the interior of C. Proposition 3.2 in Beck and Teboulle (2004) demonstrates that if Slater s condition is fulﬁlled then the CGM produces a sequence fxtgt 0 which satisﬁes kxt zk be tq=2

for some positive constants b and q. The rate is signiﬁcantly faster than what is known for the herding algorithm but the result is not applicable to the problem of computing projections since (non-trivial) projections lie in the boundary of C and not in the interior. These convergence results are nevertheless fundamental and they are a corner stone for various studies. For instance, in Jaggi (2013) these convergence results for the CGM are combined with a duality argument to give bounds

GR UNEW ALDER

on the duality gap between the primal and dual solution of the optimization problem. Another line of research considers extensions of the CGM which are build to circumvent the sub-linear rate of convergence. We discussed already the most prominent approaches: the classical approach from Wolfe (1970) in which away steps are added to the CGM and the approach from Garber and Hazan (2013) which both guarantee a linear rate of convergence. Much of the research on the CGM has been done in the general context where an arbitrary continuously differentiable function on a convex set is optimized. However, the more specialized problem of determining projections onto convex sets has also garnered attention. Especially, in the context of sparsity. The convex set C that is typically considered in this context is the standard simplex in Rd which is d 1 D cch fen W 1 n dg; where e1; : : : ; ed is the standard basis in Rd. The simplex d 1 is of particular importance because 1-constraints can be naturally expressed through it: w 2 Rd C with kwk 1 D 1 implies that w D Pd i D1 hw; eii ei 2 cch fen W 1 n dg D d 1

(and vice versa). The CGM acting on the simplex can be used to ﬁnd sparse solutions for a variety of problems. Most notably the lasso estimator (Tibshirani, 1996) can be found through an application of the CGM to a simplex (Clarkson, 2010). In Duchi et al. (2008) a similar problem is studied. The motivation are there algorithms which require projections onto a simplex. The authors develop a projection algorithm that can be computed in the order of d operations on average (the algorithm is stochastic) where d is the dimension of the simplex. This resembles the order of run time of the CGM. Another set of important results for the CGM rely on strong convexity assumptions about the function f and the set C, see for example Garber and Hazan (2015) and references therein. Under such assumptions one can obtain a rate of 1=t for ﬁnding the projection on C, i.e. the rate of convergence applies independently of where in C the solution lies. The boundary of a strongly convex set C bends signiﬁcantly. In our paper we assume very different properties of C. We are interested in the convex hull C of a ﬁnite set S and faces of such sets C are ﬂat, i.e. they do not bend. Closely related to the closed convex hull of a set S is the minimum enclosing ball of S (MEB, Clarkson (2010)). The difference between the two is that an Euclidean ball bends (the Euclidean norm is uniformly convex) while the closed convex hull interpolates the boundaries through hyperplanes. For example, the MEB of S D f e1; : : : ; edg is the Euclidean unit ball while cch S is an 1-ball (a diamond). There exist efﬁcient algorithms for determining the MEB under different conditions. Furthermore, a variety of machine learning problems can be rephrased as the problem of ﬁnding an MEB (support vector algorithms are approached in this way in Tsang et al. (2005a,b)).

1.2 Contributions

In this paper we study the usefulness of conditional gradient methods to determine projections onto compact convex sets in Hilbert spaces. The contributions split naturally into theoretical and applied contributions through which we explore this theme. The theoretical contributions are two theorems. Our ﬁrst theorem shows that the herding algorithm retains its rate of convergence of 1=t if, and only if, the perturbations introduced through points outside the minimal face are coupled in a certain way. The approach we use (considering the minimal face to which well-known results can be easily adapted and controlling perturbations which are introduced by elements outside of the minimal face) is to the best of our knowledge new and we expect that this line of reasoning can be extended in the future to study more advanced methods like the CGM with away steps.

COMPACT CONVEX PROJECTIONS

Our second theorem analyses why the standard CGM fares worse in this setting than the herding algorithm. On the applied side we show how projections onto compact convex sets in kernel space can be exploited to develop novel machine learning algorithms. We demonstrate this by developing a new and fast kernel regressor that is en-par with state-of-the-art non-parametric regressors. We provide approximation error bounds for the method by means of a dual approach and we study its performance in experiments. The approximation error bound is a novel modiﬁcation of an approach from Wolfe (1976) which sacriﬁces tightness in favor of a signiﬁcantly reduced computation time.

2. Compact Convex Projections

We use the following algorithm to ﬁnd the projection of y 2 H onto the compact convex set C H in terms of a set of support points from the compact set S C which contains the extremes of C. In the following we denote the extremes of C with ex C, i.e. ex C D fx W x 2 C; there exists no y; z 2 C; 2 .0; 1/ such that x D y C .1 /zg.

Algorithm 1. Input: y 2 H, T 2 N, and S H.

1. [Initialize] Set w0 D y; t D 1.

2. [Optimization oracle] Choose xt 2 arg maxx2S hwt 1; xi.

3. [Update weight] Set wt D wt 1 .xt y/.

4. [Iterate] If t < T then increment t by 1 and go back to 2.

5. [Terminate] Return the approximation .x1 C : : : C x T /=T .

A maximizer exists in Step 2 because we search for a maximum of a continuous function on a compact set. If there are multiple maximizer then we can choose an arbitrary one, for instance, we can enumerate S and choose the maximizer with the lowest index. This is the herding algorithm as stated in Section 1.1 with the only modiﬁcation being that it is applied to y which can lie outside of C. The weight at step t is essentially the approximation error scaled up by t because wt D .t C1/y .x1 C: : :Cxt/ t.y pt/, where pt D .x1 C: : :Cxt/=t is the approximation. We can also add a line search to the optimization:

Algorithm 1 (ls). Input: y 2 H, T 2 N, and S H.

1. [Initialize] Set w0 D y; t D 1.

2. [Optimization oracle] Choose xt 2 arg maxx2S hwt 1; xi.

3. [Line search] Calculate Q t D hwt 1; wt 1 C .xt y/i = kwt 1 C .xt y/k2.

4. [Line search] If t D 1 set t D 1, otherwise set t D 1 Q t.

5. [Update weight] Set wt D .1 t/wt 1 t.xt y/.

6. [Iterate] If t < T then increment t by 1 and go back to 2.

7. [Terminate] Set ˇi D i QT j Di C1.1 j / for all i T and return the approximation ˇ1x1 C : : : C ˇT x T .

We use here the convention that 0=0 D 1 which implies that t D 1 if xt D pt. The weight in the

GR UNEW ALDER

line search algorithm is similar to the weight of the herding algorithm scaled down by 1=t, i.e. the weight in the line search algorithm is wt D y .ˇt 1x1C: : :Cˇt txt/ where ˇt i D i Qt j Di C1.1 j /. The choice Q t D hwt 1; wt 1 C .xt y/i = kwt 1 C .xt y/k2 minimizes kwtk over all choices of Q t 2 R. We need to be assured that the scaling t lies in the interval Œ0; 1 to guarantee that we have a convex combination of points of S as the approximation. Q t is always non-negative since for t 1

hwt 1; wt 1 C .xt y/i D hwt 1; xti

i D1 ˇt 1 i hwt 1; xii hwt 1; xti max x2S hwt 1; xi D 0;

where we deﬁne a sum from i D 1 to 0 to be 0. The inequality above holds because the ˇ s are non-negative and they sum to 1. Q t can, however, be strictly larger than 1. Hence, to guarantee that our approximation is a convex combination of points from S we need to force the scaling factor back into the interval Œ0; 1 . We do this in the algorithm by assigning the value 1 to the scaling factor in this case. This choice minimizes kwtk because the derivative of kwtk2 with respect to is non-positive for all Q , that is

2. kwt 1 C .xt y/k2 hwt 1; wt 1 C .xt y/i/ 0; for all 2 . 1; Q t :

Note that Algorithm 1 is the herding algorithm applied to a point outside the convex set. This corresponds to the CGM where the optimization over the step size is replaced by a step size of 1=t. Algorithm 1 (ls) is the CGM algorithm applied to the problem of ﬁnding the projection of y on the convex set spanned by S.

2.1 Rate of Convergence

We take a closer look at the convergence behavior of the herding algorithm and the standard CGM when applied to a projection problem. It is known that the herding algorithm converges with a rate of 1=t to elements in the interior of compact convex sets in Hilbert space (this is shown in Chen et al. (2010) for what is called the marginal polytope. The same proof works for arbitrary compact convex sets in a Hilbert space). Hence, if we want to approximate the projection of an element y onto the convex set C and y is already in the interior of C (in essence a trivial projection problem) then standard proofs guarantee a rate of 1=t. Obviously, the interesting case is one where the projection lies in the boundary. We provide a theorem below (with the proof being postponed to Section 6.1, p. 26) which extends current results to the case where y lies in the boundary or outside the convex set if an assumption on the perturbations is fulﬁlled. We also show that this assumption is both necessary and sufﬁcient for the algorithm to converge with a rate of 1=t. We conclude this section by presenting a number of settings in which this assumption is fulﬁlled. It is instructive to go through the main ideas of our approach. Consider again Figure 1. We can observe that there is a minimal face of the convex set which contains the projection Py of y onto the convex set, i.e. there exists a face that contains Py and is a subset of any other face that contains Py. In the ﬁgure, the minimal face is the convex set which contains the projection, i.e. the red dot. The vector y Py, with Py 2 C being the projection, stands orthogonal on this minimal face. Furthermore, this minimal face is either an extreme point or Py lies in the relative interior of it. In the latter case we have a ball around Py (relative to the afﬁne subspace spanned by the minimal face) which is contained in the minimal face. This property of the existence of a ball

COMPACT CONVEX PROJECTIONS

Figure 3: (a) A visualization of Lemma 1. (b) The ﬁgure depicts a triangular prism. The point of projection Py is marked by the red dot. The minimal face that contains Py is the top edge of the prism. The red arrows visualize the three subspaces UF ; Ub and Uı. (c) The ﬁgure shows the projection of a set Cc onto the subspace UF Uı (the polytope enclosed by the thick line). The red dot marks the projection of Py onto the subspace and the red arrow visualizes the error at some stage t, i.e. wt. The horizontal line represents fx W hwt; xi D 0g. The inner product between the black dot and wt might be very different if evaluated over the whole space, i.e. if the component Pbwt is considered too. In particular, the polytope might appear to the algorithm like the modiﬁed polytope in which the black dot is replaced by the blue dot. In this modiﬁed polytope the best aligned element is the leftmost point.

around the element that we want to approximate is of crucial importance in all current approaches that demonstrate a fast reduction of error. We can therefore hope for a fast reduction of error in the afﬁne subspace spanned by F . In fact, since y Py stands orthogonal on the minimal face, we can also observe that the algorithm applied to the minimal face behaves in exactly the same way as if we run it with Py instead of y and we can conclude that the algorithm converges with a rate of 1=t if it chooses only elements of F . The obvious problem is that we do not know the minimal face beforehand and we can not force the algorithm to choose only elements from it. At this point, a difﬁcult task is to measure how much harm elements outside the minimal face can do when they are chosen by the algorithm. One important observation here is that for any element x outside the minimal face there exists no ball B around Py such that afffx; Pyg\B is contained in C (aff refers to the afﬁne hull operator). Figure 3 (a) is a visualization of this property. The red dot marks Py. The line going through the bottom left corner and Py leaves the polytope at Py and there exists no ball in C such that the line segment in this ball lies fully in C. This property is also not difﬁcult to prove formally. We summarize the result in the following simple lemma.

Lemma 1 Given a compact convex set C H and a y 2 H there exists a minimal face F that contains Py and for every x 2 Cn F and any ball B centered at Py with radius greater than zero we have that aff fx; Pyg \ B 6 C.

The existence of the minimal face is shown in Step (ii) of the proof of Theorem 2. If the second part of the lemma would be false then we would have a point in aff fx; Pyg \ C such that Py is a convex combination of x and this point. This would imply x 2 F (see eq. 3 on p. 28) with a contradiction to the assumptions of the lemma. The lemma implies that any element outside the minimal face which is chosen by the algorithm will introduce perturbations into the estimate that can not be easily removed: the line from x through

GR UNEW ALDER

Py leaves the convex set at Py and there is no element z 2 C that is well aligned with .x Py/, i.e. a component of .x Py/ points outward of the convex set and there exists no z in C such that z Py has a positive inner product with this component. These perturbations get only smaller over time because of the down-scaling of the old approximation at steps t by t=.t C 1/ and not because of future choices of the algorithm which are well aligned with the error. We are able to show that despite this property the herding algorithm converges with the same rate of 1=t that it would achieve if Py would lie inside C under an assumption on these perturbations.

Three subspaces. To state our assumption, it is convenient to introduce the centered versions Cc D fx Py W x 2 Cg of C and Fc D fx Py W x 2 F g of F . We denote the linear subspace spanned by Cc with span Cc. We can write span Cc as a direct sum UF Uı Ub of three orthogonal subspaces UF ; Uı; Ub span Cc and we denote the orthogonal projections onto these subspaces and onto span Cc by PF ; Pı; Pb and PC . Here,

UF D span Fc (UF D f0g if Fc D f0g) is the (unique) subspace spanned by the minimal face,

Uı and Ub are any two orthogonal subspaces of span Cc that are also orthogonal to UF and which fulﬁll

1. span Cc D UF Uı Ub,

2. there exists a ball (relative to the subspace Uı) around 0 in f Pıx W x 2 Ccg,

3. there exists an orthonormal basis of Ub, say e1; : : : ; ek, k D dim Ub, such that for all i k, f0g 6D fhei; xi W x 2 Ccg . 1; 0 ,

4. PC .y Py/ 2 Ub.

Ub D f0g and Uı D f0g are possible.

Observe that this representation is not invariant under rotations. In particular, Uı and Ub can change dimensions when C is rotated (see Figure 4 (a) for an example). The split into these three subspaces is helpful since it allows us to separate the errors that can be minimized by choosing particular elements x 2 S from the errors that cannot be reduced by choosing elements in S. Recall that the errors of our algorithms at step t are wt. The perturbations at step t that cannot be minimized by suitable choices of elements in S are Pbwt and the remainder Pıwt C PF wt consists of the error that can be reduced by choosing appropriate elements in S. The split into these subspaces is shown in Figure 3 (b). UF is here the subspace spanned by the line on top of the triangular prism, Uı is the subspace orthogonal to it and aligned with the base of the prism. Ub is orthogonal to these two subspaces and the whole prism lies below Py (the red dot) in Ub.

A perturbed optimization problem. Apart from adding to the overall error, these perturbations introduce a subtle and more serious problem as follows. For any .PF C Pı/wt we have an element s that maximizes hs ; .PF C Pı/wti and by choosing s we reduce the overall error (the reduction will be signiﬁcant if k.PF C Pı/wtk is large). Imagine there is a second element s which has a very small positive inner product with .PF C Pı/wt, i.e. 0 < t D hs; .PF C Pı/wti hs ; .PF C Pı/wti. s will not be chosen if the algorithm optimizes over UF Uı, however, since we optimize over all of span Cc, it is possible that hs; wti > hs ; wti. In particular, this may happen if h Pbwt; s i = h Pbwt; si is large and the perturbations Pbwt are effectively changing the geometry

COMPACT CONVEX PROJECTIONS

of the optimization problem in UF Uı (see Figure 3 (c) for a visualization of this effect). The rate of convergence can be signiﬁcantly reduced by this effect. In particular this happens if the perturbations make the algorithm choose a sequence of elements with inner products t converging to 0 since the error in UF Uı will then not decrease over time. This problem can only occur if the Pbwt values affect the different elements in S in a very unbalanced way, i.e. there must be some elements x; x0 2 S such that h Pbwt; xi h Pbwt; x0i. Our theorem below shows that this is the central problem that can hinder the rate of convergence. We demonstrate that Algorithm 1 converges with a rate of 1=t if, and only if, the perturbations affect the different elements in S in a sufﬁciently balanced way.

The main assumption and the theorems. Assumption 1 below formalizes the concept of a balanced inﬂuence of the perturbations on the elements of S. We need the following set of critical points for a given sequence of elements xt chosen by the algorithm to state this assumption,

D.fxtg/ D fx W x 2 Sn F; xt D x for inﬁnitely many tg:

We use here fxtg to abbreviate fxtgt 1. Only elements in D.fxtg/ can lead to a reduction in convergence rates and it sufﬁces to check that none of the elements in D.fxtg/ lead to the above described effect to demonstrate fast convergence. Our main assumption is now the following:

Assumption 1: Given sequences fxtgt 1; fwtgt 0 we assume that there exists a representation UF Uı Ub D span Cc as described above, where for all x; x0 2 D.fxtg/ there exists a < 1 and a t0 < 1 such that h Pbwt; x Pyi h Pbwt; x0 Pyi for all t t0.

We restrict our analysis to the case where there are only ﬁnitely many extremes of C. We thus assume that C is compact, convex and has only ﬁnitely many extremes. A set C with these properties is called a convex polytope. The following theorem demonstrates that for a convex polytope the above assumption is both necessary and sufﬁcient for the fast rate of convergence of Algorithm 1.

Theorem 2 Given a compact convex set C H and a ﬁnite subset S of C with ex C S there exists for y 2 H a constant b > 0 such that Algorithm 1 has a worst-case approximation error of b=t for all t 1 if, and only if, Assumption 1 is fulﬁlled for the sequences fxtgt 1 and fwtgt 0 generated by the algorithm.

The reason why the herding algorithm retains its rate of convergence is that elements outside the minimal face will not be chosen anymore after some time t0 2 N and the decay t=.t C 1/ is high enough to remove the perturbations introduced in these ﬁrst t0 many steps fast enough so that they do not harm the overall rate of convergence, i.e. if we assume Py D 0 then

k.I PF /wtk D t 1

t k.I PF /wt 1k D t 2

t k.I PF /wt 2k D t0

t k.I PF /wt0k

and the error orthogonal to the minimal face decays with a rate of 1=t. The situation is different for the CGM. While we can show that no elements outside the minimal face is chosen after some t0 2 N the perturbations introduced in these initial steps dominate the rate of convergence of the algorithm even for steps t t0 (in the case of applying the CGM with away-steps it is known that the minimal face is identiﬁed after ﬁnite many steps under certain conditions and the algorithms optimizes after

GR UNEW ALDER

this step over the minimal face (Wolfe, 1970; Gu elat and Marcotte, 1986)). We decompose the error in the theorem below into two parts, the error in the afﬁne subspace spanned by minimal face F and the error in the orthogonal complement of that space. This decomposition is natural in the way that we have a strong reduction of the error in the minimal face per step and only a weak reduction in the orthogonal complement of the minimal face. The ﬁrst part of the theorem is standard and follows from Beck and Teboulle (2004). In the following, pt denotes the approximation at time t. We use again Assumption 1, but be aware the the weight wt is deﬁned slightly differently for the CGM, i.e. the weight is scaled down by a factor of 1=t at step t.

Theorem 3 Given a compact convex set C H, a ﬁnite subset S of C with ex C S and an element y 2 H the following holds:

If only elements in F \ S, where F C is the minimal face that contains Py, are chosen then the method converges linearly to the projection and there exist constants b; ˇ > 0 with

k Py ptk be ˇt:

Under Assumption 1 if minx2S k Py xk > 0 and the approximation does not equal Py in ﬁnite many steps then the sequence fk.I PF /.Py pt/kgt 0 converges sub-linearly and there exists a constant d > 0 such that

k PF .Py pt/k2 .1 ı2 F =diam 2.F // k PF .Py pt 1/k2Cd k.I PF /.Py pt 1/k2 :

Furthermore, there exists a time t0 after which only elements in F \ S are chosen.

It is worth noting that the term .1 ı2 F =diam 2.F // k PF .Py pt 1/k2 would guarantee a linear rate of convergence if the extra perturbation part would not be present.

The constant. It is of obvious interest to get insight into the size of the involved constants, in particular into the size of the constant b in Theorem 2 and its relation to quantities like ıF and the dimension of span Cc. The baseline is here the constant that we gain in the trivial case where F D C. Here, b can be taken to be 3r C 2r2=ıF , with r D supx2C kxk, and the constant does not depend on the afﬁne dimension of C (see the proof of Theorem 2, part (ii.f)). In the case where F 6D C we can ﬁrst observe that b depends both on ım, the radius of the largest ball inside the projection of Cc onto UF Uı, and on 0 D supx;x02D.fxtg/ x;x0, where x;x0 are the constants in Assumption 1, i.e. b depends on ım= 0 (see the proof of Theorem 2, part (iv.b)). So the larger the ball around Py in UF Uı and the closer coupled the perturbations are the smaller the constant b. Referring these quantities back to geometric properties of C is non-trivial and in all likelihood providing a general characterization is at least as difﬁcult as providing a general characterization of when Assumption 1 is fulﬁlled. However, in concrete settings it is often actually rather easy to provide bounds on b depending, for instance, on the afﬁne dimension of C. We provide a number of examples in the next section. In particular, Corollary 6 - 8 contain concrete values for the constant. The dependence on d is here linear or sub-linear if we ignore the dependence of the dimension on the size of the set C, i.e. r D supx2C kxk might also depend on the dimension. For instance, for the standard hypercube in d-dimensions r D p

COMPACT CONVEX PROJECTIONS

Figure 4: (a) The ﬁgure shows two possible splits into Ub Uı depending on how the square is rotated. The ﬁrst one is visualized by solid red arrows. Here Ub D R2 and Uı D f0g. The second split is visualized through the dashed arrows. The red dashed arrow is a basis vector of Ub and the blue dashed arrow is a basis vector of Uı. (b) The ﬁgure shows two projection problems. The locations of Py are marked with the red dots and are annotated with (1) and (2). The red arrows visualize the space Ub for the two problems. For problem (1) this is a three dimensional space since UF is here 0-dimensional and for (2) it is a 2 dimensional space. In both cases Uı D f0g. (c) The ﬁgure resambles (b). Instead of the hypercube it shows the simplex in three dimensions. The red arrows depict basis vectors for Ub. Uı D f0g as for the hypercube.

2.1.1 EXAMPLES

We now take a look at a number of concrete projection problems to demonstrate how Assumption 1 can be used to prove fast convergence of Algorithm 1 in various situations. Our ﬁrst result does not rely on assumptions about the shape of the convex polytope, but assumes that its span is a d < 1 dimensional subspace of a Hilbert space and the projection Py lies in a d 1 dimensional face of the convex set (Corollary 4 and 5). In the applications we have in mind the point of projection is related to an estimator and the convex polytope enforces some form of sparsity. Typically, in such settings, the projection lies in some low dimensional face of the convex polytope.

We provide a rather general condition for this case in Corollary 6 and we use this corollary to demonstrate that Algorithm 1 converges with the fast rate independent of the location of Py if we are working, for instance, with the hypercube or the simplex (Corollary 7 and 8). In particular, Py can lie in a low dimensional face of the simplex or can be an extreme of it. There is certainly more to be understood here, but these cases demonstrate that the performance of Algorithm 1 is robust across a variety of settings and they demonstrate the uses of Assumption 1.

The proofs of the corollaries that follow are contained in the appendix on page 39 and onward. It seems also worth pointing out that we can rotate our coordinate system in an arbitrary way and translate the convex polytope and y without it affecting the algorithms or the rate of convergence. This holds because the algorithm uses only inner products and orthogonal operators (rotations) do not change these. Similarly, a translation does not affect the maximization step or the update. This allows us, for instance, to prove the fast rate of convergence for the standard hypercube Œ0; 1 d and to generalize this result to arbitrary hypercubes in Rd.

GR UNEW ALDER

One-Dimensional Ub. If the space Ub is one-dimensional then the perturbations affect all elements in Sn F equally (up to a ﬁnite multiplier) and Assumption 1 is always fulﬁlled. This is in particular the case if the minimal face is .d 1/-dimensional, but can also be fulﬁlled for lower dimensional faces if Uı is not 0-dimensional.

Corollary 4 Given a compact convex set C H and a ﬁnite subset S of C with ex C S. If for y 2 H there exists a decomposition UF Uı Ub of span Cc such that Ub is one dimensional then Assumption 1 is fulﬁlled. In particular, in this case there exists a constant b > 0 such that Algorithm 1 has a worst-case approximation error of b=t for all t 1.

One can relate the assumption that Ub is one-dimensional to a uniqueness assumption about the projection Py. If PC y 62 C and only elements in A D fz 2 H W z D PC .y Py/ C PC Py; 2 Œ0; 1/g are projected onto Py then Ub is 1-dimensional and Assumption 1 is fulﬁlled. We summarize this result in another corollary.

Corollary 5 Given a compact convex set C H and a ﬁnite subset S of C with ex C S. Assumption 1 is fulﬁlled if PC y 2 Hn C and whenever P z D Py for some z 2 H then PC z 2 A holds. In particular, in this case there exists a constant b > 0 such that Algorithm 1 has a worst-case approximation error of b=t for all t 1.

This last assumption is not fulﬁlled in Figure 1 if the projection lies in the corners or edges, but it is fulﬁlled in all other cases. A quite different application of our theorem allows us to exploit the geometry of the convex set to derive the fast rate of convergence independent of the location of Py.

Zero-Dimensional Uı. If we can rotate the convex set such that we get a split UF Uı Ub for which Uı D f0g then our Assumption 1 is also fulﬁlled. The next corollary states this result. We use here, and in the following two corollaries, d D dim Ub and we let e1; : : : ; ed be any basis of Ub. Furthermore, D mini d minfj hei; x Pyi j W x 2 Sn F; hei; x Pyi 6D 0g > 0 and r D supx2C kxk.

Corollary 6 Let C be a compact convex set in some Hilbert space H, S a ﬁnite set with cch S D C, and y 2 H such that there exists a split into UF Uı Ub of span Cc with Uı D f0g. Assumption 1 is fulﬁlled and there exists a constant b > 0 such that Algorithm 1 has a worst-case approximation error of b=t for all t 1. The constant b can be chosen as p

d4r3=. ıF /C6r2.1=ıF C1= /C5r.

This condition is somewhat abstract but leads, for example, to results that prove that the algorithm converges with the fast rate in classical sparsity settings independent of the location of Py (Corollary 7 and 8 below). Also, observe that Corollary 6 together with Corollary 4 imply that if Py lies in a face of dimension .d 2/ then Algorithm 1 will converge with the fast rate: if Uı is 0-dimensional then this follows from Cor. 6 and if Uı is 1-dimensional then UF is also 1-dimensional and the result follows from Cor. 4 (Uı cannot be 2-dimensional due to Lemma 1).

Hypercubes. Here, we consider the compact convex set Œ0; 1 d D cch f0; 1gd Rd and transformations of it (rotations, translations and changes of its size). The set of extremes of the standard hypercube is f0; 1gd and the d-dimensional hypercube has 2d m d m many m-dimensional faces, where m d. We formulate the following corollary in terms of a rotation matrix Q, a translation by a vector z and a scaling by a scalar c. In the proof of the corollary we transform the hypercube to the standard hypercube and we show that Uı D f0g independent of the location of y. This is visualized

COMPACT CONVEX PROJECTIONS

in Figure 4 (b) for a 3-dimensional hypercube. (1) and (2) are here two projection problems and the red arrows depict two bases of Ub. Constructing such bases for which Uı D f0g is always possible for the standard hypercube.

Corollary 7 Let y 2 Rd, Q any orthogonal matrix, c > 0 any scaling, z 2 Rd, S D f0; 1gd and C D Œ0; 1 d, d 1. Assumption 1 is fulﬁlled for the set QS D c QŒS C z and QC D c QŒC C z (independently of the dimensionality of the face Py lies in). In particular, there exists a constant b > 0 such that Algorithm 1 has a worst-case approximation error of b=t for all t 1. The constant b can be chosen as p

d4r3=.cıF / C 6r2.1=ıF C 1=c/ C 5r.

Standard Simplex. Algorithm 1 also attains a fast rate of convergence if the convex set we use is the standard simplex in Rd independently of the location of y. The standard simplex has d m many faces of dimension m 1, i.e. 1 m d and the dimension of the corresponding span of the centered face is m 1. The faces of such a simplex are again simplices. In particular, if we denote the standard simplex with d 1 D cch fe1; : : : ; edg, then the set of m 1 dimensional faces of d 1 are fcch fei1; : : : ; eimg W i1 < i2 : : : < im; ij d 8j mg. As for the hypercube we can show that Uı D f0g. This is visualized in Figure 4 (c) for the standard simplex in R3. The ﬁgure shows two projection problems (1) and (2) and corresponding bases of Ub.

Corollary 8 Let y 2 Rd, S D fei W i dg and C D d 1 D cch S, d 1. Assumption 1 is fulﬁlled and there exists a constant b > 0 such that Algorithm 1 has a worst-case approximation error of b=t for all t 1. The constant b can be chosen as 4dr3=ıF C 6r2.1=ıF C p

2.2 The Algorithm for Finite S

We are particularly interested in the case where S D fx1; : : : ; xng is ﬁnite. If S is ﬁnite we can change the algorithm to keep track of hwt; xii DW ati instead of wt 2 H. This reﬂects a change of representation from a basis representation of wt to one based on S. The weight vector can, after this change of representation, be replaced by at D .at1; : : : ; atn/ in the algorithm.

Algorithm 2. Input: y 2 H, T 2 N, and S H.

1. [Initialize] Set a0 D .hy; x1i ; : : : ; hy; xni/ and t D 1.

2. [Optimization oracle] Choose it 2 arg maxj n a.t 1/j .

3. [Update weight] Set at D at 1 .hxit; x1i ; : : : ; hxit; xni/ C a0.

4. [Iterate] If t < T then increment t by 1 and go back to 2.

5. [Terminate] Return the approximation .xi1 C : : : C xi T /=T .

The computational costs per iteration are then determined by the number of samples n and the computational costs of calculating inner products in H (n inner products per iteration).

2.3 Line Search for Finite S

The line search in Algorithm 1 (ls) contains terms of the form kwt 1k2. The philosophy in Algorithm 2 is to avoid measuring objects in their norm and to work solely with relations between objects in the form of inner products hy; xii to reduce computational costs. To follow this philosophy we need a version of the line search that does not need access to kwt 1k or kyk. Achieving this goal is

GR UNEW ALDER

easy enough: the line search aims for ﬁnding an t such that pt D .1 t/pt 1 C txit, with pt being our approximation at step t, achieves minimal distance to y. So we want that

t 2 arg min 2Œ0;1 k.1 /pt 1 C xit yk2 :

Setting the derivative of the squared norm to zero with respect to we gain

Q t D hy pt 1; xit pt 1i

kpt 1 xitk2 :

The norm in the denominator can be calculated recursively and calculating its value needs no signiﬁcant computational resources. Because kwtk D kpt yk we gain the same Q t 2 Œ0; 1/ as in Algorithm 1 (ls). Also, the minimizer over the interval Œ0; 1 is again t D 1 Q t.

2.4 Approximation Error Bounds through Duality

There exists a well established test for the approximation error (Wolfe, 1976). Let S D fx1; : : : ; xng still be ﬁnite. For any p 2 C, p 6D y, we can write hp; Pyi D hp; 1x1 C : : : nxni mini n hp; xii and

kp yk k Py yk

i D1 i xi y; p y

So kp yk k Py yk kp yk min x2S

The last term tells us how far we are away from the optimum and it can be used to guarantee a speciﬁc approximation error at termination. In the settings we are interested in kp yk is, in fact, too expensive to compute and we need to modify the approach from Wolfe (1976). Observe that

D kp yk min x2S

gives us the easier to compute

p x; p y kp yk .kp yk k Py yk/

The right side is directly an upper bound on .kp yk k Py yk/2, because kp yk kp yk k Py yk 0. However, this bound is overly conservative and we loose an order of magnitude since we lower bound kp yk k Py yk with a quantity that goes to 0. A better way seems to be to use linear functionals to approximate p y. Natural choices are here hx; i for x 2 S or hp; i because we can often compute these functionals efﬁciently. The former choice can be exploited in the following way ky pk max x2S

and we can calculate a lower bound on ky pk in about O.n/ operations (depending on the representation of p) by calculating the right hand side inner product for every x 2 S. Using this bound we get the following upper bound on the approximation error

maxx2S hp x; p yi maxx2S jhy p; x= kxkij kp yk k Py yk :

COMPACT CONVEX PROJECTIONS

The latter choice leads to

ky pk y p; p

The inequality also holds when taking the absolute value of the right side and

jkpk hy; p= kpkij kp yk k Py yk :

If our convex set is a norm ball then as p ! Py we also have that y p gets more and more aligned with p= kpk and we expect this bound to be good.

2.5 Parallelization

Algorithm 2 is easy to parallelize since the bottleneck per iteration is the calculation of n inner products in the update equation of at. These inner products can be calculated independently of each other and by having c processes available we can distribute the computation such that each process has to calculate at most dn=ce many inner products per iteration. Determining the arg max can then be achieved by a loop over the n entries of at. This operation is typically fast and no parallelization is needed. Though, it is easy to distribute this operation too to reduce the computation time to dn=ce C c by ﬁrst calculating c maxima over sets of size at most dn=ce and by then calculating the maximum of these c maxima. Finally, the summation of the vector of inner products with at 1 a0 can also be split such that each process has to perform at most dn=ce many of these summations. This gives us in total a computation time in the order of dn=ce.

3. CCP-Kernel Regression

We now want to apply the algorithm to a challenging statistical problem. That is the problem of non-parametric regression. For this we use a reproducing kernel Hilbert space (RKHS, Aronszajn (1950)) which is in a certain sense natural given past herding applications, but it is also computationally efﬁcient thanks to the reproducing property. An RKHS H is implicitly deﬁned through a kernel function k.x; x0/. H is the completion of

i D1 ik.xi; / W n 2 N; xi 2 X; i 2 R o ;

where X is the domain of the covariates (or inputs). In statistical applications we like to approximate H with a nested family of sets of functions that are restricted in a certain way in their size. One nested family is, for instance, the family of closed balls Br of radius r centered at zero. For these we know that S r 0 Br D H. The balls Br are, however, not compact if H is inﬁnite dimensional. Also, controlling the extremes of Br is not necessarily easy since we usually only have access to the kernel function k and not a basis fengn 1 of H. So ideally we would like a family of sets that can approximate H like the balls Br, but in contrast to Br the different sets would be compact, convex and controllable through the kernel. There is a family of sets with these properties which arises naturally from the kernel function (cl denotes the closure operator):

C.r/ D cch fk.x; / W x 2 Xg D cl n n X

i D1 ik.xi; / W n 2 N; x 2 Xn; 2 Rn C; k k 1 r o H:

GR UNEW ALDER

C.r/ is compact and convex and the extremes of C.r/ are contained in the set S D fk.x; / W x 2 Xg. There is an obvious similarity to the deﬁnition of L and by suitably symmetrizing C.r/ we gain a family of sets which covers all of L and can approximate any element in H up to arbitrary precision.

3.1 Symmetric Closed-Convex Hull

The sets C.r/ might not contain 0 or any elements on the negative axes. This is obviously not ideal for representing functions. We overcome this shortcoming by symmetrising C.r/ by including the elements k.x; /. In this way we get the family of sets

Cs.r/ D cl n n X

i D1 ik.xi; / W n 2 N; x 2 Xn; 2 Rn; k k 1 r o :

This family of sets has some similarity to the closed balls in H. In particular, if the kernel function is bounded then Cs.r/ is a subset of the closed ball of radius Qr D r supx2X k.x; x/ in H since

i D1 ik.xi; / k k 1 sup x2X k.x; x/ Qr

and the smallest closed set containing the elements P i ik.xi; / can not be larger than a closed ball of radius Qr.

Denseness and Universal Approximation. More important for us is the following property: the set S r 0 Cs.r/ is dense in H, since for any f 2 H and > 0 there exists an n 2 N, elements x1; : : : ; xn 2 X and coefﬁcients 1; : : : ; n 2 R such that f Pn i D1 ik.xi; / . But Pn i D1 ik.xi; / is an element of Cs.r/ if r Pn i D1 j ij. In other words, we can approximate any function in H arbitrary well by making r large enough. Furthermore, if our kernel function is a universal kernel then we can approximate any continuous function on X arbitrary well in the supremums norm. This property is sometimes called the universal approximation property.

3.2 Simplifying the Sets

Optimizing over Cs.r/ itself is difﬁcult and from a practical point of view it makes sense to reduce the sets further to save computation time. In regression we have a number of covariates x1; : : : ; xn given and we know that we can represent any RKHS function h exactly on these points xi; i n, by functions of the form Pn i D1 ik.xi; /, where 2 Rn are suitable weights, if the kernel matrix K D .k.xi; xj //i;j n is of full rank because

h.x1/ ::: h.xn/

has then a (unique) solution. From this point of view it makes sense to use the family of sets

C.r/ WD n n X

i D1 ik.xi; / W 2 Rn; k k 1 r o :

COMPACT CONVEX PROJECTIONS

instead of Cs.r/. C.r/ is fully characterized by S D f rk.x1; /; : : : ; rk.xn; /g in the sense that C.r/ D cch S and S contains the extremes of C.r/. S itself is of size 2n and we can optimize efﬁciently over it.

3.3 Interpolation in H

Before approaching the regression problem we start with a closely related interpolation problem. We aim for interpolating a function g 2 H at n support points x1; : : : ; xn. Let y1 D g.x1/; : : : ; yn D g.xn/. The interpolation algorithm is essentially an efﬁcient version of Algorithm 2 applied to the elements rk.x1; /; : : : ; rk.xn; /, where r is the scaling factor of C.r/. So at the heart of the algorithm we have a vector a 2 Rn which keeps track of the approximation error and the entries ai are just hwt; k.xi; /i, where wt is the weight vector used in Algorithm 1. We formulate the algorithm in terms of assignments which denote the operation of overwriting the left hand side with the value on the right hand side.

Algorithm 3. Input: y 2 Rn, T 2 N, r > 0, a kernel function k and x1; : : : ; xn 2 X.

1. [Initialize] Set a D ry and t D 1.

2. [Optimization oracle] Choose j 0 2 arg mini n ai, j 00 2 arg maxi n ai.

3. If mini n ai maxi n ai then let j D j 0; zt D xj and st D 1

4. else let j D j 00; zt D xj and st D 1.

5. [Update weight] Set ai ai C ryi r2stk.xi; zt/ for all i n.

6. [Iterate] If t < T then increment t by 1 and go back to 2.

7. [Terminate] Return the regressor f .x/ D r.s1k.z1; x/ C : : : C s T k.z T ; x//=T .

The complexity of the algorithm is O.T n/ since we have T iterations and we need to update in each iteration a 2 Rn. The bottleneck in this algorithm is the evaluation of the k.zi; x/.

Line search. The line search version of this algorithm is based on Section 2.3. We update recursively the elements ; ; , where at step t we have that D kpt 1k2, D hpt 1; yi and D .hpt 1; k.x1; /i ; : : : ; hpt 1; k.xn; /i/, to keep the complexity of the algorithm at O.T n/.

Algorithm 3 (ls). Input: y 2 Rn, T 2 N, r > 0, a kernel function k and x1; : : : ; xn 2 X.

1. [Initialize] Set a D ry; t D 1; D 0; D 0 and D .0; : : : ; 0/.

2. [Optimization oracle] Choose j 0 2 arg mini n ai, j 00 2 arg maxi n ai.

3. If mini n ai maxi n ai then let j D j 0; zt D xj and st D 1

4. else let j D j 00; zt D xj and st D 1.

GR UNEW ALDER

5. [Calculate step size] Let v D r2.k.x1; zt/; : : : ; k.xn; zt// and

Q t D rstyj rst j C

2rst j C vj ; if t D 1 let t D 1 and otherwise let t D 1 Q t:

6. [Update weight] a .1 t/a C t.ry stv/.

7. [Update line search variables] .1 t/2 C2 t.1 t/rst j C 2 t vj , .1 t/ C t.st=r/v and .1 t/ C tstryj .

8. [Iterate] If t < T then increment t by 1 and go back to 2.

9. [Terminate] Return the regressor f .x/ D r.s1ˇ1k.z1; x/ C : : : C s T ˇT k.z T ; x//=T , with ˇi D i QT n Di C1.1 n/.

The expensive calculation is here the calculation of v D .k.x1; zt/; : : : ; k.xn; zt//.

Approximation error stopping rule. We can also use an approximation error of below as the stopping criterion by using bounds from Section 2.4. The algorithm below achieves this in O.T n/ for the bound that uses our approximation pt to gauge the approximation error. The algorithm is very similar to the line search algorithm because we need to store similar quantities for calculating the bound as for the line search. We update recursively the elements ; ; , where at step t (before the update) we have that D kpt 1k2, D hpt 1; yi, D .hpt 1; k.x1; /i ; : : : ; hpt 1; k.xn; /i/. The complexity of the algorithm is again O.T n/. The version shown below is for the line search. By replacing t with 1=t one can gain a version for the standard algorithm. We state only the changes to Algorithm 3 (ls).

Algorithm 3 (ls, ae). (replace 8. and 9. in Algorithm 3 (ls)). Input: y 2 Rn, > 0, r > 0, a kernel function k and x1; : : : ; xn 2 X.

8. [Upper bound] Calculate the upper bound

b D max i n r i C ryi ˇˇp =p ˇˇ _ max i n C r i ryi ˇˇp =p ˇˇ :

9. [Terminate] If b return the regressor f .x/ D r.s1ˇ1k.z1; x/ C : : : C stˇtk.zm; x// where ˇi D i Qt n Di C1.1 n/. Otherwise, go back to 2.

As for the other two algorithms above the computationally most demanding operation in this algorithm is the calculation of .k.x1; zt/; : : : ; k.xn; zt//.

3.4 Norm Minimizing Regressor

The interpolation algorithms can also be applied to the regression problem. Let the observations be .x1; y1/; : : : ; .xn; yn/ and let K be the kernel matrix. If the kernel matrix is of full rank then with D K 1y, where y D .y1; : : : ; yn/>, the function h.x/ D Pn i D1 ik.xi; / 2 H fulﬁlls .h.x1/; : : : ; h.xn// D K D y> and the interpolation algorithm applied to h will converge under

COMPACT CONVEX PROJECTIONS

the usual conditions with a rate of 1=t, that is kh gtk2 b2=t2 with gt D .1=t/ Pt i D1 k.zi; /. There are two interesting observations here: ﬁrst, the algorithm minimizes the distance to h implicitly without knowledge of h itself. In fact, determining h numerically is usually impossible due to ill-conditioning. Second, the algorithm minimizes the distance in the RKHS norm and not in a least-squares sense. Both, the RKHS norm and the least-squares criterion can be seen as distance measures that are in certain ways related. For instance, a norm-distance of zero between two functions implies that they have the same least-square error and, furthermore, the least-squares error of our approximation is bounded by

i D1 .yi gt.xi//2 D 1

i D1 jhh gt; k.xi; /ij2

i D1 k.xi; xi/

.kh P hk2 C c=t2/

where P h denotes the projection of h onto C.r/ the constant c is .b2=n/ Pn i D1 k.xi; xi/. One can also formulate the norm minimization problem as a convex optimization problem

min 2Rn >.K 2y/ s.t.

ˇˇ j ˇˇ D r:

and use convex programs to ﬁnd the projection. This representation has the drawback that it makes explicit use of the n n kernel matrix K and for large scale problems this formulation needs significant amounts of memory.

4. Experiments

We conducted a set of experiments to gauge the performance of the approach. The ﬁrst set of experiments was constructed to demonstrate the behavior of the error bounds and to compare the optimization routine with some standard optimization procedures. The second set of experiments compared the regressor with well established regressors in a small scale setting. The advantage of the small scale setting is that we can compare to methods like the GP regressor. The ﬁnal set of experiments focused on a large scale benchmark data set and compared our method to the Fast-KRR, which is the state-of-the-art regressor for large scale problems.

4.1 Experiment 1: Empirical Rate of Convergence

The ﬁrst set of experiments were conducted to explore basic properties of the optimization approach: how does a typical regression curve look like in comparison to a GP regression curve? How do the error bounds relate? How does the optimization routine behave in comparison to a general purpose optimizer? In our ﬁrst experiment we generated 1000 data points from a Gaussian process (Gaussian covariance function) with normal distributed noise (the right plot in Figure 5). We ﬁtted the maximum a posteriori estimator (MAP, known hyper-parameters) to it (red curve) and then did split the data into an 800 and 200 batch to run a cross validation loop over the hyper parameters (we also used a Gaussian covariance but with an unknown width parameter). We ran the CCP algorithm for 20 (yellow) and 100 (purple) iterations (without line search). The red bars show the points and weights of the solution when the algorithm is run for 100 iterations. The black bars show the solution found by the cvx Matlab toolbox for a precision of about 10 16. The next experiment was about the bounds. We used again 1000 data points generated as above and a ﬁxed hyperparameter (no cv loop) to see how the bounds behave as the number of iterations increases. There

GR UNEW ALDER

Iterations 0 2000 4000 6000 8000 10000

log10 scale

Bound Norm Distance Conservative

x 0 0.2 0.4 0.6 0.8 1

Data MAP CCP20 CCP100

Figure 5: Left: The plot shows three different bounds on the approximation error. Right: The ﬁgure shows data from a Gaussian process with Gaussian likelihood function. The data is overlaid with three regression curves. MAP is here the optimal regressor. The vertical bars show the weight that is assigned by the CCP-regressors with 100 iterations (red) and the cvx matlab optimization toolbox (black) to observations at various locations x.

n 100 500 1000 2000 3000 4000 5000

CCP 0:03.0:03/ 0:1.0:08/ 0:2.0:09/ 0:25.0:29/ 0:3.0:26/ 0:47.0:26/ 0:57.0:61/

cvx 0:64.0:13/ 1:3.0:08/ 2:7.0:13/ 8:16.0:12/ 17:2.0:39/ 31:3.1:68/ 49:8.0:88/

Figure 6: Run time comparison between the CCP projection algorithm and a general purpose solver.

is one difﬁculty here, which is that we do not have the best ﬁtting function h 2 H and we can not calculate the exact distance to h. The three curves on the left side of the ﬁgure are our two bounds of the difference kh gtk kh P hk, gt is the regressor after t iterations and P h the solution of the optimization problem which we try to ﬁnd. Conservative refers to the bound where we use kgt hk .kgt hk k P h hk/ and Bound refers to our second bound) and the distance to P h (Norm Distance), that is kgt P hk which is also an upper bound of kh gtk kh P hk (we determined P h by using cvx with a precision of about 10 16). The results are shown on the left in Figure 5. One can observe that all three bounds show similar behavior, however, the conservative bound is, as expected, very loose.

We were also interested in a run time comparison between the projection algorithm and a general purpose solver (the cvx toolbox) to see how much can be gained by using the specialized projection method. The cvx toolbox uses the SDPT3 package to solve semideﬁnite-quadratic-linear programming problems. The details of the SDPT3 package are described in Toh, Todd, and Tutuncu (1999); Tutuncu, Toh, and Todd (2003). We used as a stopping rule for both methods an error below 10 4, that is CCP stopped when our bound guaranteed us that the error is below 10 4 and cvx stopped when its own error bound signaled an error below this threshold (details about how the precision is calculated in the SDPT3 package can be found in Toh et al. (1999)[Section 3]). The results of the run time comparison are shown in Figure 6 (mean(standard deviation) in seconds averaged over 10 runs; same experimental setup as for the bounds plot; n is the number of data points).

COMPACT CONVEX PROJECTIONS

4.2 Experiment 2: Run time vs. Least-Squares Error on a Small Scale Sample

The second set of experiments tested on a small scale how the run time of the CCP-regressor measures against the statistical performance. We were interested in seeing how the method fares in comparison to standard Gaussian process/ridge regression and Fast-KRR. We reproduced the experiment from Zhang et al. (2013) which uses the million songs data set (about 450000 data points and a covariate dimension of 90). We normalized the data as in Zhang et al. (2013) by letting each covariate dimension have standard deviation 1. We also used the same kernel (Gaussian with D 6). Finally, we normalized the response variable (year when a song appeared) to lie in Œ0; 1 by subtracting the minimal year and dividing by (maximum - minimum). 450000 data points are too much to run the standard GP regressor/ridge regressor and we downsampled the data set to a small subset of 5000 training points and 1000 test points. We were interested in how the radius affects the performance of CCP and how different it is from how affects the GP. We were also interested in seeing how the error threshold translates into least-squares performance and run time. The results are shown in the table in Figure 7. The notation is (run time - least-squares error), r is the radius for CCP (with line-search), p the number of elements in each partition of Fast-KRR (we used for Fast-KRR the same regularization schedule as in Zhang et al. (2013)) and , in CCP- , refers to the stopping criterion. We used our upper bound to gauge the current error and we stopped when the error bound passed r. The table below shows the results. We also ran the CCP-0:1 setting with r D 5000 to see how close we can get to the performance of the GP regressor. The estimator took 30:6 seconds to produce an estimate with an error of 0:124.

r.p/= 50/0.02 100/0.01 250/0.004 500/0.002 1000/0.001

GP 56 - 0.121 55 - 0.121 55 - 0.121 55 - 0.121 65 - 0.121

CCP-0:1 1.03 - 0.53 1.63 - 0.46 2.97 - 0.36 4.77 - 0.22 8.89 - 0.17

CCP-0:01 3.66 - 0.47 6.21 - 0.33 12.8 - 0.19 30.67 - 0.16 93.13 - 0.13

Fast-KRR 9.3 - 0.236 10.6 - 0.203 12.5 - 0.179 16.5 - 0.158

Figure 7: The table shows a comparison between the GP/ridge regressor, the Fast-KRR and the CCP-regressor. Each entry consists of the run time (seconds) and the least-squares error.

As one expects the GP run time is essentially independent of . A bit surprising is here that the regularization parameter seems to have hardly any effect on the least-squares error of the GP regressor. The run time of the CCP algorithm, however, is strongly affected by r which is consistent with our bound in which r inﬂuences the constant of the convergence rate. For both thresholds the computation time increases slightly non-linearly in r. The least-squares error depends on both and r. Increasing r results in this experiment in a higher reduction in least-squares error in dependence of the added computation time. The lowest error is achieved by the GP-regressor (which corresponds to the Fast-KRR with one partition). A slightly suboptimal solution with a least-squares error of 0.124 is achieved by the CCP-regressor in about 25 seconds less than what the GP-regressor needs. One can also observe that Fast-KRR is fast in computing an estimate with a low least-squares errors which is en par with

GR UNEW ALDER

the CCP-regressor in this experiment (in 10 seconds the Fast-KRR reaches, for instance, about 0:2 and the CCP-regressor 0:17). The interesting question is here how the performance of Fast-KRR and the CCP-regressor scale with the amount of data.

4.3 Experiment 3: Run time vs. Least-Squares Error on a Large Scale Sample

In the last experiment we compared Fast-KRR with the CCP-regressor (line search) on the full million songs data set. We used similar partition sizes as in Zhang et al. (2013) for Fast-KRR and we ran the CCP-regressor with r D 100000. The bottleneck in the CCP-regressor is the number of kernel evaluations that need to be performed which are tn for t iterations and n samples. The Fast-KRR regressor needs for m partitions (for simplicity let m be such that 0 n mod m) in the order of n2=m many kernel evaluations and it needs to calculate m many inverses of matrices of size .n=m/ .n=m/. So if t D n=m then the CCP-regressor needs to perform exactly as many kernel evaluations as the Fast-KRR algorithm. Especially for large m this will leave us with few iterations for the CCP-regressor and one expects that if the kernel evaluation is expensive in comparison to the computational cost of an inverse then Fast-KRR will perform better. On the other hand if we have either small partitions or cheap to evaluate kernels then the CCP-regressor will excel compared to Fast-KRR. In terms of memory: the CCP-regressor needs a small multiple of the original data size while the Fast-KRR needs to store another n=m n=m matrix. We made 25 GB of memory available to the Fast-KRR method, which allowed us to go down to 26 partitions on our cluster. The results of the comparison are shown in Figure 8 (mean over 10 runs with randomly chosen training and test sets).

Time (min) 0 50 100 150 200 250 300

Least-squares error

Fast KRR CCP

Figure 8: The plot shows the performance of the Fast-KRR and cpp-regressor in dependence of the run-time. We used partition sizes 26, 32, 38, 48, 64, 96, 128 and 256 for Fast-KRR. 26 was the limit we could achieve with 25 GB of memory.

We also determined the standard deviations. These were for both regressors marginal and we did not plot error-bars (maximal standard deviation for Fast-KRR: 0:16 10 4; CCP: 0:02).

COMPACT CONVEX PROJECTIONS

The plot shows us that the Fast-KRR achieves already a very good performance if the number of samples per partition is small. The CCP-regressor is signiﬁcantly outperformed at that stage. With more run-time the CCP-regressor catches up and approaches the minimizer in the set C.100000/. After about three hours of run-time the CCP-regressor overtakes the Fast-KRR and achieves the lowest least-squares error. Another difference one can observe is that there is, in principle, no limit of how long we can let the CCP-regressor optimize while we are limited with the Fast-KRR by the memory that we have available (the memory requirement increases linearly in the number of elements per partition and quadratically in the sample size).

5. Discussion and Open Problems

Motivated by the recent work on kernel herding we explored the uses of herding and the CGM algorithm for calculating projections onto convex sets. We derived a theorem that extends current convergence results for herding to the boundary if the perturbations introduced by our lack of knowledge of the minimal face that contains the projection are in a certain way well behaved. We also showed that our condition on the perturbations is both necessary and sufﬁcient for the algorithm to converge with the fast rate. An important open question is if there exist convex sets for which this assumption can fail to hold. Furthermore, we demonstrated that the herding algorithm and the CGM will chose no elements outside of the minimal face after some ﬁnite time t0 if this condition is fulﬁlled. Providing tight bounds on the size of t0 and the number of elements outside the minimal face that are chosen in these ﬁrst iterations in future research would help with improving the understanding of herding and the CGM further. Such bounds should also be useful to provide a better understanding of algorithms like the CGM with away steps: away steps are essentially trying to remove the perturbations introduced through elements outside of the minimal face. It is known that the CGM with away-steps successfully removes perturbations after ﬁnite many steps under certain conditions (e.g. Wolfe (1970); Gu elat and Marcotte (1986)), though, as we discussed in the introduction, there is still much to be gained by reﬁned analyses. Another interesting direction for future research concerns the geometry of the convex set in Hilbert space. The geometry factors into the rates of convergence of CGM like algorithms in various ways (Slater s condition, pyramidal width etc.). One might also aim for results that hold for convex sets with countably inﬁnite or uncountable many extremes. We believe that one of the main difﬁculties will be here that there does not need to exist a gap between the minimal face and extremes that lie outside of the minimal face. We made use of such a gap to show that after some ﬁnite time only extremes of the minimal face will be chosen. Without such a gap one expects that there does not exist such a ﬁnite window of steps and perturbations will be continuously injected into the approximation.

On the practical side one can wonder how widely applicable the projection in kernel space approach, which we used to derive a novel kernel regressor, is. Projections onto the standard simplex have proven to be very valuable for a variety of statistical problems and the projection onto the set f Pn i D1 ik.xi; / W n 2 N; k k 1 1; xi 2 X for i ng is in a way the natural analogue of the standard simplex. There are some obvious differences. The set is, for example, from a geometrical point of view far more complicated since in kernel space the natural way to describe objects is through the kernel function and not through an orthonormal basis. Nevertheless, our results suggest that this approach has merit and it is worth to explore its uses for other non-parametric problems.

GR UNEW ALDER

6. The Proofs of the Two Theorems and the Corollaries

This section contains the proof details of our two theorems and the various corollaries. We start with a technical lemma that is needed in the proof of the ﬁrst theorem. Using this we prove that the basic algorithm without line-search achieves a convergence rate of 1=t.

6.1 The First Theorem

We need a technical lemma that guarantees us that if we have two operators A; B that pull elements x towards the origin if x gets large (in a certain sense) then compositions of these operators (for instance C D A2BA4B3) applied to x will be bounded, i.e. k Cxk b for some constant b. In fact, we need slightly more. We need a result which holds simultaneously for a family of operators A. We formulate the lemma in the way that some initial value x0 and a particular inﬁnite composition C of operators in A and B is given. This is necessary since our operators in A are not necessarily a contraction for arbitrary elements but only for the elements they are applied to. In the following, we denote with Ct the composition operator that consists of the ﬁrst t operators in C (in the above example C3 D A2B) and we use the notation Cs;t for the operator that fulﬁlls Ct D Cs;t ı Cs 1. We split the proof of the lemma and the following theorems into a number of separate claims and we use P (proof) and QQQ (q.e.d.) brackets for the proofs of the claims.

Lemma 9 Let H be a Hilbert space, K a closed subspace of H, P the projection onto K, B W H ! H, Bx D .I P /x C PBP x an operator, A a family of operators A W H ! H, C a ﬁnite composition of operators of A and B, and x0 2 H such that there exist constants a; b; > 0 with

(i) k Axk ; k Bxk kxk C b for every A 2 A, x 2 H.

(ii) For any element x 2 f Ctx0gt 1 and for any t 1 for which Ct;t 2 A, i.e. an operator in A is chosen at time t, we have with A D Ct;t 2 A that

k Axk2 kxk .kxk / C b:

(iii) If k P xk a for an element x 2 H then k Bxk kxk.

Then for all t 1

k Ctx0k2 kxk2 _ .c C b/2 C .a C b/2 and k Ctx0k kxk _ .c C b/ C a C b;

with the constant c WD ..a C b/2 C b/= .

Proof (a) For any n 0, x 2 H, it holds that k Bnxk2 kxk2 C .a C b/2. P Observe that Bx D .I P /x CBP x. k P xk < a implies k BP xk a Cb and if k P xk a then k BP xk2 D k Bxk2 k.I P /xk2 kxk2 k.I P /xk2 D k P xk2

and k BP xk k P xk _ .a C b/. Hence,

k Bn P xk PBn 1x _ .a C b/ D Bn 1P x _ .a C b/ k P xk _ .a C b/:

So for any n 0 and x 2 H we know that

k Bnxk2 D k.I P /xk2Ck Bn P xk2 k.I P /xk2Ck P xk2_.a Cb/2 kxk2C.a Cb/2: QQQ

COMPACT CONVEX PROJECTIONS

(b) For any n 0, A 2 A, if kxk c C b, x D Ctx0 and Ct C1;t Cn C1 D ABn for some t 0 then k ABnxk kxk and k ABnxk kxk _ .c C b/. P If kx0k c then (ii) tells us that Ax0 2 x0 2 x0 C b x0 2 .a C b/2:

So, if k Bnxk c then (a) tells us that k ABnxk2 k Bnxk2 .a Cb/2 kxk2. And, if k Bnxk < c then k ABnxk c C b. Also, k ABnxk kxk if kxk c C b. QQQ (c) For any n 0; m 1, A1; : : : ; Am 2 A, if kxk c C b, x D Ctx0, Ct C1;t Cn Cm D A1 : : : Am Bn for some t 1 then k A1 : : : Am Bnxk kxk and it holds that k A1 : : : Am Bnxk kxk _ .c C b/. P If kzk b= for a z 2 H then k Azk kzk for all A 2 A. If kxk c C b then (b) tells us that k ABnxk kxk for all A 2 A. Hence, for any 2 l m 1, k Am l 1 : : : Am Bnxk > k Am l : : : Am Bnxk implies k Am l : : : Am Bnxk < b= c. The maximal increase of operators A 2 A is bounded by b due to (i). Since we can only see an increase if k Am l : : : Am Bnxk c we gain k Am l 1 : : : Am Bnxk .c C b/ _ kxk D kxk. A slight modiﬁcation of the argument yields the second case. QQQ (d) It follows that compositions of such sequences, say A1 : : : Am1Bn1Am1C1 : : : Am1Cm2Bn2, m1; m2 > 0, A1; : : : ; Am1Cm2 2 A do not increase the bound since with x0 D Am1C1 : : : Am2Bn2x, x0 D k Am1C1 : : : Am2Bn2xk kxk _ .c C b/

we have A1 : : : Am1Bn1x0 x0 _ .c C b/ kxk _ .c C b/:

These cases cover all possible compositions with the only exception of sequences that start with some Bn, n 1. But, if kx0k kxk _ .c C b/ then from (a) we know that

k Cxk2 D Bnx0 2 x0 2 C .a C b/2 kxk2 _ .c C b/2 C .a C b/2:

We need the deﬁnitions of the afﬁne hull and the afﬁne dimension. The afﬁne hull of a set A H is

aff A D n X

i D1 ixi W n 1; xi 2 A;

i D1 i D 1 :

The afﬁne hull can be identiﬁed with a vector space by centering it around an arbitrary element x0 2 aff A, i.e. V D fx x0 W x 2 aff Ag is a vector space. The dimension of this vector space is called the afﬁne dimension of A. If A D fxg for some element x 2 H then aff A is also just fxg and we deﬁne its afﬁne dimension to be 0. We recall some basic properties of the projection Py of y onto a compact convex set C. The projection Py is characterized by ky Pyk D minx2C ky xk and the geometric property hy Py; x Pyi 0 for every x 2 C. The latter property translates into a sort of orthogonal decomposition for convex sets, that is

ky xk2 D ky Pyk2 2 hy Py; x Pyi C k Py xk2 ky Pyk2 C k Py xk2 : (1)

We also need the deﬁnition of a face of a convex set C. A face F is a set which fulﬁlls that whenever there are two points a; b 2 C and 2 .0; 1/ with a C .1 /b 2 F then a; b 2 F .

GR UNEW ALDER

Theorem 2 Given a compact convex set C H and a ﬁnite subset S of C with ex C S there exists for y 2 H a constant b > 0 such that Algorithm 1 has a worst-case approximation error of b=t for all t 1 if, and only if, Assumption 1 is fulﬁlled for the sequences fxtgt 1 and fwtgt 0 generated by the algorithm.

Proof We start with some general observations before proving that the assumption is sufﬁcient for the fast rate of convergence. That the condition is necessary is shown at the end (see (v)). (i) The algorithm converges with the right rate to Py if the sequence fwt y t.y Py/gt 1 stays in a bounded norm ball of radius b since then

b kwt y t.y Py/k D .t C 1/y

i D1 xi y t.y Py/ D t Py

i D1 xi (2)

because wt D .t C 1/y .x1 C : : : C xt/ and hence k Py .x1 C : : : C xt/=tk b=t. Also observe that we can replace for any z 2 H

arg max x2S hz; xi with arg max x2S hz; x Pyi

in the maximization step. (ii) (a) There exists a minimal face F that contains Py which is

F D f Pyg [ fx 2 C W 9z 2 C; 2 .0; 1/ such that x C .1 /z D Pyg: (3)

(b) F is a compact and convex set and the extremes of F are ex F D ex C \ F . Let Fc D fz Py W z 2 F g be the face F centered around Py 2 F . The set aff Fc D span Fc is a closed subspace of H with orthogonal complement .aff Fc/? D F ? c . So to any element z 2 H we have a unique zjj 2 aff Fc and z? 2 .aff Fc/? with z D zjj C z?. In the following, denote the projection onto aff Fc with PF . (c) y Py stands orthogonal on the centered face Fc, i.e. y Py 2 F ? c . So the term t.y Py/ contained in wt does not inﬂuence the maximization over F . (d) Either Fc D f0g or there exists a ıF > 0 such that B.0; ıF / \ span Fc D B.0; ıF / \ Fc, where B.0; ıF / is a the closed ball of radius ıF centered at the origin. (e) For any wt, t 0, there exists an x 2 C such that hx Py; wti 0. Furthermore, if Fc 6D f0g and whenever k PF wtk 2r2=ıF , r WD supx2C kxk < 1, and an element in S \ F is chosen by the algorithm then k PF wt C1k k PF wtk. So, if k PF wtk 2r2=ıF and an element in S \ F is chosen then kwt C1 y .t C 1/.y Py/k kwt y t.y Py/k. (f) If F D C then b D 2r2=ıF C 3r satisﬁes eq. 2 and the error of the algorithm is bounded by .2r2=ıF C 3r/=t for any t 1. P (a) F is the minimal face. F is certainly a subset of any face that contains Py by the very deﬁnition of a face. It is also a face itself since if for any x 2 F , x 6D Py, there exist two points a; b 2 C and a 2 .0; 1/ such that x D a C .1 /b then there exists a 2 .0; 1/ and a z 2 C such that Py D x C .1 /z D . a C .1 /b/ C .1 /z D a C .1 /b C .1 /z D a C .1 / .1 /

1 z : c WD .1 /=.1 /b C .1 /=.1 /z is an element of C since it is a convex combination of b and z and hence with WD 2 .0; 1/ we see that Py D a C .1 /c and a 2 F . Applying the same argument to b shows us also that b 2 F and F is a face.

COMPACT CONVEX PROJECTIONS

(b) F is compact and convex and ex F D ex C \ F . F is a face and as such also convex. Furthermore, F D C \ aff F where aff F is the afﬁne hull of F . The afﬁne hull has ﬁnite afﬁne dimension since C has ﬁnite afﬁne dimension. Afﬁne hulls with ﬁnite afﬁne dimension are closed. Hence, F is closed as the intersection of two closed sets. And since closed subsets of compact sets are compact we see that F is compact. Finally, ex F D F \ ex C, that is the extremes of F are just the extremes of C which lie in F . The last property can be veriﬁed in the following way: the extremes of C are by deﬁnition not convex combinations of any two points a; b 2 C, a 6D b, and hence of no two points in F . So F \ ex C ex F . On the other hand, if c 2 ex F then there is no point a 2 Cn F for which a corresponding point b 2 C and 2 0; 1Œ exists such that c D a C .1 /b (a would otherwise be in the face F ). However, since c 2 ex F there exists also no point a 2 F with this property and c 2 ex C. (c) y Py is orthogonal to Fc. If there would be an element u 2 aff Fc, u D Pn i D1 ivi for some n 1, vi 2 Fc, Pn i D1 i D 1, such that hu; y Pyi 6D 0 there would also be an element v 2 Fc with hv; y Pyi 6D 0 (otherwise hu; y Pyi D Pn i D1 i hvi; y Pyi D 0). Furthermore, by the characterization of the minimal face in (3) and because v CPy 2 F there exists an element z 2 Fc such that Py D .v C Py/ C .1 /.z C Py/ and 0 D v C .1 /z. Hence, 0 D hy Py; vi Chy Py; .1 /zi. Because hy Py; vi 6D 0 we know that hy Py; zi 6D 0 and both have opposing signs. So one of them, say without loss of generality hy Py; vi, is strictly greater zero. But this can not be since then Qv D v C Py 2 C would fulﬁll, in contradiction to the properties of Py, hy Py; Qv Pyi > 0. (d) Fc D f0g or there exists a B.0; ı/; ı > 0. If the minimal face consists of a single element x then this element must be Py, since Py 2 F , so in this case Fc D f0g. In the other case it sufﬁces to consider a set of basis vectors which span the space span Fc, say e1; : : : ; ed, where d < 1 is the afﬁne dimension of F . There exist 1; : : : ; d > 0 such that iei 2 Fc for all i d if, and only if, B.0; ı/ \ span Fc D B.0; ı/ \ Fc for some ı > 0. Since, ei 2 aff Fc there exists a n 1, u1; : : : ; un 2 Fc and Pn j D1 ˇj D 1 with ei D Pn j D1 ˇj uj . We can also represent ei as a sum with only non-negative coefﬁcients, i.e. there exists Qˇj 0, Quj 2 Fc, j n, such that ei D Pn j D1 Qˇj Quj . This can be achieved by doing the following for each j n: if ˇj 0, then let Qˇj D ˇj and Quj D uj . If ˇj < 0 and uj D 0 then let Qˇj D ˇj and Quj D uj . Finally, if neither case applies (ˇj < 0, uj 6D 0) take a 2 0; 1Œ and a Quj 2 Fc such that 0 D uj C .1 / Quj . Such a and Quj exist due to the deﬁnition of F . With this choice ˇj uj D Quj ˇj .1 /= D Qˇj Quj , where Qˇj D ˇj .1 /= > 0. So ei D Pn j D1 Qˇj Quj and Qˇj 0. By normalizing the sum with D Pn j D1 Qˇj > 0 (some ˇj must be greater 0) we see that ei 2 Fc. Furthermore, there exists a 2 0; 1Œ and a v 2 Fc such that ei .1 /= D v. And our claim is shown to be true by letting i D minf .1 /= ; g. (e) Shrinking weight vector. If F consists of more than one element then for any weight vector wt there exists an element x in F with hx Py; PF wti D ıF k PF wtk, since ıF PF wt= k PF wtk is in B.0; ıF / and h.ıF PF wt= k PF wtk C Py/ Py; PF wti D ıF k PF wtk. This implies directly that we have an element z 2 ex F S \ F with hz Py; PF wti ıF k PF wtk, because x D Pn j D1 ˇj uj , for suitable n, ˇ1; : : : ; ˇn > 0, Pn j D1 ˇj D 1 and elements uj 2 ex F (ex C is ﬁnite and the convex hull C of ex C is in this case closed. That means any element in C can be represented exactly by a convex combination of ﬁnite many extremes), i.e. Pn j D1 ˇj uj Py; PF wt D ı k PF wtk and at least one term uj Py; PF wt must be as large as ıF k PF wtk.

GR UNEW ALDER

Now, whenever k PF wtk 2r2=ıF and an element in S \ F is chosen, then k PF wt C1k k PF wtk. This can be veriﬁed in the following way: there exists an element x0 2 F \ S such that hx0 Py; PF wti ıF k PF wtk and, hence, the algorithm will choose an element x with hx Py; PF wti hx0 Py; PF wti ıF k PF wtk (as said in (i) the maximization is unaffected by the translation Py). So, since x Py 2 Fc and PF is linear we have that

k PF wt C1k2 D k PF .wt .x Py//k2 D k PF wt .x Py/k2

D k PF wtk2 2 hx Py; PF wti C kx Pyk2 :

Furthermore, kx Pyk kxk C k Pyk is bounded by 2r and

kx Pyk2 2 hx Py; PF wti 4r2 2ıF k PF wtk 0

by our assumption on k PF wtk. For x 2 F \ S we can also observe that for any wt

hx Py; wti D h PF .x Py/; wti D hx Py; PF wti

since x Py 2 aff Fc and PF is self-adjoint. So there always exists an x 2 C such that hx Py; wti 0. If Fc D f0g, that is F D f Pyg, then x D Py gives us hx Py; wti D 0. (f) If F D f Pyg then wt y t.y Py/ D .x1 Py/C: : :C.xt Py/ D 0 since all xi D Py. If F contains more than one element and C D F then (e) tells us that if k PF wtk 2r2=ıF then

kwt C1 y .t C 1/.y Py/k kwt y t.y Py/k :

Observe that if F D C then the projection PF is closely related to the projection QP onto the afﬁne hull of C, that is QP x D PF .x QP 0/ C QP 0. Also, QP y D Py because y Py is orthogonal to span Fc. We need to show that k PF wtk 2r2=ıF under the stated condition. We have that

k PF wtk D .t C 1/PF y

i D1 PF xi D PF y

i D1 PF .xi y/ D PF y

i D1 .xi Py/

because xi Py lies in span Fc and PF y D PF Py. On the other hand

kwt y t.y Py/k D

i D1 .xi Py/ k PF wtk C k PF Pyk k PF wtk C r:

Hence, if kwt y t.y Py/k r C 2r2=ıF then kwt C1 y .t C 1/.y Py/k is smaller than kwt y t.y Py/k. So we can only see an increase of the sequence fwt y t.y Py/gt 1 if at any given t it holds that kwt y t.y Py/k < 2r2=ıF C r and since the change between t and t C 1 is just Py x we gain for any t 1 the bound

kwt y t.y Py/k < 2r2=ıF C r C k Py xk 3r C 2r2=ıF : QQQ

(iii) The case F D C is dealt with in (ii.f) so we may assume in the following that Cn F 6D ;. We assign to the three subspaces UF ; Ub and Uı deﬁned in Section 2.1 orthonormal bases. Take an arbitrary orthonormal basis of UF and let EF be the set of these basis elements. Similarly, chose

COMPACT CONVEX PROJECTIONS

a basis of Uı and let Eı be the set of these basis elements. Finally, group the basis elements of Ub, guaranteed to us in our assumption, in Eb. Furthermore, let ım be the largest radius of an open ball around 0 in UF Uı. ım is always strictly larger than 0. For any e 2 Eb and t 0

hwt; ei D h Py; ei C h.t C 1/.y Py/; ei

n D1 hxn Py; ei h.t C 1/.y Py/; ei C h Py; ei

and for any e 2 Eb and any t 0 it holds that

hwt C1 .y Py/; ei hwt; ei D hxt C1 Py; ei 0:

Hence hwt C1 .y Py/; ei hwt; ei for all e 2 Eb. Clearly, Fc lies in UF Uı. More importantly, any element that is not in Fc lies at least partly in Ub, or in other words, if x 2 Cn F then Pb.x Py/ 6D 0. P For any element x 2 C that is not in F there exists an e 2 Eb such that hx Py; ei 6D 0. Otherwise, x Py would lie in a subspace A for which there is a ı0 > 0 and B.0; ı0/ \ A D B.0; ı0/ \ A \ Cc. The element .x Py/ would then also lie in A and z D .ı0=2/.x Py/= kx Pyk 2 B.0; ı0/ \ A D B.0; ı0/ \ A \ Cc (x 6D Py, since Py 2 F and the norm kx Pyk is strictly positive). So with D ı0=.2 kx Pyk/ and D =.1 C / 2 0; 1Œ we would have that

.x Py/ C .1 /z D .x Py/ =.1 C / .x Py/ .1 =.1 C // D 0

and x C .1 /.z C Py/ D Py. But, because x and z C Py 2 C this implies that x 2 F by the deﬁnition of the minimal face. The conclusion x 2 F is a contradiction to our original assumption and our claim holds. QQQ (iv) (a) If the sequence fk.PF C Pı/.wt y t.y Py//kgt 1 is bounded then the sequence fk Pb.wt y t.y Py/kgt 1 is also bounded and elements in Sn F are chosen only ﬁnitely often. (b) Under Assumption 1 the sequence fk.PF C Pı/.wt y t.y Py//kgt 1 is bounded. Furthermore, under this assumption the sequence fkwt y t.y Py/kgt 1 is bounded and there exists a constant b such that the approximation error is bounded by b=t. P (a) Let us assume that fk Pb.wt y t.y Py/kgt 1 is unbounded. If elements in F are chosen at any stage t then Pb.wt y t.y Py// D Pb.wt C1 y .t C1/.y Py// and there is no increase of the normed sequence. Hence, there must exist an element x 2 Sn F which is selected inﬁnitely often. Let e D Pb.x Py/= k Pb.x Py/k and observe that he; x Pyi 0 for all x 2 C because h Pb.x Py/; x Pyi D P e02Eb he0; x Pyi he0; x Pyi 0. If x is selected at any step t then from hx Py; wti 0 it follows that x Py; .PF C Pı/wt x Py; Pbwt D Pb.x Py/ he; wti

and hx Py; y Pyi 0 implies hx Py; .PF C Pı/.y Py/i hx Py; Pb.y Py/i which in turn implies

hx Py; .PF C Pı/. y t.y Py//i hx Py; Pb. y t.y Py//i hx Py; Pyi

h Pb.x Py/; Pb. y t.y Py//i 2r:

GR UNEW ALDER

Together, these inequalities give us

x Py; .PF C Pı/.wt y t.y Py// Pb.x Py/ he; Pb.wt y t.y Py/i 2r:

Applying the Cauchy-Schwarz inequality yields then

kx Pyk k Pb.x Py/k k.PF C Pı/.wt y t.y Py//k C 2r k Pb.x Py/k he; wt y t.y Py/i

s D1 he; xs Pyi fxs D x g D

Pb.x Py/ fxs D x g

where is the characteristic function. Since k Pb.x Py/k > 0 and x is selected inﬁnitely often we conclude that k.PF C Pı/.wt y t.y Py//k diverges in t and the corresponding sequence fk.PF C Pı/.wt y t.y Py/kgt 1 is unbounded. (b) If UF D Uı D f0g then .PF C Pı/x D 0 for all x 2 H and (a) gives us

kwt y t.y Py/k D t X

i D1 .xi Py/ D t X

i D1 Pb.xi Py/ D k Pb.wt y t.y Py/k

and the sequence is bounded due to (a). Hence, the result follows. Now let us assume that UF Uı 6D f0g. We want to apply Lemma 9. Let Q H WD UF Uı. Q H together with the inner product inherited from H is a Hilbert space. Let A be the operator deﬁned for all w 2 H through

x0 2 arg max x2Sn F hx Py; wi and Aw D w C y x0:

For any w for which there exist multiple maximizer assign to Aw one of these maximizer. Let B be deﬁned in the same way with the only difference that Sn F is replaced by S \ F . We like to study the interaction between A and B on the subspace UF Uı. The operator B optimizes over elements in S \ F . For such elements x 2 S \ F we know that x Py is orthogonal to Pbw for all possible weights w and Pbw does not inﬂuence the behavior of the operator B. Let QB be .PF C Pı/B restricted to Q H and let K WD UF QH. The operator QB fulﬁlls the assumptions of the lemma:

QB leaves the space orthogonal to K in QH unchanged and the maximization over S \ F does only depend on PF w for any w 2 Q H so QBw D .I PF /w C PF QBPF w. This also holds (trivially) if Fc D f0g.

Let b WD 2r then for w 2 Q H, QBw D k.PF C Pı/.w C .y Py/ .x Py//k kwk C kx Pyk kwk C b where x 2 S \ F and .PF C Pı/.y Py/ D 0 by our choice of Uı.

If Fc D f0g then point (iii) in Lemma 9 holds trivially. In all other cases let a WD 2r2=ıF , ıF > 0, then an argument as in (ii.e) tells us that if k PF wk a for some w 2 Q H then QBw kwk.

COMPACT CONVEX PROJECTIONS

The operator A depends on Pbw and we need to account for this error when studying the behavior of A acting on UF Uı. We do so by working with a family of operators A which are parameterized by Pbw. Let

A D f QAv W Q H ! Q H W QAv.w/ D .PF CPı/A.w Cv/ for all w 2 Q H; v 2 Ub; hv; ei 0; 8e 2 Ebg:

Observe that there is some time t00 < 1 such that for all t t00 no element in Sn.F [ D.fxtg/ is chosen and each element in D.fxtg/ has been chosen at least once. Write Ub D U U 0 with U 0 D span D.fxtg/ and a suitable subspace U . Then PU wt D PU w Qt for all t Qt. Since each element in D.fxtg/ is chosen inﬁnitely often and h Pbwt; x Pyi is non-decreasing for x 2 C there exists some element t000 t00 such that for all t t000, h PU .x Py/; Pbwti hx0 Py; Pbwti for all x 2 Sn D.fxtg/ and x0 2 D.fxtg/. Let Qt < 1 be the maximum over all t0 in Assumption 1 and t000. The family of operators A fulﬁlls the assumptions of the lemma if Assumption 1 holds and if we use x Qt as the initial value.

Given any w 2 Q H, QA 2 A (with corresponding v 2 Ub; hv; ei 0 for all e 2 Eb), there exists a x 2 Sn F such that QAw D k.PF C Pı/A.w C v/k D k.PF C Pı/.w C v C .y Py/ .x Py//k

D kw .x Py/k kwk C b:

If D.fxtg/ D ; then only elements in F are chosen after Qt and the claim of the theorem follows by the arguments in (ii.e).

In case that D.fxtg/ 6D ; consider the maximum over all in Assumption 1 and call this maximum Q < 1. We claim that there exists a constant 0 < 1 such that with WD 2ım= 0

for all wt, t Qt, whenever xt 2 Sn F , then

hxt Py; .PF C Pı/wti k.PF C Pı/wtk ım= 0

where xt is the element selected in the argmax step of Av, wt D u C v; u 2 UF Uı; v 2 Ub. P ( ) Whenever xt 2 Sn F; xt 2 D.fxtg/, for some t Qt we know that h Pbwt; xt Pyi 6D 0 because each element x 2 D.fxtg/ has been chosen at least once and each element outside the minimal face lies partly in Ub. For any x 2 D.fxtg/ we also have h Pbwt; x Pyi 6D 0 and at step t Qt

hxt Py; .Pı C PF /wti hx Py; .PF C Pı/wti h.x Py/ .xt Py/; Pbwti

hx Py; .PF C Pı/wti hx Py; Pbwti

hxt Py; Pbwti 1 hxt Py; Pbwti

hx Py; .PF C Pı/wti hx Py; Pbwti

hxt Py; Pbwti 1 hxt Py; .PF C Pı/wti

hx Py; .PF C Pı/wti . Q 1/hxt Py; .PF C Pı/wti

and hxt Py; .Pı C PF /wti hx Py; .Pı C PF /wti = Q :

(ˇ) Consider a set of elements z1; : : : ; z Qd 2 D.fxtg/ such that z1 Py; : : : z Qd Py are linearly independent and with dim U 0 D Qd 1. Apply the Gram-Schmidt method to transform these into an orthonormal basis Qz1; : : : ; Qz Qd such that Qzi D Pi j D1 ˇij .zj Py/ for some

GR UNEW ALDER

scalars ˇij . Since the vectors are linearly independent we know that maxi;j d ˇˇˇij ˇˇ < 1. For any x 2 Sn.F [ D.fxtg// if Pbx 2 U then, by the choice of Qt, for all t Qt

h Pbwt; x Pyi min x02D.fxtg/h Pbwt; x0 Pyi:

If Pbx 62 U then we can write PU 0.x Py/ D 1Qz1; : : : ; Qd Qz Qd for suitable scalars 1; : : : ; Qd. We claim that there exists a x < 1 such that

h Pbwt; PU 0.x Py/i x max i d h Pbwt; zi Pyi

for all t Qt. This holds since

h Pbwt; PU 0.x Py/i d max i d j ij max i d h Pbwt; Qzii

d 2 max i d j ij max i;j d

ˇˇˇij ˇˇ max i d h Pbwt; zi Pyi

and, using Parseval s identity, j ij k PU 0.x Py/k 2r. Hence, the claim follows with x D 2rd 2 maxi;j d ˇˇˇij ˇˇ < 1.

Furthermore, because for all t Qt, x 2 Sn.F [ D.fxtg//, h PU .x Py/; Pbwti mini dhzi Py; Pbwti by our choice of Qt we can observe that for any x 2 Sn.F [ D.fxtg/ DW M

h Pbwt; x Pyi D h Pbwt; PU .x Py/i C h Pbwt; PU 0.x Py/i

. sup x2M x C 1/ max i d h Pbwt; zi Pyi

Q . sup x2M x C 1/h Pbwt; xt Pyi:

Repeating the argument in . / this tells us that

hxt Py; .Pı C PF /wti hx Py; .Pı C PF /wti=. Q . sup x2M x C 1//:

( ) The above argument also works for x 2 F without the need to adapt the multiplier 0 WD Q .supx2M x C 1/. Now, we know that there exists an element x 2 S such that

hx Py; .PF C Pı/wti k.PF C Pı/wtk ım

and, hence,

hxt Py; .PF C Pı/wti hx Py; .PF C Pı/wti= 0 k.PF C Pı/wtk ım= 0

which proves the claim. QQQ

Writing u D w C v, w 2 Ub Uı, v 2 Ub, this becomes hxt Py; wi kwk ım= 0 for all t Qt. The usual argument gives us for the operator QA being used at time t that QAw 2 D .PF C Pı/.w C v C .y Py/ .x0 Py// 2

D kwk2 2 x0 Py; w C x0 Py 2

kwk2 2 kwk ım= 0 C b D kwk .kwk 2 / C b:

COMPACT CONVEX PROJECTIONS

Hence, the lemma can be applied and the sequence of weights projected onto UF Uı, i.e. f.PF C Pı/wtgt 0 is bounded in norm. Together with (a) this implies that the weight sequence stays bounded and the result follows. QQQ (v) The condition is also necessary for the fast rate of convergence. Observe that there always exists a decomposition UF Uı Ub of span Cc: let UF D span Fc. Let QU be the orthogonal complement of UF in span Cc. If this is empty then we are done. Otherwise, chose an orthonormal basis e1; : : : ; ek of QU , k D dim QU such that e1 D PC .y Py/= k PC .y Py/k if PC .y Py/ 6D 0. Consider the set Eb D fei W i k; PeiŒCc . 1; 0 or PeiŒCc Œ0; 1/g and let Eı D fe1; : : : ekgn Eb. Then Ub D span Eb; Uı D span Eı fulﬁll the assumptions. Since PıŒCc is convex and we have an open interval around 0 in each direction e 2 Eı we have a ball around 0 in Uı. Similarly, the assumption for Ub is fulﬁlled if we exchange e 2 Eb with e whenever PeŒCc Œ0; 1/. If Assumption 1 is not fulﬁlled for this decomposition then there exists an element x 2 Sn F which is selected inﬁnitely often by the algorithm. But, since k Pb.wt y t.y Py//k is nondecreasing in t and because x 2 Sn F fulﬁlls k Pb.x Py/k > 0, we have kwt y t.y Py/k k Pb.x Py/k Pt s D1 fxs D xg and the right side diverges in t. Therefore, we have an unbounded sequence fkwt y t.y Py/kgt 1. Observe that the algorithm does not converge with the rate 1=t if fkwt y t.y Py/kgt 1 is unbounded.

6.2 The Second Theorem

Theorem 3 Given a compact convex set C H, a ﬁnite subset S of C with ex C S and an element y 2 H the following holds:

If only elements in F \ S, where F C is the minimal face that contains Py, are chosen then the method converges linearly to the projection and there exist constants b; ˇ > 0 with

k Py ptk be ˇt:

Under Assumption 1 if minx2S k Py xk > 0 and the approximation does not equal Py in ﬁnite many steps then the sequence fk.I PF /.Py pt/kgt 0 converges sub-linearly and there exists a constant d > 0 such that

k PF .Py pt/k2 .1 ı2 F =diam 2.F // k PF .Py pt 1/k2Cd k.I PF /.Py pt 1/k2 :

Furthermore, there exists a time t0 after which only elements in F \ S are chosen.

Proof We aim for a similar argument as in the proof of the ﬁrst theorem: (1) optimizing the approximation of Py by using y instead of Py does not hurt the rate of convergence. (2) This guarantees us that the rate of convergence is what we aim for if we work solely with the minimal face F that contains Py. Since, we do not know this minimal face we need to deal with perturbations that are introduced by elements in Sn F . These perturbations do slow down the rate of convergence. We adapt the proof from Beck and Teboulle (2004) to address (1). We make use of parts of the proof of Theorem 2 and we use Theorem 2 (ii.a) etc. to refer to it. In the following we will use pt to denote the approximation at step t (with p0 D 0), xt as the element that is chosen by the algorithm and wt D y pt.

GR UNEW ALDER

(i) Let F be the minimal face which contains Py (Thm. 2 (ii.a)) and assume that either C D F or only elements in F are chosen by the algorithm. (a) We claim that in this case Q t 2 Œ0; 1 for all t 1. P In step 1 we have by deﬁnition that Q t D 1. If C consists of a single element then Q t D 0=0, which we deﬁne to be 0, for all t 2. For the case that C consists of more than a single element and for any step t 2 we have that

Q t D hwt 1; wt 1 C .xt y/i

kwt 1 C xt yk2 D hy pt 1; xt pt 1i

kxt pt 1k2 D h Py pt 1; xt pt 1i

where the last step holds because xt pt 1 lies in the span of Cc and y Py stands orthogonal on span Cc (Thm. 2 (ii.c)), i.e. hy Py; xt pt 1i D 0. We can also observe that

h Py pt 1; xt Pyi D max x2C h Py pt 1; x Pyi ıF k Py pt 1k 0 (4)

holds: y Py stands orthogonal on span Cc and, hence,

arg max x2S hwt 1; xi D arg max x2S h Py pt 1; xi arg max x2C h Py pt 1; xi :

Furthermore, Py pt 1 lies in the span of the centered minimal face, i.e. Py pt 1 2 Fc, and Thm. 2 (ii.d) tells us that there exists a constant ıF > 0 which makes the above true. Following Beck and Teboulle (2004) we complete the square

h Py pt 1; Py pt 1 .Py xt/i k Py pt 1k2 2 h Py pt 1; Py xti C k Py xtk2

D kpt 1 xtk2

and observe that Q t 1. Hence, t D Q t. QQQ (b) If F D C or only elements in F are chosen by the algorithm then for any t 2 either Py D pt and for all s t we have Py D ps or k Py ptk2 .1 ı2 F =diam 2.C// k Qwt 1k2. Furthermore, there exist constants b; ˇ > 0 such that k Py ptk be ˇt. P We aim at reproducing the argument from Beck and Teboulle (2004) by exploiting the orthogonality between y Py and span Cc. Let Qwt D Py pt D Qwt 1 t.xt Py C Qwt 1/ and assume that Qwt 1 6D 0. In short, if C consists of a single element then there is nothing to show and, otherwise, we have for t 2 that t D Q t and

k Qwtk2 D k Qwt 1k2 2 t h Qwt 1; Qwt 1 C .xt Py/i C 2 t kxt pt 1k2 :

Observe that hwt; Qwti D hy pt; Py pti D k Qwtk2 because of the orthogonality and because Py pt 2 span Cc. This, together with the orthogonality between y Py and xt Py yields

hwt 1; wt 1 C .xt y/i D hwt 1; Py pt 1 C xt Pyi D h Qwt 1; Qwt 1 C xt Pyi :

Furthermore, kwt 1 C xt yk D kxt pt 1k and, hence,

t D h Qwt 1; Qwt 1 C .xt Py/i

kxt pt 1k2 :

COMPACT CONVEX PROJECTIONS

By ﬁlling in this value of t we gain

k Qwtk2 D k Qwt 1k2 2jh Qwt 1; Qwt 1 C xt Pyij2

kxt pt 1k2 C jh Qwt 1; Qwt 1 C xt Pyij2

D k Qwt 1k2 k Py xtk2 jh Qwt 1; Py xtij2

kxt pt 1k2 :

Since arg maxx2S hwt 1; xi D arg maxx2S h Qwt 1; xi we have that

h Qwt 1; xt Pyi D max x2C h Qwt 1; x Pyi ıF k Qwt 1k

and k Qwt 1k2 k Py xtk2 jh Qwt 1; Py xtij2 k Qwt 1k2 .k Py xtk2 ı2 F /:

Eq. 4 tells us now that kxt Py C .Py pt 1/k2 kxt Pyk2 and

k Qwtk2 k Qwt 1k2 .kxt Pyk2 ı2 F /

kxt Pyk2 .1 ı2 F =diam 2.C// k Qwt 1k2 :

This is the second part of our claim. The ﬁrst part follows directly from the particular form that t attains. With Qwt 1 D 0 we have

t D h Qwt 1; Qwt 1 C .xt Py/i

kxt pt 1k2 D 0

and Qwt D Qwt 1 D 0. Finally, both cases imply the fast rate of convergence: Let D ı2 F =diam 2.C/ 2 .0; 1/ then in both cases k Qwtk2 .1 / k Qwt 1k2 for all t 2 and Lemma A.1(ii) from Beck and Teboulle (2004) tells us that k Qwtk2 e t k Qw1k2. But then with ˇ D =2 we have k Py ptk e ˇt k Qw1k and the result follows. QQQ (ii) We address now F 6D C. First, we can observe that in the general case either Q t 1 holds after at most ﬁnite many steps or there is one ﬁnal step where Q t > 1 and the approximation error becomes zero afterwards, i.e. there exists a time t0 < 1 such that Q t 1 for all t t0 or, if Q t > 1 for some t t0, then for all s > t we have that ps D Py and Q s 2 Œ0; 1 . t0 depends here on y and S. P We expand the argument of (i.a). In the case that S does not contain the element Py we can argue in the following way: If ky Pyk > 0 and kpt 1 Pyk .kxt Pyk2 hpt 1 Py; xt Pyi/= ky Pyk then

hy pt 1; xt pt 1i D ky pt 1k2 C hy pt 1; xt yi

D ky pt 1k2 C hy Py; xt yi C h Py pt 1; Py yi C h Py pt 1; xt Pyi

D k Py pt 1k2 hpt 1 Py; xt Pyi C hy Py; Py pt 1i C hy Py; xt Pyi

k Py pt 1k2 hpt 1 Py; xt Pyi C ky Pyk kpt 1 Pyk

and Q t 1. S consists of ﬁnite many elements and WD dist .S; Py/ D minx2S kx Pyk > 0. Since xt 2 S we know that kxt Pyk . Furthermore,

h Py pt 1; xt Pyi D hy Py; xt Pyi C hy pt 1; xt Pyi 0

GR UNEW ALDER

because the ﬁrst inner product is non-positive and the second term is non-negative (the same argument as in Theorem 2 (ii.e)). Hence,

kxt Pyk2 hpt 1 Py; xt Pyi 2 > 0:

Standard results for the CGM tell us that we have a constant c > 0 such that kpt yk2 k Py yk2 c=t and hence

kpt Pyk2 D kpt yk2 C2 hpt y; y Pyi Cky Pyk2 kpt yk2 ky Pyk2 c=t

where we used that

hpt y; y Pyi D hpt Py; y Pyi ky Pyk2 ky Pyk2 :

Hence, for all t 2 N with t t0, where t0 D c ky Pyk2 = 4, we know that Q t 1. If y D Py then the argument simpliﬁes to

hy pt 1; xt pt 1i k Py pt 1k2 hpt 1 Py; xt Pyi kxt pt 1k2 :

The remaining case is the case where Py 2 S. In fact, the only critical case is where xt D Py. We can argue in the following way:

Q t D hy Py; Py pt 1i C h Py pt 1; Py pt 1i

k Py pt 1k2 1

and t D 1. Hence, pt D Py and xt C1 D arg maxx2S hy pt; xi D arg maxx2S hy Py; x Pyi. If xt C1 D Py then pt C1 D Py and Q t C1 D 1. Otherwise, xt C1 is an element of the minimal face. Hence, if pt D Py and xt C1 6D Py then

Q t C1 D hy Py; xt C1 pti C h Py pt; xt C1 pti

kxt C1 ptk2 D 0 D t C1

and pt C1 D Py. The same argument yields that ps D Py for all s t and Q s 2 Œ0; 1 . QQQ (iii) Consider now the split span Cc D UF Uı Ub. We will use here the same notation ım; ıF etc. as in Theorem 2. (a) If Assumption 1 holds then there exists a s0 < 1 such that for all t s0 the algorithm chooses elements xt 2 F . P Assumption 1 provides us with a time s0 after which only elements in D.fxtg/ [ F are chosen. As in Theorem 2 (iv) we can observe that there exists a constant c > 0 such that for all t s0

hxt Py; .PF C Pı/wti ck.PF C Pı/wtk:

xt 2 S is here the element chosen at step t. Hence, the sequence fkt.PF CPı/wtkgt 0 is bounded. Furthermore, because 0 hxt Py; twti D hxt Py; t.PF C Pı/wti C hxt Py; t Pbwti we can infer that

hxt Py; t Pbwti hxt Py; t.PF C Pı/wti kxt Pyk kt.PF C Pı/wtk

and the sequence f hxt Py; t Pbwtigt 0 is bounded. But this implies that x D xt is either an element of F or it is selected only ﬁnitely often (and, hence, x 62 D.fxtg/). Therefore, D.fxtg/ is empty and the result follows. QQQ

COMPACT CONVEX PROJECTIONS

(b) There exists a constant d > 0 and a time u0 after which for all t u0

1 ı2 F diam 2.F /

k PF Qwt 1k2 C d k.I PF / Qwt 1k2 :

P Let us consider t > t0 _ s0 with s0 from (a) and t0 from (ii). We know that only elements in F are chosen at t and hence

kxt pt 1k2 Q t D hwt 1; Qwt 1 C xt Pyi D h Qwt 1; Qwt 1 C xt Pyi C hy Py; Qwt 1i

D h PF Qwt 1; PF Qwt 1 C xt Pyi C k.I PF / Qwt 1k2 C hy Py; Qwt 1i :

Also, Q t D t and

k PF Qwtk2 D k PF Qwt 1k2 2 t h PF Qwt 1; PF Qwt 1 C xt Pyi C 2 t kxt PF pt 1k2

1 ı2 F diam 2.F /

k PF Qwt 1k2 C .k.I PF / Qwt 1k2 C hy Py; Qwt 1i/2

kxt pt 1k2 :

Now, since pt converges to Py there exists a time u0 > t0 _ s0 after which kxt pt 1k kxt Pyk =2 minx2S kx Pyk =2 and k.I PF / Qwt 1k2 k.I PF / Qwt 1k 1. Hence, for all t u0 we have that

1 ı2 F diam 2.F /

k PF Qwt 1k2 C 4.1 C ky Pyk/

minx2S kx Pyk2 k.I PF / Qwt 1k2 :

Choosing d D 4.1 C ky Pyk/=.minx2S kx Pyk2/ yields the result. QQQ (c) If minx2S kx Pyk > 0 and if k.I PF / Qwt0_s0k > 0 with s0 from (a) and t0 from (ii) then the sequence fk.I PF / Qwtkgt 1 converges sub-linearly. P Let us consider t t0 _ s0. We know that only elements in F are chosen at t and that t D Q t. Therefore k.I PF / Qwt C1k D .1 t C1/ k.I PF / Qwtk and

k.I PF / Qwt C1k

k.I PF / Qwtk D 1 t C1 D kxt C1 Py C Qwtk2 hwt; Qwt C xt C1 Pyi

kxt C1 ptk2

D kxt C1 Pyk2 C h PF Qwt; xt C1 Pyi hy Py; Pb Qwti

kxt C1 Pyk2 C 2 hxt C1 Py; Qwti C k Qwtk2

which converges to 1 since limt!1 k Qwtk D 0. This means that the sequence converges sublinearly. QQQ

6.3 Corollaries

Corollary 4 Given a compact convex set C H and a ﬁnite subset S of C with ex C S. If for y 2 H there exists a decomposition UF Uı Ub of span Cc such that Ub is one dimensional then Assumption 1 is fulﬁlled. In particular, in this case there exists a constant b > 0 such that Algorithm 1 has a worst-case approximation error of b=t for all t 1.

GR UNEW ALDER

Proof Consider elements x; x0 2 Sn F . Since k Pb.x Py/k ; k Pb.x0 Py/k > 0 we know that D k Pb.x Py/k = k Pb.x0 Py/k < 1 and since Ub is one-dimensional we have for any t

h Pbwt; x Pyi D k Pbwtk k Pb.x Py/k k Pbwtk k Pb.x0 Py/k D h Pbwt; x0 Pyi

and Assumption 1 is fulﬁlled.

Let for the following corollary A D fz 2 H W z D PC .y Py/ C PC Py; 2 Œ0; 1/g.

Corollary 5 Given a compact convex set C H and a ﬁnite subset S of C with ex C S. Assumption 1 is fulﬁlled if PC y 2 Hn C and whenever P z D Py for some z 2 H then PC z 2 A holds. In particular, in this case there exists a constant b > 0 such that Algorithm 1 has a worst-case approximation error of b=t for all t 1.

Proof The assumption guarantees us that Ub is one-dimensional. Assume otherwise. By assumption k PC .y Py/k > 0 and we can chose Eb such that there exist two elements e1; e2 in Eb with e1 D PC .y Py/= k PC .y Py/k and he1; e2i D 0; ke1k D ke2k D 1. We claim that z1 D Py C e1 and z2 D Py C e2 are both projected onto Py, i.e. P z1 D Py D P z2: We know that Pe1ŒCc . 1; 0 and he1; x Pyi 0 for all x 2 C. Hence, for all x 2 C we know that hz1 Py; x Pyi D he1; x Pyi 0. But, this means that P z1 D Py. The same argument applies to z2 and Py is the projection of two elements for which hz1 Py; z2 Pyi D 0. Furthermore, PC z1 D PC Py C e1 and PC z2 D PC Py C e2. If PC z2 2 A then there exist ; Q > 0, e2 C PC Py D PC z2 D PC .y Py/ C PC Py D Q e1 C PC Py

and e2 D Q e1 with a contradiction to orthogonality.

For the next corollaries let d D dim Ub, let e1; : : : ; ed be any basis of Ub and let r D supx2C kxk. Also introduce D mini d minfj hei; x Pyi j W x 2 Sn F; hei; x Pyi 6D 0g > 0.

Corollary 6 Let C be a compact convex set in some Hilbert space H, S a ﬁnite set with cch S D C, and y 2 H such that there exists a split into UF Uı Ub of span Cc with Uı D f0g. Assumption 1 is fulﬁlled and there exists a constant b > 0 such that Algorithm 1 has a worst-case approximation error of b=t for all t 1. The constant b can be chosen as p

d4r3=. ıF /C6r2.1=ıF C1= /C5r.

Proof If Fc D f0g then each element x 2 Sn F can only be chosen once since 0 > hwt; x Pyi D h Pbwt; x Pyi if x has been chosen at least once. This implies that D.fxtg/ is, in fact, the empty set and Assumption 1 is fulﬁlled. It is also easy to see that hwt; eii kx Pyk 2r for all i d. This implies that the constant b can in this case be chosen as 2r p

d. Also, since ıF and are upper bounded by r this implies that b 2r3p

d=. ıF /. Assume now that Fc 6D f0g and that there exists a x 2 D.fxtg/. By deﬁnition x 2 Sn F and x is chosen inﬁnitely often. Hence, fh Pbwt; x Pyigt 1 is a non-increasing sequence that diverges to 1. There exists a ıF > 0 such that for any wt, maxx02S\F hwt; x0 Pyi ıF k PF wtk. Also, for any x0 2 S, we have that hwt; x0 Pyi D h PF wt; x0 Pyi C h Pbwt; x0 Pyi and, since the term h Pbwt; x0 Pyi is always non-positive, we know that if PF wt 6D 0 elements will be chosen such that h PF wt; x0 Pyi ıF k PF wtk. Hence, whenever k PF wtk 2r2=ıF for some t then k PF wt C1k k PF wtk. This implies that k PF wtk 2r2=ıF C 3r for all t 1. Hence, for x to be chosen inﬁnitely often we need for all time steps t where x is chosen that

0 hwt; x Pyi h PF wt; x Pyi Ch Pbwt; x Pyi 2r.2r2=ıF C3r/Ch Pbwt; x Pyi

COMPACT CONVEX PROJECTIONS

since kx Pyk 2r. Therefore, h Pbwt; x Pyi r.4r2=ıF C6r/ and fh Pbwt; x Pyigt 1 cannot diverge to 1. In other words, x cannot be chosen inﬁnitely often with a contradiction to the initial assumption. We can also control the constant in this case: k Pbwtk can only grow if either PF wt D 0 or there is an x0 2 Sn F such that h Pbwt; x0 Pyi r.4r2=ıF C 6r/. Then for wt if for any i d hwt; eii > r.4r2=ıF C 6r/

then no x 2 Sn F with hei; x Pyi 6D 0 can be played since for such a x

h Pbwt; x Pyi hei; x Pyi hei; wti > r.4r2=ıF C 6r/:

Hence, hei; wti r.4r2=ıF C 6r/= C 2r since the increment of wt in one step is bounded by 2r. Since this holds for each i d we gain the bound k Pbwtk p

d4r3=.ıF / C 6r2= C 2r and

kwtk k PF wtk C k Pbwtk p

d4r3=.ıF / C 6r2.1=ıF C 1= / C 5r:

Corollary 7 Let y 2 Rd, Q any orthogonal matrix, c > 0 any scaling, z 2 Rd, S D f0; 1gd and C D Œ0; 1 d, d 1. Assumption 1 is fulﬁlled for the set QS D c QŒS C z and QC D c QŒC C z (independently of the dimensionality of the face Py lies in). In particular, there exists a constant b > 0 such that Algorithm 1 has a worst-case approximation error of b=t for all t 1. The constant b can be chosen as p

d4r3=.cıF / C 6r2.1=ıF C 1=c/ C 5r.

Proof Because the algorithm does not change by translating and rotating S and C we can consider a hypercube anchored at the origin and aligned with the standard basis of Rd. We therefore assume without loss of generality that we have to deal with a scaled standard hypercube in Rd. Let us assume in the following that dim UF D k d. Observe that either UF D f0g or we can write UF D span fei1; : : : ; eikg, where e1; : : : ; ed is the standard basis in Rd, ij d for all j k, and k D dim UF . This holds because a k-dimensional face of the hypercube is a translated k-dimensional hypercube in a subspace spanned by some basis vectors ei1; : : : ; eik. Deﬁne the following index sets I D fi1; : : : ; ikg, I D ; if UF D f0g, and J D f1; : : : ; dgn I. There are signs sj 2 f 1; 1g, j 2 J, such that Eb D fsj ej W j 2 Jg is a basis for Ub, and the split into UF Uı Ub is satisfying the conditions of Assumption 1. In particular, Uı D f0g, PC .y Py/ 2 Ub and for all e 2 Eb we have fhx; ei W x 2 Ccg . 1; 0 . P Uı D f0g because the minimal faces which are orthogonally projected onto a face that contains a ball around Py (relative to the projection) are translated versions of the minimal face and they induce the same subspace as the minimal face. That also implies directly that Eb can be chosen to fulﬁll Assumption 1. PC .y Py/ stands orthogonal on UF and, hence, lies fully in Ub. Therefore, Corollary 6 applies. Observe that can here be chosen as the scaling factor c and the corollary provides the stated result since dim Ub d.

Corollary 8 Let y 2 Rd, S D fei W i dg and C D d 1 D cch S, d 1. Assumption 1 is fulﬁlled and there exists a constant b > 0 such that Algorithm 1 has a worst-case approximation error of b=t for all t 1. The constant b can be chosen as 4dr3=ıF C 6r2.1=ıF C p

GR UNEW ALDER

Proof Due to the rotation invariance it sufﬁces to consider the case where F D cch fe1; : : : ; ekg for some k d. Let us consider ﬁrst the case where F D fe1g and, hence, Fc D f0g. In this case, Cc D cch f0; e2 e1; : : : ; ed e1g. Observe that hx Ce1; ej i 0 for any x 2 Cc and all j d since x C e1 2 C and, hence, x C e1 D P i d iei for some non-negative i. In particular, for j 2, hx; ej i 0 for all x 2 Cc. Furthermore, hx; e1i 0 for all x 2 Cc because x C e1 D P i d iei with non-negative scalars that fulﬁll P i d i D 1. I.e. hx; e1i D hx C e1; e1i 1 ke1k2 1 D 0. Therefore, we have an orthonormal basis f e1; e2; : : : ; edg, of Rd such that hx; ei 0 for all basis elements e and all x 2 Cc. This implies that there exists no basis element Qe 2 Rd such that there exists elements x; x0 2 Cc with hx; Qei > 0 > hx0; Qei and, hence, f e1; e2; : : : ; edg spans the d-dimensional space Ub and Uı D f0g. The rate of convergence follows now from Corollary 6. Let us consider the case F D cch fe1; : : : ; ekg and k 2. For any ej , j > k, and any x 2 Cc, hx C Py; ej i 0 since x C Py D P i d iei for some non-negative i. But this implies hx; ej i D hx C Py; ej i 0 because Py is a convex combination of e1; : : : ; ek. Also ek C1; : : : ; ed 2 F ? c because span Fc D span fe2 e1; : : : ; ek e1g and hei e1; ej i D 0 for any i k; j > k. Finally, consider e D Pk j D1 ej = p

k; kek D 1, and observe that he; eii D 0 for all i > k. Also, he; ei e1i D 0 for i k. This implies that e stands orthogonal on Fc because for v 2 span Fc; v D Pk i D2 i.ei e1/; i 2 R, we have that he; vi D Pk i D2 ihei e1; ei Ce1i D 0. Furthermore, for any x 2 Cc, x D Pd i D1 iei Py; Pd i D1 i D 1,

i D1 ihe; ei Pyi D

i D1 ihe; ei e1i he; Py e1i D 0:

Hence, we have d C1 k orthonormal vectors ek C1; : : : ; ed; e that lie in the orthogonal complement of Fc and for any x 2 Cc, hx; eii 0; hx; ei 0, for all k C 1 i d. Hence, Uı D f0g and the result on the rate of convergence follows from Corollary 6. The constant also follows directly from the corollary: in case that Fc D f0g we have that Sn F D fe2; : : : ; edg, the basis of Ub is f e1; e2; : : : ; edg and Py D e1. can be lower bounded by 1 : h e1; x Pyi D 1 for any x 2 Sn F ; ei; ej Py attains value 1 if i D j 2 and value 0 if i 6D j, i; j 2. Otherwise, Sn F D fek C1; : : : ; edg, the basis of Ub is fek C1; : : : ; ed; Pk i D1 ei= p

kg and Py D Pk i D1 iei for some i 0; Pk i D1 i D 1. Hence, hei; ej Pyi D 0 if i 6D j; i; j > k; hei; ei Pyi D 1 if i > k; h Pk i D1 ei= p

k; ej Pyi D 1= p

k, where j k. Hence, 1= p

d and the result follows.

N. Aronszajn. Theory of reproducing kernels. Trans. of the American Mathematical Society, 1950.

F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and conditional gradient algorithms. In International Conference on Machine Learning, 2012.

A. Beck and M. Teboulle. A conditional gradient method with linear rate of convergence for solving convex linear systems. Math. Meth. Op. Res., 2004.

COMPACT CONVEX PROJECTIONS

M.D. Cannon and C.D. Cullum. A tight upper bound on the rate of convergence of the frank-wolfe algorithm. SIAM J. Control Optim., 1968.

Y. Chen, M. Welling, and A. Smola. Supersamples from kernel-herding. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2010.

K. L. Clarkson. Coresets, sparse greedy approximation,and the frank-wolfe algorithm. ACM Transactions on Algorithms, 2010.

J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efﬁcient projections onto the 1-ball for learning in high dimensions. In International Conference on Machine Learning, 2008.

M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Res. Logist. Quart., 1956.

D. Garber and E. Hazan. A polynomial time conditional gradient algorithm with applications to online and stochastic optimization. Co RR, abs/1301.4666, 2013.

D. Garber and E. Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In International Conference on Machine Learning, 2015.

J. Gu elat and P. Marcotte. Some comments on wolfe s away step . Mathematical Programming, 1986.

M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, 2013.

S. Lacoste-Julien and M. Jaggi. On the global linear convergence of frank-wolfe optimization variants. In Advances in Neural Information Processing, 2015.

R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 1996.

K.C. Toh, M.J. Todd, and R.H. Tutuncu. Sdpt3 a matlab software package for semideﬁnite programming. Optimization Methods and Software, 1999.

I. Tsang, J. Kwok, and P.-M. Cheung. Core vector machines: Fast svm training on very large data sets. The Journal of Machine Learning Research, 2005a.

I. Tsang, J. Kwok, and K. Lai. Core vector regression for very large regression problems. In International Conference on Machine Learning, 2005b.

R.H Tutuncu, K.C. Toh, and M.J. Todd. Solving semideﬁnite-quadratic-linear programs using sdpt3. Mathematical Programming Ser. B, 2003.

M. Welling. Herding dynamical weights to learn. In International Conference on Machine Learning, 2009.

P. Wolfe. Convergence theory in nonlinear programming, chapter 1. North-Holland Publishing Company, 1970.

P. Wolfe. Finding the nearest point in a polytope. Mathematical Programming, 1976.

Y. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel ridge regression. In Conference on Learning Theory, 2013.