# fast_projection_onto_convex_smooth_constraints__a590dcbb.pdf

Fast Projection Onto Convex Smooth Constraints

Ilnura Usmanova 1 Maryam Kamgarpour 2 Andreas Krause 3 Kﬁr Yehuda Levy 4 5

The Euclidean projection onto a convex set is an important problem that arises in numerous constrained optimization tasks. Unfortunately, in many cases, computing projections is computationally demanding. In this work, we focus on projection problems where the constraints are smooth and the number of constraints is significantly smaller than the dimension. The runtime of existing approaches to solving such problems is either cubic in the dimension or polynomial in the inverse of the target accuracy. Conversely, we propose a simple and efﬁcient primal-dual approach, with a runtime that scales only linearly with the dimension, and only logarithmically in the inverse of the target accuracy. We empirically demonstrate its performance, and compare it with standard baselines.

1. INTRODUCTION

Constrained optimization problems arise naturally in numerous ﬁelds such as control theory, communication, signal processing, and machine learning (ML). A common approach for solving constrained problems is to project onto the set of constraints in each step of the optimization method. Indeed, in ML the most popular learning method is projected stochastic gradient descent (SGD). Moreover, projections are employed within projected quasi-Newton (Schmidt et al., 2009), and projected Newton-type methods.

The projection operation in itself requires solving a quadratic optimization problem over the original constraints. In this work, we address the case where we have several

1Automatic Control Laboratory, D-ITET, ETH Z urich, Switzerland 2Department of Electrical and Computer Engineering, University of British Columbia, Canada 3Department of Computer Science, ETH Z urich, Switzerland 4A Viterby fellow 5Department of Electrical & Computer Engineering, Technion - Israel Institute of Technology. Correspondence to: Ilnura Usmanova <ilnurau@control.ee.ethz.ch>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

smooth constraints, i.e., our constraint set K is

K = {x Rn : hi(x) 0 ; i [m]} , (1)

where hi s are convex and smooth. We focus on the case where the dimension of the problem n is high, and the number of constraints m is low. This captures several important ML applications, like multiple kernel learning (Ye et al., 2007), semi-supervised learning (Zhu et al., 2006), triangulation in computer vision (Aholt et al., 2012), applications in signal processing (Huang and Palomar, 2014), solving constrained MDPs (Altman and Asingleutility, 1999; Jin and Sidford, 2020).

In some special cases like box constraints, ℓ2 or ℓ1 constraints, the projection problem can be solved very efﬁciently. Nevertheless, in general there does not exist a uniﬁed and scalable approach for projection. One generic family of approaches for solving convex constrained problems are Interior Point Methods (IPM) (Karmarkar, 1984; Nemirovski and Todd, 2008). Unfortunately, in general the runtime of IPMs scales as O n3m log(n/ε) , where ε is the accuracy of the solution, so these methods are unsuitable for high dimensional problems.

Our contribution. We propose a generic and scalable approach for projecting onto a small number of convex smooth constraints. Our approach applies generally for any constraint set that can be described by Eq. (1). Moreover, our approach extends beyond the projection objective to any strongly convex and smooth objective. The overall runtime of our method for ﬁnding an approximate projection is O(nm2.5 log2(1/ε) + m3.5 log(1/ε)) (see Thm. 3.2 and the discussion afterwards). Thus, the runtime of our method scales linearly with n, making it highly suitable for solving high-dimensional problems that are ubiquitous in ML. Furthermore, in contrast to the Frank-Wolfe (FW) algorithm (Frank and Wolfe, 1956), our approach is generic (i.e., does not require a linear minimization oracle) and depends only logarithmically on the accuracy.

Moreover, we extend our technique beyond the case of intersections of few smooth constraints. In particular, we provide a conversion scheme that enables to efﬁciently project onto norm balls using an oracle that projects onto their dual. One can interpret this result as an algorithmic equivalence between projections onto norm ball and its

Fast Projection Onto Convex Smooth Constraints

dual. This holds for both smooth and non-smooth norms.

On the technical side, our approach utilizes the dual formulation of the problem, and solves it using a cutting plane method. Our key observation is that in the special case of projections, one can efﬁciently compute approximate gradients and values for the dual problem, which we then use within the cutting plane method. Along the way, we prove the convergence of cutting plane methods with approximate gradient and value oracles, which may be of independent interest.

Related work. In the past years, projection free ﬁrst order methods have been extensively investigated. In particular, the Frank-Wolfe (FW) algorithm (Frank and Wolfe, 1956) (also known as conditional gradient method) is explored, e.g., in (Jaggi, 2013; Garber and Hazan, 2015; Lacoste Julien and Jaggi, 2015; Garber and Meshi, 2016; Garber, 2016; Lan and Zhou, 2016; Allen-Zhu et al., 2017; Lan et al., 2017). This approach avoids projections and instead assumes that one can efﬁciently solve an optimization problem with linear objective over the constraints in each round. Unfortunately, the latter assumption holds only in special cases. In general, this linear minimization might be nontrivial and have the same complexity as the initial problem. Moreover, FW is in general unable to enjoy the fast convergence rates that apply to standard gradient based methods 1. In particular, FW does not achieve the linear rate obtained by projected gradient descent in the case of smooth and strongly-convex problems. Moreover, FW does not enjoy the accelerated rate obtained by projected Nesterov s method for smooth and convex problems.

Further popular approaches for solving constrained problems are the Augmented Lagrangian method and ADMM (Alternating Direction Method of Multipliers) (Boyd et al., 2011; He and Yuan, 2012; Goldstein et al., 2014; Eckstein and Yao, 2012). Such methods work directly on the Lagrangian formulation of the problem while adding penalty terms. Under speciﬁc conditions, their convergence rate may be linear (Nishihara et al., 2015; Giselsson and Boyd, 2014). However, ADMM requires the ability to efﬁciently compute the proximal operator, which as a special case includes the projection operator. To project onto the intersection of convex constraint sets m i=1Ki, consensus ADMM can exploit projection oracles for each Ki separately. For this general case, only sublinear rate is shown (Xu et al., 2017; Peters and Herrmann, 2019). In the special case of polyhedral sets Ki, it can have linear rate (Hong and Luo, 2017).

Levy and Krause (2019) suggest a fast projection scheme

1Note that for some special cases like simplex constraints one can ensure fast rates for FW (Garber and Hazan, 2013; Lacoste Julien and Jaggi, 2015). In general, FW requires O(1/ε) calls to an oracle providing the solution to linear-minimization oracle to obtain ε-accurate solutions.

that can approximately solve a projection onto a single smooth constraint. However, their approach cannot ensure an arbitrarily small accuracy. Li et al. (2020) extend this approach to a simple constraint like ℓ1, ℓ -ball in addition to a single smooth constraint. Basu et al. (2017) address highdimensional QCQPs (Quadratically Constrained Quadratic Programs) via a polyhedral approximation of the feasible set obtained by sampling low-discrepancy sequences. Nevertheless, they only show that their method converges asymptotically, and do not provide any convergence rates. There are also works focusing on fast projections on the sets with a good structure like ℓ1, ℓ balls (Condat, 2016; Gustavo et al., 2018; Li and Li, 2020).

Primal-dual formulation of optimization problems is a standard tool that has been extensively explored in the literature. For example, Arora et al. (2005), Plotkin et al. (1995) and Lee et al. (2015) propose to apply the primal-dual approach to solving LPs and SDP. Nevertheless, almost all of the previous works consider problems which are either LPs or SDPs, these are very different from the projection problem that we consider here. In particular: (i) These works make use of the specialized structure of LP s and SDP s, which does not apply to our work where we consider general constraints. Plotkin et al. (1995) consider general convex constraints, but assume the availability of an oracle that can efﬁciently solve LP s over this set. This is a very strong assumption that we do not make.

(ii) We devise and employ an approximate gradient oracle for our dual problem is novel way, which is done by a natural combination of Nesterov s method in the primal together with a cutting plane method in the dual. Furthermore, we provide a novel analysis for the projection problem, showing that an approximate solution to the dual problem can be translated to an approximate primal solution.

Thus, the techniques and challenges in our paper are very different from the ones in the aforementioned papers.

Preliminaries and Notation. We denote the Euclidean norm by . For a positive integer t we denote [t] = {1, . . . , t}. A function F : Rn 7 R is α-strongly convex if, F(y) F(x) + F(x) (y x) + α

2 x y 2, x, y Rn . It is well known that strong-convexity implies x Rn, α

2 x x 2 F(x) F(x ), where x = arg minx Rn F(x) .

2. PROBLEM FORMULATION

The general problem of Euclidean projection is deﬁned as a constrained optimization problem

min x Rn x0 x 2

subject to hi(x) 0, i = 1, . . . , m, (P1)

Fast Projection Onto Convex Smooth Constraints

where the constraints hi : Rn R are convex.

Goal: Our goal in this work is to ﬁnd an ε-approximate projection x, such that for any x : hi(x) 0 we have, x x0 2 x x0 2 + ε , and hi( x) ε , i [m] .

Assumptions: Deﬁning K := {x Rn : hi(x) 0; i [m]}, we assume that K is compact. Furthermore, we assume the hi s to be L-smooth and G-Lipschitz continuous in the convex hull of K and x0, i.e., |hi(x) hi(y)| G x y , i [m] x, y Conv{K, x0} and hi(x) hi(y) L x y , i [m] x, y Rn. We assume both G and L to be known. We denote by H > 0 the bound maxx K |hi(x)| H, i [m]. We further assume that the distance between x0 and K is bounded by B, minx K x x0 B. Our method does not require the knowledge of B, H. The Lipschitz continuity and smoothness assumptions above are standard and often hold in machine learning applications.

KKT conditions: Our ﬁnal assumption is that Slater s condition holds, i.e., that there exists a point x Rn

such that i [m]; hi(x) < 0. Along with convexity this immediately implies that the optimal solution x to Problem (P1) satisﬁes the KKT conditions, i.e., there exist λ(1) , . . . λ(m) R+ s.t. (x x0) + Pm i=1 λ(i) hi(x ) = 0 , λ(i) hi(x ) = 0, i [m] , and that there exists a ﬁnite bound on |λ(i) | i [m]. Throughout this paper, we assume the knowledge of an upper bound that we denote by R: |λ(i) | R ; i [m] . In appendix B.1, we showcase two problems where we obtain such a bound explicitly. When such a bound is unknown in advance, one can apply a generic technique to estimating R on the ﬂy , by applying a standard doubling trick. This will only yield a constant factor increase in the overall runtime. We elaborate on this in appendix B.2. For simplicity we assume throughout the paper that R 1.

3. FAST PROJECTION APPROACH 3.1. Intuition: the Case of a Single Constraint

As a warm-up, consider the case of a single smooth constraint

min x Rn:h(x) 0 x0 x 2 . (2)

Our fast projection method relies on the (equivalent) dual formulation of the above problem. Let us ﬁrst deﬁne the Lagrangian x Rn, λ 0, L(x, λ) := x0 x 2 +λh(x) . Note that L( , ) is strongly convex in x and concave in λ. Denoting the dual objective by d(λ), the dual problem is,

max λ 0 d(λ), where d(λ) := min x Rn x0 x 2 + λh(x). (3)

We denote an optimal dual solution by λ arg maxλ 0 d(λ). Our approach is to ﬁnd an approximate

optimal solution to the dual problem maxλ 0 d(λ). Here we show how to do so, and demonstrate how this translates to an approximate solution for the original projection problem (Eq. (2)).

The intuition behind our method is the following. L(x, λ) is linear in λ, and d(λ) := minx Rn L(x, λ), therefore maxλ 0 d(λ) is a one-dimensional concave problem. Moreover, d(λ) is differentiable and smooth since the primal problem is strongly convex (see Lemma 3.2). Thus, if we could access an exact gradient oracle for d( ), we could use bisection (see Alg. 6 in the appendix) in order to ﬁnd an ε-approximate solution to the dual problem within O(log(1/ε)) iterations (Juditsky, 2015). Due to strong duality, this translates to an ε-approximate solution of the original problem (Eq. (2)). While an exact gradient oracle for d( ) is unavailable, we can efﬁciently compute approximate gradients for d( ). Fixing λ 0, this can be done by (approximately) solving the following program,

min x Rn L(x, λ) := x x0 2 + λh(x). (4)

Letting x λ = arg minx Rn x x0 2 + λh(x) one can show that d(λ) = h(x λ). Thus, in order to devise an approximate estimate for d(λ), it is sufﬁcient to solve the above unconstrained program in x to within a sufﬁcient accuracy. This can be done at a linear rate using Nesterov s Accelerated Gradient Descent (AGD) (see Alg. 3) due to the fact that Eq. (4) is a smooth and strongly-convex problem (recall that h( ) is smooth). These approximate gradients can then be used instead of the exact gradients of d( ) to ﬁnd an ε-optimal solution to the dual problem within O(log(1/ε)) iterations. The formal description for the case of a single constraint can be found in Appendix A. Next we discuss our approach for the case with several constraints.

Remark Note that using Nesterov s AGD method for the dual problem in the same way as we do for the primal problem sounds like a very natural idea. However, we cannot guarantee the strong concavity of the dual problem and hence, we cannot hope for the linear convergence rate of this approach. In contrast, the bisection algorithm can guarantee the linear convergence rate even for non-strongly concave dual problems.

3.2. Duality of Projections Onto Norm Balls

As a ﬁrst application, we show how the approach from Section 3.1 has applications for efﬁcient projection on norm balls. The dual of a norm is an important notion that is often used in the analysis of algorithms. Here we show an algorithmic connection between the projection onto norms and onto their dual. Concretely, we show that one can use our framework in order to obtain an efﬁcient conversion scheme that enables to project onto a given unit norm ball

Fast Projection Onto Convex Smooth Constraints

using an oracle that enables to project onto its dual norm ball. This applies even if the norms are non-smooth, thus extending our technique beyond constraint sets that can be expressed as an intersection of few smooth constraints. Our approach can also be generalized to general convex sets and their dual (polar) sets.

Given a norm P : Rn 7 R, its dual norm is deﬁned as,

P (x) := max P (z) 1 z x ; x Rn

As an example, for any p 1 the dual of the ℓp-norm is the ℓq-norm with q = p/(p 1). Furthermore, the dual of the spectral norm (over matrices) is the nuclear norm; ﬁnally for a PD matrix A Rd d we can deﬁne the induced norm x A = x Ax, whose dual is ( x A) := x A 1 := x A 1x.

Our goal is to project onto the norm ball w.r.t. P( ), i.e.,

min x Rn:P (x) 1 x0 x 2 . (5)

Next we state our main theorem for this section,

Theorem 3.1. Let P( ) be a norm, and assume that we have an oracle that enables to project onto its dual norm ball P ( ). Then we can ﬁnd an ε-approximate solution to Problem (5), by using O(log(1/ε)) calls to that oracle.

The idea behind this conversion scheme between norm ball projections is to start with the dual formulation of the problem as we describe in Eq. (3). Interestingly, one can show that the projection oracle onto the dual norm, enables to compute the exact gradients of d(λ) in this case. This in turn enables to ﬁnd an approximate solution to the dual problem using only logarithmically many calls to the dual projection oracle. Then we can show that such a solution can be translated to an approximate primal solution. We elaborate on our approach in Appendix C.

3.3. Projecting onto the Intersection of Several Non-Linear Smooth Constraints.

In the rest of this section we will show how to extend our method from Section 3.1 to problems with several constraints. Similarly to Section 3.1, we solve the dual objective using approximate gradients, which we obtain by running Nesterov s method over the primal variable x. Differently from the one-dimensional case, the dual problem is now multi-dimensional, so we cannot use bisection. Instead, we employ cutting plane methods like center of gravity (Levin, 1965; Newman, 1965), the Ellipsoid method (Shor, 1977; Iudin and Nemirovskii, 1977), and Vaidya s method (Vaidya, 1989). These methods are especially attractive in our context, since their convergence rate depends only logarithmically on the accuracy, and their runtime is linear in the dimension n. Our main

result, Theorem 3.2, states that we ﬁnd an ε-approximate solution to the projection problem (P1) within a total runtime of O nm3.5 log(m/ε) + m4 log(m/ε) if we use the classical Ellipsoid method, and a runtime of O nm2.5 log(m/ε) + m3.5 log(m/ε) if we use the more sophisticated method by Vaidya (1989).

The Lagrangian of the original problem (P1) is deﬁned as follows: x Rn, λ(1), . . . , λ(m) 0, L(x, λ) := x x0 2 + λ h(x) , where λ := (λ(1), . . . , λ(m)), h(x) := (h1(x), . . . , hm(x)) Rm. Deﬁning d(λ) := minx Rn L(x, λ), the dual problem is now deﬁned as follows,

max λ Rm,λ 0 d(λ) , (6)

where λ 0 is an elementwise inequality. Recall that we assume that we are given R 0 such that λ {λ : λ R}, for some λ arg maxλ 0 d(λ). Thus, our dual problem can be written as

max λ D d(λ) , (P2)

where D := {λ Rm : i [m]; λ(i) [0, R]}, and λ(i) is the ith component of λ. Thus, D is an ℓ -ball of diameter R centered at [R/2, . . . , R/2]T . In the rest of this section, we describe and analyze the two components of our fast projection algorithm. In Sec. 3.4 we describe the ﬁrst component, which is a cutting plane method that we use to solve the dual objective. In contrast to the standard cutting plane approach where exact gradient and value oracles are available, we describe and analyze a setting with approximate oracles. Next, in Sec. 3.5 we show how to construct approximate gradient and value oracles using a fast ﬁrst order method. Finally, in Sec. 3.6 we show how to combine these components to our fast projection algorithm that approximately solves the projection problem.

3.4. Cutting Plane Scheme with Approximate Oracles

We ﬁrst describe a general recipe for cutting plane methods, which captures the center of gravity, Ellipsoid method, and Vaidya s method amongst others. Such schemes require access to exact gradient and value oracles for the objective. Unfortunately, in our setting we are only able to devise approximate oracles. To address this issue, we provide a generic analysis, showing that every cutting plane method converges even when supplied with approximate oracles. This result may be of independent interest, since it will enable the use of Cutting plane schemes in ML application where we often only have access to approximate oracles.

Cutting Plane Scheme: We seek to solve maxλ D d(λ), where D is a compact convex set in Rm and d( ) is concave. Now, assume that we may access an exact separation oracle Os for D, that is, for any λt / D, Os outputs w Rm such

Fast Projection Onto Convex Smooth Constraints

that D {λ Rm : w (λ λt) 0}. We also assume access to (εg, εv)-approximate gradient and value oracles Og : D 7 Rm, Ov : D 7 R for d( ), meaning, d(λ) Og(λ) εg, |d(λ) Ov(λ)| εv. Finally, assume that we are given a point λ1 = [R/2, . . . , R/2] D, and R > 0 such that D M1 := {λ Rm + : λ λ1 R/2}. A cutting plane method works as demonstrated in Alg. 1.

Algorithm 1 Cutting Plane Method with Approximate Oracles

Input: gradient and value oracles Og, Ov with accuracies (εg, εv), and exact separation oracle Os

for t [T] do

if λt D then

call gradient oracle gt Og(λt), set wt = gt; else

call separation oracle and set wt Os(λt). end if Construct Mt+1 such that {λ Mt : w t (λ λt) 0} Mt+1, and choose λt+1 Mt+1. end for Output: λ arg maxλ {λ1,...,λT } D Ov(λ).

Remark 1: The output of the scheme in Alg. 1 is always non-empty since we have assumed λ1 D.

Cutting plane methods differ from each other by the construction of sets Mt s and choices of query points λt s. For such methods, the volume of Mt s decreases exponentially fast with t, and this gives rise to linear convergence guarantees in the case of exact gradient and value oracles.

Deﬁnition 3.1 (θ-rate Cutting Plane method). We say that a cutting plane method has rate θ > 0 if the following applies: t, Vol(Mt)/Vol(M1) e θt ; and Vol is the usual m-dimensional volume.

For example, for the center of mass method as well as Vaidya s method, we have θ = O(1), for the Ellipsoid method we have θ = O(1/m). Our next lemma extends the convergence of cutting plane methods to the case of approximate oracles. Let us ﬁrst denote Dε := {λ D : d(λ) d(λ ) ε} the set of all ε-approximate solutions , where λ arg maxλ D d(λ). We need Dε to have nonzero volume to ensure the required accuracy after the sufﬁcient decrease of volume of Mt. Later we show that in our case with Lipschitz continuous and convex h1, . . . , hm, then Dε contains ℓ -ball of radius r(ε) ε/m (Corollary 3.1).

Lemma 3.1. Let λ1 D, R > 0 such, D {λ : λ λ1 R/2}. Given ε > 0 assume that there exists an ℓ -ball of diameter r(ε) > 0 that is contained in Dε. Now assume that d(λ) is concave and we use the cutting plane scheme of Alg. 1 with oracles that satisfy εg ε R m, and εv ε . Then after T = O( m

θ log(R/r(ε))) rounds

it outputs λ D such that, maxλ D d(λ) d( λ) 4ε, where θ is the rate of the cutting plane method.

Proof. We denote TActive = {t [T] : λt D}, clearly this set is non-empty since λ1 D. Also, for any t TActive we denote gt := Og(λt) (note that in this case wt = gt). We divide the proof into two cases: when Dε is separated by wt from all λt D, and when not.

Case 1: Assume that there exists t TActive, and λε Dε such that, w t (λε λt) = g t (λt λε) 0. In this case, using the concavity of d( ) and deﬁnitions of gt, R, we get, d(λt) d(λε) + d(λt) (λt λε) = d(λε) + g t (λt λε) + ( d(λt) gt) (λt λε) d(λ ) ε R m(ε/(R m)) = d(λ ) 2ε, where we used y 2 m y , y Rm. Thus, d( λ) Ov( λ) ε Ov(λt) ε d(λt) 2ε d(λ ) 4ε.

Case 2: Assume that for any t TActive, and any λε Dε, we have w t (λε λt) = g t (λt λε) 0. This implies that t [T], λε Dε, w t (λε λt) 0. Hence t [T], Dε Mt, implying that

t [T] Vol(Dε) Vol(Mt). (7)

Next, we show that the above condition can hold only if T m

θ log(R/r(ε)). Indeed, according to our assumption Vol(Dε) Vol(ℓ -ball of radius r(ε)) = rm(ε). On the other hand, we assume that Vol(Mt) e θt Vol(ℓ -ball of radius R/2) = e θt(R/2)m. Combining these with Eq. (7) implies that in order to satisfy Case 2, we must have T m

θ log(R/2r(ε)). Thus, for any T > m

θ log(R/r(ε)), Case 1 must hold, which establishes the lemma.

3.5. Gradient and Value Oracles for the Dual

Here we show how to efﬁciently devise gradient and value oracles for the dual objective. Our scheme is described in Alg. 2. Similarly to the one-dimensional case, given λ we approximately minimize L( , λ), which enables us to derive approximate gradient and value oracles. The guarantees of Alg. 2 are given in Lemma 3.3. Before we state the guarantees of Alg. 2 we derive a closed form formula for d(λ). Recall that, d(λ) = minx Rn L(x, λ), and that L(x, λ) is 2-strongly-convex in x. This implies that the minimizer of L( , λ) is unique, and we therefore denote,

x λ := arg min x Rn L(x, λ) . (8)

The next lemma shows we can compute d(λ) based on x λ, and states the smoothness of d(λ).

Fast Projection Onto Convex Smooth Constraints

Algorithm 2 Oapproximate gradient/value oracles for d( )

Input: λ 0, target accuracy ε Compute xλ, an ε-optimal solution of minx Rn L(x, λ) := x x0 2 + λ h(x) . Method: Nesterov s AGD (Alg. 3) with α = 2, β = 2 + λ 1L, and T = O( β log(βB/ ε)) . Let: v := xλ x0 2 + λ h(xλ), g := h(xλ) Output: (xλ, g, v)

Algorithm 3 Accelerated Gradient Descent (AGD) (Nesterov, 1998)

Input: F : Rn R, x0 Rn, iterations T, strongconvexity α, smoothness β Set: y0 = x0, κ := β/α for t = 0, . . . , T 1 do

yt+1 = xt 1

xt+1 = 1 + κ 1 κ+1 yt+1 κ 1 κ+1yt . end for Output: y T

Lemma 3.2. For any λ 0 the following holds: (i) d(λ) = h(x λ), and λ1, λ2 0,

d(λ1) d(λ2) m G2 λ1 λ2 ;

and, x λ1 x λ2 m G λ1 λ2 .

Also, (ii) d(λ ) d(λ) m2G2 λ λ 2 + m H λ λ . Moreover, (iii) x = x λ , where λ , x are the optimal solutions to the dual and primal problems.

The proof is quite technical and can be found in Appendix D.1. From the above lemma we can show that for any ε there exists an ℓ -ball of a sufﬁciently large radius r(ε) contained in the set of ε-optimal solutions to the dual problem in D. Corollary 3.1. Let ε [0, 1]. Then there exists an ℓ - ball of radius r(ε) := (2m) 1 min{ε/H, ε/G} that is contained in the set of ε-optimal solutions within D.

The proof is in Appendix D.2. Eq. (8) together with Lemma 3.2 suggest that exactly minimizing L( , λ) enables to obtain gradient and value oracles for d( ). In Alg 2 we do so approximately, and the next lemma shows that this translates to approximate oracles. Lemma 3.3. Given, λ 0, running Alg. 2 it outputs, (x, g, v) such that the following applies:

m G2 ε ; (ii) x x λ 2 ε ;

and (iii) |v d(λ)| ε .

Additionally, Alg. 2 requires TInternal = O(

1 + m RL log(m/ ε)) queries for the gradient of h( ), and its total runtime is O(nm TInternal) O(nm3/2 log(m/ ε)).

The proof is in Appendix D.3. The proof of the ﬁrst part is based on 2-strong-convexity of L( , λ) and G-Lipschitz continuity of h( ). The second part of the above result also uses the convergence rate of Nesterov s AGD (Nesterov, 1998) described in Appendix in Theorem A.1. Using the notation that appears in the description of the cutting plane method (Alg. 1) we can think of Alg. 2 as a procedure that receives λ 0 and returns a gradient oracle Og(λ) := g, value oracle Ov(λ) := v, and primal solution oracle Ox(λ) := xλ.

Remark: Notice that scaling the constraints h by a factor α > 0 leaves the constraints set unchanged, while scaling the smoothness L by a factor of α. Nonetheless, this naturally also scales the bound of the Lagrange multipliers, R, by a factor of 1/α. Lemma 3.3 tells us that the runtime of our algorithm, TInternal, depends only on RL and is therefore invariant to such scaling.

3.6. Fast Projection Algorithm

Below we describe how to compose the two components presented in Sections 3.4 and 3.5 to a complete algorithm for solving the projection problem of (P1).

Algorithm 4 Fast Projection Method

Input: Accuracy parameters ε > 0, λ1 D, number of rounds T (1) For any λ Rm deﬁne three oracles: Og(λ) := g, Ov(λ) := v, and Ox(λ) := xλ according to the output (g, v, xλ) of Alg. 2 with the inputs λ and ε, (2) Deﬁne the separation oracle Os(λ)i := [1, if λ(i) > R; 0, if λ(i) (0, R); 1, if λ(i) < 0], (3) Employ a cutting plane method as in Alg. 1 for solving the dual problem, maxλ D d(λ), Output: λ arg maxλ {λ1,...,λT } D Ov(λt), and x = Ox( λ).

Our method in Alg. 4 employs a cutting plane scheme (Alg. 1), while using Alg. 2 in order to devise the gradient and value oracles for d( ). Next we discuss the role of the primal solution oracle Ox, and connect it to our overall projection scheme. Recall that the cutting plane method that we use above ﬁnds λ, which is an approximate solution to the dual problem. To extract a primal solution from the dual solution λ, it makes sense to approximately solve minx Rn L(x, λ), and this is exactly what the oracle Ox provides (see Alg. 2). Next we state the guarantees of the above scheme.

Theorem 3.2. Let ε > 0, and consider the projection problem of (P1), and its dual formulation in Eq. (6). Then upon invoking the scheme in Alg. 4 with ε = ε4 256(m RG)6 , and T = O( m

θ log(m R/ε)), it outputs x such that x K :=

Fast Projection Onto Convex Smooth Constraints

{x : h(x) 0},

x x0 2 x x0 2+6ε; and hi( x) ε, i [m] .

Moreover, the total runtime of our method is O nm2.5θ 1 log2(m/ε) + τCP(m)mθ 1 log(m R/ε) , where θ is the rate of the cutting plane method (Def. 3.1), and τCP(m) is the extra runtime required by the cutting plane method for updating the sets Mt beyond calling the gradient and value oracles.

Let us discuss two choices of a cutting plane method: Ellipsoid method: In this case θ = O(1/m) and τCP(m) = O(m2). Thus, when used within our scheme the total runtime is O nm3.5 log(m/ε) + m4 log(m/ε) . Vaidya s method: In this case θ = O(1) and τCP(m) = O(m2.5). Thus, when used within our scheme the total runtime is O nm2.5 log(m/ε) + m3.5 log(m/ε) .

Proof of Thm. 3.2. First notice that we may apply the cutting plane method of Alg. 1 since D is an ℓ -ball of diameter R, so we can set M1 := D, and λ1 as its center. Moreover, according to Corollary 3.1 for any ε 0 there exists r ε/m such that an ℓ -ball of radius r is contained in the set of ε-optimal solutions to the dual problem in D. Let us denote ε := ε 4m RG 2, and notice that we can write ε = ε m RG 2. Now by setting ε as accuracy parameter to Alg. 2, it follows from Lemma 3.3 that it generates gradient and value oracles with the following accuracies, εg =

m G2 ε ε R m ; and εv ε ε . Now applying Lemma 3.1 with these accuracies implies that within T = m

θ log(m R/ε) calls to these approximate oracles it outputs a solution λ such that d( λ) d(λ ) 4 ε. Next we show that this guarantee on the dual translates to a guarantee for x w.r.t. the original primal problem (P1). We will require the following lemma, proved in Appendix D.4.

Lemma 3.4. Let F : Rm R be an L-smooth and concave function, and let λ = arg maxλ D F(λ). Also let D is a convex subset of Rm. Then, F(λ) F(λ ) 2 2L (F(λ ) F(λ)) , λ D .

Using the above lemma together with the m G2-smoothness of d( ) (Lemma 3.2) implies,

d( λ) d(λ )

8m G2 ε . (9)

Now, using g := h( x) (Alg. 2), and g d( λ)

m G2 ε (Lemma 3.3), as well as d(λ ) = h(x ) (Lemma 3.2), we conclude from Eq. (9):

h( x) h(x ) h( x) d( λ) + d( λ) h(x )

= g d( λ) + d( λ) d(λ )

16m G2 ε , (10)

where we used ε ε. The above implies that i [m], hi( x) = hi(x ) + (hi( x) hi(x )) hi(x ) + |hi( x) hi(x )| 0 + h( x) h(x ) m G

16 ε ε, where the second inequality uses the feasibility of x , and the last line uses the deﬁnition of ε (we assume R 1). This concludes the ﬁrst part of the proof. Moreover, from Eq. (10) we also get,

= λ (h( x) h(x )) ( λ λ ) h(x ) (λ ) h(x )

16m G2 ε λ + d(λ ) (λ λ) + 0

16 ε λ + d(λ ) d( λ)

16 ε + 4 ε 5ε . (11)

where the ﬁrst inequality uses Eq. (10) as well as h(x ) = d(λ ) (Lemma 3.2) and complementary slackness, which implies (λ ) h(x ) = 0; the second inequality uses the concavity of d( ) implying that d(λ ) d( λ) d(λ ) (λ λ), and the last line uses the deﬁnition of ε as well as ε ε. Using Eq. (11) together with εoptimality of x with respect to L( , λ) (Alg. 2) implies that x K := {x : hi(x) 0; i [m]} we have, x x0 2 x x0 2 + λ h(x) λ h( x) + ε x x0 2 + 6ε , and we used ε ε, and λ 0, h(x) 0. This concludes the proof. Runtime: for a single t [T] we invoke Alg. 2, and its runtime is O(nm1.5 log(m/ε)) (Lemma 3.3), additionally τCP for the update. Multiplying this by T we get a runtime of O(nm2.5θ 1 log2(m/ε) + τCP mθ 1 log(m/ε)). Also, every call to the separation oracle for D takes O(m) which is negligible compared to computing the gradient and value oracles.

Note that inside our algorithm we could use not only Vaidya s and Ellipsoid methods, but any other cutting plane scheme. For example, the faster cutting plane methods proposed by Lee et al. (2015), Jiang et al. (2020) can be used as well.

4. EXPERIMENTAL EVALUATION

4.1. Synthetic Problem

We ﬁrst demonstrate the performance of our approach on synthetic problems of projection onto a randomly generated quadratic set and onto their intersection.

min x Rn x xp 2 (12)

subject to (x xi)T Ai(x xi) 0, i = 1, . . . , m.

The matrices Ai are generated randomly in such a way that they are positive deﬁnite and have norm equal to 1. We compare our approach with the Interior Point Method

Fast Projection Onto Convex Smooth Constraints

(IPM) from the MOSEK solver, as well as with SLSQP from the scipy.optimize.minimize package. For Algorithm 2, to solve the primal subproblems we use the AGD method as described before. We select the smoothness parameter L is based on the norms of the matrices Ai, and tune the parameter R empirically using the doubling trick.The run-times are shown in Table 4.1. The run-times are averaged over 5 runs of the method on the random inputs. The accuracy is ﬁxed to 10 4. The results demonstrate a substantial performance improvement obtained by our fast projection approach as the dimensionality increases. The runtime in seconds is not a perfect performance measure, but is the most reasonable measure we could think of. Comparing the number of iterates hides the complexity of each iteration which might be huge for interior point methods.

4.2. Learning the Kernel Matrix in Discriminant Analysis via QCQP (Kim et al., 2006; Ye et al., 2007; Basu et al., 2017)

We next consider an application in multiple kernel learning. Consider a standard binary classiﬁcation setup where X a subset of Rn denotes the input space, and Y = { 1, +1} denotes the output (class label) space. We assume that the examples are independently drawn from a ﬁxed unknown probability distribution over X Y. We model our data with positive deﬁnite kernel functions (Sch olkopf et al., 2018). In particular, for any x1, . . . , xn X, the Gram matrix, deﬁned by Gjk = K(xj, xk) is positive semideﬁnite. Let X = [x+ 1 , . . . , x+ n+, x 1 , . . . , x n ] be a data matrix of size n = n+ + n , where {x+ 1 , . . . , x+ n+} and {x 1 , . . . , x n } are the data points from positive and negative classes. For binary classiﬁcation, the problem of kernel learning for discriminant analysis seeks, given a set of p kernel matrices Gi = Ki(xj, xk), xj, xk X, i [p], Gi Rn n to learn an optimal linear combination G G = G | G = Pp i=1 θi Gi, Pp i=1 θi = 1, θi 0 . This problem was introduced by Fung et al. (2004), reformulated as an SDP by Kim et al. (2006), and as a much more tractable QCQP by Ye et al. (2007). Latter approach learns an optimal kernel matrix G G = n G | G = Pp i=1 θi Gi, Pp i=1 θiri = 1, θi 0 o ,

where Gi = Gi PGi, ri = Trace( Gi), P = I 1

n1n1T n, and 1n is the vector of all ones of size n, by solving the following convex QCQP

4βT β + βT a λ

subject to t 1

ri βT Giβ, i = 1, . . . , p, (14)

where a = [1/n+, . . . , 1/n+, 1/n , . . . , 1/n ] Rn, β Rn. Hereby λ is a regularization parameter that we set to λ = 10 4. The optimal θ corresponds to the dual solu-

tion of the above problem (13). Note that in this application, the number of data points n is much larger than the number of constraints (i.e., the number of kernel matrices), making it ideally suited for our approach. We run our algorithm applied for this problem over β with ﬁxed t = 5 10 8. Then the problem becomes strongly convex: arg maxβ 1

4βT β + βT a + λT t = arg maxβ 1

4(βT β 4βT a + 4a T a) = arg maxβ 1

4 β 2a 2 2. We use the doc-rna dataset (Uzilov et al., 2006) from LIBSVM with n = 4000, 10000, 11000 data points and compare the results and the running time with the IPM. We focus on learning a convex combination of m Gaussian Kernels K(x, z) = Pm i=1 θie x z 2/σ2 i with different bandwidth parameters σi, chosen uniformly on the logarithmic scale over the interval [10 1, 102], as in (Kim et al., 2006; Ye et al., 2007). Results are shown in Table 4.2 below. Moreover, for the Kernel Learning problem with ε = 500ε2, we present the results for IPM and Fast Projection algorithms for m = 3, n = 11000 dependent on the target accuracy in Table 3. Note that quadratic constraints do not satisfy Lipschitz continuity assumption on the whole Rn. However, the Lipschitz continuity holds on any compact set inside Rn. Since the AGD algorithm keeps the iterates on the compact set, this is enough to guarantee the Lipschitz continuity. Moreover, the Lipschitz constant G itself is needed only to specify the accuracy for AGD ε. It only affects the runtime of AGD logarithmically. The parameter H is not needed to be known since it inﬂuences only the upper bound on the runtime of the ellipsoid method. 2

5. CONCLUSION

We proposed a novel method for fast projection onto smooth convex constraints. We employ a primal-dual approach, and combine cutting plane schemes with Nesterov s accelerated gradient descent. We analyze its performance and prove its effectiveness in high-dimensional settings with a small number of constraints. The results are generalizable to any strongly-convex objective with smooth convex constraints. Our work demonstrates applicability of cutting plane algorithms in the ﬁeld of Machine Learning and can potentially improve efﬁciency of solving high dimensional constrained optimization problems. Enforcing constraints can be of crucial importance when ensuring reliability and safety of machine learning systems.

Acknowledgements

We thank the reviewers for the helpful comments. This project received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme grant agreement No

2The experiments were run on a machine with Intel Core i77700K, 64Gb RAM.

Fast Projection Onto Convex Smooth Constraints

Table 1. Run-times (in seconds). Hereby, n is the dimensionality of the problem, the number of constraints is 2.

m = 2, n: 10 100 500 1000 2000 5000 8000 10000 12000 SLSQP 0.011 0.384 3.481 9.757 47.143 573.980 - - - IPM 0.059 0.073 0.577 2.427 11.850 118.414 408.137 751.216 901.878 Fast Proj 0.416 2.429 3.746 15.504 22.482 141.704 350.681 547.240 666.231 ADMM 23.761 92.836 285.383 - - - - - -

Table 2. Run-times (in seconds). Hereby, n is the number of data points (dimensionality), the number of kernels is 3 (number of constraints). For large problems, our approach outperforms IPM.

m = 3, n: 4000 10000 11000 Fast Proj 230.281 768.086 1216.9440 IPM 75.133 906.631 1302.088

Table 3. Run-times (in seconds). Hereby, ε is the target accuracy in objective value, the number of kernels is 3 (number of constraints). For large problems and smaller accuracies, our approach outperforms IPM.

ε 10 6 10 7 10 8

Fast Proj 83.2381 519.7166 1216.9440 IPM 1011.5410 1070.3363 1302.0882

815943, the Swiss National Science Foundation under the grant SNSF 200021 172781, and under NCCR Automation under grant agreement 51NF40 180545, as well as the Israel Science Foundation (grant No. 447/20).

Aholt, C., Agarwal, S., and Thomas, R. (2012). A qcqp approach to triangulation. In European Conference on Computer Vision, pages 654 667. Springer.

Allen-Zhu, Z., Hazan, E., Hu, W., and Li, Y. (2017). Linear convergence of a frank-wolfe type algorithm over trace-norm balls. In Advances in Neural Information Processing Systems, pages 6192 6201.

Altman, E. and Asingleutility, I. (1999). Constrained markov decision processes.

Arora, S., Hazan, E., and Kale, S. (2005). Fast algorithms for approximate semideﬁnite programming using the multiplicative weights update method. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS 05), pages 339 348. IEEE.

Basu, K., Saha, A., and Chatterjee, S. (2017). Largescale quadratically constrained quadratic program via low-discrepancy sequences. In Advances in Neural Information Processing Systems, pages 2297 2307.

Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 3(1):1 122.

Condat, L. (2016). Fast projection onto the simplex and the ℓ1 ball. Mathematical Programming, 158(1):575 585.

Eckstein, J. and Yao, W. (2012). Augmented lagrangian and alternating direction methods for convex optimization: A tutorial and some illustrative computational results. RUTCOR Research Reports, 32(3).

Frank, M. and Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(12):95 110.

Fung, G., Dundar, M., Bi, J., and Rao, B. (2004). A fast iterative algorithm for ﬁsher discriminant using heterogeneous kernels. In Proceedings of the twenty-ﬁrst international conference on Machine learning, page 40.

Garber, D. (2016). Faster projection-free convex optimization over the spectrahedron. In Advances in Neural Information Processing Systems, pages 874 882.

Garber, D. and Hazan, E. (2013). Playing non-linear games with linear oracles. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 420 428. IEEE.

Garber, D. and Hazan, E. (2015). Faster rates for the frankwolfe method over strongly-convex sets. In International Conference on Machine Learning, pages 541 549.

Garber, D. and Meshi, O. (2016). Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. In Advances in Neural Information Processing Systems, pages 1001 1009.

Giselsson, P. and Boyd, S. P. (2014). Metric selection in douglas-rachford splitting and admm. ar Xiv: Optimization and Control.

Goldstein, T., O Donoghue, B., Setzer, S., and Baraniuk, R. (2014). Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences, 7(3):1588 1623.

Fast Projection Onto Convex Smooth Constraints

Gustavo, C., Wohlberg, B., and Rodriguez, P. (2018). Fast projection onto the ℓ , ℓ1-mixed norm ball using steffensen root search. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4694 4698. IEEE.

He, B. and Yuan, X. (2012). On the o(1/n) convergence rate of the douglas rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700 709.

Hong, M. and Luo, Z. (2017). On the linear convergence of the alternating direction method of multipliers. Mathematical Programming, 162:165 199.

Huang, Y. and Palomar, D. P. (2014). Randomized algorithms for optimal solutions of double-sided qcqp with applications in signal processing. IEEE Transactions on Signal Processing, 62(5):1093 1108.

Iudin, D. and Nemirovskii, A. (1977). Evaluation of informational complexity of mathematical-programming programs. Matekon, 13(2):3 25.

Jaggi, M. (2013). Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1), pages 427 435.

Jiang, H., Lee, Y. T., Song, Z., and Wong, S. C.-w. (2020). An improved cutting plane method for convex optimization, convex-concave games, and its applications. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 944 953.

Jin, Y. and Sidford, A. (2020). Efﬁciently solving mdps with stochastic mirror descent. In International Conference on Machine Learning, pages 4890 4900. PMLR.

Juditsky, A. (2015). Convex optimization ii: Algorithms.

https://ljk.imag.fr/membres/Anatoli. Iouditski/cours/convex/chapitre_22. pdf.

Karmarkar, N. (1984). A new polynomial-time algorithm for linear programming. In Proceedings of the sixteenth annual ACM symposium on Theory of computing, pages 302 311.

Kim, S.-J., Magnani, A., and Boyd, S. (2006). Optimal kernel selection in kernel ﬁsher discriminant analysis. In Proceedings of the 23rd international conference on Machine learning, pages 465 472.

Lacoste-Julien, S. and Jaggi, M. (2015). On the global linear convergence of frank-wolfe optimization variants. In Advances in Neural Information Processing Systems, pages 496 504.

Lan, G., Pokutta, S., Zhou, Y., and Zink, D. (2017). Conditional accelerated lazy stochastic gradient descent. In

International Conference on Machine Learning, pages 1965 1974.

Lan, G. and Zhou, Y. (2016). Conditional gradient sliding for convex optimization. SIAM Journal on Optimization, 26(2):1379 1409.

Lee, Y. T., Sidford, A., and Wong, S. C.-w. (2015). A faster cutting plane method and its implications for combinatorial and convex optimization. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 1049 1065. IEEE.

Levin, A. Y. (1965). An algorithm for minimizing convex functions. In Doklady Akademii Nauk, volume 160, pages 1244 1247. Russian Academy of Sciences.

Levy, K. Y. and Krause, A. (2019). Projection free online learning over smooth sets. In Proc. International Conference on Artiﬁcial Intelligence and Statistics (AISTATS).

Li, Q. and Li, X. (2020). Fast projection onto the ordered weighted ℓ1 norm ball. ar Xiv preprint ar Xiv:2002.05004.

Li, Y., Cao, X., and Chen, H. (2020). Fully projectionfree proximal stochastic gradient method with optimal convergence rates. IEEE Access, 8:165904 165912.

Mahdavi, M., Yang, T., Jin, R., Zhu, S., and Yi, J. (2012). Stochastic gradient descent with only one projection. In Advances in Neural Information Processing Systems, pages 494 502.

Nemirovski, A. S. and Todd, M. J. (2008). Interior-point methods for optimization. Acta Numerica, 17:191 234.

Nesterov, Y. (1998). Introductory lectures on convex programming volume i: Basic course.

Newman, D. J. (1965). Location of the maximum on unimodal surfaces. Journal of the ACM (JACM), 12(3):395 398.

Nishihara, R., Lessard, L., Recht, B., Packard, A., and Jordan, M. I. (2015). A general analysis of the convergence of admm. ar Xiv preprint ar Xiv:1502.02009.

Peters, B. and Herrmann, F. J. (2019). Algorithms and software for projections onto intersections of convex and non-convex sets with applications to inverse problems. ar Xiv preprint ar Xiv:1902.09699.

Plotkin, S. A., Shmoys, D. B., and Eva Tardos (1995). Fast approximation algorithms for fractional packing and covering problems.

Schmidt, M., Berg, E., Friedlander, M., and Murphy, K. (2009). Optimizing costly functions with simple constraints: A limited-memory projected quasi-newton algorithm. In Artiﬁcial Intelligence and Statistics, pages 456 463.

Fast Projection Onto Convex Smooth Constraints

Sch olkopf, B., Smola, A. J., and Bach, F. (2018). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press.

Shor, N. Z. (1977). Cut-off method with space extension in convex programming problems. Cybernetics, 13(1):94 96.

Uzilov, A. V., Keegan, J. M., and Mathews, D. H. (2006). Detection of non-coding rnas on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics, 7(173).

Vaidya, P. M. (1989). A new algorithm for minimizing convex functions over convex sets. In 30th Annual Symposium on Foundations of Computer Science, pages 338 343. IEEE.

Xu, Z., Taylor, G., Li, H., Figueiredo, M. A. T., Yuan, X., and Goldstein, T. (2017). Adaptive consensus ADMM for distributed optimization. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3841 3850. PMLR.

Yang, T., Lin, Q., and Zhang, L. (2017). A richer theory of convex constrained optimization with reduced projections and improved rates. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3901 3910. JMLR. org.

Ye, J., Ji, S., and Chen, J. (2007). Learning the kernel matrix in discriminant analysis via quadratically constrained quadratic programming. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 854 863.

Zhu, X., Kandola, J., Lafferty, J., and Ghahramani, Z. (2006). Graph kernels by spectral transforms. Semisupervised learning, pages 277 291.