# dueling_convex_optimization__fb5c56c4.pdf

Dueling Convex Optimization

Aadirupa Saha 1 Tomer Koren 2 Yishay Mansour 2

Abstract We address the problem of convex optimization with preference (dueling) feedback. Like the traditional optimization objective, the goal is to ﬁnd the optimal point with the least possible query complexity, however, without the luxury of even a zeroth order feedback. Instead, the learner can only observe a single noisy bit which is win-loss feedback for a pair of queried points based on their function values. The problem is certainly of great practical relevance as in many real-world scenarios, such as recommender systems or learning from customer preferences, where the system feedback is often restricted to just one binary-bit preference information. We consider the problem of online convex optimization (OCO) solely by actively querying {0, 1} noisy-comparison feedback of decision point pairs, with the objective of ﬁnding a near-optimal point (function minimizer) with the least possible number of queries. For the non-stationary OCO setup, where the underlying convex function may change over time, we prove an impossibility result towards achieving the above objective. We next focus only on the stationary OCO problem, and our main contribution lies in designing a normalized gradient descent based algorithm towards ﬁnding a ϵ-best optimal point. Towards this, our algorithm is shown to yield a convergence rate of O(dβ/ϵν2) (ν being the noise parameter) when the underlying function is β-smooth. Further we show an improved convergence rate of just O(dβ/αν2 log 1

ϵ ) when the function is additionally also α-strongly convex.

1. Introduction

Online convex optimization is a very well studied ﬁeld of research where the goal is to optimize a convex function through sequentially accessing the function information at

1Microsoft Research, New York City 2Blavatnik School of Computer Science, Tel Aviv University, and Google Research Tel Aviv. Correspondence to: Aadirupa Saha <aadirupa.saha@microsoft.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

actively queried points (Flaxman et al., 2005; Ghadimi & Lan, 2013; Bubeck et al., 2017). Due to its great practical relevance in diverse real world scenarios, over the years the problem has been attempted for several problem setups, e.g. optimization with gradient or hessian information (ﬁrst and second order optimization) (Zinkevich, 2003; Hazan et al., 2007; Hazan & Kale, 2014), function value oracle (zeroth order feedback) (Flaxman et al., 2005; Hazan & Li, 2016; Saha & Tewari, 2011; Yang & Mohri, 2016; Bubeck et al., 2017), multi-point feedback (Ghadimi & Lan, 2013; Shamir, 2017; Agarwal et al., 2010), projection free algorithms (Chen et al., 2018; Sahu et al., 2018) etc., for both adversarial (Abernethy et al., 2008; Dekel et al., 2015; Bach & Perchet, 2016; Bubeck et al., 2015) and stochastic (Agarwal et al., 2011; Wang et al., 2017) settings (Bubeck, 2014; Hazan, 2019; Shalev-Shwartz et al., 2011).

Almost all the existing online optimization literature assumes a ﬁrst order (gradient information) or at least zeroth order (function value at the queried point) oracle access towards solving the convex optimization problem. However, in reality, many practical applications of online optimization only allow a {0, 1}-comparison feedback indicating a noisy preference of two (or more) queried points depending on their function values (instead of revealing any information of the gradients at the queried points or even the function values at those points) Such examples are prevalent in many real-world systems that need to collect user preferences instead of their absolute ratings, like recommender systems, online merchandises, search engine optimization, crowdsourcing, drug testing, tournament ranking, social surveys, etc.

This is precisely the reason why there was a massive surge of interest in the bandit community to learn from preference feedback famously studied as the dueling bandit problem (Komiyama et al., 2015; Ailon et al., 2014; Busa-Fekete & H ullermeier, 2014; Wu & Liu, 2016). It explicitly models this relative preference information structure, often in the setting of ﬁnite action spaces. The goal is to identify the most rewarding activities in hindsight according to a speciﬁc score function. Precisely, the learner repeatedly selects a pair of items to be compared to each other in a duel, and consequently observe a binary stochastic preference feedback of the winning item in this duel. Over the last

Dueling Convex Optimization

two decades, the bandit literature attempted to study various problems with dueling feedback, but mostly in the stochastic setting with ﬁnite arm (decision) spaces. This is since almost all the existing dueling bandit techniques rely on estimating the K K preference-matrix and hence the regret scales as O(K). But this, of course, becomes impractical for large (or inﬁnite) action spaces of size K; here lies another primary motivation of our work.

On the other hand, from the point of view of optimization literature, the main reason for the lack of techniques in the comparison based optimization framework is that almost all traditional optimization algorithms require the knowledge (magnitude and direction) of function gradients at the queried points (or at least a noisy estimate of that). However, the gradient-magnitude information is impossible to obtain from a binary 0 1 preference feedback, and thus, the problem is arguably harder than the conventional bandit feedback based optimization framework. Overall, the challenge to work with {0, 1}-preference feedback lies in the inherent disconnect between the feedback observed by the learner and her payoff at any given round; while this disparity already exists in standard dueling bandits even with ﬁnite decision space, the setting gets even more challenging for the inﬁnite decision spaces, and additionally with noisy comparison oracles.

Problem Formulation (informal). We address the problem of online convex optimization solely from comparison based oracles. Precisely, given an underlying convex function f : D 7 [0, 1] 1 over a convex decision space D Rd, the goal is to ﬁnd a near-optimal (probably approximately correct) point x D such that x satisﬁes Pr(f(x) f(x ) < ϵ) 1 δ, for any prespeciﬁed ϵ, δ (0, 1), by actively querying binary (noisy) comparison feedbacks of the form 1(f(xt) > f(yt)) on a sequence of decision point pairs (aka duels) {(xt, yt)}T t=1. By noisy comparison oracle we appeal to a general setup, where at each round t the true sign feedback 1(f(xt) > f(yt)) could be altered with probability (1/2 ν), ν [0, 0.5] being an unknown noise parameter. Given a ﬁxed ϵ, δ (0, 1), the learner s goal is to ﬁnd such a near-optimal point x with least possible pairwise-query complexity (T).

The only work that closely attempted a similar problem as described above is (Jamieson et al., 2012). They proposed a coordinate descent based technique to learn from comparison feedback with a provably optimal convergence rate (upto polylogarithmic factors). However, their analysis is only limited to strongly-convex functions, which is a restricted assumption on the function class and precisely, this is why a simple line-search based coordinate descent

1Note, one can always scale the range of f inside [0, 1] as long as f respects a bounded range.

algorithm works in their case, which is known to fail without strong-convexity. We assume a more general class of only smooth-convex functions, which does not need to be strongly convex, and thus the coordinate descent algorithms do not work in our setup. We instead propose a normalizedgradient descent based algorithm, which is arguably simpler both in terms of implementation and analysis; besides, our convergence rate for strongly convex and smooth functions is order-wise better as their sample complexity incurs extra polylogarithm factors in d, ϵ, ν, which we could get rid of.

Our contributions. The main contributions of this paper can be summarized as follows:

We formulate the problem of online convex optimization from comparison-based oracles (Sec. 2).

We ﬁrst analyze the hardness of the setup under different notion of non-stationarity of the underlying function sequence (f1, f2, . . .) (Thm. 1, Sec. 3).

Towards designing optimization algorithm, we inferred a descent direction (direction of the gradient at the queried points, aka normalized gradients) estimation is enough for our objective. Moreover, when the underlying function is β-smoothly convex, we propose a method to estimate normalized gradient estimates from (noisy) comparison oracles (see Thm. 3 and Lem. 9).

We propose a normalized gradient descent based algorithm for online smooth-convex optimization (Alg. 1) with noisy-comparison oracles. The convergence rate of our algorithm is shown to be O(dβ/ϵν2)2 as derived in Thm. 5 which matches the convergence rates of non-accelarated optimization routines for value based (zeroth order) feedback oracels, which is arguably a easier problem setup that ours (Rem. 2).

Further when the function is additionally also αstrongly convex, we show an improved convergence rate of just O(dβ/αν2 log 1

ϵ ) as derived in Thm. 7, which is provably optimal in terms of d ν and ϵ (Rem. 3). Overall, we are the ﬁrst to apply normalized gradient descent based techniques for optimization with preference feedback, consequently our proof techniques are new and involves novelty.

Related work. As motivated above, there has been very little work on convex optimization with preference feedback. (Yue & Joachims, 2009) ﬁrst address the regret optimization problem for ﬁxed functions f (arm rewards) with preference feedback. However, one of the major differences is that their optimization objective is deﬁned in terms of the preferences (which are directly observable and hence easier to

2 O( ) hides logarithmic dependencies.

Dueling Convex Optimization

optimize) as opposed to deﬁning it w.r.t. f as we considered. Further, their techniques are majorly restricted to the class of nice smooth and differentiable preference functions that allows gradient estimation. This is why their proposed technique could appeal back to the classical gradient descent based optimization algorithms (which are widely studied for the problem of online convex optimization with zeroth and ﬁrst order oracle (Flaxman et al., 2005; Hazan & Li, 2016; Saha & Tewari, 2011; Yang & Mohri, 2016; Bubeck et al., 2017)), yielding a regret guarantee of O(

d LT 3/4) with standard analysis (L being the lipschitz parameter of their preference map). Clearly, their algorithm (in fact, any gradient descent based technique) fails in our case, as using only a 1-bit comparison / sign feedback, it is impossible to estimate gradient magnitudes, where lies the main bottleneck of our setup. Consequently, we choose to design normalized gradient descent based optimization algorithms instead.

The only follow up of (Yue & Joachims, 2009) is (Kumagai, 2017), who considers an almost identical problem setup to that of the former and relied on the same gradient descent based ideas to show an improved O(

T) regret guarantee, but at the cost of even stricter assumptions on the preference models: Precisely it requires the underlying function f to be twice continuously differentiable, L-Lipschitz, strongly convex and smooth and a thrice differentiable, rotationsymmetric preference map. Their setting is hence far from our current optimization objective and does not apply to our 1-bit noisy comparison feedback model, which is not even singly differentiable.

As described above, the work most related to our setup is by (Jamieson et al., 2012) which, however, only applies to the speciﬁc class of α-strongly β-smooth convex functions. On the other hand, our algorithm applies to a more general class of β-smooth convex functions, simpler to implement, requires newer insights for the performance analysis, and also respects better convergence rates (see Rem. 3 for details).

2. Preliminaries and Problem Statement

Notation. Let [n] = {1, 2, . . . n}, for any n N. Given a set S, for any two items x, y S, we denote by x y the event x is preferred over y. For any r > 0, let Bd(r) and Sd(r) denote the ball and the surface of the sphere of radius r in d dimensions respectively. Id denotes the d d identity matrix. For any vector x Rd, x 2 denotes the ℓ2 norm of vector x. 1(ϕ) is generically used to denote an indicator variable that takes the value 1 if the predicate ϕ is true, and 0 otherwise. sign(x) = +1 if x 0 or 1 otherwise, x R. Unif(S) denotes a uniform distribution over any set S.

Problem Setup. We solve the online convex optimization

problem with 0-1 preference feedback. In the most general setup, the environment chooses a sequence of convex functions f1, f2, . . . (unknown to the learner), such that ft : Rd 7 R for all t. At any iteration, the goal of the learner is to pick a pair of points (xt, yt) upon which it gets to see a binary 0 1 bit noisy comparison feedback ot s.t.:

Pr(ot = 1(ft(yt) > ft(xt))) = 1/2 + ν,

where ν (0, 1/2] is the (unknown) noise-parameter.

Based on the variability in the function sequence f1, f2, . . ., one can categorize the problem into the following cases:

1. Stationary: In this case the functions are assumed to be ﬁxed across time, i.e. ft = f for some unknown but ﬁxed function f : Rd 7 R.

2. Stochastic: In this case the functions are drawn from some (unknown) stochastic distribution P from a given (known) class of convex functions F {f : Rd 7 R}, i.e. at any time time t [T], ft P.

3. Adversarial: The most general setup for any arbitrary sequence of convex functions {ft}T t=1. Note this case can be seen as a special case of Stochastic setup where F = {ft}T t=1, and Pr(ft = fi) = 0 for all i = t.

Objective. We consider the standard (ϵ, δ)-probably approximately correct function optimization objective: Given any ﬁxed choice of ϵ, δ (0, 1), the objective of the learner is to ﬁnd a decision point x Rd with minimum possible pairwise-query complexity {(xt, yt)T t=1} such that the ﬁnal point x must satisfy:

Pr( f(x) f(x ) < ϵ) 1 δ,

where f(x) := PT t=1 Eft P[ft(x)]

T , x Rd denotes the average function value, and x := arg minz Rd f(z) denotes the minimizer of f.

3. Non-stationarity leads to Impossibility

We ﬁrst analyze the fundamental performance limit of the problem for non-stationary sequence of functions, precisely the Stochastic and Adversarial setp. The main result of this section (Thm. 1) shows the optimization problem is in fact impossible to solve in both cases. This is precisely because a comparison based oracle does not take into account the scale of the function values, neither the learner has a way to estimate this from the comparison feedback. Thus in the non-stationary setup it is always possible problem instances to make a decision point appear to be the best in terms of the observed preferences, where as in reality it is not even an ϵ-optimal point compared to x . The proof of Thm. 1 gives a formal argument.

Dueling Convex Optimization

Theorem 1. For the setup of Stochastic and Adversarial it is always possible to construct problem instances where the optimization problem becomes infeasible (impossible to identify the true minimizer x ).

Proof. Instance I: Consider a simple problem instance I for Stochastic setup with decision set D = {x1, x2}, function class F = {f1, f2}, and assume P is such that:

( f1, with probability 0.99 f2, otherwise ,

where f1 and f2 is respectively deﬁned as:

( 0.01, if x = x1 0, otherwise , f2(x) = ( 0, if x = x1 1, otherwise .

Then note that at any round t, Prft P(x2 x1) = Pr(ft = f1) Pr(x2 x1 | ft = f1) + Pr(ft = f2) Pr(x2 x1 | ft = f2) = 0.99 1 + 0.01 0 = 0.99. Thus on expectation, x2 wins over x1 almost always (99 times out of 100 duels).

However, on the other hand, in terms of the function values x1 is the optimal point (minimizer of f, see Objective in Sec. 2). This can be easily inferred just by noting Eft P[ft(x1)] = Pr(ft = f1) ft(x1) + Pr(ft = f2) ft(x1) = 0.99 0.01 + 0.01 0 = 0.0099 < 0.01 = 0.01 1 = Eft P[ft(x2)]. So for the instance I, we have x = 1.

Instance I : Now let us consider a slightly tweaked version of instance I, say I which has the exactly same D = {x1, x2} and a slightly different function class F = {f1, f 2}, where f 2 is deﬁned as: f 2(x) = ( 0, if x = x1 0.1, otherwise .

Now note, even for I at any round t, we still see Prft P(x2 x1) = Pr(ft = f1) Pr(x2 x1 | ft = f1) + Pr(ft = f 2) Pr(x2 x1 | ft = f 2) = 0.99 1 + 0.01 0 = 0.99.

The interesting thing however is for this case, in terms of the function values x2 is indeed the optimal point (i.e. x = x2). This follows by noting Eft P[ft(x1)] = Pr(ft = f1) ft(x1) + Pr(ft = f 2) ft(x1) = 0.99 0.01 + 0.01 0 = 0.0099 > 0.001 = 0.01 0.1 = Eft P[ft(x2)]. So for the instance I , we have x = 2.

Now consider any ϵ < min(0.01 0.0099, 0.0099 0.001) = 0.0001. The only ϵ-optimal arm for I is x1, whereas for I is x2. But in both case learner observes the exactly same preference feedback, i.e. Pr(x2 x1) =

0.99. No it really has no ways to distinguish I from I

and consequently identify the ϵ-best optimal point of the true underlying instance. Note the dispute arises since the binary comparison based preference feedback reveals no information of the magnitude of the underlying function values, and any learning algorithm would observe x2 beats x1 almost always (with 0.99 probability), irrespective of if the true optimal arm is x1 (i.e. true instance is I) or x2 (i.e. the true instance is I ).

Remark 1. Above example also proves the impossibility result for the Adversarial setup as the Stochastic setup is just a simple and special case of the former.

4. Estimating Descent Direction (Normalized Gradient) using Comparison Oracle

As motivated in Sec. 1, one of our main contribution lies in the analysis to obtain an unbiased estimate of normalizedgradient at any desired point x Rd from 1-bit comparison feedback. Thm. 3 shows the main result of this section, but before that we ﬁnd it useful to introduce the following key lemma that shows how one can obtain an unbiased estimate of the direction of any vector g Rd, i.e., its normalized version g g , using only a 1-bit comparison (sign function) oracle. Following is a general result whose scope lies beyond our speciﬁc problem setup.

Lemma 2. For a given vector g Rd and a random unit vector u drawn uniformly from Sd(1), we have

E[sign(g u)u] = c

for some universal constant c [ 1

Proof sketch. Without loss of generality we can assume g = 1, since normalizing g does not affect the left-hand side. First, let us show that E[sign(g u)u] = γg for some γ R. Consider the reﬂection matrix along g given by P = 2gg I, and examine the random vector u = Pu.

Observe that sign(g u ) = sign

2 g 2 g u g u

sign(g u). Since u is also a random vector on the unit sphere, we have

E[sign(g u)u] = 1

2E[sign(g u)u] + 1

2E[sign(g u )u ]

2E[sign(g u)u] + 1

2E[sign(g u)(2gg I)u]

= E[(g u) sign(g u)]g.

Thus, E[sign(g u)u] = γg for γ = E[|g u|].

It remains to bound γ, which by rotation invariance equals E[|u1|]. For an upper bound, observe that by symmetry E[u2 1] = 1

d E[Pd i=1 u2 i ] = 1

d and thus E[|u1|] p

Dueling Convex Optimization

d. We turn to prove a lower bound on γ. If u were a Gaussian random vector with i.i.d. entries ui N(0, 1/d), then from standard properties of the (truncated) Gaussian distribution we would have gotten that E[|u1|] = p

2/πd. For u uniformly distributed on the unit sphere, ui is distributed as v1/ v where v is Gaussian with i.i.d. entries N(0, 1/d). We then can write

dv1 is a standard Normal, we have

= 2Φ(1) 1 0.7,

and since E[ v 2] = 1 an application of Markov s in-

equality gives Pr

ϵ2E[ v 2] = ϵ2. For ϵ = 1

4 this implies that Pr

5, whence γ = E[|u1|] 1/20

Using above result we obtain the main result of this section. The complete proof is moved to Appendix. A.

Theorem 3. If f is β-smooth, for any u Unif(Sd(1)), δ (0, 1) and vector b Sd(1):

Eu[sign(f(x + δu) f(x δu))u b]

d f(x) f(x) + 2λ,

for some universal constant c [ 1

20, 1], and λ =

Proof sketch. The proof mainly lies on the following lemma that shows how to the comparison feedback of two close points x + γu and x γu can be used to recover directional information of the gradient of f at point x.

Lemma 4. If f is β-smooth, for any u Unif(Sd(1)), and γ (0, 1), then with probability at least 1 λ where

λ = 3βγ f(x) q

dβγ we have

sign(f(x + γu) f(x γu))u = sign( f(x) u)u.

Consequently, for any vector b Sd(1) we have Eu[sign(f(x+γu) f(x γu))u b] Eu[sign( f(x)

We give a brief outline of the above proof, the complete analysis could be found in Appendix A. From smoothness we have |f(x + γu) f(x γu) 2γu f(x)| βγ2. Therefore, if βγ2 γ |u f(x)|, we will have that sign(f(x + γu) f(x γu)) = sign(u f(x)). Let us analyse Pru(βγ |u f(x)|). We know for v N(0d, Id), u := v/ v is uniformly distributed on the unit sphere. Then we show its possible

to write: Pu

|u f(x)| βγ

d log(1/γ )

Again since v f(x) N(0, f(x) 2), for any γ > 0

2π γ f(x) .

Combining the inequalities, we have that sign(f(x+γu) f(x γu)) = sign(u f(x)) except with probability at most

d log(1/γ ) f(x)

As for the claim about the expectation, note that for any vector b Sd(1), Eu[sign(f(x+γu) f(x γu))u b]

Eu[sign( f(x) u)u b] 2λ, since with probability 1 λ the two expectations are identical, and otherwise, they differ by at most 2. This concludes the proof of Lem. 4.

The result of Thm. 3 now simply follows by combining the guarantees of Lem. 2 and 4.

5. Noiseless case: Analysis for Sign-Feedback with Normalized-Gradient Descent

In this section, we analyse the case of no-noise , i.e. ν = 1/2 (see the comparison feedback model in Sec. 2), i.e. upon querying any duel (x, y) the learner gets access to its true sign feedback 1(f(x) > f(y)). We start by presenting our main algorithm (Alg. 1) for β-smooth convex optimization which is based on the technique of normalized gradient descent which appeals back to our results derived in Sec. 4. We next analyse its rate of convergence which respects a PAC sample complexity bound of O dβ x1 x 2

Dueling Convex Optimization

5), x1 being the initial point considered by the algorithm. Following this, we also address the setup for combined αstrongly convex, β-smooth functions and show that for this case one can achieve an improved convergence rate with by simply using a blackbox routine for beta-smooth optimization (say our Alg. 1) iteratively. Precisely, our resulting algorithm (Alg. 2) is shown to yield a convergence rate of O dβ

ϵ (Thm. 7), and this respects the optimal convergence rates in terms of ϵ and d (Rem. 3).

5.1. f is β-smooth convex

Algorithmic ideas. We start by recalling that, from a single 0-1 bit comparison feedback, one can not hope to recover the gradient estimates. This is since a comparison oracle does not reveal any information on the scale (magnitude) of the function values (see Sec. 3 for a motivating example). So the traditional gradient descent based techniques are bound to fail in our case in the ﬁrst place. However, in Sec. 4 we show, if not the entire gradient, we can almost recover the direction of the gradient (aka normalized gradient) at any desired point (see Thm. 3). Thus we appeal to the technique of Normalized Gradient Descent to solve the current problem.

Starting from an initial point x1, the algorithm sequentially goes on ﬁnding a nearly unbiased normalized gradient estimate ht := sign(f(x t) f(y t))ut,3 ut Unif(Sd(1)) being any random unit norm d-dimensional vector, and take η-sized steps along the negative direction of the estimated normalized gradients. Also note, at each time t, we keep track of the so-far-minimum point xt (current best). Essentially at all t, xt traces arg mint τ=1 f(xτ). This is unavoidable since without the knowledge of the gradient magnitude the algorithm does not have a way to gauge whether xt is very close or too far form the optimal point x . Theorem 5 (Noiseless-Optimization: β-smooth function (Alg. 1)). Consider f to be β smooth. Suppose Alg. 1 is run with η = ϵ 20 dβ , γ = (ϵ/β)3/2

log 480 βd(D+ηT )/

and Tϵ = O dβD

ϵ , where D x1 x 2 (is an assumed known upper bound). Then Alg. 1 returns E[f( x T +1)] f(x ) ϵ with sample complexity 2Tϵ.

Remark 2. The convergence rate in Thm. 5 is same as that achieved by non-accelerated optimization algorithms for smooth convex optimization with zeroth-order feedback which reveals the true function value at queried the points (e.g., Nesterov & Spokoiny, 2017). This is arguably a richer feedback model compared to our 1-bit comparison oracle, as one can always obtain the comparison feedback from zeroth order oracle but not the other way. So a comparison based optimization is always as hard as that of the

3We take cues form the result of Thm. 3 or more precisely Lem. 4 to come up with the functional form of ht.

Algorithm 1 β-NGD(x1, η, γ, Tϵ)

1: Input: Initial point: x1 Rd such that D := x1 x 2 (assume known), Learning rate η, Perturbation parameter γ, Query budget Tϵ (Recall the desired error tolerance is ϵ > 0) 2: Initialize Current minimum x1 Rd

3: for t = 1, 2, 3, . . . , Tϵ do 4: Sample ut Unif(Sd(1)) 5: x t := xt + γut 6: y t := xt γut 7: Play the duel (x t, y t), and receive binary preference feedback ot = 1(f(x t) < f(y t)). Set o t = 2ot 1.

8: Update xt+1 xt ηht, where ht = o tut 9: Query the pair (xt+1, xt).

10: Update xt+1

( xt+1 if o t = 1 xt otherwise (o t = +1) 11: end for 12: Return x T +1

zeroth order optimization problem, as we can always solve the later problem, given a black-box for the ﬁrst.

Proof sketch (Thm. 5). Consider the following cases:

Case 1. (f(x0) f(x ) ϵ): In this case, the initial point x1 is already good enough (close enough to x ).

Case 2. (f(x0) f(x ) > ϵ): In this case we appeal to Lem. 6 which ensures ﬁnding a point xt for t [Tϵ] such that E[f(xt+1)] f(x ) ϵ.

The bound of Thm. 5 now follows noting that by deﬁnition xt+1 = min(x1, . . . , xt) so as long as t [Tϵ] with E[f(xt+1)] f(x ) ϵ, we have E[f( x T +1)] f(x ) ϵ.

Finally the sample complexity bound follows straightforwardly since Alg. 1 takes Tϵ as input and for each t [Tϵ] it makes 2 pairwise comparisons, respectively (x t, y t) and (xt+1, xt), making the total query complexity 2Tϵ.

The rest of the proof we brieﬂy sketch the proof of the following main lemma: Lemma 6. Consider f is β smooth. In Alg. 1, if the initial point x1 is such that f(x1) f(x ) > ϵ, and given any ϵ > 0 the parameters Tϵ, γ and η is as deﬁned in Thm. 5. Then there exists at least one t such that E[f(xt+1)] f(x ) ϵ, i.e. mint [T ] E[f(xt+1)] f(x ) ϵ.

Consider any t = 1, 2, . . . T, such that f(xτ) > f(x ) + ϵ, τ [t], and we denote by nt = f(xt) f(xt) . Let yt :=

β nt. Then using β-smoothness of f, f(yt)

f(x ) + f(x )(yt x ) + β

2 yt x 2 = f(x ) + ϵ.

Dueling Convex Optimization

Thus we conclude f(yt) < f(xt). From Lem. 14 we get

n t (yt xt) 0 = n t

n t (xt x ) q

Observation 1. Note for any t [T], f(xt) > f(x ) + ϵ

implies xt x > q

β (since from β-smoothness of f

we know that f(xt) f(x ) β

Observation 2. Since f(yt) < f(xt), from properties of convex function (see Lem. 14) in Appendix B we get

n t (yt xt) 0 = n t

We denote by Ht the history {xτ, uτ}t 1 τ=1 xt till time t. Then note that by the update rule of xt+1 (and since ht = 1):

xt+1 x 2 xt x 2 2ηh t (xt x ) + η2

Now starting from the above equation along with a series of inference steps stemming from Observations 1,2, Thm. 3, some properties of the convex functions, and appropriate choices of γ and η, it can be shown that above implies: EHt[Eut[ xt+1 x 2 | Ht]] EHt[ xt x 2] (

2 1)ϵ 400dβ . Now summing over t = 1, . . . T and further applying laws of iterated expectation, this ﬁnally boils down to: EHT [ x T +1 x 2] x1 x 2 (

2 1)ϵT 400dβ .

Thus, we see that, if indeed f(xτ) f(x ) > ϵ continues to hold for all τ = 1, 2, . . . T, then E[ x T +1 x 2] 0, for T 400dβ (

2 1)ϵ( x1 x 2), which basically implies x T +1 = x (i.e. f(x T +1) = f(x )). Otherwise there must have been a time t [T] such that f(xt) f(x ) < ϵ. This concludes the proof with Tϵ = T.

This concludes the proof of Lem. 6, as well as the proof of Thm. 5 which was already proved earlier (assuming the result of Lem. 6 holds good).

5.2. f is α-strongly convex and β-smooth

We next show a better convergence rate for α-strongly convex and β-smooth function. For this case, we note that one can simply reuse any optimal optimization algorithm for β-smooth convex functions (we use our Alg. 1) as a blackbox to design an optimal algorithm for α-strongly convex, β-smooth functions. Our proposed method (α, β)- NGD (Alg. 2) adapts a phase-wise iterative optimization approach, where inside each phase we use the Alg. 1 as a blackbox to locate a ϵk-optimal point in that phase with exponentially decaying ϵk (with ϵ1 = 1) and warm start the (k + 1)-th phase from the optimizer returned by β-NGD

in the k-th phase. The method works precisely due to the nice properly of strong-convex functions where nearness in function values implies nearness from the optimal x in ℓ2-norm (see Lem. 8 and proof of Thm. 7). The algorithm is described in Alg. 2.

Algorithm 2 (α, β)-NGD(ϵ)

1: Input: Error tolerance ϵ > 0 2: Initialize Initial point: x1 Rd such that D := x0 x 2 (assume known). Phase counts kϵ := log2 α

ϵ , t 800dβ (

2 1)α η1 ϵ1 20 dβ , ϵ1 = 400dβD (

2 1)t1 = 1, t1 = t x1 x 2

γ1 (ϵ1/β)3/2

2d(D+η1t1)2

log 480 βd(D+η1t1)/

3: Update x2 β-NGD x1, η1, γ1, t1

4: for k = 2, 3, . . . , kϵ do 5: ηk ϵk 20 dβ , ϵk = 400dβ (

2 1)tk , tk = 2t

γk (ϵk/β)3/2

2d(1+ηktk)2

log 480 βd(1+ηktk)/

6: Update xk+1 β-NGD xk, ηk, γk, tk

7: end for 8: Return x = xkϵ+1

Theorem 7 (Noiseless-Optimization: α-strongly convex and β-smooth case (Alg. 2)). Consider f to be α-strongly convex and β-smooth. Then Alg. 2 returns E[f( x)] f(x ) ϵ with sample complexity (number of pairwise

comparisons) O dβ

ϵ + x1 x 2) .

Remark 3. The line search algorithm proposed by (Jamieson et al., 2012) also achieves the same convergence rate for strongly convex functions, modulo some additional multiplicative polylogarithmic terms (in d, ϵ, ν) in the sample complexity bounds, which we could get rid of. Their lower bound justiﬁes the tightness of the analysis of Thm. 7 in terms of the problem parameters d, ν and ϵ. However, it is important to note that our algorithm is much simpler both in implementation and analysis. Besides, our proposed methods are certainly more general that applies to the class of non-strongly convex functions as well (see Thm. 5), where (Jamieson et al., 2012) fails.

Proof sketch (Thm. 7). We start by recalling an important property of strongly convex function (proof in Appendix B):

Lemma 8. If f : R 7 R is an α-strongly convex function, with x being the minimizer of f. Then for any x R, α

2 x x 2 f(x) f(x ).

Also let Hk := {xk , (xt , yt , ot )t tk }k k =1 {xk+1} denotes the complete history till the end of phase k for all k [kϵ].

By Thm. 5, for any ﬁxed T > 0, when βNGD (Alg. 1) is run with η = ϵ 20 dβ

Dueling Convex Optimization

log 480 βd(D+ηT )/

2ϵ and ϵ = 400dβD (

where D := x1 x 2. Then Alg. 1 returns E[f( x T +1)] f(x ) ϵ = 400dβ x1 x 2

2 1)T with sample complexity 2T. However, in this case since f is also α-strongly convex, using Lem. 8 we get:

E[α/2 x T +1 x 2] 400dβ x1 x 2

i.e E[ x T +1 x 2] 800dβ x1 x 2

2 1)αT . Now initially for k = 1, clearly applying the above result for T = t x1 x 2, we get E[ x2 x 2] 800dβ x1 x 2

Thus, for any k = 2, . . . kϵ, given the initial point xk, if we run β-NGD with T = 2t = 1600dβ (

2 1)α, we get from

(1): EHk[ xk+1 x 2 | Hk 1] 800dβ xk x 2

2 . This implies given the history till phase k 1, using (1) and our choice of tk,

EHk[f(xk+1) f(x ) | Hk 1]

4α xk x 2 | Hk 1]

2)k 1 x1 x 2 1 α2k+1 .

Thus, to ensure at k = kϵ, E[f(xkϵ+1) f(x )] ϵ, this demands (1/2)kϵ+1α ϵ, or equivalently α

2ϵ 2kϵ+1, which justiﬁes the choice of kϵ = log2 α

ϵ . By Thm. 5, recall running the subroutine β-NGD(xk 1, ηk, γk, tk) actually requires a query complexity of 2tk, and hence the total query complexity of Alg. 2 becomes 4tkϵ + t1 = O 800dβ (

2 1)α(log2 α

ϵ + x1 x 2) .

6. General Case: Noisy-Sign-Feedback

Recall from Sec. 2 that in the most general setup, we consider a noisy comparison feedback such that Pr(ot = 1(ft(yt) > ft(xt))) = 1/2 + ν, for some ν (0, 0.5]. Our proposed algorithms in the earlier section require the knowledge of true sign feedback ot = 1(f(xt) > f(yt)) for every pairwise query (xt, yt), but in reality the comparison oracle could be noisy and return incorrect signs where they fail. We resolve the problem by a resampling-trick described below.

6.1. sign-recovery: De-noising the Oracle

Main idea of sign-recovery (Alg. 3). We note that, given any pair (x, y), by re-querying it for O( 1

ν2 log 1 ν2δ) times, one can obtain the true the sign feedback 1(f(x) > f(y)) with high probability.

Lemma 9. For any dueling pair (x, y) and conﬁdence parameter δ (0, 1], with probability at least (1 δ/2), the

output o of sign-recovery(x, y, δ) (Alg. 3) returns the true indicator value of 1(f(x) > f(y)) with at most t = O 1

ν2 log 1 ν2δ pairwise queries. More precisely:

Pr o = 1(f(x) f(y)) and t = O 1

where the probability is taken on randomness of the observed comparison sequence {oτ}τ [t].

Remark 4. Note the algorithm does not require the knowledge of the noise parameter ν.

Algorithm 3 sign-recovery(x, y, δ)

1: Input: Dueling pair: (x, y). Desired conﬁdence δ [0, 1]. Initialize w 0 2: for t = 1,2, . . . do 3: Play (x, y). 4: Receive ot noisy-preference 1(f(x) < f(y))

5: Update w w + ot, pt(x, y) w

6: conft := q

2t 7: lt(x, y) := pt(x, y) conft 8: lt(y, x) := 1 pt(x, y) conft 9: if either lt(x, y) > 1/2 or lt(y, x) > 1/2: Break. 10: end for

11: Compute o

( 1 if lt(x, y) > 1/2 0 otherwise 12: Return o

Proof sketch Let Alg. 3 stops at round τ. The proof uses Hoeffding s inequality which ensures at all iteration t [τ], with probability at least (1 δ), it satisﬁes |pt(x, y) Pr({s = 1})| conft. (See Lem. 16 and 17 in Appendix C.) Note this equivalently implies |pt(y, x) Pr({s = 0})| conft, where pt(y, x) = 1 pt(x, y) and since Pr({s = 0}) = 1 Pr({s = 1}). We now consider the following cases:

Case 1. 1(f(x) < f(y)) = 1 : So in this case, at any t [τ], Pr(ot = 1) = Pr(s = 1) = 1/2 + ν. Then lt(x, y) pt(x, y) conft Pr(ot = 1) 2conft = 1/2 + ν 2conft. So we get lt(x, y) > 1/2 whenever 2conft < ν, or t > 2 ν2 log 8t2

δ . Note that later is satisﬁed for any t 8 ν2 log 64

ν2δ. This can be easily veriﬁed setting a = 2/ν2 and b = 8/δ in Lem. 18.

Then assuming the conﬁdence bounds of Lem. 17 holds good at all t, we can safely conclude that the algorithm satisﬁes the stopping criterion (see Line #9 in Alg. 3) for τ = O 1

ν2 log 1 ν2δ This implies the correctness and sample complexity of sign-recovery.

Case 2. 1(f(x) < f(y)) = 0 : In this case Pr(ot = 0) = 1/2 + ν. Then lt(y, x) pt(y, x) conft Pr(ot =

Dueling Convex Optimization

0) 2conft = 1/2 + ν 2conft. The rest follows same as Case 1. (complete proof in Appendix C). .

6.2. Algorithms and Analysis

β-smooth convex functions. We can essentially re-use Alg. 1 again, except since we don t have access to the true comparison 1(f(x t) > f(y t)), but only a noisy 1bit feedback of the former, we need to estimate the true sign with high probability. For this, we appeal to our sign-recovery subroutine to obtain the true ot, as required in the Line #7 and #9 of Alg. 1. For completeness, the pseudocode of Robust-β-NGD (Alg. 4) is given in Appendix C.

Theorem 10 (Noisy-Optimization: Smooth Convex functions). Consider f to be β smooth. Suppose β-NGD (Alg. 4) is run with η = ϵ(2p 1)2

log 480 βd(D+ηT )/

2ϵ and Tϵ = O dβD

where D := x1 x 2. Then, for any given δ (0, 1), with high probability at least (1 δ) (over the randomness of noisy comparison feedback ot) it returns E[f( x T +1)]

f(x ) ϵ with sample complexity O dβD

ϵν2 log dβD

ϵν2δ (expectation is taken over randomness of the algorithm).

Proof sketch. The proof precisely follows from the proof of Thm. 5, as Lem. 9 ensures at every round ot gets assigned to the true 1(f(xt) < f(yt)) with sufﬁciently high probability. This ensures the correctness of the algorithm. The sample complexity simply follows by taking into account the additional O 1

ν2 pairwise-queries incurred in sign-recovery subroutine (per dueling-pair) to recover the correct comparison feedback at each round. α-strongly convex and β-smooth functions. In this case again, we can reuse our Alg. 2, originally proposed for the noiseless setup. To accommodate the noisy comparisonfeedback oracle, in this case it requires to use the robust version of the underlying blackbox algorithm, i.e. Robustβ-NGD in Line #3 and #6 of Alg. 2. For completeness, the full algorithm (Alg. 5) is given in Appendix C.

Theorem 11 (Noisy-Optimization: Strongly Convex and Smooth case). Let f is α-strongly convex and β-smooth. Then given any δ (0, 1), with probability at least (1 δ) (over randomness of noisy comparison feedback ot), Alg. 5 with using Robust-β-NGD (Alg. 4) as the underlying blackbox), returns E[f( x)] f(x ) ϵ in sample complexity O dβ ν2α(log2 α

ϵ + D) log dβD log(α/ϵ)

ν2δ (expectation is taken over randomness of the algorithm), where D := x1 x 2.

7. Conclusion and Perspective

We address the problem of online convex optimization with comparison feedback and design normalized gradient de-

scent based algorithms that yield fast convergence guarantees for smooth convex and strongly convex+smooth functions. Moving forward, there are many open questions to address, including unifying the class of preference maps (that maps a duel-score to preference feedback), analyze the regret minimization objective, understanding informationtheoretic performance limits, or even generalizing the framework to general subsetwise preference-based learning problem. The setup of optimization from preference feedback being relatively new and almost unexplored, the scopes of potential future directions are vast.

Acknowledgments

The work of YM has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 882396), by the Israel Science Foundation (grant number 993/17) and the Yandex Initiative for Machine Learning at Tel Aviv University. TK was supported in part by the Israeli Science Foundation (ISF) grant no. 2549/19, by the Len Blavatnik and the Blavatnik Family foundation, and by the Yandex Initiative in Machine Learning.

Abernethy, J. D., Hazan, E., and Rakhlin, A. Competing in the dark: An efﬁcient algorithm for bandit linear optimization. Conference on Learning Theory, 2008.

Agarwal, A., Dekel, O., and Xiao, L. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, pp. 28 40. Citeseer, 2010.

Agarwal, A., Foster, D. P., Hsu, D. J., Kakade, S. M., and Rakhlin, A. Stochastic convex optimization with bandit feedback. In Advances in Neural Information Processing Systems, pp. 1035 1043, 2011.

Ailon, N., Karnin, Z. S., and Joachims, T. Reducing dueling bandits to cardinal bandits. In ICML, volume 32, pp. 856 864, 2014.

Bach, F. and Perchet, V. Highly-smooth zero-th order online optimization. In Conference on Learning Theory, pp. 257 283, 2016.

Bubeck, S. Convex optimization: Algorithms and complexity. ar Xiv preprint ar Xiv:1405.4980, 2014.

Bubeck, S., Dekel, O., Koren, T., and Peres, Y. Bandit convex optimization:\sqrtt regret in one dimension. In Conference on Learning Theory, pp. 266 278, 2015.

Bubeck, S., Lee, Y. T., and Eldan, R. Kernel-based methods for bandit convex optimization. In Proceedings of

Dueling Convex Optimization

the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 72 85. ACM, 2017.

Busa-Fekete, R. and H ullermeier, E. A survey of preferencebased online learning with bandit algorithms. In International Conference on Algorithmic Learning Theory, pp. 18 39. Springer, 2014.

Chen, L., Zhang, M., and Karbasi, A. Projection-free bandit convex optimization. ar Xiv preprint ar Xiv:1805.07474, 2018.

Dekel, O., Eldan, R., and Koren, T. Bandit smooth convex optimization: Improving the bias-variance tradeoff. In Advances in Neural Information Processing Systems, pp. 2926 2934, 2015.

Flaxman, A. D., Kalai, A. T., and Mc Mahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 385 394. Society for Industrial and Applied Mathematics, 2005.

Ghadimi, S. and Lan, G. Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Hazan, E. Introduction to online convex optimization. ar Xiv preprint ar Xiv:1909.05207, 2019.

Hazan, E. and Kale, S. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489 2512, 2014.

Hazan, E. and Li, Y. An optimal algorithm for bandit convex optimization. ar Xiv preprint ar Xiv:1603.04350, 2016.

Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

Jamieson, K. G., Nowak, R., and Recht, B. Query complexity of derivative-free optimization. In Advances in Neural Information Processing Systems, pp. 2672 2680, 2012.

Komiyama, J., Honda, J., Kashima, H., and Nakagawa, H. Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on Learning Theory, pp. 1141 1154, 2015.

Kumagai, W. Regret analysis for continuous dueling bandit. In Advances in Neural Information Processing Systems, pp. 1489 1498, 2017.

Nesterov, Y. and Spokoiny, V. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527 566, 2017.

Saha, A. and Tewari, A. Improved regret guarantees for online smooth convex optimization with bandit feedback. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 636 642, 2011.

Sahu, A. K., Zaheer, M., and Kar, S. Towards gradient free and projection free stochastic optimization. ar Xiv preprint ar Xiv:1810.03233, 2018.

Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107 194, 2011.

Shamir, O. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18(52):1 11, 2017.

Wang, Y., Du, S., Balakrishnan, S., and Singh, A. Stochastic zeroth-order optimization in high dimensions. ar Xiv preprint ar Xiv:1710.10551, 2017.

Wu, H. and Liu, X. Double Thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pp. 649 657, 2016.

Yang, S. and Mohri, M. Optimistic bandit convex optimization. In Advances in Neural Information Processing Systems, pp. 2297 2305, 2016.

Yue, Y. and Joachims, T. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1201 1208, 2009.

Zinkevich, M. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 928 936, 2003.