# online_nonconvex_optimization_with_imperfect_feedback__d1bf9802.pdf Online Non-Convex Optimization with Imperfect Feedback Amélie Héliou Criteo AI Lab a.heliou@criteo.com Matthieu Martin Criteo AI Lab mat.martin@criteo.com Panayotis Mertikopoulos Univ. Grenoble Alpes, CNRS, Inria, LIG & Criteo AI Lab panayotis.mertikopoulos@imag.fr Thibaud Rahier Criteo AI Lab t.rahier@criteo.com We consider the problem of online learning with non-convex losses. In terms of feedback, we assume that the learner observes or otherwise constructs an inexact model for the loss function encountered at each stage, and we propose a mixed-strategy learning policy based on dual averaging. In this general context, we derive a series of tight regret minimization guarantees, both for the learner s static (external) regret, as well as the regret incurred against the best dynamic policy in hindsight. Subsequently, we apply this general template to the case where the learner only has access to the actual loss incurred at each stage of the process. This is achieved by means of a kernel-based estimator which generates an inexact model for each round s loss function using only the learner s realized losses as input. 1 Introduction In this paper, we consider the following online learning framework: 1. At each stage t = 1, 2, . . . of a repeated decision process, the learner selects an action xt from a compact convex subset K of a Euclidean space Rn. 2. The agent s choice of action triggers a loss ℓt(xt) based on an a priori unknown loss function ℓt : K R; subsequently, the process repeats. If the loss functions ℓt encountered by the agent are convex, the above framework is the standard online convex optimization setting of Zinkevich [58] for a survey, see [16, 29, 45] and references therein. In this case, simple first-order methods like online gradient descent (OGD) allow the learner to achieve O(T 1/2) regret after T rounds [58], a bound which is well-known to be min-max optimal in this setting [1, 45]. At the same time, it is also possible to achieve tight regret minimization guarantees against dynamic comparators such as the regret incurred against the best dynamic policy in hindsight, cf. [11, 21, 23, 30, 33] and references therein. On the other hand, when the problem s loss functions are not convex, the situation is considerably more difficult. When the losses are generated from a stationary stochastic distribution, the problem can be seen as a version of a continuous-armed bandit in the spirit of Agrawal [3]; in this case, there exist efficient algorithms guaranteeing logarithmic regret by discretizing the problem s search domain and using a UCB-type policy [18, 37, 48]. Otherwise, in an adversarial context, an informed adversary can impose linear regret to any deterministic algorithm employed by the learner [31, 45, 50]; as a result, UCB-type approaches are no longer suitable. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Convex losses Non-Convex losses Feedback Static regret Dynamic regret Static regret Dynamic regret Exact O T 1/2 [58] O T 2/3V 1/3 T [11] O T 1/2 [38, 50] O T 2/3V 1/3 T Unbiased O T 1/2 [58] O T 2/3V 1/3 T [11] O T 1/2 O T 2/3V 1/3 T Bandit O T 1/2 [17, 19] O T 4/5V 1/5 T [11] O T n+2 n+3 O T Table 1: Overview of related work. In regards to feedback, an exact model means that the learner acquires perfect knowledge of the encountered loss functions; unbiased refers to an inexact model that is only accurate on average; finally, bandit means that the learner records their incurred loss and has no other information. We only report here the best known bounds in the literature; all bounds derived in this paper are typeset in bold. In view of this impossibility result, two distinct threads of literature have emerged for online nonconvex optimization. One possibility is to examine less demanding measures of regret like the learner s local regret [31] and focus on first-order methods that minimize it efficiently [28, 31]. Another possibility is to consider randomized algorithms, in which case achieving no regret is possible: Krichene et al. [38] showed that adapting the well-known Hedge (or multiplicative / exponential weights) algorithm to a continuum allows the learner to achieve O(T 1/2) regret, as in the convex case. This result is echoed in more recent works by Agarwal et al. [2] and Suggala & Netrapalli [50] who analyzed the follow the perturbed leader (FTPL) algorithm of Kalai & Vempala [34] with exponentially distributed perturbations and an offline optimization oracle (exact or approximate); again, the regret achieved by FTPL in this setting is O(T 1/2), i.e., order-equivalent to that of Hedge in a continuum. Our contributions and related work. A crucial assumption in the above works on randomized algorithms is that, after selecting an action, the learner receives perfect information on the loss function encountered i.e., an exact model thereof. This is an important limitation for the applicability of these methods, which led to the following question by Krichene et al. [38, p. 8]: One question is whether one can generalize the Hedge algorithm to a bandit setting, so that sublinear regret can be achieved without the need to explicitly maintain a cover. To address this open question, we begin by considering a general framework for randomized action selection with imperfect feedback i.e., with an inexact model of the loss functions encountered at each stage. Our contributions in this regard are as follows: 1. We present a flexible algorithmic template for online non-convex learning based on dual averaging with imperfect feedback [42]. 2. We provide tight regret minimization rates both static and dynamic under a wide range of different assumptions for the loss models available to the optimizer. 3. We show how this framework can be extended to learning with bandit feedback, i.e., when the learner only observes their realized loss and must construct a loss model from scratch. Viewed abstractly, the dual averaging (DA) algorithm is an umbrella scheme that contains Hedge as a special case for problems with a simplex-like domain. In the context of online convex optimization, the method is closely related to the well-known follow the regularized leader (FTRL) algorithm of Shalev-Shwartz & Singer [46], the FTPL method of Kalai & Vempala [34], lazy mirror descent (MD) [15, 16, 45], etc. For an appetizer to the vast literature surrounding these methods, we refer the reader to [14, 16, 42, 45, 46, 55, 57] and references therein. In the non-convex setting, our regret minimization guarantees can be summarized as follows (see also Table 1 above): if the learner has access to inexact loss models that are unbiased and finite in mean square, the DA algorithm achieves in expectation a static regret bound of O(T 1/2). Moreover, in terms of the learner s dynamic regret, the algorithm enjoys a bound of O(T 2/3V 1/3 T ) where VT := PT t=1 ℓt+1 ℓt denotes the variation of the loss functions encountered over the horizon of play (cf. Section 4 for the details). Importantly, both bounds are order-optimal, even in the context of online convex optimization, cf. [1, 11, 20]. With these general guarantees in hand, we tackle the bandit setting using a kernel smoothing technique in the spirit of Bubeck et al. [19]. This leads to a new algorithm, which we call bandit dual averaging (BDA), and which can be seen as a version of the DA method with biased loss models. The bias of the loss model can be controlled by tuning the radius of the smoothing kernel; however, this comes at the cost of increasing the model s variance an incarnation of the well-known biasvariance trade-off. By resolving this trade-off, we are finally able to answer the question of Krichene et al. [38] in the positive: BDA enjoys an O(T n+2 n+3 ) static regret bound and an O(T n+3 n+4 V 1/(n+4) T ) dynamic regret bound, without requiring an explicit discretization of the problem s search space. This should be contrasted with the case of online convex learning, where it is possible to achieve O(T 3/4) regret through the use of simultaneous perturbation stochastic approximation (SPSA) techniques [25], or even O(T 1/2) by means of kernel-based methods [17, 19]. This represents a drastic drop from O(T 1/2), but this cannot be avoided: the worst-case bound for stochastic non-convex optimization is Ω(T (n+1)/(n+2)) [36, 37], so our static regret bound is nearly optimal in this regard (i.e., up to O(T 1/(n+2)(n+3)), a term which is insignificant for horizons T 1012). Correspondingly, in the case of dynamic regret minimization, the best known upper bound is O(T 4/5V 1/5 T ) for online convex problems [11, 24]. We are likewise not aware of any comparable dynamic regret bounds for online non-convex problems; to the best our knowledge, our paper is the first to derive dynamic regret guarantees for online non-convex learning with bandit feedback. We should stress here that, as is often the case for methods based on lifting, much of the computational cost is hidden in the sampling step. This is also the case for the proposed DA method which, like [38], implicitly assumes access to a sampling oracle. Estimating (and minimizing) the per-iteration cost of sampling is an important research direction, but one that lies beyond the scope of the current paper, so we do not address it here. 2 Setup and preliminaries 2.1. The model. Throughout the sequel, our only blanket assumption will be as follows: Assumption 1. The stream of loss functions encountered is uniformly bounded Lipschitz, i.e., there exist constants R, L > 0 such that, for all t = 1, 2, . . . , we have: 1. |ℓt(x)| R for all x K; more succinctly, ℓt R. 2. |ℓt(x ) ℓt(x)| L x x for all x, x K. Other than this meager regularity requirement, we make no structural assumptions for ℓt (such as convexity, unimodality, or otherwise). In this light, the framework under consideration is akin to the online non-convex setting of Krichene et al. [38], Hazan et al. [31], and Suggala & Netrapalli [50]. The main difference with the setting of Krichene et al. [38] is that the problem s domain K is assumed convex; this is done for convenience only, to avoid technical subtleties involving uniform fatness conditions and the like. In terms of playing the game, we will assume that the learner can employ mixed strategies to randomize their choice of action at each stage; however, because this mixing occurs over a continuous domain, defining this randomization requires some care. To that end, let M M(K) denote the space of all finite signed Radon measures on K. Then, a mixed strategy is defined as an element π of the set of Radon probability measures (K) M(K) on K, and the player s expected loss under π when facing a bounded loss function ℓ L (K) will be denoted as ℓ, π := Eπ[ℓ] = R K ℓ(x) dπ(x). (1) Remark 1. We should note here that contains a vast array of strategies, including atomic and singular distributions that do not admit a density. For this reason, we will write c for the set of strategies that are absolutely continuous relative to the Lebesgue measure λ on K, and for the set of singular strategies (which are not); by Lebesgue s decomposition theorem [26], we have = c . By construction, contains the player s pure strategies, i.e., Dirac point masses δx that select x K with probability 1; however, it also contains pathological strategies that admit neither a density, nor a point mass function such as the Cantor distribution [26]. By contrast, the Radon-Nikodym (RN) derivative p := dπ/dλ of π exists for all π c, so we will sometimes refer to elements of c as Radon-Nikodym strategies ; in particular, if π c, we will not distinguish between π and p unless absolutely necessary to avoid confusion. Much of our analysis will focus on strategies χ with a piecewise constant density on K, i.e., χ = Pk i=1 αi 1Ci for a collection of weights αi 0 and measurable subsets Ci K, i = 1, . . . , k, such that R i αiλ(Ci) = 1. These strategies will be called simple and the space of simple strategies on K will be denoted by X X(K). A key fact regarding simple strategies is that X is dense in in the weak topology of M [26, Chap. 3]; as a result, the learner s expected loss under any mixed strategy π can be approximated within arbitrary accuracy ε > 0 by a simple strategy χ X. In addition, when k (or n) is not too large, sampling from simple strategies can be done efficiently; for all these reasons, simple strategies will play a key role in the sequel. 2.2. Measures of regret. With all this in hand, the regret of a learning policy πt , t = 1, 2, . . . , against a benchmark strategy π is defined as Regπ (T) = PT t=1 Eπt[ℓt] Eπ [ℓt] = PT t=1 ℓt, πt π , (2) i.e., as the difference between the player s mean cumulative loss under πt and π over T rounds. In a slight abuse of notation, we write Regp (T) if π admits a density p , and Regx(T) for the regret incurred against the pure strategy δx, x K. Then, the player s (static) regret under πt is given by Reg(T) = maxx K Regx(T) = supχ X Regχ(T) (3) where the maximum is justified by the compactness of K and the continuity of each ℓt. The lemma below provides a link between pure comparators and their approximants in the spirit of Krichene et al. [38]; to streamline our discussion, we defer the proof to the supplement: Lemma 1. Let U be a convex neighborhood of x in K and let χ X be a simple strategy supported on U. Then, Regx(T) Regχ(T) + L diam(U)T. This lemma will be used to bound the agent s static regret using bounds obtained for simple strategies χ X. Going beyond static comparisons of this sort, the learner s dynamic regret is defined as Dyn Reg(T) = PT t=1[ ℓt, πt minπ ℓt, π ] = PT t=1 ℓt, πt π t (4) where π t arg minπ ℓt, π is a best-response to ℓt (that such a strategy exists is a consequence of the compactness of K and the continuity of each ℓt). In regard to its static counterpart, the agent s dynamic regret is considerably more ambitious, and achieving sublinear dynamic regret is not always possible; we examine this issue in detail in Section 4. 2.3. Feedback models. After choosing an action, the agent is only assumed to observe an inexact model ˆℓt L (K) of the t-th stage loss function ℓt; for concreteness, we will write ˆℓt = ℓt + et (5) where the observation error et captures all sources of uncertainty in the player s model. This uncertainty could be both random (zero-mean) or systematic (non-zero-mean), so it will be convenient to decompose et as et = zt + bt (6) where zt is zero-mean and bt denotes the mean of et. To define all this formally, we will write Ft = F(π1, . . . , πt) for the history of the player s mixed strategy up to stage t (inclusive). The chosen action xt and the observed model ˆℓt are both generated after the player chooses πt so, by default, they are not Ft-measurable. Accordingly, we will collect all randomness affecting ˆℓt in an abstract probability law P, and we will write bt = E[et | Ft] and zt = et bt; in this way, E[zt | Ft] = 0 by definition. In view of all this, we will focus on the following descriptors for ˆℓt: a) Bias: bt Bt (7a) b) Variance: E[ zt 2 | Ft] σ2 t (7b) c) Mean square: E[ ˆℓt 2 | Ft] M 2 t (7c) In the above, Bt, σt and Mt are deterministic constants that are to be construed as bounds on the bias, (conditional) variance, and magnitude of the model ˆℓt at time t. In obvious terminology, a model with Bt = 0 will be called unbiased, and an unbiased model with σt = 0 will be called exact. Example 1 (Parametric models). An important application of online optimization is the case where the encountered loss functions are of the form ℓt(x) = ℓ(x; θt) for some sequence of parameter vectors θt Rm. In this case, the learner typically observes an estimate ˆθt of θt, leading to the inexact model ˆℓt = ℓ( ; ˆθt). Importantly, this means that ˆℓt does not require infinite-dimensional feedback to be constructed. Moreover, the dependence of ℓon θ is often linear, so if ˆθt is an unbiased estimate of θt, then so is ˆℓt. Example 2 (Online clique prediction). As a specific incarnation of a parametric model, consider the problem of finding the largest complete subgraph a maximum clique of an undirected graph G = (V, E). This is a key problem in machine learning with applications to social networks [27], data mining [9], gene clustering [49], feature embedding [56], and many other fields. In the online version of the problem, the learner is asked to predict such a clique in a graph Gt that evolves over time (e.g., a social network), based on partial historical observations of the graph. Then, by the Motzkin-Straus theorem [13, 41], this boils down to an online quadratic program of the form: maximize ut(x) = Pn i,j=1 xi Aij,txj subject to xi 0, Pn i=1 xi = 1, (MCP) where At = (Aij,t)n i,j=1 denotes the adjacency matrix of Gt. Typically, ˆAt is constructed by picking a node i uniformly at random, charting out its neighbors, and letting ˆAij,t = |V|/2 whenever j is connected to i. It is easy to check that ˆAt is an unbiased estimator of At; as a result, the function ˆut(x) = x ˆAtx is an unbiased model of ut. Example 3 (Online-to-batch). Consider an empirical risk minimization model of the form m Pm i=1 fi(x) (8) where each fi : Rn R corresponds to a data point (or sample ). In the online-to-batch formulation of the problem [45], the optimizer draws uniformly at random a sample it {1, . . . , m} at each stage t = 1, 2, . . . , and observes ˆℓt = fit. Typically, each fi is relatively easy to store in closed form, so ˆℓt is an easily available unbiased model of the empirical risk function f. 3 Prox-strategies and dual averaging The class of non-convex online learning policies that we will consider is based on the general template of dual averaging (DA) / follow the regularized leader (FTRL) methods. Informally, this scheme can be described as follows: at each stage t = 1, 2, . . . , the learner plays a mixed strategy that minimizes their cumulative loss up to round t 1 (inclusive) plus a regularization penalty term (hence the regularized leader terminology). In the rest of this section, we provide a detailed construction and description of the method. 3.1. Randomizing over discrete vs. continuous sets. We begin by describing the dual averaging method when the underlying action set is finite, i.e., of the form A = {1, . . . , n}. In this case, the space of mixed strategies is the n-dimensional simplex n := (A) = {p Rn + : Pn i=1 pi = 1}, and, at each t = 1, 2, . . . , the dual averaging algorithm prescribes the mixed strategy pt arg minp n{η Pt 1 s=1 ˆℓt, p + h(p)}. (9) In the above, η > 0 is a learning rate parameter and h: n R is the method s regularizer , assumed to be continuous and strongly convex over n. In this way, the algorithm can be seen as tracking the best choice up to the present, modulo a day 0 regularization component the follow the regularized leader interpretation. In our case however, the method is to be applied to the infinite-dimensional set (K) of the learner s mixed strategies, so the issue becomes considerably more involved. To illustrate the problem, consider one of the prototypical regularizer functions, the negentropy h(p) = Pn i=1 pi log pi on n. If we naïvely try to extend this definition to the infinite-dimensional space (K), we immediately run into problems: First, for pure strategies, any expression of the form P x K δx(x ) log δx(x ) would be meaningless. Second, even if we focus on Radon-Nikodym strategies p c and use the integral definition h(p) = R K p log p, a density like p(x) 1/(x(log x)2) on K = [0, 1/2] has infinite negentropy, implying that even c is too large to serve as a domain. 3.2. Formal construction of the algorithm. To overcome the issues identified above, our starting point will be that any mixed-strategy incarnation of the dual averaging algorithm must contain at least the space X X(K) of the player s simple strategies. To that end, let V be an ambient Banach space which contains the set of simple strategies X as an embedded subset. For technical reasons, we will also assume that the topology induced on X by the reference norm of V is not weaker than the natural topology on X induced by the total variation norm; formally, TV α for some α > 0.1 For example, V could be the (Banach) space M(K) of finite signed measures on K, the (Hilbert) space L2(K) of square integrable functions on K endowed with the L2 norm,2 etc. With all this in hand, the notion of a regularizer on X is defined as follows: Definition 1. A regularizer on X is a lower semi-continuous (l.s.c.) convex function h: V R { } such that: 1. X is a weakly dense subset of the effective domain dom h := {p : h(p) < } of h. 2. The subdifferential h of h admits a continuous selection, i.e., there exists a continuous mapping h on dom h := { h = } such that h(q) h(q) for all q dom h. 3. h is strongly convex, i.e., there exists some K > 0 such that h(p) h(q) + h(q), p q + (K/2) p q 2 for all p dom h, q dom h. The set Q := dom h will be called the prox-domain of h; its elements will be called prox-strategies. Remark. For completeness, recall that the subdifferential of h at q is the set h(q) = {ψ V : h(p) h(q) + ψ, p q for all q V}; also lower semicontinuity means that the sublevel sets {h c} of h are closed for all c R. For more details, we refer the reader to Phelps [44]. Some prototypical examples of this general framework are as follows (with more in the supplement): Example 4 (L2 regularization). Let V = L2(K) and consider the quadratic regularizer h(p) = (1/2) p 2 2 = (1/2) R K p2 if p c L2(K), and h(p) = otherwise. In this case, Q = dom h = c L2(K) and h(q) = q is a continuous selection of h on Q. Example 5 (Entropic regularization). Let V = M(K) and consider the entropic regularizer h(p) = R K p log p whenever p is a density with finite entropy, h(p) = otherwise. By Pinsker s inequality, h is 1-strongly convex relative to the total variation norm TV on V; moreover, we have Q = {q c : supp(q) = K} dom h and h(q) = 1 + q log q on Q. In the finite-dimensional case, this regularizer forms the basis of the well-known Hedge (or multiplicative/exponential weights) algorithm [5, 6, 39, 54]; for the infinite-dimensional case, see [38, 43] (and below). With all this in hand, the dual averaging algorithm can be described by means of the abstract recursion yt+1 = yt ˆℓt, pt+1 = Q(ηt+1yt+1), (DA) where (i) t = 1, 2, . . . denotes the stage of the process (with the convention y0 = ˆℓ0 = 0); (ii) pt Q is the learner s strategy at stage t; (iii) ˆℓt L (K) is the inexact model revealed at stage t; (iv) yt L (K) is a score variable that aggregates loss models up to stage t; (v) ηt > 0 is a learning rate sequence; and (vi) Q: L (K) Q is the method s mirror map, viz. Q(ψ) = arg maxp V{ ψ, p h(p)}. (10) For a pseudocode implementation, see Alg. 1 below. In the paper s supplement we also show that the method is well-posed, i.e., the arg max in (10) is attained at a valid prox-strategy pt Q. We illustrate this with an example: Example 6 (Logit choice). Suppose that h(p) = R K p log p is the entropic regularizer of Example 5. Then, the corresponding mirror map is given in closed form by the logit choice model: Λ(ϕ) = exp(ϕ) R K exp(ϕ) for all ϕ L (K). (11) This derivation builds on a series of well-established arguments that we defer to the supplement. Clearly, R K Λ(ϕ) = 1 and Λ(ϕ) > 0 as a function on K, so Λ(ϕ) is a valid prox-strategy. 1Since the dual space of M(K) contains L (K), we will also view L (K) as an embedded subset of V . 2In this case, α = p λ(K): this is because p 2 TV = R K 1 = λ(K) p 2 2 if p c. Algorithm 1: Dual averaging with imperfect feedback [Hedge variant: Q Λ] Require: mirror map Q: L (K) Q; learning rate ηt > 0; initialize: y1 0 1: for t = 1, 2, . . . do 2: set pt Q(ηtyt) [pt Λ(ηtyt) for Hedge] # update mixed strategy 3: play xt pt # choose action 4: observe ˆℓt # model revealed 5: set yt+1 yt ˆℓt # update scores 6: end for 4 General regret bounds 4.1. Static regret guarantees. We are now in a position to state our first result for (DA): Proposition 1. For any simple strategy χ X, Alg. 1 enjoys the bound Regχ(T) η 1 T +1[h(χ) min h] + PT t=1 et, χ pt + 1 2K PT t=1 ηt ˆℓt 2 . (12) Proposition 1 is a template bound that we will use to extract static and dynamic regret guarantees in the sequel. Its proof relies on a suitable energy function measuring the match between the learner s aggregate model yt and the comparator χ. The main difficulty is that these variables live in completely different spaces (L (K) vs. X respectively), so there is no clear distance metric connecting them. However, since bounded functions ψ L (K) and simple strategies χ X are naturally paired via duality, they are indirectly connected via the Fenchel Young inequality ψ, χ h(χ) + h (ψ), where h (ψ) = maxp V{ ψ, p h(p)} (13) denotes the convex conjugate of h. We will thus consider the energy function Et := η 1 t [h(χ) + h (ηtyt) ηtyt, χ ]. (14) By construction, Et 0 for all t and Et = 0 if and only if pt = Q(ηtyt) = χ. More to the point, the defining property of Et is the following recursive bound (which we prove in the supplement): Lemma 2. For all χ X, we have: Et+1 Et + ˆℓt, χ pt + η 1 t+1 η 1 t [h(χ) min h] + ηt 2K ˆℓt 2 . (15) Proposition 1 is obtained by telescoping (15); subsequently, to obtain a regret bound for Alg. 1, we must relate Regx(T) to Regχ(T). This can be achieved by invoking Lemma 1 but the resulting expressions are much simpler when h is decomposable, i.e., h(p) = R K θ(p(x)) dx for some C2 function θ: [0, ) R with θ > 0. In this more explicit setting, we have: Theorem 1. Fix x K, let C be a convex neighborhood of x in K, and suppose that Alg. 1 is run with a decomposable regularizer h(p) = R K θ p. Then, letting φ(z) = zθ(1/z) for z > 0, we have: E[Regx(T)] φ(λ(C)) φ(λ(K)) ηT +1 + L diam(C)T + 2 XT t=1 Bt + α2 t=1 ηt M 2 t . (16) In particular, if Alg. 1 is run with learning rate ηt 1/tρ, ρ (0, 1), and inexact models such that Bt = O(1/tβ) and M 2 t = O(t2µ) for some β, µ 0, we have: E[Reg(T)] = O(φ(T nκ)T ρ + T 1 κ + T 1 β + T 1+2µ ρ) for all κ 0. (17) Corollary 1. If the learner s feedback is unbiased and bounded in mean square (i.e., Bt = 0 and supt Mt < ), running Alg. 1 with learning rate ηt 1/tρ guarantees E[Reg(T)] = O(φ(T nρ)T ρ + T 1 ρ). (18) In particular, for the regularizers of Examples 4 and 5, we have: 1. For θ(z) = (1/2)z2, Alg. 1 with ηt t 1/(n+2) guarantees E[Reg(T)] = O(T n+1 n+2 ). 2. For θ(z) = z log z, Alg. 1 with ηt t 1/2 guarantees E[Reg(T)] = O(T 1/2). Remark 2. Here and in the sequel, logarithmic factors are ignored in the Landau O( ) notation. We should also stress that the role of C in Theorem 1 only has to do with the analysis of the algorithm, not with the derived bounds (which are obtained by picking a suitable C). First, in online convex optimization, dual averaging with stochastic gradient feedback achieves O( T) regret irrespective of the choice of regularizer, and this bound is tight [1, 16, 45]. By contrast, in the non-convex setting, the choice of regularizer has a visible impact on the regret because it affects the exponent of T: in particular, L2 regularization carries a much worse dependence on T relative to the Hedge variant of Alg. 1. This is due to the term O(φ(T nκ)T ρ) that appears in (17) and is in turn linked to the choice of the enclosure set C having λ(C) T nκ for some κ 0. The negentropy regularizer (and any other regularizer with quasi-linear growth at infinity, see the supplement for additional examples) only incurs a logarithmic dependence on λ(C). Instead, the quadratic growth of the L2 regularizer induces an O(1/λ(C)) term in the algorithm s regret, which is ultimately responsible for the catastrophic dependence on the dimension of K. Seeing as the bounds achieved by the Hedge variant of Alg. 1 are optimal in this regard, we will concentrate on this specific instance in the sequel. 4.2. Dynamic regret guarantees. We now turn to the dynamic regret minimization guarantees of Alg. 1. In this regard, we note first that, in complete generality, dynamic regret minimization is not possible because an informed adversary can always impose a uniformly positive loss at each stage [45]. Because of this, dynamic regret guarantees are often stated in terms of the variation of the loss functions encountered, namely VT := PT t=1 ℓt+1 ℓt = PT t=1 maxx K|ℓt+1(x) ℓt(x)|, (19) with the convention ℓt+1 = ℓt for t = T.3 We then have: Theorem 2. Suppose that the Hedge variant of Alg. 1 is run with learning rate ηt 1/tρ and inexact models with Bt = O(1/tβ) and M 2 t = O(t2µ) for some β, µ 0. Then: E[Dyn Reg(T)] = O(T 1+2µ ρ + T 1 β + T 2ρ 2µVT ). (20) In particular, if VT = O(T ν) for some ν < 1 and the learner s feedback is unbiased and bounded in mean square (i.e., Bt = 0 and supt Mt < ), the choice ρ = (1 ν)/3 guarantees E[Dyn Reg(T)] = O(T 2+ν To the best of our knowledge, Theorem 2 provides the first dynamic regret guarantee for online non-convex problems. The main idea behind its proof is to examine the evolution of play over a series of windows of length = O(T γ) for some γ > 0. In so doing, Theorem 1 can be used to obtain a bound for the learner s regret relative to the best action x K within each window. Obviously, if the length of the window is chosen sufficiently small, aggregating the learner s regret per window will be a reasonable approximation of the learner s dynamic regret. At the same time, if the window is taken too small, the number of such windows required to cover T will be Θ(T), so this approximation becomes meaningless. As a result, to obtain a meaningful regret bound, this window-by-window examination of the algorithm must be carefully aligned with the variation VT of the loss functions encountered by the learner. Albeit intuitive, the details required to make this argument precise are fairly subtle, so we relegate the proof of Theorem 2 to the paper s supplement. We should also observe here that the O(T 2+ν 3 ) bound of Theorem 2 is, in general, unimprovable, even if the losses are linear. Specifically, Besbes et al. [11] showed that, if the learner is facing a stream of linear losses with stochastic gradient feedback (i.e., an inexact linear model), an informed adversary can still impose Dyn Reg(T) = Ω(T 2/3V 1/3 T ). Besbes et al. [11] further proposed a scheme to achieve this bound by means of a periodic restart meta-principle that partitions the horizon of play into batches of size (T/VT )2/3 and then runs an algorithm achieving (T/VT )1/3 regret per batch. Theorem 2 differs from the results of Besbes et al. [11] in two key aspects: (a) Alg. 1 does not require a periodic restart schedule (so the learner does not forget the information accrued up to a given stage); and (b) more importantly, it applies to general online optimization problems, without a convex structure or any other structural assumptions (though with a different feedback structure). 3This notion is due to Besbes et al. [11]. Other notions of variation have also been considered [11, 21, 23], as well as other measures of regret, cf. [30, 32]; for a survey, see [20]. Algorithm 2: Bandit dual averaging [Hedge variant: Q Λ] Require: mirror map Q: L (K) Q; parameters ηt, δt, εt > 0; initialize: y1 0 1: for t = 1, 2, . . . do 2: set pt (1 εt)Q(ηtyt) + εt/λ(K) [Q Λ for Hedge] # mixed strategy 3: play xt pt # choose action 4: set ˆℓt = Kδt(xt, ) ℓt(xt)/pt(xt) # payoff model 5: set yt+1 yt ˆℓt # update scores 6: end for 5 Applications to online non-convex learning with bandit feedback As an application of the inexact model framework of the previous sections, we proceed to consider the case where the learner only observes their realized reward ℓt(xt) and has no other information. In this bandit setting , an inexact model is not available and must instead be constructed on the fly. When K is a finite set, ℓt is a |K|-dimensional vector, and an unbiased estimator for ℓt can be constructed by setting ˆℓt(x) = [1{x = xt}/ P(x = xt)] ℓt(xt) for all x K. This importance weighted estimator is the basis for the EXP3 variant of the Hedge algorithm which is known to achieve O(T 1/2) regret [8]. However, in the case of continuous action spaces, there is a key obstacle: if the indicator 1{x = xt} is replaced by a Dirac point mass δxt(x), the resulting loss model ˆℓt δxt would no longer be a function but a generalized (singular) distribution, so the dual averaging framework of Alg. 1 no longer applies. To counter this, we will take a smoothing approach in the spirit of [19] and consider the estimator ˆℓt(x) = Kt(xt, x) ℓt(xt)/pt(xt) (22) where Kt : K K R is a (time-varying) smoothing kernel, i.e., R K Kt(x, x ) dx = 1 for all x K. For concreteness (and sampling efficiency), we will assume that losses now take values in [0, 1], and we will focus on simple kernels that are supported on a neighborhood Uδ(x) = Bδ(x) K of x in K and are constant therein, i.e., Kδ(x, x ) = [λ(Uδ(x))] 1 1{ x x δ}. The smoothing radius δ in the definition of Kδ will play a key role in the choice of loss model being fed to Alg. 1. If δ is taken too small, Kδ will approach a point mass, so it will have low estimation error but very high variance; at the other end of the spectrum, if δ is taken too large, the variance of the induced estimator will be low, but so will its accuracy. In view of this, we will consider a flexible smoothing schedule of the form δt = 1/tµ which gradually sharpens the estimator over time as more information comes in. Then, to further protect the algorithm from getting stuck in local minima, we will also incorporate in pt an explicit exploration term of the form εt/λ(K). Putting all this together, we obtain the bandit dual averaging (BDA) algorithm presented in pseudocode form as Alg. 2 above. By employing a slight variation of the analysis presented in Section 4 (basically amounting to a tighter bound in Lemma 2), we obtain the following guarantees for Alg. 2: Proposition 2. Suppose that the Hedge variant of Alg. 2 is run with learning rate ηt 1/tρ and smoothing/exploration schedules δt 1/tµ, εt 1/tβ respectively. Then, the learner enjoys the bound E[Reg(T)] = O(T ρ + T 1 µ + T 1 β + T 1+nµ+β ρ). (23) In particular, if the algorithm is run with ρ = (n + 2)/(n + 3) and µ = β = 1/(n + 3), we obtain the bound E[Reg(T)] = O(T n+2 n+3 ). Proposition 3. Suppose that the Hedge variant of Alg. 2 is run with parameters as in Proposition 2 against a stream of loss functions with variation VT = O(T ν). Then, the learner enjoys E[Dyn Reg(T)] = O(T 1+nµ+β ρ + T 1 β + T 1 µ + T ν+2ρ nµ β). (24) In particular, if the algorithm is run with ρ = (1 ν)(n + 2)/(n + 4) and µ = β = (1 ν)/(n + 4), we obtain the optimized bound E[Dyn Reg(T)] = O(T n+3+ν To the best of our knowledge, Proposition 3 is the first result of its kind for dynamic regret minimization in online non-convex problems with bandit feedback. We conjecture that the bounds of Propositions 2 and 3 can be tightened further to O(T n+1 n+2 ) and O(T n+2+ν n+3 ) by dropping the explicit exploration term; we defer this finetuning to future work. Acknowledgments P. Mertikopoulos is grateful for financial support by the French National Research Agency (ANR) in the framework of the Investissements d avenir program (ANR-15-IDEX-02), the Lab Ex PERSYVAL (ANR-11-LABX-0025-01), and MIAI@Grenoble Alpes (ANR-19-P3IA-0003). This research was also supported by the COST Action CA16228 European Network for Game Theory (GAMENET). Broader Impact This is a theoretical work which does not present any foreseeable societal consequence. [1] Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games. In COLT 08: Proceedings of the 21st Annual Conference on Learning Theory, 2008. [2] Naman Agarwal, Alon Gonen, and Elad Hazan. Learning in non-convex games with an optimization oracle. In COLT 19: Proceedings of the 32nd Annual Conference on Learning Theory, 2019. [3] Rajeev Agrawal. Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054 1078, December 1995. [4] Felipe Alvarez, Jérôme Bolte, and Olivier Brahic. Hessian Riemannian gradient flows in convex programming. SIAM Journal on Control and Optimization, 43(2):477 501, 2004. [5] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing, 8(1):121 164, 2012. [6] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, 1995. [7] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002. [8] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002. [9] Balabhaskar Balasundaram, Sergiy Butenko, and Illya V. Hicks. Clique relaxations in social network analysis: The maximum k-plex problem. Operations Research, 59(1):133 142, January 2011. [10] Claude Berge. Topological Spaces. Dover, New York, 1997. [11] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Operations Research, 63(5):1227 1244, October 2015. [12] Avrim Blum and Adam Tauman Kalai. Universal portfolios with and without transaction costs. Machine Learning, 35(3):193 205, 1999. [13] Immanuel M. Bomze, Marco Budinich, Panos M. Pardalos, and Marcello Pelillo. The maximum clique problem. In Handbook of Combinatorial Optimization. Springer, 1999. [14] Mario Bravo and Panayotis Mertikopoulos. On the robustness of learning in games with stochastically perturbed payoff observations. Games and Economic Behavior, 103, John Nash Memorial issue:41 66, May 2017. [15] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 358, 2015. [16] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. [17] Sébastien Bubeck and Ronen Eldan. Multi-scale exploration of convex functions and bandit convex optimization. In COLT 16: Proceedings of the 29th Annual Conference on Learning Theory, 2016. [18] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. X-armed bandits. Journal of Machine Learning Research, 12:1655 1695, 2011. [19] Sébastien Bubeck, Yin Tat Lee, and Ronen Eldan. Kernel-based methods for bandit convex optimization. In STOC 17: Proceedings of the 49th annual ACM SIGACT symposium on the Theory of Computing, 2017. [20] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. [21] Nicolò Cesa-Bianchi, Pierre Gaillard, Gábor Lugosi, and Gilles Stoltz. Mirror descent meets fixed share (and feels no regret). In 989-997 (ed.), Advances in Neural Information Processing Systems, volume 25, 2012. [22] Gong Chen and Marc Teboulle. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization, 3(3):538 543, August 1993. [23] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In COLT 12: Proceedings of the 25th Annual Conference on Learning Theory, 2012. [24] Benoit Duvocelle, Panayotis Mertikopoulos, Mathias Staudigl, and Dries Vermeulen. Learning in timevarying games. https://arxiv.org/abs/1809.03066, 2018. [25] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA 05: Proceedings of the 16th annual ACM-SIAM Symposium on Discrete Algorithms, pp. 385 394, 2005. [26] Gerald B. Folland. Real Analysis. Wiley-Interscience, 2 edition, 1999. [27] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75 174, 2010. [28] Nadav Hallak, Panayotis Mertikopoulos, and Volkan Cevher. Regret minimization in stochastic non-convex learning via a proximal-gradient approach. https://arxiv.org/abs/2010.06250, 2020. [29] Elad Hazan. A survey: The convex optimization approach to regret minimization. In Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright (eds.), Optimization for Machine Learning, pp. 287 304. MIT Press, 2012. [30] Elad Hazan and Comandur Seshadhri. Efficient learning algorithms for changing environments. In ICML 09: Proceedings of the 26th International Conference on Machine Learning, 2009. [31] Elad Hazan, Karan Singh, and Cyril Zhang. Efficient regret minimization in non-convex games. In ICML 17: Proceedings of the 34th International Conference on Machine Learning, 2017. [32] Mark Herbster and Manfred K. Warmuth. Tracking the best expert. Machine Learning, 32(2):151 178, 1998. [33] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In AISTATS 15: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015. [34] Adam Tauman Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, October 2005. [35] Narendra Karmarkar. Riemannian geometry underlying interior point methods for linear programming. In Mathematical Developments Arising from Linear Programming, number 114 in Contemporary Mathematics. American Mathematical Society, 1990. [36] Robert David Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In NIPS 04: Proceedings of the 18th Annual Conference on Neural Information Processing Systems, 2004. [37] Robert David Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In STOC 08: Proceedings of the 40th annual ACM symposium on the Theory of Computing, 2008. [38] Walid Krichene, Maximilian Balandat, Claire Tomlin, and Alexandre Bayen. The Hedge algorithm on a continuum. In ICML 15: Proceedings of the 32nd International Conference on Machine Learning, 2015. [39] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212 261, 1994. [40] Panayotis Mertikopoulos and Zhengyuan Zhou. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1-2):465 507, January 2019. [41] Theodore S. Motzkin and Ernst G. Straus. Maxima for graphs and a new proof of a theorem of Turán. Canadian Journal of Mathematics, 1965. [42] Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120 (1):221 259, 2009. [43] Steven Perkins, Panayotis Mertikopoulos, and David S. Leslie. Mixed-strategy learning with continuous action sets. IEEE Trans. Autom. Control, 62(1):379 384, January 2017. [44] Robert Ralph Phelps. Convex Functions, Monotone Operators and Differentiability. Lecture Notes in Mathematics. Springer-Verlag, 2 edition, 1993. [45] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011. [46] Shai Shalev-Shwartz and Yoram Singer. Convex repeated games and Fenchel duality. In Advances in Neural Information Processing Systems 19, pp. 1265 1272. MIT Press, 2007. [47] Aleksandrs Slivkins. Introduction to multi-armed bandits. ar Xiv preprint ar Xiv:1904.07272, 2019. [48] Aleksandrs Slivkins. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1-2):1 286, November 2019. [49] Victor Spirin and Leonid A. Mirny. Protein complexes and functional modules in molecular networks. Proceedings of the National Academy of Sciences, 2003. [50] Arun Sai Suggala and Praneeth Netrapalli. Online non-convex learning: Following the perturbed leader is optimal. In ALT 20: Proceedings of the 31st International Conference on Algorithmic Learning Theory, 2020. [51] Constantino Tsallis. Possible generalization of Boltzmann Gibbs statistics. Journal of Statistical Physics, 52:479 487, 1988. [52] Paul Tseng. Convergence properties of Dikin s affine scaling algorithm for nonconvex quadratic minimization. Journal of Global Optimization, 30(2):285 300, 2004. [53] Robert J. Vanderbei, Marc S. Meketon, and Barry A. Freedman. A modification of Karmarkar s linear programming algorithm. Algorithmica, 1(1):395 407, November 1986. [54] Vladimir G. Vovk. Aggregating strategies. In COLT 90: Proceedings of the 3rd Workshop on Computational Learning Theory, pp. 371 383, 1990. [55] Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543 2596, October 2010. [56] Zhisheng Zhong, Tiancheng Shen, Yibo Yang, Chao Zhang, and Zhouchen Lin. Joint sub-bands learning with clique structures for wavelet domain super-resolution. In Neur IPS 18: Proceedings of the 32nd International Conference of Neural Information Processing Systems, 2018. [57] Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Stephen P. Boyd, and Peter W. Glynn. On the convergence of mirror descent beyond stochastic convex programming. SIAM Journal on Optimization, 30(1):687 716, 2020. [58] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML 03: Proceedings of the 20th International Conference on Machine Learning, pp. 928 936, 2003.