# no_internal_regret_with_nonconvex_loss_functions__60ba0300.pdf

No Internal Regret with Non-convex Loss Functions

Dravyansh Sharma

Carnegie Mellon University dravyans@cs.cmu.edu

Internal regret is a measure of performance of an online learning algorithm, which measures the change in performance by substituting every occurrence of a given action i by an alternative action j. Algorithms for minimizing internal regret are known for the finite experts setting, including a general reduction to the problem of minimizing external regret for this case. The reduction however crucially depends on the finiteness of the action space. In this work we approach the problem of minimizing internal regret for a continuous action space. For the full information setting, we show how to obtain O(

T) internal regret for the class of Lipschitz functions, as well as non-Lipschitz dispersed functions, i.e. the non-Lipschitzness may not concentrate in a small region of the action space. We also consider extensions to partial feedback settings, and again obtain sublinear internal regret. Finally we discuss applications of internal regret minimization over continuous spaces to correlated equilibria in pricing problems and auction design, as well as to data-driven hyperparameter tuning.

Introduction We consider the problem of repeatedly making decisions in an uncertain environment, using the standard online learning framework. The algorithm or learner has a continuous space C of actions, and the following game is played for T rounds. At each round or time step t, the learner chooses an action (possibly probabilistically), the environment makes its move by choosing a loss function over the space of actions, and the learner then incurs the loss for its action chosen. The most popular metric for measuring the performance of the algorithm is computing its (external, as it is unrelated to learner s choices) regret with respect to the best fixed action from C played across all rounds, in hindsight. However, this may not be useful in common cases where no single fixed action works well over the entire course of the algorithm s interaction with its environment One alternative approach to determine if the learner performed well in choosing the actions is to compute the regret of substituting the chosen actions with alternative actions. This gives rise to the notions of internal regret or swap regret (Cesa-Bianchi and Lugosi 2003; Stoltz 2005; Blum

Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

and Mansour 2007). While the external regret has been studied in a wide variety of general settings owing to its popularity, the study of internal and swap regret has largely been limited to the case of finite action space, also known as finite experts setting. In this work, we consider the problem of minimizing internal regret in the presence of continuously many infinite experts, and non-convex loss functions, in both full information (i.e., learner observes loss for all experts) and partial feedback (learner only observes their incurred loss) settings. We will particularly focus on the case of Lipschitz losses and piecewise-constant losses motivated by known applications. Internal regret is closely connected to correlated equilibria in multi-player repeated games (Foster and Vohra 1999; Blum and Mansour 2011).

Related Work Internal regret. The notion of internal regret was first introduced by (Foster and Vohra 1998) in the context of calibrated forecasting. Several low internal regret algorithms have been developed, from initial algorithms with convergence guarantees without regret upper bounds (Foster and Vohra 1999; Hart and Mas-Colell 2000), to general potential-based algorithms with low regret bounds (Cesa Bianchi and Lugosi 2003), as well as algorithms based on reduction to external regret (Blum and Mansour 2007; Stoltz and Lugosi 2005, 2007). Note that the algorithms, as well as reductions due to (Stoltz 2005; Blum and Mansour 2007) to external regret, were obtained for the finite experts setting. A lower bound of Ω(

NT) on a related notion called swap regret for any randomized algorithm was given by (Blum and Mansour 2007) and further improved in the recent work of (Ito 2020) to Ω( NT log N). Bandit setting. For the bandit setting (Stoltz 2005) gave an algorithm with O(N T log N) regret which runs in exponential time. (Blum and Mansour 2007) obtained a polynomial time algorithm with slightly worse regret bound of O(N NT log N) via a reduction argument. Recent work of (Ito 2020) modifies their approach to give a polynomial time algorithm with O(N

T) regret bound, resulting in a more efficient reduction to external regret without needing first-order bounds. Other related notions of regret. (Mohri and Yang 2014) consider a generalization of swap regret called conditional swap regret, which considers all possible modifications of

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

a learner s action sequence that depend on some fixed bounded history. They further generalize it to a notion of transductive regret in (Mohri and Yang 2017), which is a generalization of (1) external regret; (2) internal regret; (3) swap regret; and (4) conditional swap regret. Sleeping experts is another popular setting introduced by (Freund et al. 1997) where the set of actions that are available to the decision algorithm varies over time. Algorithms with low regret are known for this problem (Blum and Mansour 2007; Kleinberg, Niculescu-Mizil, and Sharma 2010) and have applications to calendar tracking (Blum 1997), textcategorisation (Cohen and Singer 1999), formulating websearch engine queries (Cohen and Singer 1996) and subgroup fairness (Blum and Lykouris 2020). A related notion is wide-range regret (Lehrer 2003; Blum and Mansour 2007; Mohri and Yang 2017). Correlated equilibrium. There is a tight connection between swap regret and correlated equilibrium (Aumann 1974). For a general-sum game with any finite number of players, if every player has zero internal regret playing a distribution Q over the joint action space then it is a correlated equilibrium. In a repeated game setting, if each player uses an action selection algorithm whose swap regret of this form is sublinear in T, then the empirical distribution of the players actions converges to a correlated equilibrium (Hart and Mas-Colell 2000), and in fact, the benefit of a deviation from a correlated equilibrium is bounded exactly by R/T, where R is the largest swap regret of any of the players. Data-driven algorithm design. (Gupta and Roughgarden 2016) define a formal learning framework for selecting algorithms from a family of heuristics or setting hyperparameters. It has been successfully applied to parameter tuning several combinatorial problems like integer programming, clustering and low-rank approximation(Balcan, Dick, and Vitercik 2018; Bartlett, Indyk, and Wagner 2022). Prior work on online data-driven parameter selection has focused on external regret against a fixed or dynamic choice of actions (Balcan, Dick, and Vitercik 2018; Sharma, Balcan, and Dick 2020). Our work obtains first internal regret bounds of online parameter configuration.

Summary of Results We define internal regret on a continuum as a limit. We obtain no internal regret algorithms in the presence of nonconvex loss functions. Specifically,

In the full information setting, for the case of L-Lipschitz loss functions, we obtain an algorithm which achieves regret O( d T log RLT), where d is the dimension of the Euclidean action space and R is the diameter of action space. Further, for the case of one-dimensional piecewise constant functions, we obtain an algorithm which achieves regret O( T log KT) under a mild smoothness assumption, where K is a bound on the number of pieces in each loss function. We show that our bounds are nearoptimal by providing a Ω(

T) lower bound on the internal regret for our loss functions. We extend our results to partial feedback setting, and again obtain sublinear regret O(T d+1 d+2 ), where the soft-O

notation suppresses factors other than T as well as logarithmic terms in T. We provide applications of our results to designing strategies for achieving correlated equilibrium in multiplayer repeated games, and to data-driven hyperparameter tuning via concrete instantiation for a combinatorial parameter selection problem.

Notation and Terminology We assume an adversarial online learning model with a continuum of available actions given by a closed and compact set C. At each time step t, an online learner (or algorithm) A selects a distribution pt over the action space C. After that, the environment (or adversary) selects a loss function lt : C [0, 1], where t lt(a) is the loss of the action a C at time t. In the full information setting, the online algorithm receives the complete loss function lt and experiences a loss l A t = R

C lt(a)pt(a)da. In the partial information (or bandit) setting, the online algorithm receives loss lt(at), where at is distributed according to pt. The loss of the action a during the first T time steps is LT (a) = PT t=1 lt(a), and the loss of learner A is LA T = PT t=1 l A t . The goal for the external regret setting is to design an online algorithm that will be close to performance of the best fixed action, that is, to have a loss close to L T = mina C LT (a). Formally, one wants to minimize external regret given by RT = LA T L T . We define the internal regret on continuous action spaces as follows. We assume the action space C defines a metric space over some norm ||.||. For ϵ > 0 and actions a, b C, let p(a,b,ϵ) t be the distribution on C formed by removing the probability mass in some ball B(a, ϵ) and adding it uniformly to points in B(b, ϵ), i.e.

p(a,b,ϵ) t (x)=

0 ifx B(a, ϵ), a / B(b, ϵ), pt(x)+pa b t (x) if x B(b, ϵ), a / B(b, ϵ), pt(x) otherwise,

where pa b t (x) :=

B(a,ϵ) pt(y)dy

vol(B(b,ϵ)) and vol(B(b, ϵ)) = R

B(b,ϵ) dy. For technical simplicity, we define p(a,b,ϵ) t (x) = pt(x) for all x C if a B(b, ϵ) (i.e. swapping with almost the same action does not change the distribution). Then the internal regret is defined as

Ri T = max a,b C lim ϵ 0

C(pt(x) p(a,b,ϵ) t (x))lt(x)dx ϵ ,

where lt : C [0, 1] is the loss function at time t. We illustrate this definition with an example. Lemma 1. If pt is a discrete distribution over a finite subset C C for all t [T], then

Ri T = max b C ˆRi T (C {b}),

where ˆRi T (A) denotes the standard notion of internal regret over the finite experts in A (Blum and Mansour 2011).

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Proof. Since C is finite, we have ϵ0 = 1

2 mina =b;a,b C ||a b|| > 0. Clearly, for any a C and 0 < ϵ < ϵ0, the ball B(a, ϵ) does not contain b C for b = a. Thus, R

B(a,ϵ) pt(y)dy = pt(a) if a C . As above, let 0 < ϵ < ϵ0. For a, b C, if a B(b, ϵ) we have Ri T = 0 by definition. We assume a / B(b, ϵ). If B(a, ϵ) C = {}, it is easy to see that p(a,b,ϵ) t (x) = pt(x) for all x C and again Ri T = 0 (since pt is assumed to have no mass outside of C ). But this will hold for sufficiently small ϵ unless a C . In this case, intuitively, the probability mass on a is uniformly redistributed to points in B(b, ϵ) with the mass effectively concentrating at b as ϵ 0. Specifically, the limit corresponds to a discrete distribution over C {b} and the result follows from the definitions.

We use [n] to denote the set {1, . . . , n}. I[ ] will be used to denote the 0-1 valued indicator function.

A Potential Function Analysis

In this section, we will describe the potential function based approach due to Cesa-Bianchi and Lugosi (2003) for finite experts. The problem is parametrized by a decision space X, by an outcome space Y, and by a convex and twice differentiable potential function ϕ : RN R+. At each step t = 1, 2, . . . , the current state is represented by a point Rt 1 RN, where R0 = 0N. The decision maker observes a vector-valued drift function rt : X Y RN and selects an element xt from the decision space X. In return, an outcome yt Y is received, and the new state of the problem is the drifted point Rt = Rt 1 + rt(xt, yt). The goal of the decision maker is to minimize the potential ϕ(Rt) for a given t (which might be known or unknown to the decision maker). To illustrate the above abstraction, we consider the standard online learning with external regret and express it in the above framework. Here, the decision maker is a learner whose goal is to forecast a hidden sequence y1, y2, . . . of elements in the outcome space Y. At each time t, the learner computes its guess xt X for the next outcome yt. This guess is based on the advice f1,t, . . . , f N,t X of N experts from a fixed pool. The guesses of the learner and the experts are then individually scored using a loss function l : X Y R. The learner s goal is to keep as small as possible the cumulative regret with respect to each expert, which can be easily modeled within the above abstract decision problem by associating a coordinate to each expert and by defining the components ri,t of the drift function rt by ri,t(xt, yt) = l(xt, yt) l(fi,t, yt) for i = 1, ..., N. The role of the potential function in this framework is to provide a generalized way to measure the size (or distance from the origin) of the cumulative regret which is measured by the state Rt. To obtain the general results, two assumptions are needed on the potential function

Assumption 1. (Generalized Blackwell s condition). At each time t, a decision xt X exists such that

sup yt Y (Rt 1) rt(xt, yt) 0

Strategies satisfying the above condition tend to keep the point Rt as close as possible to the minimum of the potential by forcing the drift vector to point away from the gradient of the current potential. This gradient descent approach to sequential decision problems is not new, and the prominent eponymous example of a decision strategy of this type is the one used by Blackwell to prove his celebrated approachability theorem (Blackwell 1956), generalizing von Neumann s celebrated minimax theorem to vector-valued payoffs. We also need a second assumption about additivity of the potential function.

Assumption 2. The potential ϕ can be written as ϕ(u) = P i ϕi(ui) for all u = (u1, ..., u N) RN, where ϕi : R R+ is a nonnegative function of one variable. Typically, ϕi will be monotonically increasing and convex on R.

Under these assumptions, the following general theorem is given by (Cesa-Bianchi and Lugosi 2003).

Theorem 2. Let ϕ be a twice differentiable additive potential function and let r1, r2, ..., r N RN be such that ϕ(Rt 1) rt 0 for all t 1, where Rt = r1 + + rt. Let f : R+ R+ be an increasing, concave, and twice differentiable auxiliary function such that, for all t = 1, 2, ...,

sup u RN f(ϕ(u))

i=1 ϕ i (ui)r2 i,t C(rt)

for some non-negative function C : RN R+. Then, for all t = 1, 2, ...,

f(ϕ(Rt)) f(ϕ(0)) + 1

In order to apply the above theorem, one must choose a potential function ϕi and functions f and C. For minimizing the internal regret, one choice is to have the exponential potential function ϕi(u) = eηu, by choosing f(x) = 1 η ln x and setting C(rt) = η maxi r2 i,t. (Cesa-Bianchi and Lugosi 2003) (Theorem 3) show the existence of a randomized learner which satisfies the Blackwell condition and, by setting the step size parameter η = p

4 ln N/T they conclude that the internal regret Ri T satisfies Ri T 2

No Internal Regret on a Continuum

We first discuss challenges in extending known approaches for finite experts, i.e. if action space C = {1, 2, . . . , N} instead of a closed compact set. The key idea of the reduction based approach of (Blum and Mansour 2007) is to design a meta-algorithm which runs N base external regret algorithms and has a meta-distribution p over the distributions over actions Q output by the base algorithms, such that p can be computed as the stationary distribution of a finite Markov chain corresponding to matrix Q. While approaches are known to approximate the stationary distribution of a countably infinite Markov chain (Seneta 1980), it is not clear if this approach could be extended to the uncountably infinite setting that we consider.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Moreover the regret bounds obtained in (Blum and Mansour 2007) correspond to the stronger notion of swap regret, but obtain a weaker regret bound which is polynomial in N. This is reasonable for small N, but handling the case of large N efficiently is noted as an open problem in (Blum and Mansour 2011). The above approach of (Cesa-Bianchi and Lugosi 2003) provides an computationally inefficient alternative, and bounds the weaker notion of internal regret.

Lipschitz-continuous Loss Functions We first consider the case when the loss function lt : C [0, 1] is a L-Lipschitz continuous function. We will assume that the action space C Rd is closed compact and Euclidean. The key idea is to discretize the action space, i.e. obtain a cover for it using sufficiently many experts such that any point in the action space is ϵ-close to some point in the cover. We then apply the results of (Cesa-Bianchi and Lugosi 2003) to obtain algorithms with low internal regret over this large but finite cover. To formalize things, we will use the following well-known discretization lemma.

Lemma 3. Let C Rd be contained in a ball of radius R. Then the greedy procedure of adding points until there is no longer a point more than ϵ away from all points in the collection obtains an ϵ-cover Cϵ C such that |Cϵ| (3R/ϵ)d and for every a C there exists a Cϵ such that ||a a ||2 ϵ.

Proof. The greedy procedure terminates after adding at most (3R/ϵ)d points by a standard covering argument.

Now our main strategy is to apply the algorithm of (Cesa Bianchi and Lugosi 2003) over an ϵ-cover Cϵ C for the action space C. Crucially the logarithmic dependence on the regret bound allows us to choose sufficiently small ϵ (even as small as 1/T) so as to not incur significant approximation regret for the Lipschitz function. The algorithm, called Discretized Exponential Potential Minimizer (DEPo M) is formally as Algorithm 1. Our main result in this section is the following theorem about the performance of Algorithm 1.

Theorem 4. Consider an online learner over action space C Rd such that C is compact, closed and contained in a ball of radius R. If the learner is faced with a sequence of L-Lipschitz losses l1, . . . , l T : C [0, 1] and the learner

uses Algorithm 1 with η = q

T ln 3RLT, ϕ(x) = eηx and

with ϵ = 1 LT , then the expected internal regret of the learner

d T ln(RLT) .

Proof. By Lemma 3, the greedy procedure of Algorithm 1 uses a ϵ-cover Cϵ with |Cϵ| (3R/ϵ)d. Now by Corollary 8 of (Cesa-Bianchi and Lugosi 2003), the internal regret of Algorithm 1 when substituting the actions over the ϵ-cover is at most 2

T ln N. Substituting N (3R/ϵ)d, we get that the internal regret w.r.t. to Cϵ is at most

T ln ((3R/ϵ)d) = O

Algorithm 1: DEPOM(η, ϵ)

1: Input: step size η [0, 1], discretization parameter ϵ. 2: Discretize the parameter space C to get Cϵ with |Cϵ| (3R/ϵ)d as follows. 3: Select any point c C and add to Cϵ. 4: while There is a point c C s.t. minc Cϵ ||c c || > ϵ do 5: Add c to c 6: for t = 1, . . . , T do 7: Set pt as the solution of the linear system (in N = |Cϵ| variables)

(i,j) I[j = k] (i,j)ϕ(R(i,j),t 1) =

(i,j) I[i = k]I[j = l] (i,j)ϕ(R(i,j),t 1)

where indices (i, j) Cϵ Cϵ. 8: Play according to pt and suffer loss lt( ). 9: Compute drift r(i,j),t = pj,t(lt(j) lt(i)). 10: Update R(i,j),t = R(i,j),t 1 + r(i,j),t.

By L-Lipschitzness of the loss function, if the substitution is made using some point c in C instead of points in Cϵ, the learner may incur an additional regret of ϵL w.r.t. the point in Cϵ at most ϵ away from c (in any round t). Summing up over all rounds, we get that the internal regret of Algorithm 1 w.r.t. points in C is at most

Plugging in ϵ = 1 LT completes the proof.

Remark 1. Note that for the above discretization teachnique to yield good regret it was crucial to use an algorithm on finite experts with regret sub-logarithmic in N, the number of experts. For if the regret was only say

N, or polynomial in N, (e.g. for the swap regret bound in (Blum and Mansour 2007)), substituting ϵ = 1 LT no longer works. For an appropriate choice of ϵ their approach yields O(T d+1 d+2 poly(d, log T)) regret, which has much slower convergence for any d.

Dispersed Non-Lipschitz Loss Functions

We will now go beyond the case of Lipschitz-continuous functions. If we make no further assumption on the loss function, then linear swap regret is unavoidable. This follows from a well known halving adversary which implies a Ω(T) lower bound on the external regret using piecewise constant functions. The conclusion follows by recalling that swap regret is always at least as large as the external regret. An interesting question is whether the lower bound also extends to internal regret (i.e. when only one action is swapped/rewired by the modification rules). In fact, for

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

swap regret a general lower bound of Ω( NT log N) for finite experts was recently developed by (Ito 2020). More specialized lower bounds for internal regret (which is a weaker version of swap regret) remains an open question even for the finite experts regime. To circumvent the typical halving adversaries in lower bounds, (Balcan, Dick, and Vitercik 2018) present a necessary and sufficient condition called dispersion for learning piecewise-Lipschitz functions. Roughly speaking, this corresponds to the assumptions that too many discontinuities (or violations of L-Lipschitzness) of the non-Lipschitz loss functions are not allowed to concentrate in any region of the action domain. Formally, this is given as

Definition 1. Dispersion (Balcan, Dick, and Vitercik 2018). The sequence of random loss functions l1, . . . , l T is βdispersed for the Lipschitz constant L if, for all T and for all ϵ T β, we have that, in expectation, at most O(ϵT) functions (the soft-O notation suppresses dependence on quantities beside ϵ, T and β, as well as logarithmic terms) are not L-Lipschitz for any pair of points at distance ϵ in the domain C. That is, for all T and for all ϵ T β,

max ρ,ρ C ||ρ ρ ||2 ϵ

{t [T] | lt(ρ) lt(ρ ) > L||ρ ρ ||2}

For simplicity we will focus on piecewise constant functions in one dimension with a bounded number of discontinuities in any round. Suppose there are at most K discontinuities in any piecewise constant lt( ), and the probability that a dicontinuity is located in any interval of width ϵ, pϵ κϵ for some κ > 0 and for any ϵ 0 then the above definition may be readily verified. Indeed, in any interval of width ϵ T β 0, the expected number of discontinuities is O(κϵ) in any round, or O(κϵT) across T rounds by linearity of expectation. Now by a VC-dimension based argument (see e.g. Lemma 1 of (Balcan, Dick, and Vitercik 2018)), the maximum number of discontinuities in an ϵ-interval is at most O(κϵT) + O( q

δ ) with probability at least 1 δ. Or the expected maximum number of discontinuities

is O κϵT + q

δ + δT . Choose δ = 1

T to conclude

that Definition 1 is satisfied for β = 1

2. We formalize this as the following lemma.

Lemma 5. Let l1, . . . , l T : C [0, 1] be a sequence of piecewise constant functions with C R and the discontinuities of the loss functions have a κ-bounded distribution1. Then the sequence of loss functions is 1/2-dispersed in the sense of Definition 1.

Now the key insight is that for the one-dimensional piecewise-constant case with at most K discontinuities in any function lt, we can effectively reduce the problem to

1A distribution is said to be κ-bounded if the corresponding probability density f(x) satisfies, supx f(x) κ. For example, the standard normal distribution N(µ, σ) is 1

2πσ -bounded.

Algorithm 2: CEPOM(η)

1: Input: step size η [0, 1], experts schedule Ct 2C. 2: Initialize p1 as the uniform distribution over C1. 3: Play a1 according to p1. 4: for t = 2, . . . , T do 5: Set pt as the solution of the linear system (in N = O(KT) variables)

(i,j) I[j = k] (i,j)ϕ(R(i,j),t 1) =

(i,j) I[i = k]I[j = l] (i,j)ϕ(R(i,j),t 1)

where indices (i, j) Ct Ct. 6: Play according to pt and suffer loss lt( ). 7: Set r(i,j),t = pj,t(lt(j)/wt(j) lt(i)/wt(j)), where wt(k) is the width of the piece k in l T . 8: Update R(i,j),t = R(i,j),t 1 + r(i,j),t.

one with N = O(KT) experts, one corresponding to each piece in the (also piecewise-constant) total loss function l T = PT t=1 lt. The main modification needed in Algorithm 1 is in step 10, the losses used in the drift function need to scaled by the width of the constant-loss piece corresponding to each expert. Our main theorem in this section gives an upper bound on the internal regret of Continuous Exponential Potential Minimizer (CEPo M, Algorithm 2).

Theorem 6. Consider an online learner over action space C R such that C is compact and closed. If the learner is faced with a sequence of L-Lipschitz losses l1, . . . , l T : C [0, 1] which are all piecewise-constant with at most K pieces and the discontinuities are κ-bounded, and the learner uses Algorithm 2 with η = q

4 T ln KT and ϕ(x) = eηx, then there exists an experts schedule Ct for which the expected internal regret of the learner is O p

Remark 2. A computationally efficient implementation in per iteration time O(K log(KT)) for sampling can be achieved using the interval-tree based algorithm of (Cohen Addad and Kanade 2017).

Remark 3. Extension beyond the one-dimensional piecewise constant case is possible by using techniques from (Balcan, Dick, and Vitercik 2018). Piecewise constant case is the simplest w.r.t. analysis as well as computationally efficient implementation.

Finally, we provide a lower bound that shows our results are tight up to logarithmic factors.

Theorem 7. Consider an online learner over compact and closed action space C R. There exists a sequence of random piecewise-constant losses l1, . . . , l T : C [0, 1] such that any online internal regret of the learner has expected internal regret at least Ω

T on the sequence.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Bandit Feedback For a continuous action space C, the loss function over the entire space may be difficult to even represent, let alone compute (or observe) as a learner. Therefore it is more realistic to assume that the learner only observes partial feedback, for example only the loss value lt(at) corresponding to the played action at C instead of the entire function lt( ). The typical approach to minimize external regret in the presence of partial feedback is to estimate the losses, form a probability distribution over the experts based on these estimates, and add some uniform exploration to this distribution (in order to ensure reasonable loss estimates for different experts). Our approach to minimize the internal regret uses the same basic ingredients, but additionally adjusts its probability distribution to satisfy a fixed point constraint (similar to the linear system in the full information algorithm above). In more detail, at each round t, the learner computes a probability distribution pt over some subsets of the domain, given by a schedule Ct. We use unbiased loss estimates

ˆlt(a) = lt(a)I[a ct]

pt(ct) where ct C corresponds to the set from experts schedule Ct at round t, which was sampled according to pt. Moreover, for each ci, cj Ct define the cumulative loss estimate

c Cs p(i,j) s (c) Z

a c ˆls(a)da

where p(i,j) s (c) is the same as ps(c) except p(i,j) s (ci) = 0, p(i,j) s (cj) = ps(ci) + ps(cj). For any ct,k Ct, we define ω(ct,k) as the probability mass of the uniform distribution on C over ct,k (as before). The full algorithm is presented as Algorithm 3. As before, we instantiate our algorithm for the continuous Lipschitz setting and the dispersed piecewise constant setting, and obtain respective regret bounds. Theorem 8. Consider an online learner over action space C Rd such that C is compact, closed and contained in a ball of radius R. Given a sequence of losses l1, . . . , l T : C [0, 1] with bandit feedback, Algorithm 3 with Ct = Cϵ, ϵ = (3R)d

L2T 1 d+2 , ηt = q

3R)d ln(3R/ϵ) and γt = ( 3R

Algorithm 3: INTERNAL EXP3(ηt, γt)

1: Input: step-size schedule ηt [0, 1], exploration schedule γt [0, 1], experts schedule Ct 2C. 2: Initialize p1 as the uniform distribution over C1. 3: Play a1 according to p1. 4: for t = 2, . . . , T do

5: Set q(i,j) t = exp( ηt L(i,j) t 1 ) P

k =l exp( ηt L(k,l) t 1 ).

6: Set ˆpt as the solution of the linear system (in |Ct| variables) ˆpt,k = P

i =j q(i,j) t ˆp(i,j) t,k , where (i, j, k) C3 t . 7: Set pt,k = (1 γt)ˆpt,k + γt ω(ct,k). 8: Play according to pt and suffer loss lt( ).

achieves an expected internal regret of

O T d+1 d+2 (

d Rd + L)polylog(T) .

Proof Sketch. We use discretization to cover the parameter space. In this case, the regret has unavoidable polynomial dependence on the number of experts. Therefore, we need to choose a coarser cover with ϵ T 1 d+2 instead of ϵ T 1 in order to get a sublinear regret bound.

For the one-dimensional piecewise-constant loss setting with at most K pieces in any lt, we use a fixed schedule Ct unlike the full information setting and obtain O(T 2/3poly(K, ln T)) expected regret. The key difference is that we do not observe the actual intervals where the loss is constant. We use uniform intervals of carefully tuned width to ensure that discontinuities of the losses do not fall within our intervals with high probability, without blowing up the estimated losses ˆlt (which vary inversely with probability mass inside the interval). Theorem 9. Consider an online learner over action space C R such that C is compact and closed. Given a sequence of losses l1, . . . , l T : C [0, 1] which are all piecewiseconstant with at most K pieces, there exists a parameter setting ηt, γt, Ct for Algorithm 3 such that the expected internal regret of the learner is O T 2/3poly(K, ln T) . We conclude this section with a couple remarks. Remark 4. Our results can be adapted to obtain high probability bounds on the internal regret, using martingale inequalities along the lines of (Auer et al. 2002). Remark 5. We have worked with the d-dimensional Euclidean space and our loss sequence may be adversarial (up to dispersion constraints). In contrast, (Kleinberg, Slivkins, and Upfal 2008) consider (external) regret bounds for a general metric space but for stochastic losses and obtain asymptotically tight results that depend on the so-called zooming dimension of the problem.

Applications We discuss below two main applications of our results.

Correlated Equilibria The relationship between correlated equilibria and internal regretis well known for games with finite action spaces (Foster and Vohra 1999; Blum and Mansour 2011). Here we will define and establish the connection in the more general continuous action space setting. Definition 2. A game G = M, (Ai), (si) has a finite set M of m players. Player i has a continuous set Ai Ci of actions and a loss function si : Ai ( j =i Aj) [0, 1] that maps the action of player i for any combination of actions of the other players to a bounded real number. The goal of each player is to minimize its loss. A correlated equilibrium is a distribution Q over the joint action space A1 AM with the following property. If a vector of actions a is drawn from the distribution Q, player i is given action ai from a (but no information regarding other players

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

actions). The probability distribution Q is a correlated equilibrium if, for each player i, it is their best response to play the suggested action provided that the other players do not deviate. We now formally define an ϵ-correlated equilibrium.

Definition 3. A joint probability distribution Q over joint action space Ai is an ϵ-correlated equilibrium w.r.t. deviations (Fk)k [M] if for every player i and for any function F : Ai Ai, F Fi, we have Ea Q[si(ai, a i)] Ea Q[si(F(ai), a i)] + ϵ, where a i denotes the joint actions of the other players besides player i.

In other words, P is an ϵ-correlated equilibrium if the expected incentive to deviate is at most ϵ for every player. By choosing the class Fk to one that allows swaps of single actions in Ak, we obtain a direct connection with internal regret. The following theorem relates the empirical distribution of the actions performed by each player, their swap regret, and the distance from a correlated equilibrium (generalizes finite action space version of Foster and Vohra 1998; Hart and Mas-Colell 2000).

Theorem 10. Let G = M, (Ai), (si) be a game and assume that for T time steps each player follows a strategy that has internal regret of at most RT . The empirical distribution ˆQ of the joint actions played by the players is an (RT /T)-correlated equilibrium, and the loss of each player equals its expected loss on ˆQ.

The above theorem states that the payoff for each player is their payoff in some approximate correlated equilibrium. In addition, the theorem relates the internal regret to the distance from a correlated equilibrium. Note that if the average internal regret vanishes then the game converges in the limit to the set of correlated equilibria. Our internal regret algorithms therefore give strategies to achieve correlated equilibrium in a game with L-Lipschitz (Azrieli and Shmaya 2013; Deligkas, Fearnley, and Spirakis 2020) or piecewiseconstant losses (for example item-pricing and auction design Morgenstern and Roughgarden 2016; Syrgkanis 2017; Balcan, Sandholm, and Vitercik 2018).

Data-driven Hyperparameter Tuning

Data-driven algorithm selection or parameter tuning is an approach to tune continuous hyperparameters of an algorithm, by learning over multiple input instances of the problem (Gupta and Roughgarden 2016). The approach has found impactful application to several fundamental problems, including semi-supervised learning, clustering, linear regression, adversarial robustness and simulated annealing (Balcan and Sharma 2021; Balcan et al. 2017, 2022, 2023; Balcan, Nguyen, and Sharma 2023; Blum, Dan, and Seddighin 2021). The loss-functions (as a function of the parameter) for combinatorial optimization algorithms like clustering and greedy knapsack are typically piecewise constant as algorithms make different (discrete) choices across the breakpoints (Balcan 2020). Prior work has shown bounded external and tracking regret for online hyperparameter tuning for these algorithms, also assuming the dispersion condition is satisfied (Balcan, Dick, and Vitercik 2018; Sharma,

Balcan, and Dick 2020). Our work achieves vanishing internal regret for this problem under the same assumptions. Formally, for a given algorithmic problem (say clustering, or knapsack), let Π denote the set of problem instances of interest. We also fix a (potentially infinite) family of algorithms A, parameterized by a set P Rd. Let Aρ denote the algorithm in the family A parameterized by ρ P. The performance of any algorithm on any problem instance is given by a utility function u : Π P [0, H], i.e. u(x, ρ) measures the performance on problem instance x Π of algorithm Aρ A. The utility of a fixed algorithm Aρ from the family is given by uρ : Π [0, H], with uρ(x) = u(x, ρ). In our online learning setup, the learner receives the dual class function ux : P [0, H], with ux(ρ) = uρ(x), which measure the performance of all algorithms of the family for a fixed problem instance x Π (in particular, the dual function for problem instance xt in round t). In the bandit feedback setting, one only receives ux(ρt) for the parameter ρt played by the learner. For many parameterized algorithms, the dual class functions are piecewise constant (Balcan 2020). We instantiate our results for the knapsack problem. Greedy Knapsack. Knapsack is a well-known NPcomplete problem. We are given a knapsack with capacity cap and items i [m] with sizes wi and values vi. The goal is to select a subset S of items to add to the knapsack such that P i S wi cap while maximizing the total value P i S vi of selected items. The greedy heuristic to add items in decreasing order of vi/wi gives a 2-approximation. We consider a generalization to use vi/wρ i proposed by (Gupta and Roughgarden 2016) for ρ [0, 10]. For example, for the value-weight pairs {(0.99, 1), (0.99, 1), (1.01, 1.01)} and capacity cap = 2 the classic heuristic ρ = 1 gives value 1.01 but using ρ = 3 gives the optimal value 1.98. We can tune the parameter ρ with vanishing internal regret. Theorem 11. Consider instances of the knapsack problem given by bounded weights wt [1, C] and independent values vt [0, 1] drawn from some bounded-density distribution (which may change with t) for t [T]. Then there is an algorithm for learning the parameter ρ for the greedy heuristic family above with expected internal regret O(

Conclusion We provide a novel extension of the notion of internal regret on continuous action spaces, and algorithms that achieve no regret guarantees in full and bandit information settings. Internal regret based strategies lead to correlated equilibria, which we expect to be impactful in automated mechanism design, and data-driven algorithm design more broadly. Our research raises several interesting questions for future research including extension to other notions of regret (e.g. swap/transductive/dynamic regret), lower bounds in the bandit setting and computational efficiency for large action space dimension d.

Acknowledgments The direction was inspired by a comment from Nina Balcan during a guest lecture by Avrim Blum. We appreciate feedback from Keegan Harris and anonymous reviewers.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

References Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1): 48 77. Aumann, R. J. 1974. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1(1): 67 96. Azrieli, Y.; and Shmaya, E. 2013. Lipschitz games. Mathematics of Operations Research, 38(2): 350 357. Balcan, M.-F. 2020. Data-Driven Algorithm Design. In Roughgarden, T., ed., Beyond Worst Case Analysis of Algorithms. Cambridge University Press. Balcan, M.-F.; Blum, A.; Sharma, D.; and Zhang, H. 2023. An analysis of robustness of non-lipschitz networks. Journal of Machine Learning Research (JMLR), 24(98): 1 43. Balcan, M.-F.; Dick, T.; and Vitercik, E. 2018. Dispersion for data-driven algorithm design, online learning, and private optimization. In Symposium on Foundations of Computer Science (FOCS), 603 614. IEEE. Balcan, M.-F.; Khodak, M.; Sharma, D.; and Talwalkar, A. 2022. Provably tuning the Elastic Net across instances. Advances in Neural Information Processing Systems (Neur IPS), 35: 27769 27782. Balcan, M.-F.; Nagarajan, V.; Vitercik, E.; and White, C. 2017. Learning-theoretic foundations of algorithm configuration for combinatorial partitioning problems. In Conference on Learning Theory (COLT), 213 274. PMLR. Balcan, M.-F.; Sandholm, T.; and Vitercik, E. 2018. A general theory of sample complexity for multi-item profit maximization. In Conference on Economics and Computation (EC), 173 174. Balcan, M.-F.; and Sharma, D. 2021. Data driven semisupervised learning. Advances in Neural Information Processing Systems (Neur IPS), 34: 14782 14794. Balcan, N.; Nguyen, A. T.; and Sharma, D. 2023. New Bounds for Hyperparameter Tuning of Regression Problems Across Instances. In Conference on Neural Information Processing Systems (Neur IPS). Bartlett, P.; Indyk, P.; and Wagner, T. 2022. Generalization bounds for data-driven numerical linear algebra. In Conference on Learning Theory (COLT), 2013 2040. PMLR. Blackwell, D. 1956. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1): 1 8. Blum, A. 1997. Empirical support for winnow and weighted-majority algorithms: Results on a calendar scheduling domain. Machine Learning, 26(1): 5 23. Blum, A.; Dan, C.; and Seddighin, S. 2021. Learning complexity of simulated annealing. In International Conference on Artificial Intelligence and Statistics (AISTATS), 1540 1548. PMLR. Blum, A.; and Lykouris, T. 2020. Advancing Subgroup Fairness via Sleeping Experts. In Innovations in Theoretical Computer Science Conference (ITCS), volume 11. Blum, A.; and Mansour, Y. 2007. From external to internal regret. Journal of Machine Learning Research (JMLR), 8(6).

Blum, A.; and Mansour, Y. 2011. Learning, regret minimization, and equilibria. Algorithmic Game Theory (Chapter 4). Cesa-Bianchi, N.; and Lugosi, G. 2003. Potential-based algorithms in on-line prediction and game theory. Machine Learning, 51(3): 239 261. Cohen, W. W.; and Singer, Y. 1996. Learning to query the web. In AAAI Workshop on Internet-Based Information Systems, 16 25. Cohen, W. W.; and Singer, Y. 1999. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems (TOIS), 17(2): 141 173. Cohen-Addad, V.; and Kanade, V. 2017. Online optimization of smoothed piecewise constant functions. In Artificial Intelligence and Statistics (AISTATS), 412 420. PMLR. Deligkas, A.; Fearnley, J.; and Spirakis, P. 2020. Lipschitz continuity and approximate equilibria. Algorithmica, 82(10): 2927 2954. Foster, D. P.; and Vohra, R. 1999. Regret in the on-line decision problem. Games and Economic Behavior, 29(1-2): 7 35. Foster, D. P.; and Vohra, R. V. 1998. Asymptotic calibration. Biometrika, 85(2): 379 390. Freund, Y.; Schapire, R. E.; Singer, Y.; and Warmuth, M. K. 1997. Using and combining predictors that specialize. In Symposium on Theory of Computing (STOC), 334 343. Gupta, R.; and Roughgarden, T. 2016. A PAC approach to application-specific algorithm selection. In Innovations in Theoretical Computer Science (ITCS), 123 134. Hart, S.; and Mas-Colell, A. 2000. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5): 1127 1150. Ito, S. 2020. A tight lower bound and efficient reduction for swap regret. Advances in Neural Information Processing Systems (Neur IPS), 33: 18550 18559. Kleinberg, R.; Niculescu-Mizil, A.; and Sharma, Y. 2010. Regret bounds for sleeping experts and bandits. Machine learning, 80(2): 245 272. Kleinberg, R.; Slivkins, A.; and Upfal, E. 2008. Multi-armed bandits in metric spaces. In ACM Symposium on Theory of Computing (STOC), 681 690. Lehrer, E. 2003. A wide range no-regret theorem. Games and Economic Behavior, 42(1): 101 115. Mohri, M.; and Yang, S. 2014. Conditional swap regret and conditional correlated equilibrium. Advances in Neural Information Processing Systems (Neur IPS), 27. Mohri, M.; and Yang, S. 2017. Online learning with transductive regret. Advances in Neural Information Processing Systems (Neur IPS), 30. Morgenstern, J.; and Roughgarden, T. 2016. Learning simple auctions. In Conference on Learning Theory (COLT), 1298 1318. PMLR. Seneta, E. 1980. Computing the stationary distribution for infinite Markov chains. Linear Algebra and Its Applications, 34: 259 267.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Sharma, D.; Balcan, M.-F.; and Dick, T. 2020. Learning piecewise Lipschitz functions in changing environments. In International Conference on Artificial Intelligence and Statistics (AISTATS), 3567 3577. PMLR. Stoltz, G. 2005. Incomplete information and internal regret in prediction of individual sequences. Ph.D. thesis, Universit e Paris Sud-Paris XI. Stoltz, G.; and Lugosi, G. 2005. Internal regret in on-line portfolio selection. Machine Learning, 59(1): 125 159. Stoltz, G.; and Lugosi, G. 2007. Learning correlated equilibria in games with compact sets of strategies. Games and Economic Behavior, 59(1): 187 208. Syrgkanis, V. 2017. A sample complexity measure with applications to learning optimal auctions. Advances in Neural Information Processing Systems (Neur IPS), 30.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)