# online_learning_under_adversarial_nonlinear_constraints__81690b49.pdf

Online Learning under Adversarial Nonlinear Constraints

Pavel Kolev1, Georg Martius1,2, and Michael Muehlebach1

1Max Planck Institute for Intelligent Systems, Tübingen, Germany 2University of Tübingen, Tübingen, Germany {pavel.kolev, georg.martius, michael.muehlebach}@tuebingen.mpg.de

In many applications, learning systems are required to process continuous nonstationary data streams. We study this problem in an online learning framework and propose an algorithm that can deal with adversarial time-varying and nonlinear constraints. As we show in our work, the algorithm called Constraint Violation Velocity Projection (CVV-Pro) achieves

T regret and converges to the feasible set at a rate of 1/

T, despite the fact that the feasible set is slowly time-varying and a priori unknown to the learner. CVV-Pro only relies on local sparse linear approximations of the feasible set and therefore avoids optimizing over the entire set at each iteration, which is in sharp contrast to projected gradients or Frank Wolfe methods. We also empirically evaluate our algorithm on two-player games, where the players are subjected to a shared constraint.

1 Introduction

Today s machine learning systems are able to combine computation, data, and algorithms at unprecedented scales, which opens up new and exciting avenues in many domains, such as computer vision, computer graphics, speech and text recognition, and robotics [Jordan and Mitchell, 2015]. One of the leading principles that has enabled this progress is the focus on relatively simple pattern recognition and empirical risk minimization approaches, which mostly rely on offline gradient-based optimization and stipulate that the training, validation, and test data are independent and identically distributed.

Somewhat overlooked in these developments is the role of non-stationarity and constraints [Jordan, 2019]. Indeed, emerging machine learning problems involve decision-making in the real world, which typically includes interactions with physical, social, or biological systems. These systems are not only time varying and affected by past interactions, but their behavior is often characterized via fundamental constraints. Examples include cyber-physical systems where constraints are imposed by the laws of physics, multi-agent systems that are subjected to a shared resource constraint, or a reinforcement learning agent that is subjected to safety and reliability constraints. In particular, in their seminal work Auer et al. [2002] gave a reduction for the multi-arm bandit setting to the full information online optimization setting, by employing the multiplicative weights framework [Littlestone and Warmuth, 1994]. This classical reduction was recently extended by Sun et al. [2017] to the contextual bandit setting with sequential (time-varying) risk constraints.

This motivates our work, which is in line with a recent trend in the machine learning community towards online learning, adaptive decision-making, and online optimization. More precisely, we study an online problem with slowly time-varying constraints, governed by the following interaction protocol (see Assumption 1.2). In each time step t, the learner commits a decision xt and then in addition to a loss value ft(xt) with its gradient ft(xt) receives partial information about the current feasible set Ct := {x Rn | gt(x) 0}, where the constraint function gt(x) is defined as

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

[gt,1(x), . . . , gt,m(x)]. The quality of the learner s decision making is measured, for every T 1, by comparing to the best decision in hindsight x T arg minx CT PT t=1 ft(x), that is,

t=1 ft(x T ) subject to g T (x T ) c

which will be shown to be bounded by O(

T) for our algorithm. The functions ft and gt are restricted to ft F and gt G (as defined in Assumption 1.1) and c > 0 is an explicit constant.

It is important to note that our performance objective (1) is symmetric in the sense that the constraint x CT applies to both the learner s decision x T and the benchmark x T . This contrasts prior work by Neely and Yu [2017], Yu et al. [2017], Sun et al. [2017], Chen et al. [2017], Cao and Liu [2019] and Liu et al. [2022], where a different notion of constraint violation PT t=1 gt(xt) c0

T is used for the learner, while either a single benchmark x 1:T satisfies gt(x 1:T ) 0 for all t {1, . . . , T} or multiple benchmarks {x t}T t=1 satisfy x t arg minx Ct ft(x). Unlike (1), different requirements are imposed on the learner and the benchmark(s), which leads to asymmetric regret formulations: PT t=1 ft(xt) PT t=1 ft(x 1:T ) and PT t=1 ft(xt) PT t=1 ft(x t), respectively. Furthermore, as our bound g T (x T ) c/

T applies for all T 1, it implies the cumulative constraint violation bound in Neely and Yu [2017] up to a constant factor PT t=1 gt(xt) c PT t=1 1/

Even more intriguing is the fact that our algorithm is unaware of the feasible sets a-priori, and obtains, at each iteration, only a local sparse approximation of Ct based on the first-order information of the violated constraints. The indices of all violated constraints at xt will be captured by the index set I(xt) := {i {1, . . . , m} | gt,i(xt) 0}, while G(xt) := [ gt,i(xt)]i I(xt) denotes the matrix whose columns store the corresponding gradients. In order to guarantee a regret of O(

T) in (1) we require the following assumptions. Assumption 1.1. There exist R, LF, LG > 0: 1) F is a class of convex functions, where every f F satisfies || f(x)|| LF, x B4R, with || || the ℓ2 norm and BR the hypersphere of radius R centered at the origin; 2) G is a class of concave βG-smooth functions, where every g satisfies || g(x)|| LG, x B4R; 3) The feasible set Ct is non-empty and contained in BR for all t.

We note that these assumptions are standard in online optimization [Hazan, 2016, Ch. 3]. The learner s task is nontrivial even in the case where the feasible set is time invariant. If the feasible set is time varying, additional assumptions are required that restrict the amount that the feasible set is allowed to change. These two assumptions, see Part 2 i) and ii) below, are described by the following interaction protocol between the learner and the environment: Assumption 1.2. (Interaction protocol) At each time step t {1, . . . , T}: 1) the learner chooses xt; 2) the environment chooses ft F and gt G such that i) ||gt(x) gt 1(x)|| = O(1/t), uniformly for all x B4R, with || || the ℓ norm, and ii) Ct is contained in Qt := t 1 ℓ=0Sℓ, where St := {x Rn | G(xt) (x xt) 0} is a cone centered at xt for t 1 and S0 = Rn (the situation is illustrated in Figure 1, more details are presented in Appendix A); 3) the environment reveals to the learner partial information on cost ft(xt), ft(xt) and all violated constraints gt,i(xt), gt,i(xt) for i I(xt).

Figure 1: At each time step, the feasible set Ct contained in a polyhedral intersection Qt changes slightly and is only partially revealed.

The requirements i) ||gt gt 1|| = O(1/t); and ii) Ct Qt restrict the feasible sets that the environment can choose. We note that despite the fact that ||gt gt 1|| = O(1/t), ||g1 gt|| = Θ(ln(t)), which means that the sequence of functions gt that defines Ct does not converge in general. As a result, Ct may evolve in such a way that the initial iterates x1, x2, . . . , xt0 achieve a large cost compared to minx CT PT t=1 ft(x), as these are constrained by the sets C1, C2, . . . , Ct0, which may be far away from CT . The second requirement ii) Ct Qt avoids this situation and is therefore key for obtaining an O(

Our setup differs from traditional online convex optimization [Zinkevich, 2003] in the following two important ways:

i) The environment chooses not only the functions ft but also the nonlinear constraint functions gt, ii) even if gt is time-invariant, i.e., gt = g for all t the learner has only access to local information about the feasible set.

That is, the information about the feasible set is only revealed piece-by-piece and needs to be acquired by the agent through repeated queries of a constraint violation oracle.

We propose an online algorithm that despite the lack of information about the feasible set, achieves O(

T) regret, and will derive explicit non-asymptotic bounds for the regret and the convergence to CT . We thus conclude that our algorithm matches the performance of traditional online projected gradients or Frank-Wolfe schemes, while requiring substantially less information about the feasible set and allowing it to be time-varying. Perhaps equally important is the fact that instead of performing projections onto the full feasible set at each iteration, our algorithm only optimizes over a local sparse linear approximation. If constraints are nonlinear, which includes norm-constraints or constraints on the eigenvalues of a matrix, optimizing over the full feasible set at each iteration can be computationally challenging.

1.1 Related Work

Online learning has its roots in online or recursive implementations of algorithms, where due to the piece-by-piece availability of data, algorithms are often analyzed in a non i.i.d. setting. A central algorithm is the multiplicative weights scheme [Freud and Schapire, 1997], where a decider repeatedly chooses between a finite or countable number of options with the aim of minimizing regret. This online learning model not only offers a unifying framework for many classical algorithms [Blum, 1998], but represents a starting point for online convex optimization Hazan [2016], and adversarial bandits [Lattimore and Szepesvári, 2020]. Our approach extends this line of work by allowing the environment to not only choose the objective functions ft, but also the constraints gt. Due to the fact that our learner only obtains local information about the feasible set, our work is somewhat related to Levy and Krause [2019], Lu et al. [2023], Garber and Kretzu [2022], Mhammedi [2022], where the aim is to reduce the computational effort of performing online projected gradient steps or Frank-Wolfe updates. More precisely, Levy and Krause [2019] propose an algorithm that directly approximates projections, while requiring multiple queries of the constraint functions and their gradients. A slightly different constraint violation oracle is assumed in Garber and Kretzu [2022], where the learner can query separating hyperplanes between a given infeasible point and the feasible set. Algorithmically, both Garber and Kretzu [2022] and Levy and Krause [2019] depart from online gradient descent, where the latter computes projections via an approximate Frank-Wolf-type scheme. An alternative is provided by Mhammedi [2022] and Lu et al. [2023], where optimizations over the entire feasible set are simplified by querying only a set membership oracle based on the Minkowski functional. While our approach also avoids projections or optimizations over the entire feasible set, we introduce a different constraint violation oracle that returns a local sparse linear approximation of the feasible set. We call the constraint violation oracle only once every iteration and do not require a two-step procedure that involves multiple oracle calls. In addition, we also allow for adversarial time-varying constraints.

In addition, there has been important recent work that developed online optimization algorithms with constraints. In contrast to the primal formulation of our algorithm, these works are based on primal-dual formulations, where the algorithm is required to satisfy constraints on average, so called long-term constraints. The research can be divided into two lines of work Mahdavi et al. [2012], Jenatton et al. [2016], Yu and Neely [2020] and Yuan and Lamperski [2018], Yi et al. [2021] that use a set of weaker and stricter definitions for constraint violations and investigate time-invariant constraints, which contrasts our formulation that includes time-varying constraints. A third line of work by Mannor et al. [2009], Chen et al. [2017], Neely and Yu [2017], Yu et al. [2017], Sun et al. [2017], Cao and Liu [2019], Liu et al. [2022] focuses on time-varying constraints, where, however, the following weaker notion of constraint violation is used: PT t=1 gt(xt) c

T, where t refers to time and xt to the learner s decision. This metric allows constraint violations for many iterations, as long as these are compensated by strictly feasible constraints (in the worst case even with a single feasible constraint with a large margin). In contrast, our algorithm satisfies gt(xt) c/

t for all iterations t {1, . . . , T}, where c is an explicit constant independent of the dimension of the decision variable and the number of constraints. This means that we can explicitly bound the constraint violation at every iteration, whereas infeasible and strictly feasible iterates cannot compensate each other.

An important distinction to Neely and Yu [2017] is given by our performance metric (see also the discussion in Neely and Yu [2017] and Liu et al. [2022]). On the one hand, the work by Chen et al. [2017], Cao and Liu [2019], Liu et al. [2022] use PT t=1 ft(xt) PT t=1 ft(x t) as a performance measure, where the iterates xt are required to satisfy PT t=1 gt(xt) c

T and the optimal solutions x t satisfy x t arg minx Ct ft(x). On the other hand, the work by Neely and Yu [2017], Yu et al. [2017], Sun et al. [2017] use PT t=1 ft(xt) PT t=1 ft(x 1:T ) as a performance measure, where the iterates xt are required to satisfy PT t=1 gt(xt) c

T and the optimal solution x 1:T satisfies gt(x 1:T ) 0 for all t {1, . . . , T}. This leads to a major asymmetry in the way regret is measured: while the iterates of the online algorithm only need to satisfy a cumulative measure of constraint violation, the benchmark x 1:T , which represents the best fixed decision in hindsight, is required to satisfy all constraints gt(x 1:T ) 0 for t = {1, . . . , T}. In contrast, the performance metric introduced in (1) is symmetric and imposes the same constraint x CT (approximately) on the learner s decision x T and (exactly) on the benchmark x T . These features make our algorithm a valuable addition to the algorithmic toolkit of online constrained optimization.

Castiglioni et al. [2022] studied the following asymmetric setting with adversarial environment, benchmark x T belonging to arg minx X PT t=1 ft(x) subject to 1

T PT t=1 gt(x) 0, online iterates xt satisfying 1

T PT t=1 gt(xt) O(1/

T), and regret PT t=1 ft(xt) PT t=1 ft(x T ) O(

T). Their benchmark and regret formulation can be obtained as a special case of our formulation with time-averaged constraints, that is, when our g T (x) is chosen as 1

T PT t=1 gt(x). In contrast, our iterate x T satisfies 1 T PT t=1 gt(x T ) O(1/

T), a constraint that is asymptotically the same as the one satisfied by the benchmark x T . We further note that they introduced a parameter ρ = supx X mint [T ] mini [m] gt,i(x), which is required to be positive and known to the algorithm for achieving O(

T) regret. Notably ρ > 0 implies that the intersection of all feasible sets is non-empty, which is a strong assumption (as is knowledge about the parameter ρ). In our formulation with time-averaged constraints, Assumption 1.2 reduces to the requirement that the feasible set Ct belongs to a polyhedral intersection Qt, which does not require a non-emtpy intersection of all Ct (has a geometrical interpretation and the assumption ||gt gt 1|| = O(1/t) is automatically satisfied). Thus, there are situations, where the regret bound from Castiglioni et al. [2022] becomes vacuous (for ρ = 0), while our method still provably achieves O(

T) regret. Additional differences are that Castiglioni et al. [2022] considers primal-dual methods and assumes that all constraints are revealed after every iteration, whereas our method is primal-only and has only partial information about all violated constraints. The latter point reduces computation and simplifies projections onto the velocity polyhedron, but requires a nontrivial inductive argument for establishing O(

Other relevant related studies have investigated online learning problems with supply/budget constraints. In these settings, the decision maker must choose a sequence of actions that maximizes their expected reward while ensuring that a set of resource constraints are not violated. The process terminates either after a pre-specified time horizon has been reached or when the total consumption of some resource exceeds its budget. Badanidiyuru et al. [2018] introduced the bandits with knapsacks framework, which considers bandit feedback, stochastic objective and constraint functions. They proposed an optimal algorithm for this problem, which was later improved by Agrawal and Devanur [2014, 2019] and Immorlica et al. [2022]. Immorlica et al. [2022] introduced the adversarial bandits with knapsacks setting and showed that an appropriate benchmark for this setting is the best fixed distribution over arms. Since no-regret is no longer possible under this benchmark, they provide no-α-regret guarantees for their algorithm.

An important special case of our online learning model arises when the environment is represented by an adversarial player that competes with the learner. This corresponds to a repeated generalized Nash game due to the constraint that couples the decisions of the learner and its adversary. If the adversary plays best response, the resulting equilibria are characterized by quasi-variational inequalities [Facchinei and Kanzow, 2007] and there has been important recent work, for example by Jordan et al. [2023], Kim [2021], Facchinei and Kanzow [2010] that proposes different gradient and penalty methods for solving these inequalities. Our approach adopts a different perspective, rooted in online learning, which allows us to derive non-asymptotic convergence results for a first-order gradient-based algorithm that can be implemented in a straightforward manner. Our approach is also inspired by the recent work of Muehlebach and Jordan [2022], who propose a similar algorithm for the offline setting.

1.2 Main Contributions

We give an online optimization scheme under unknown non-linear constraints that achieves an optimal O(

T) regret and converges to the latest feasible set at a rate of O(1/

T). There are two variants of our problem formulation: The first deals with situations where constraints are unknown but fixed, the second allows constraints to be chosen in a time-varying and adversarial manner.

Our algorithm, named Constraint Violation Velocity Projection (CVV-Pro), has the following features:

1. It assumes access to a new type of oracle, which on input xt, returns partial information on all currently violated constraints. Namely, the value gt,i(xt) and the gradient gt,i(xt) for all i I(xt).

2. It projects an adversarially generated negative cost gradient ft(xt) onto a velocity polyhedron Vα(xt) := v Rn | [ gi(xt)] v αgi(xt), i I(xt) . Due to the linear and local structure of Vα(xt), the projection can be computed efficiently.

3. In contrast to standard online methods that project in each round a candidate decision onto the feasible set, our method trades off feasibility for efficiency. In particular, it produces a sequence of decisions that converges at a rate of O(1/

T) to the latest feasible set.

4. Our method handles time-varying adversarial constraints gt, provided a decreasing rate of change ||gt+1 gt|| O(1/t) and that each feasible set Ct belongs to Qt (see Assumption 1.2). As we show in Section 3.1, an important special case where the assumption of decreasing rate of change is satisfied is given by gt = 1

t Pt j=1 gj, i.e., when gt represents an average of constraints gt over time.

1.3 Outline

Section 2 describes our algorithm and considers the situation where gt is time invariant. This sets the stage for our main results in Section 3 that provide regret guarantees for our new online convex optimization setting with non-stationary, nonlinear, and unknown constraints. An important and interesting application of our algorithm are generalized Nash equilibrium problems, as will be illustrated with a numerical experiment in Section 4. The experiment will also highlight that the numerical results agree with the theoretical predictions.

2 Online learning under unknown, time-invariant, and nonlinear constraints

2.1 Online Gradient Descent

Online gradient descent [Hazan, 2016, Ch. 3.1] is a classical and perhaps the simplest algorithm that achieves optimal O(

T) regret for the setting of a compact, convex, time-invariant, and a priori known feasible set. It consists of the following two operations: i) yt+1 = xt ηt ft(xt) takes a step from the previous point in the direction of the previous cost gradient; and ii) xt+1 = Proj C(yt+1) projects yt+1 back to the feasible set C, as yt+1 may be infeasible.

In this section, we generalize the online gradient descent algorithm to the setting where the feasible set is unknown a priori and has to be learned through repeated queries of a constraint violation oracle that only reveals local information.

2.2 Overview

In Section 2.3, we present the pseudo code of our algorithm. In Section 2.4, we give a structural result showing that Algorithm 1 under Assumption 1.1 and a bounded iterate assumption guarantees an optimal O(

T) regret and converges to the feasible set at a rate of O(1/

T). In Appendix E, we show that the bounded iterate assumption can be enforced algorithmically, by introducing an additional hypersphere constraint that attracts the sequence {xt}t 1 to a fixed compact set.

2.3 Constraint Violation Velocity Projection (CVV-Pro)

We present below the pseudocode of Algorithm 1 for a fixed horizon length T, as it is standard in the literature [Hazan, 2016]. However, we note that our algorithm is oblivious to the horizon length T, i.e., it can run for any number of iterations without knowing T a priori.

Figure 2: Illustration of the proposed (CVV-Pro) algorithm. Left: the constraint gt,j is violated by the current solution xt. The cost gradient ft(xt) is projected onto the hyperplane (moved by αgt,j(xt)) with normal vector gt,j(xt). This yields rt (see Section 2.5), and results in the velocity projection vt (η = 1 for clarity). Right: next iteration with updated x, where both f and g are changed. Then the procedure is applied recursively.

Algorithm 1 Constraint Violation Velocity Projection (CVV-Pro)

1: Requirements: See Assumption 1.1 2: Input: α > 0 3: Initialization: Step sizes ηt = 1 α

t 1 4: for t = 1 to T do 5: Play xt and observe: 6: cost information ft(xt), ft(xt) and constraint information gi(xt), gi(xt) }i I(xt) 7: Construct the velocity polyhedron as follows:

Vα(xt) := v Rn | [ gi(xt)] v αgi(xt), i I(xt) ,

8: Solve the velocity projection problem: vt = arg minv Vα(xt) 1 2 v + ft(xt) 2

9: Update: xt+1 = xt + ηtvt 10: end for

Let x C be an arbitrary decision. We show in Claim 2.2 that α(x xt) Vα(xt). Hence, the velocity polyhedron Vα(xt) is always non-empty and well defined.

2.4 Structural Result

Here, we show that Algorithm 1 under Assumption 1.1 and a bounded iterate assumption, guarantees an optimal O(

T) regret and converges to the feasible set at a rate O(1/

T). The bounded iterate assumption will be removed subsequently, which however, will require a more complex analysis. Theorem 2.1 (Structural). Suppose Assumption 1.1 holds and in addition xt BR for all t {1, . . . , T}. Then, on input α = LF/R, Algorithm 1 with step sizes ηt = 1 α

t guarantees the following for all T 1:

(regret) PT t=1 ft(xt) minx C PT t=1 ft(x) 18LFR

(feasibility) gi(xt) 8 h LG

R + 2βG i R2

t, for all t {1, . . . , T} and i {1, . . . , m}.

2.5 Proof Sketch of Theorem 2.1

Our analysis establishes, in two steps, an important geometric property that connects the convex costs and the concave constraints via the velocity polyhedron Vα(xt). More precisely, we show that the inner product r t (x T xt) 0 for all t {1, . . . , T}. This property will be crucial for deriving the regret and feasibility bounds.

In the first step, we leverage the constraints concavity and show that the vector α(x T xt) belongs to the velocity polyhedron Vα(xt).

Claim 2.2. Suppose gi is concave for every i {1, . . . , m}. Then α(x xt) Vα(xt) for all x C. In addition, xt int(C) implies [ gi(xt)] [x xt] 0 for all x C.

Proof. Let x C be an arbitrary feasible decision, satisfying gi(x) 0 for all i {1, . . . , m}. Since gi is concave, we have gi(xt) + [ gi(xt)] [x xt] gi(x) 0 and thus [ gi(xt)] [x xt] gi(xt). The second conclusion follows by xt int(C), which implies gi(xt) 0.

In the second step, we show that r t (xt x ) 0, where rt = vt + ft(xt) is such that rt belongs to the normal cone NVα(xt)(vt) of the velocity polyhedron Vα(xt) evaluated at the projection vt.

Lemma 2.3 (Main). Let vt be the projection of ft(xt) onto the polyhedron Vα(xt) such that vt = rt ft(xt) Vα(xt), where rt NVα(xt)(vt). Then, r t (x xt) 0 for all x C.

Proof. By definition, the normal cone NVα(xt)(vt) is given by {u Rn | u (v vt) 0, v Vα(xt)}. Then, by construction rt NVα(xt)(vt) and thus it holds for every v Vα(xt) that r t [v vt] 0. The proof proceeds by case distinction:

Case 1. Suppose xt is in the interior of C. Then, I(xt) = , which implies ft(xt) Vα(xt) = Rn and thus rt = 0.

Case 2. Suppose xt is on the boundary or outside of C, i.e., I(xt) = . By Claim 2.2, we have [ gi(xt)] [x xt] 0 for all x C. By construction, vt Vα(xt) and thus v(x) = vt + x xt Vα(xt). The statement follows by applying v = v(x) to r t [v vt] 0.

Regret. To establish the first conclusion of Theorem 2.1 (regret), we combine the preceding geometric property with the analysis of online gradient descent. Since ft F is convex, we upper bound the regret in terms of the gradient of ft, namely PT t=1 ft(xt) ft(x T ) PT t=1[ ft(xt)] (xt x T ) and then we show that the following inequality holds

[ ft(xt)] (xt x T ) ηt

2 vt 2 = r t (xt x T ) + xt x T 2 xt+1 x T 2

xt x T 2 xt+1 x T 2

Moreover, in Appendix D (see Lemma D.2), we upper bound the velocity vt α x T xt + 2 ft(xt) . Combining Assumption 1.1 and xt BR yields a uniform bound vt Vα, where for α = LF/R we set Vα := 4LF. The desired regret follows by a telescoping argument and by convexity of the cost functions ft F.

Feasibility. For the second conclusion of Theorem 2.1 (convergence to the feasible set at a rate of 1/

T), we develop a non-trivial inductive argument that proceeds in two steps. In Appendix D (see Claim D.6), we give a structural result that bounds the constraint functions from below. In particular, for every i I(xt) we have gi(xt+1) (1 αηt)gi(xt) η2 t V2 αβG and for every i I(xt) it holds that gi(xt+1) ηt+1Vα[2LG + VαβG/α].

Using an inductive argument, we establish in Appendix D (see Lemma D.5) the following lower bound: gi(xt) cηt where c = 2Vα(LG + VαβG/α). Choosing α = LF/R implies that Vα = 4LF. Then, the desired convergence rate to the feasible set follows for the step size ηt = 1 α

h LG + βGVα

R + 4βG i R2

3 Online Learning under Adversarial Nonlinear Constraints

3.1 Problem Formulation

In this section, we consider an online optimization problem with adversarially generated time-varying constraints. More precisely, at each time step t, the learner receives partial information on the current cost ft and feasible set Ct, and seeks to minimize (1). To make this problem well posed, we restrict the environment such that each feasible set Ct is contained in Qt (see Section 1) and the rate of change between consecutive time-varying constraints decreases over time. We quantify a sufficient rate of decay with the following assumption.

Assumption 3.1 (TVC Decay Rate). We assume that the adversarially generated sequence {gt}t 1 of time-varying constraints is such that for every x B4R and all t 1, the following holds gt+1(x) gt(x) 98 t+16 LG

R + 3βG R2.

We note that Assumption 3.1 essentially only requires gt+1(x) gt(x) O(1/t), as R can be chosen large enough such that the bound is satisfied. Of course, R will appear in our regret and feasibility bounds, but it will not affect the dependence on t or T (up to constant factors).

An important special case where Assumption 3.1 is satisfied, is summarized in the following Lemma. The proof is included in Appendix F (see Lemma F.7 and Lemma F.8). Lemma 3.2. Suppose the functions gt,i satisfy Assumption 1.1 and in addition there is a decision xt,i BR such that gt,i(xt,i) = 0 for every t 1 and i {1, . . . , m}. Then the time-averaged constraints gt,i(x) := 1

t Pt ℓ=1 gℓ,i(x) satisfy Assumption 1.1 and Assumption 3.1.

3.2 Velocity Projection with Attractive Hypersphere Constraint

We show in Appendix E that the second assumption in Theorem 2.1, namely, xt BR for all t 1 can be enforced algorithmically. We achieve this in two steps.

1) Algorithmically, we introduce an additional hypersphere constraint gm+1(xt) = 1

2[R2 xt 2] that attracts the decision sequence {xt}t 1 to a hypersphere BR and guarantees that it always stays inside a hypersphere B4R with a slightly larger radius.

More precisely, we augment the velocity polyhedron in Step 3 of Algorithm 1 as follows: V α(xt) = Vα(xt) if x R, otherwise

V α(xt) = {v Vα(xt) | [ gm+1(xt)] v αgm+1(xt)}. 2) Analytically, we give a refined inductive argument in Appendix F (see Lemma E.5), showing that gm+1(xt) 27R2/ t + 15, xt 4R and vt 7LF, for all t 1.

3.3 Main Contribution

Our main contribution is to show that Algorithm 1 with the augmented velocity polyhedron V α(xt), achieves optimal O(

T) regret and satisfies g T (x T ) O(1/

T) convergence feasibility rate. Due to space limitations, we defer the proof to Appendix F. Theorem 3.3 (Time-Varying Constraints). Suppose the functions {ft, gt}t 1 satisfy Assumptions 1.1, 1.2 and 3.1. Then, on input R, LF > 0 and x1 BR, Algorithm 1 applied with α = LF/R, augmented velocity polyhedron V α( ) and step sizes ηt = 1 α t+15 guarantees the following for all T 1:

(regret) PT t=1 ft(xt) minx CT PT t=1 ft(x) 246LFR

(feasibility) gt,i(xt) 265 h LG

R + 4βG i R2 t+15, for all t {1, . . . , T} and i {1, . . . , m};

(attraction) gm+1(xt) 27 R2 t+15, for all t {1, . . . , T}.

Our regret analysis in Theorem 3.3 builds upon the following key structural result that generalizes Lemma 2.3 to time-varying constraints. In particular, in Appendix F (see Lemma F.3), we show that given the feasible set CT QT , it holds for every x CT that r t (x xt) 0 for all t {1, . . . , T}. As a result, a similar argument as in (2) shows that the regret is bounded by O(

Moreover, we note that the linear and quadratic dependence on R in Theorem 3.3 is consistent in length units. Let the radius R be of length units ℓ, then the Lipschitz constant LF, which can be viewed as the supremum over the ℓ2 norm of the gradient is of 1/ℓunits, and the βG smoothness constant (associated with Hessian) is of 1/ℓ2 units. This means that the regret bound in Theorem 3.3 has the same units as ft, while the feasibility bound has the same units as gt.

4 Simulation examples

Two-player games with shared resources are an excellent example for demonstrating the effectiveness and importance of our online learning framework. We apply our algorithm and show numerical experiments that support our theoretical findings.

We choose random instances of a two player game with linear utility and constraints. In particular, we consider the following optimization problem

min x n max y n x Ay subject to Cxx + Cyy 1, (3)

where n = {x Rn | Pn i=1 xi = 1, x 0} is the probability simplex. Each component of the utility matrix A Rn n is sampled from the normal distribution and the constraint matrices Cx, Cy [0, 1]m n have each of their components sampled uniformly at random from [0, 1].

0 1,000 2,000 3,000 4,000 0

100 101 102 103 104 10 2

two-norm of error

100 101 102 103 104

max avg violation

min i I(xt) gt,i(xt)

Figure 3: (Setup) The CVV-Pro algorithm is executed on five random instances of the two-player game with shared resources (Section 4.1). The thick line is the mean and the thin lines indicate the minimum and the maximum over the five runs. (a) The regret follows the predicted O(

T) slope. (b) The CVV-Pro algorithm achieves a convergence rate of O(1/

t) for the averaged decisions xt := 1 t Pt ℓ=1 xℓtowards x . In our experiment, we set x := x10000. Similar behavior is reported for the averaged decisions yt of the adversary. (c) The maximal constraint violation expressed by mini I(xt) gt,i(xt) converges at a rate of O(1/

t), as predicted by our theoretical results. (Compute) For the implementation of CVV-Pro we have used the MATLAB R2019a numerical computing software. The computation of the experiment takes about 4 hours on a machine with CPU: Intel(R) i7-6800K 3.40 GHz with 6 cores, GPU: NVIDIA Ge Force GTX 1080, and RAM: 32 GB.

4.1 Online Formulation

The problem in (3) can be modeled with our online learning framework (1) by choosing costs ft(x) := x Ayt and time-averaged resource constraints g T (x) := 1

T PT t=1 egt(x), where the function egt(x) := 1 Cxx Cyyt. Thus, the constraint in (3) is included as an average over the past iterations of yt. The strategy for choosing yt will be described below and, as we will see, the average of yt over the past iterations converges. This ensures that the feasible set Ct (defined in (1)) is slowly time-varying, while the averages of xt and yt over past iterates converge to equilibria in (3). Further, by a refined version of Lemma 3.2 (see Lemma F.6 in Appendix F), the time-averaged constraints g T (x) satisfy Assumption 3.1.

In each iteration, Algorithm 1 seeks to minimize the online problem and commits to a decision xt. The adversary computes the best response ˆyt with respect to the decision xt by solving arg maxy n x t Ay. To make the dynamics more interesting, the adversary then commits with probability 0.8 to ˆyt and with probability 0.2 to a random decision ξt, i.e., yt = 0.8ˆyt + 0.2ξt where the random variable ξt is sampled uniformly at random from the probability simplex n.

As both players optimize over the probability simplex (x, y n), the sequence of decisions {xt}t 1 is automatically bounded. Thus, we can apply Theorem 3.3 with the original velocity polyhedron, as discussed in Appendix E. We implemented our algorithm with ηt = 1/(α

t) and α = 100.

4.2 Experimental Results

We report results from numerical simulations with decision dimension n = 100, m = 10 shared resource constraints, T = 4000 iterations, and five independently sampled instances of the two-player game. The learner s regret, depicted in Figure 3a, shows a clear correspondence with the theoretical prediction of O(

T). Figure 3b presents the maximal constraint violation mini I(x T ) 1 T PT t=1 egt,i(x T ), which follows the predicted O(1/

T) convergence rate. We also conclude from Figure 3c that the learner s averaged decisions x T = 1

T PT t=1 xt converge at a rate of O(1/

T). Similarly, the averaged decisions y T of the adversary also converge at a rate of O(1/

T). We note that there is little variability in the results despite the different realizations of the matrices A, Cx, Cy.

Contrasting CVV-Pro and Online Gradient Descent In Appendix C, we show that our (CVV-Pro) algorithm outperforms the standard Online Gradient Descent algorithm in the two-player game from above. In particular, our algorithm achieves a lower regret and a runtime improvement of about 60%. Further, the percentage of violated constraints decreases rapidly and plateaus at 20%.

The amount of improvement in execution time is likely to be greater for higher-dimensional problems, where fewer constraints tend to be active at each iteration. Moreover, when the constraints are nonlinear, which includes ℓp norm or spectral constraints, optimizing over the full feasible set can be computationally challenging. In contrast, the velocity projection step in CCV-Pro is always a convex quadratic program with linear constraints, regardless of the underlying feasible set.

5 Broader Impact

It is important to emphasize that our work is theoretical, and the main contribution is to design and analyze a novel algorithm that combines techniques from the seemingly distant fields of online convex optimization (online gradient descent) and non-smooth mechanics (velocity space, see Muehlebach and Jordan [2022]). Nevertheless, the list of potential applications includes, but is not limited to: adversarial contextual bandits with sequential risk constraints Sun et al. [2017], network resource allocation Chen et al. [2017], logistic regression Cao and Liu [2019], Liu et al. [2022], ridge regression and job scheduling Liu et al. [2022], two-player games with resource constraints (Section 4), system identification and optimal control (Appendix B).

6 Conclusion

We propose an online algorithm that, despite the lack of information about the feasible set, achieves O(

T) regret. We further ensure convergence of violated constraint min{g T (x T ), 0} at a rate of O(1/

T) and derive explicit constants for all our bounds that hold for all T 1. We thus conclude that our algorithm matches the performance of traditional online projected gradients or Frank-Wolfe schemes, while requiring substantially less information about the feasible set and allowing the feasible set to be time-varying. Perhaps equally important is the fact that instead of performing projections onto the full feasible set at each iteration, our algorithm only optimizes over a local sparse linear approximation. We show the applicability of our algorithm in numeric simulations of random two-player games with shared resources.

Acknowledgements

We acknowledge the support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039B). Georg Martius is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 Project number 390727645. Pavel Kolev was supported by the Cyber Valley Research Fund and the Volkswagen Stiftung (No 98 571). Michael Muehlebach thanks the German Research Foundation and the Branco Weiss Fellowship, administered by ETH Zurich, for the support. We thank anonymous reviewers for comments, which helped improve the presentation of the paper.

Shipra Agrawal and Nikhil R. Devanur. Bandits with concave rewards and convex knapsacks. In ACM Conference on Economics and Computation, 2014.

Shipra Agrawal and Nikhil R. Devanur. Bandits with global convex constraints and objective. Operations Research, 67(5):1486 1502, 2019.

Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002.

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. Journal of the ACM, 65(3):1 55, 2018.

Avrim L. Blum. On-line algorithms in machine learning. Online Algorithms: The State of the Art, pages 306 325, 1998.

Xuanyu Cao and K. J. Ray Liu. Online convex optimization with time-varying constraints and bandit feedback. IEEE Transactions on Automatic Control, 64(7):2665 2680, 2019.

Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Giulia Romano, and Nicola Gatti. A unifying framework for online optimization with long-term constraints. In Advances in Neural Information Processing Systems 36, 2022.

Tianyi Chen, Qing Ling, and Georgios B. Giannakis. An online convex optimization approach to proactive network resource allocation. IEEE Transactions on Signal Processing, 65(24):6350 6364, 2017.

Francisco Facchinei and Christian Kanzow. Generalized Nash equilibrium problems. A Quarterly Journal of Operations Research, 5:173 210, 2007.

Francisco Facchinei and Christian Kanzow. Penalty methods for the solution of generalized Nash equilibrium problems. SIAM Journal on Optimization, 20(5):2228 2253, 2010.

Yoav Freud and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997.

Dan Garber and Ben Kretzu. New projection-free algorithms for online convex optimization with adaptive regret guarantees. In Conference on Learning Theory, 2022.

Elad Hazan. Introduction to online convex optimization. Foundation and Trends in Optimization, 2 (3-4):157 325, 2016.

Nicole Immorlica, Karthik Abinav Sankararaman, Robert E. Schapire, and Aleksandrs Slivkins. Adversarial bandits with knapsacks. Journal of the ACM, 69(6):40:1 40:47, 2022.

Rodolphe Jenatton, Jim Huang, and Cedric Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In 33th International Conference on Machine Learning, 2016.

M. I. Jordan. Artificial intelligence the revolution hasn t happened yet. Harvard Data Science Review, 1(1):1 8, 2019.

M. I. Jordan and T. M. Mitchell. Machine learning: Trends, perspectives, and prospects. Science, 349(6245):255 260, 2015.

Michael I. Jordan, Tianyi Lin, and Manolis Zampetakis. First-order algorithms for nonlinear generalized Nash equilibrium problems. Journal of Machine Learning Research, 24:38:1 38:46, 2023.

Jong Gwang Kim. Equilibrium computation of generalized nash games: A new Lagrangian-based approach. In 22nd Conference on Economics and Computation, 2021.

Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.

Kfir Y. Levy and Andreas Krause. Projection free online learning over smooth sets. In 22nd International Conference on Artificial Intelligence and Statistics, 2019.

Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212 261, 1994.

Qingsong Liu, Wenfei Wu, Longbo Huang, and Zhixuan Fang. Simultaneously achieving sublinear regret and constraint violations for online convex optimization with time-varying constraints. Performance Evaluation, 49(3):4 5, 2022.

Zhou Lu, Nataly Brukhim, Paula Gradu, and Elad Hazan. Projection-free adaptive regret with membership oracles. In International Conference on Algorithmic Learning Theory, 2023.

Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: online convex optimization with long term constraints. Journal of Machine Learning Research, 13(81):2503 2528, 2012.

Shie Mannor, John N. Tsitsiklis, and Jia Yuan Yu. Online learning with sample path constraints. Journal of Machine Learning Research, 10(20):569 590, 2009.

Zakaria Mhammedi. Efficient projection-free online convex optimization with membership oracle. In Conference on Learning Theory, 2022.

Michael Muehlebach and Michael I. Jordan. On constraints in first-order optimization: A view from non-smooth dynamical systems. Journal of Machine Learning Research, 23(256):1 47, 2022.

Michael J. Neely and Hao Yu. Online convex optimization with time-varying constraints. ar Xiv preprint ar Xiv:1702.04783, 2017.

Wen Sun, Debadeepta Dey, and Ashish Kapoor. Safety-aware algorithms for adversarial contextual bandit. In 34th International Conference on Machine Learning, 2017.

Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Tianyou Chai, and Karl H. Johansson. Regret and cumulative constraint violation analysis for online convex optimization with long term constraints. In 38th International Conference on Machine Learning, 2021.

Hao Yu and Michael J. Neely. A low complexity algorithm with o( t) regret and O(1) constraint violations for online convex optimization with long term constraints. Journal of Machine Learning Research, 21(1):1 24, 2020.

Hao Yu, Michael Neely, and Xiaohan Wei. Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems 30, 2017.

Jianjun Yuan and Andrew Lamperski. Online convex optimization for cumulative constraints. In Advances in Neural Information Processing Systems 31, 2018.

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In 20th International Conference on Machine Learning, 2003.