# online_learning_with_unknown_constraints__c40ead05.pdf Online Learning with Unknown Constraints Karthik Sridharan 1 Seung Won Wilson Yoo 1 We consider the problem of online learning where the sequence of actions played by the learner must adhere to an unknown safety constraint at every round. The goal is to minimize regret with respect to the best safe action in hindsight while simultaneously satisfying the safety constraint with high probability on each round. We provide a general meta-algorithm that leverages an online regression oracle to estimate the unknown safety constraint, and converts the predictions of an online learning oracle to predictions that adhere to the unknown safety constraint. On the theoretical side, our algorithm s regret can be bounded by the regret of the online regression and online learning oracles, the eluder dimension of the model class containing the unknown safety constraint, and a novel complexity measure that characterizes the difficulty of safe learning. We complement our result with an asymptotic lower bound that shows that the aforementioned complexity measure is necessary. When the constraints are linear, we instantiate our result to provide a concrete algorithm with T regret using a scaling transformation that balances optimistic exploration with pessimistic constraint satisfaction. 1. Introduction Online learning is a key tool for many sequential decision making paradigms. From a practical view point, it is often the case that either due to safety concerns (Dobbe et al., 2020), to guarantee fairness or privacy properties (Zafar et al., 2019), (Levy et al., 2021), or in many cases, simply due to physical restrictions in the real world (Atawnih et al., 2016), the agent or learner often must pick actions that are not only effective but also strictly adhering to some constraints on every round. Often, the safety constraint 1Department of Computer Science, Cornell University, Ithaca NY, United States. Correspondence to: Seung Won Wilson Yoo . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). is determined by parameters of the environment that are unknown to the learner. For example individual fairness constraints may be defined by an unknown similarity metric (Gillen et al., 2018), or in robotics applications, safety may hinge on uncertainties such as an unknown payload weight (Brunke et al., 2022). Thus in such situations, the learner must learn the unknown parameters that characterize the safety constraint. In this work, we study the general problem of online learning with unknown constraints, where the learner only observes noisy feedback of the safety constraints. We consider arbitrary decision spaces and loss functions. Our goal is to design algorithms that can simultaneously minimize regret while strictly adhering to the safety constraint at all time steps. Naturally, regret is measured w.r.t. the best decision in hindsight that also satisfies the constraint on every round. The learner only has access to an initial safe-set of actions/decisions to begin, and must gain more information about the safety constraint. To solve this problem, we assume access to a general online learning oracle that has low regret (without explicit regard to safety) and a general online regression oracle that provides us with increasingly accurate estimations of the unknown constraint function. The key technical insight in this work is in exploring what complexity or geometry allows us to play actions within guaranteed safe-sets while expanding the safe-sets and simultaneously ensuring regret is small. We introduce a complexity measure that precisely characterizes this inherent per-step tension between regret minimization and information acquisition with respect to the safety constraint (with the key challenge of remaining within the safe set). We complement our results with a lower bound that shows that asymptotically, whenever this complexity measure is large, regret is also large. Our results yield an analysis that non-constructively shows the existence of algorithms for the general setting with arbitrary decision sets, loss functions, and classes of safety constraints. Furthermore, we instantiate these results explicitly for various settings, and give explicit algorithms for unknown linear constraints and online linear optimization - obtaining the first algorithm with O( Online Learning with Unknown Constraints Key Contributions We introduce a novel complexity measure Vκ(t) defined in eq. 5, that characterizes the difficulty of safe learning. On a high level, it captures the per-timestep trade off between loss minimization and information gain w.r.t. unknown constraint with κ serving as the weight placed on information gain. We provide a new safe learning algorithm under an unknown constraint (Algorithm 1) that utilizes an online regression oracle w.r.t. F, where F is the model class to which the unknown safety constraint belongs, and an online learning oracle that guarantees good performance w.r.t. Π, where Π is a benchmark policy class. Notably, our algorithm is able to handle adversarial contexts drawn from X, arbitrary action set A, safety model class F, benchmark class Π, operates under general modeling assumptions, and enjoys the following regret bound: t=1 Vκ(t) + κ inf α αT + Reg OR(T, δ, F)E(F, α) + Reg OL(T, δ, Π) where Reg OR(T, δ, F) denotes the regret bound guaranteed by the online regression oracle on F, E(F, α) denotes the eluder dimension of F, and Reg OL(T, δ, Π) denotes the regret bound guaranteed by the online learning algorithm w.r.t. Π. Via a lower bound, we show that asymptotically, if there exists a κ for which t=1 Vκ(t) c0 > 0 , no safe algorithm is able to obtain diminishing regret. If per-timestep constraint satisfaction is relaxed to long term constraint satisfaction, we show a modified version of our main algorithm yields bounds without Vκ(t). For several settings including finite action spaces, linear & generalized linear, and polytopic settings we give an instantiation of κ = κ that satisfies Vκ(t) 0 and, Regret T κ inf α αT + Reg OR(T, δ, F)E(F, α) + Reg OL(T, δ, Π) For linear settings we instantiate this result to give an algorithm with O( T) regret. We extend our main algorithm to handle vector-valued constraints 2. Related Works Online Convex Optimization and Long Term Constraints: (Mahdavi et al., 2012) initiated the problem of online convex optimization with long term constraints, a variant of online convex optimization where the learner is given a set of functional constraints {fi( ) 0}m i=1 and is required to ensure that the sum of constraint violations PT t=1 Pm i=1 fi(xt) is bounded rather than ensuring that the constraints are satisfied at every time step. For the case of known and fixed constraints, (Mahdavi et al., 2012) obtain O(T 1/2) regret and O(T 3/4) constraint violation. This was recently improved in (Yu & Neely, 2020) to be O(T 1/2) regret and O(1) constraint violation. Furthermore, (Neely & Yu, 2017) study a variant with time varying constraints {fi,t( ) 0}m i=1 and and achieve O(T 1/2) regret and long term constraint violation. (Sun et al., 2017), (Jenatton et al., 2016), and (Yi et al., 2020) study variations of this problem. Bandits with Unknown Linear Constraint: The area of work most similar to ours is the study of safe bandits with unknown linear constraints. Initiated by (Moradipari et al., 2021), this line of works studies a linear bandit setting, where a linear constraint is imposed on every action at of the form of f, at b 0 with unknown f and known b. Similar settings involving linear bandit problems with uncertain and per-round constraints have been studied by (Amani & Thrampoulidis, 2021), (Pacchiano et al., 2021), (Hutchinson et al., 2024). (Pacchiano et al., 2021) study the a setting where constraints are satisfied in expectation, and (Pacchiano et al., 2024) and (Hutchinson et al., 2024) improves this to high probability satisfaction. Safe Convex Optimization with Unknown Constraint(s): Safe convex optimization with unknown linear constraints and noisy feedback was studied in (Usmanova et al., 2019). (Fereydounian et al., 2020) seeks to optimize a fixed convex function given unknown linear constraints, and focuses on sample complexity. Closest to our work is that of (Chaudhary & Kalathil, 2022), where the authors study time varying cost functions and achieve O(T 2/3) regret. A recent concurrent work due to (Hutchinson et al., 2025) improves this to O( T). (Hutchinson & Alizadeh, 2024a) studies a variant with non-linear constraints and first order feedback, and (Hutchinson & Alizadeh, 2024b) studies a variant with d + 1 points of feedback. Per Timestep Tradeoff Between Loss Minimization and Constraint Information Gain : The SO-PGD algorithm due to (Chaudhary & Kalathil, 2022) adopts an explorefirst then exploit strategy which results in a O(T 2/3) regret bound, whereas the ROFUL algorithm due to (Hutchinson et al., 2024) strikes a better balance between regret minimization and conservative exploration of the constraint set. The Decision Estimation Coefficient (DEC) due to (Foster et al., 2023), (Foster et al., 2022) explicitly strikes a balance loss minimization and the information gained due to obser- Online Learning with Unknown Constraints vation. Our proposed algorithm seeks to similarly balance loss minimization and exploration of the constraint set. 3. Setup and Preliminary We consider the problem of online learning with unknown constraints imposed on actions the learner is allowed to play from. The learning problem proceeds for T rounds as follows. For t = 1, . . . , T: Nature selects a context xt X The learner selects a safe action at A Nature selects an outcome yt Y The learner receives safety feedback zt Z where X is the context space, A is the action space, Y is the outcome space and Z is the safety feedback space. Here, the context chosen by nature x X, action chosen by learner a A, and outcome chosen by nature y Y define the loss suffered by the learner ℓ(a, x, y), for a bounded loss function ℓ: A X Y [0, 1]. We assume a full information feedback setting with respect to the losses - the learner observes yt Y. Given a context x X, an action chosen by the learner a A is considered safe if f (a, x) 0 for an unknown function f F A X [ 1, 1]. While f is unknown to the learner, the safety function class F is known and encodes prior knowledge about the safety constraint. F may be parameterized by linear models, neural networks, or other rich function approximators depending on the problem setting. At every timestep t, the learner receives feedback on unknown safety function f through safety feedback zt psignal(f (at, xt)), where psignal : [ 1, 1] Ω(Z) and Ω(Z) denotes the set of distributions over Z. We highlight the fact that given at and xt, zt is conditionally independent of the history. This captures a wide variety of feedback settings, including those where Z = R and zt = f (at, xt)+ξt for a noise variable ξt, and those where Z = {0, 1} and zt is the result of a coin flip with success probability determined by f (at, xt). Whenever the feedback function is not made explicit, we assume that the feedback is given by zt = f (at, xt) + ξt. Throughout the paper, we will compare the performance of our algorithm against a known policy class Π X A. The goal of the learner is to minimize the regret defined as: Regret T := t=1 ℓ(at, xt, yt) min π Π: t f (π(xt),xt) 0 t=1 ℓ(π(xt), xt, yt) which is the regret with respect to the optimal policy π in hindsight that is safe on every round, i.e. t [T], f (π(xt), xt) 0. The learner in turn is also only allowed to take actions at s.t. f (at, xt) 0. Since we are interested in making no constraint violations while taking our actions, learning is impossible unless we are at least given an initial set of safe actions. For any context x X, we assume that we are given a non-empty set A0(x) A that is guaranteed to be safe. For settings with no context (i.e. X = ), we denote the initial safe set as A0. 3.1. Additional Notation We use the notation ΠS(x) to denote the projection of a vector x Rd onto some convex set S Rd. For a positive definite matrix M Rd d and vector x Rd we denote the norm induced by M as x M := x Mx. We denote the convex hull of a set S as Conv(S). Let {xs}t s=1 := x1, . . . , xt be shorthand for a sequence. For a scalar-valued function class F A X [0, 1], we denote F(a, x) := supf,f F f(a, x) f (a, x). For a vector-valued function class F A X [ 1, 1]m, denote F (a, x) := supf,f F f(a, x) f (a, x) , where is the ℓ norm. For a set G, we denote Ω(G) as the set of distributions over G. For a vector x Rd and i [d], let x[i] be the i th coordinate of x. We adopt a non-asymptotic big-oh notation: for functions f, g : X R+, f = O(g) if there exists some constant C > 0 such that f(x) Cg(x) for all x X. We write f = o(g) if for every constant c, there exists a x0 such that f(x) cg(x) for all x x0. 3.2. Online Regression Oracles and Signal Functions Similar to prior works concerned with estimation of function classes (Foster et al., 2018a), (Foster & Rakhlin, 2020), (Foster et al., 2021), (Sekhari et al., 2023a), (Sekhari et al., 2023b), we assume our algorithms have access to an online regression oracle, Oracle OR. However, unlike these prior works that assume that the provided online regression oracle enjoys a sublinear regret bound, we require that our oracle satisfy a slightly weaker condition that allows for algorithms geared towards realizability. Assumption 3.1 (Online Regression Oracle). The algorithm Oracle OR guarantees that for any (possibly adversarially chosen) sequence {at, xt, zt}T t=1, for any δ (0, 1), with probability at least 1 δ, generates predictions {ˆzt}T t=1 satisfying: t=1 (ˆzt f (at, xt))2 Reg OR(T, δ, F) Assumption 3.1 is closely linked with the model psignal that produces feedback about constraint value, and any regretminimizing oracle for strongly convex losses can be converted into an oracle that satisfies the assumption with high probability. We formalize this in Lemma C.6 in the appendix. For instance, if zt psignal(f (at, xt)) is given Online Learning with Unknown Constraints by zt = f (at, xt) + ξt where ξt is any sub-gaussian distributed random variable, then any online square loss regression algorithm on class F that uses zt as corresponding outcomes will satisfy Assumption 3.1. Similarly, if zt {0, 1} is drawn as P(zt = 1|f (at, xt)) exp(f (at, xt)) = psignal(f (at, xt)) (ie. the Boltzman distribution), then one can show that Assumption 3.1 is satisfied by running any online logistic regression algorithm over class F with zt as labels. (Rakhlin & Sridharan, 2014) characterized the minimax rates for online square loss regression in terms of the offset sequential Rademacher complexity for arbitrary F. This for example, leads to regret bounds of the form, Reg OR(T, F) = O(log |F|) for finite function classes F, and Reg OR(T, F) = O(d log(T)) when F is a ddimensional linear class. More examples can be found in (Rakhlin & Sridharan, 2014) (Section 4) and efficient implementations can be found in (Krishnamurthy et al., 2017) and (Foster et al., 2018a). 3.3. Online Learning Oracles with Given Constraints Next, we consider the following online learning problem with given constraints, which will be instrumental in solving our problem of online learning with unknown constraints. On every round t, nature first announces a context xt X and constraint set corresponding to this context At(xt) A. Because the context is fixed on every timestep, we drop the dependence on xt and abbreviate the constraint set as At. The learner responds with a distribution pt Ω(At) and draws at pt. Nature then produces a loss function yt Y. The learner suffers loss ℓ(at, yt). Let Π X A be some given policy class the learner is benchmarked against. A policy π Π satisfies the given constraint at time t if π(xt) At. Let us denote the set of policies that satisfy all given constraints (in hindsight) as ΠT := {π Π : t, π(xt) At}. The goal of the learner is to minimize regret w.r.t. the best policy in ΠT . We assume access to an online learning oracle, Oracle OL that has bounded regret for the problem of online learning with given constraints: Assumption 3.2 (Online Learning Oracle). For any sequence of adversarially chosen sets, contexts and outcomes {At, xt, yt}T t=1 and any δ (0, 1) with probability at least 1 δ, the algorithm Oracle OL produces a sequence of distributions {pt}T t=1 satisfying pt Ω(At) for all t [T] with expected regret bounded as: t=1 E at pt [ℓ(at, xt, yt)] min π ΠT t=1 ℓ(π(xt), xt, yt) Reg OL(T, δ, Π) In our application to online learning with unknown con- straints, we will pass increasingly accurate approximations of the true set of safe actions for a given context xt (i.e. {a A | f (a, xt) 0}) in place of At to Oracle OL. The complexity of the unconstrained variant of Assumption 3.2 is well understood ((Rakhlin et al., 2010)). On a first glance, it may appear that the adversarially chosen constraint sets At may pose a significant challenge to learnability. We show that this is not the case using online symmetrization arguments developed in ((Rakhlin et al., 2010)) and a minimax analysis: Proposition 3.3. There exists an algorithm satisfying Assumption 3.2 with Reg OL(T, δ, Π) 2 Radseq ℓ Π(T) where Radseq ℓ Π(T) is the sequential Rademacher complexity of the loss class, defined as: Radseq ℓ Π(T) := t=1 ϵtℓ(π(xt(ϵ1:t 1)), xt(ϵ1:t 1), yt(ϵ1:t 1)) where in the above supremum over y and x are taken over all mappings of the form y : ST 1 t=0 { 1}t 7 Y and x : ST 1 t=0 { 1}t 7 X respectively. Various properties and examples of bounds on Radseq ℓ Π(T) can be found in (Rakhlin et al., 2010). Notably, this result is non-constructive and only guarantees the existence of such regret minimizing oracles. In section 5 we provide a constructive gradient-descent based algorithm in the online linear optimization setting. 3.4. Eluder Dimension Before delving into our main results, we recall the following definition of ϵ-dependence and eluder dimension (Russo & Roy, 2013), (Foster et al., 2021), (Foster et al., 2023). Definition 3.4. An action, context pair (a, x) A X is ϵ-dependent on {ai, xi}t i=1 A X w.r.t. F if every f, f F satisfying q Pt i=1(f(ai, xi) f (ai, xi))2 ϵ also satisfies f(a, x) f (a, x) ϵ. (a, x) is ϵ-independent w.r.t. F if a is not ϵ-dependent on {(ai, xi)}t i=1. The eluder dimension E(F, ϵ) is the length of the longest sequence of pairs in A X such that for some ϵ > ϵ, each pair is ϵ -independent of its predecessors. The eluder dimension is bounded for a variety of function classes. For example, when F is finite, E(F, ϵ) |F| 1, and when F the class of linear functions, E(F, ϵ) O(d log(1/ϵ)). The eluder dimension of function class F will be a component of our regret bounds. Online Learning with Unknown Constraints 4. Main Results We first propose our complexity measure Vκ( ) that captures the per timestep trade-off between loss minimization and information gain w.r.t safety function set F. We show that this complexity characterizes the difficulty of online learning with unknown constraints by showing an upper bound featuring Vκ( ) and a lower bound that demonstrates a bounded Vκ( ) is asymptotically necessary. Finally, we supplement our results by slightly modifying our algorithm to attain long term constraint satisfaction allowing us to contribute to the literature on online learning with long term constraints. Notably, Vκ( ) does not appear in this bound, suggesting this complexity measure is inherent to per-timestep constraint satisfaction. 4.1. Complexity Measure and Main Theorem Given the oracle assumptions, we introduce a few building blocks, define our complexity measure Vκ, and provide our main theorem. We first introduce sets of functions likely to contain our true safety function f , and from these sets of functions, build a pair of sets approximating the true safe action set. Suppose up until some timestep t we have played actions {as}t 1 s=1 and seen contexts and noisy observations {xs, zs}t 1 s=1 so far. Given the history {as, xs, zs}t 1 s=1 suppose Oracle OR produces a sequence {ˆzt}t 1 s=1. Let us define a version space Ft F0 defined as: s=1 (f(as, xs) ˆzs)2 Reg OR(T, δ, F0)} (1) By Assumption 3.1, the sum of squares deviations P t(ˆzt f (at, xt))2 Reg OR(T, δ, F0) with high probability. Therefore, the version spaces Ft contain f with high probability. Now suppose we have some set G F containing the true safety function f . Given a context x X, we define the optimistic action set O(G, x), and pessimistic action set P(G, x) as: O(G, x) = {a A | f G, f(a, x) 0} P(G, x) = {a A | f G, f(a, x) 0} (2) As shorthand, we denote Ot := O(Ft, xt), Pt := P(Ft, xt). The optimistic set represents the set of all actions that could be safe, and the pessimistic set represents the set of all actions that are guaranteed to be safe. We capture this notion in the following proposition: Proposition 4.1. Suppose f G F. Then: P(G, x) {a A : f (a, x) 0} O(G, x) Intuitively, this proposition suggests an algorithm playing from P(Ft, xt) is always safe, and competitiveness with respect to O(Ft, xt) implies competitiveness with all safe actions. We will delve further into this intuition later in this section. For a set G F, we recall the definition of the width of G with respect to action a and context x introduced in (Russo & Roy, 2013): G(a, x) := sup g,g G g(a, x) g (a, x) (3) For any action a O(G, x) and g G, g(a, x) G(a, x) - hence the width captures a notion of how far an optimistic action is from being pessimistic. Simultaneously if G(a, x) is small, all functions in G have similar values on a, indicating that a provides little information if our goal is to differentiate members of G. These two facts will be instrumental in our analysis. We now utilize the regret-minimizing properties of Oracle OL to be competitive w.r.t. Ot = O(Ft, xt). Suppose we pass given context xt and optimistic set Ot as inputs to the online learning oracle Oracle OL. Denote pt Ω(Ot) as the distribution recommended by Oracle OL. Because Ot contains all constraint-satisfying actions with high probability, combined with the regret guarantee of Oracle OL in Assumption 3.2, this guarantees that actions drawn from pt will have good performance in regret. As Proposition 4.1 suggests, playing actions from Pt ensures constraint satisfaction. Motivated by this fact, we aim to play actions drawn from Ω(Pt) optimizing a certain objective striking a balance between matching the performance of (potentially unsafe) pt Ω(Ot) and information acquisition captured by width. To this end, let M be any mapping Ω(Ot) Ω(Pt) parameterized by the function class Ft and context xt. 1 We will show that when we draw actions at pt = M( pt; Ft, xt), we can bound regret by quantities involving the eluder dimension, Reg OR, Reg OL, and a novel complexity measure VM κ defined as: VM κ ( pt, Ft, xt) := E at M( pt;Ft,xt) [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] κ E at M( pt;Ft,xt) [ Ft(at, xt)] Consequently for a given κ, the optimal mapping M is to optimize for this complexity measure through a saddle-point optimization: M κ( pt, Ft, xt) := argmin pt Ω(Pt) sup y Y E at pt [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] κ E at pt [ Ft(at, xt)] 1Wherever it is clear what the arguments to the mapping are we may drop them e.g. in a non-contextual setting, the context argument is dropped. Online Learning with Unknown Constraints We note that the above optimization can be (inefficiently) solved to desired accuracy through standard techniques by treating it as a two-player game where the min-player chooses p and the max-player choses f, f , y and we form an ϵ-net over the set of actions. Later in section 5, we will give examples of efficient mappings that can be used instead of this optimal mapping. Let us define the complexity measure Vκ( ) := VM κ ( ) as the value attained by the optimal mapping: Vκ( pt, Ft, xt) := inf pt Ω(Pt) sup y Y E at pt [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] κ E at pt [ Ft(at, xt)] Vκ is similar in spirit to the Decision Estimation Coefficient (DEC) of (Foster et al., 2023) in that it balances a notion of worst-case loss with one of information gain. Specifically, Vκ captures the per-step tension between loss minimization with respect to at pt produced by Oracle OL and information gain with respect to safety function set Ft, measured by the width of action at on Ft. We now present our main theorem: Theorem 4.2. For any δ (0, 1) there exists an algorithm that with probability at least 1 3δ, produces a sequence of actions {at}T t=1 that are safe and enjoys the following bound on regret: Regret T inf κ>0 t=1 Vκ( pt, Ft, xt) +κ inf α αT + 20 α Reg OR(T, δ, F0)E(F0, α) o + Reg OL(T, δ, Π) + q Moreover, there exists a problem setting such that if there exists κ > 0 satisfying t=1 Vκ( pt, Ft, x) > 0, then, no safe algorithm (with high probability) can ensure that Regret T = o(T). Theorem 4.2 follows by combining the upper bound (Theorem 4.3) with the lower bound (Theorem 4.4), presented in the following subsections. 4.2. Algorithm and Upper Bound We now present an algorithm that attains the upper bound in Theorem 4.2. While using the optimal mapping M gives us the best upper bound, solving the optimization involved (eq. 4) may be difficult. To address this, we allow the use of any (potentially efficient) mapping M. Algorithm 1 General Constrained Online Learning 1: Input: Oracle OL, Oracle OR, A0( ), δ (0, 1), κ, M 2: F0 = {f F : x X, a A0(x), f(a, x) 0} 3: for t = 1, . . . , T do 4: Receive context xt Ft = {f F0 : s=1 (f(as, xs) ˆzs)2 Reg OR(T, δ, F0)} 5: Ot = O(Ft, xt) , Pt = P(Ft, xt) 6: pt = Oracle OLt(xt, Ot) 7: pt = M( pt; Ft, xt) 8: Draw at pt 9: Receive noisy feedback zt 10: Update ˆzt = Oracle ORt(at, xt, zt) 11: Play at and receive yt 12: end for Theorem 4.3. For any δ (0, 1) with probability at least 1 3δ, Algorithm 1 when using mapping M produces a sequence of actions {at}T t=1 that are safe and enjoys the following bound on regret: Regret T inf κ>0 t=1 VM κ ( pt, Ft, xt) +κ inf α αT + 20 α Reg OR(T, δ, F0)E(F0, α) o + Reg OL(T, δ, Π) + q Furthermore, using κ = κ (F, M) defined as sup ˆ F F,x X, p (O( ˆ F,x)),y Y Ea M( p; ˆ F,x)[ℓ(a,x,y)] E a p[ℓ( a,x,y)] Ea M( p; ˆ F,x)[ ˆ F (a,x)] Regret T κ inf α αT + 20Reg OR(T, δ, F0)E(F0, α) + Reg OL(T, δ, Π) + p The proof follows by leveraging the low-regret guarantee of Oracle OL, manipulating the definition of VM κ on a pertimestep basis, and bounding the sum of information gain terms Ft(at, xt) by an extension of techniques first presented in (Russo & Roy, 2013). In the next section, we give concrete examples of efficient M which yield bounded κ for a few important settings. 4.3. Lower Bound Assuming the existence of Oracle OL with bounded Reg OL is reasonable as f could potentially be the always-safe zero function f0 : A 7 0, reducing the problem down to exactly the unconstrainted online learning problem as Ot = A. Being able to perform regression on F using a Oracle OR with bounded Reg OR and assumption that eluder dimension Online Learning with Unknown Constraints of F is well behaved are also commonplace. However, the reader may question if the sum PT t=1 Vκ( pt, Ft, xt) must necessarily be small for safe learnability. We show that at least asymptotically, PT t=1 Vκ( pt, Ft, xt) indeed must be small to guarantee sublinear regret. We capture this notion in the following theorem: Theorem 4.4. Suppose we are given some A0, X = {}, Y = {}, f F and losses ℓ: A 7 R are fixed such that for any a A satisfying f (a) > 0, ℓ(a) = mina A:f (a ) 0 ℓ(a ). Furthermore suppose ϵ, E(F, ϵ) is finite. Then, if there exists κ > 0 such that t=1 Vκ( pt, Ft) > 0, then, safe learning is impossible, i.e. no algorithm that is safe on every round (with high probability) can ensure that Regret T = o(T). On a high level, we show that if κ is such that lim T 1 T PT t=1 Vκ( pt, Ft) c > 0, then there must be a P A0 that is is guaranteed to be safe and a P we can estimate f (a) to arbitrary accuracy. Simultaneously we show that this P has the property that all actions within it are sub-optimal in loss by at least c when compared to best safe action. Further, this set P is non-expandable, meaning that we cannot find more actions that are guaranteed to be safe based on playing a P . Once we show the existence of P satisfying the aforementioned properties, we can simply announce to any learning algorithm A0 = P . Because the set is non-expandable, and since the learning algorithm must be safe, we conclude any algorithm is doomed to only play actions within P - which is known to be c sub-optimal. 4.4. Per-Timestep Constraints vs Long Term Constraints Suppose we are only interested in ensuring that the sum of constraint violations PT t=1 f (at, xt) is o(T), as is the goal in the line of works studying online learning with long term constraints (Mahdavi et al., 2012), (Yu & Neely, 2020), (Sun et al., 2017). We show that this can be done without use of the mapping M present in the main algorithm - the idea will be to simply play the output of the online learning algorithm given sets {Ot}T t=1. Algorithm 2 defined in the appendix guarantees: Lemma 4.5. For any δ (0, 1) with probability at least 1 2δ, Algorithm 2 produces a sequence of actions {at}T t=1 that satisfies: Regret T Reg OL(T, δ, Π) and t=1 f (at, xt) inf α αT + 20Reg OR(T, δ, F0)E(F0, α) Notably, Vκ( ) does not appear in either bound. This motivates the question: is assuming access to an online learning oracle and online regression oracle and that the eluder dimension of F is small enough for us to create algorithms that make no constraint violations with high probability? Unfortunately the answer is no. We need more assumptions on the initial safe set - which is what the mapping is utilizes. Specifically consider the case where ℓ(a, x, y) = y a, the constraint set is F = {(a, x) 7 f a : f 2 1} and suppose A0 = {0}. In this case E(F, ϵ) = d log(1/ϵ), and both the online learning oracle and online regression oracle are readily available and satisfy Assumptions 3.1 and 3.2 (e.g. use gradient descent and online linear regression algorithm). However, since A0 = {0} the initial pessimistic set is P1 = {0}. However since we are forced to play in this set, we don t gain any information about f and hence in the subsequent rounds Ft = F and Pt = P1. Thus we cannot hope to play anything other than the single safe choice a0 = 0 which prevents us from achieving low regret. 5. Examples In this section, we will give examples of function classes F where we can construct mappings M that have bounded κ (F, M). Consequently, this results in concrete regret bounds from Theorem 4.2 if these mappings are used in Algorithm 1. 5.1. Finite Action Spaces We first consider the setting of finite action spaces, where A = [K], X is arbitrary, FFAS A X [ 1, 1], and losses are functions ℓ: A X Y [0, 1]. Suppose we make the following assumption that promises some separation between function values: Assumption 5.1. F FFAS s.t. f F and P(F, x) = O(F, x), maxa P (F,x) F(a, x) 0 > 0 This assumption is in fact necessary for safe learning for large T. To see this, suppose F FFAS satisfies P(F , x) = O(F , x) but a P(F , x), F (a, x) = 0. Then, we have no hope of shrinking F , and consequently expanding P(F , x). If the adversarial losses have values of 1 for all a P(F , x), and values of 0 for all a / P(F , x), we d suffer constant loss for all subsequent rounds. Now, for a set of functions F FFAS and context x, let a (F, x) := argmaxa P (F,x) F(a, x) be the widthmaximizing action in the pessimistic set. We define a mapping for this setting as: MFAS( p; F, x) := ( ea (F,x) if P(F, x) = O(F, x) p otherwise Where ea is the distribution that places all its mass on a. The Online Learning with Unknown Constraints above mapping leads to an explore then exploit algorithm with respect to FFAS that plays the maximum width safe action until no uncertainty remains w.r.t FFAS. We show that MFAS as defined above has bounded κ (FFAS, MFAS). Lemma 5.2. Suppose Assumption 5.1 holds. Then, κ (FFAS, MFAS) 2 0 . Remark 5.3. We note that we only require the existence of 0 and not knowledge of it. 0 is only required to set κ = κ (FFAS, MFAS). In appendix section C.1, we describe a procedure of adaptively selecting time-varing κt at every timestep such that Vκt(t) < 0 and maxt κt 2κ . 5.2. Linear Constraints Next, we consider the setting of linear constraints where A Rd, X = , Π = { 7 a | a A} and losses are functions ℓ: A Y [0, 1]. We show a randomized variant of the scaling-based doubly-optimistic method presented in (Hutchinson et al., 2024) (Hutchinson et al., 2025) has bounded κ . Suppose we make the following assumption on A and ℓ: Assumption 5.4. The action set A Rd is convex, compact, and bounded, maxa A a 2 Da. The losses are lipschitz with constant Dℓ, y Y, a, a A, |ℓ(a, y) ℓ(a , y)| Dℓ a a . The constants Da, Dℓare known to the learner. Let us consider the following function class: FLin = {(a, x) 7 f, a b|f Rd} where b > 0 is fixed, and the unknown constraint is f , a b 0. Suppose the initial safe given to the learner is A0 = {a A : a b}. Then, any F0 = {f Rd : f 1}, since f with f > 1 has f, b f ||f|| b > 0, yet b f ||f|| A0 violating the promised initial safe set. Remark 5.5. Suppose that A0 is the ℓ1 ball of diameter b instead of the ℓ2 ball of diameter b. Then F0 becomes the unit ℓ ball - and the eluder dimension E(F0, ϵ) increases by a factor of log(d). Now, for a set of functions F and a O(F), let γ( a; F) := max {γ [0, 1] : γ a P(F)}. Define MLin( p; F) as the distribution induced by drawing a p and outputting γ( a; F) a. In other words, we scale down each action to ensure that it is safe. We show that this scaling-based mapping MLin has bounded κ (FLin, MLin). Lemma 5.6. Suppose Assumption 5.4 holds. Then, κ (FLin, MLin) DℓDa Using the mapping MLin, we can get concrete algorithms for the setting of linear constraints, no contexts, and linear losses. Specifically, we set Oracle OR to be the Vovk-Azoury-Warmuth forecaster (Vovk, 1997), (Azoury & Warmuth, 2001) which satisfies Assumption 3.1 with Reg OR(T, δ, F) O(d log( T dδ)). In the case of linear losses, where ℓ(at, , yt) = ℓt, at , ℓt Rd we provide an online gradient descent based algorithm satisfying Assumption 3.2 (Algorithm 3). It is a randomized algorithm that plays elements of the convex hull of O(Ft) and is stated in the appendix. Lemma 5.7. Suppose Assumption 5.4 holds. Then Algorithm 3 satisfies Assumption 3.2 with: Reg OL(T, δ, Π) 4DℓDa p Finally, (Russo & Roy, 2013) show that the Eluder dimension of the linear function class is E(FLinear, ϵ) = O(d log(1/ϵ)). Combining these facts with our main regret bound, we have: Corollary 5.8. In linear case, for any δ > 0, with probability at least 1 δ Algorithm 1 satisfies: Regret T = O d b log( T 5.3. Composing an Activation Function Suppose we know that some function class G A X [ 1, 1] has bounded κ (G, MG) for some mapping MG. Consider a fixed activation function σ : [ 1, 1] [ 1, 1] satisfying the following: Assumption 5.9. σ : [ 1, 1] [ 1, 1] is a fixed, differentiable non-decreasing function such that σ(0) = 0. Furthermore, there exists c such that for all x [ 1, 1], 0 < c σ (x), where σ is the derivative of σ. Consider the following function class formed by composing G with σ: σ(G) = {(a, x) 7 σ (g(a, x)) |g G} The unknown constraint is then σ(g (a, x)) 0 for some g G. We show that if κ (G, MG) is bounded for some mapping MG, κ (σ(G), MG) is also bounded. As a consequence of Lemma 5.6 for FLin and the below lemma for compositions with activation functions, generalized linear function classes ((Kakade et al., 2011)) have bounded κ . Lemma 5.10. Let G be a function class with bounded κ (G, MG) for some mapping MG. Suppose assumption 5.4 holds. Then κ (σ(G), MG) κ (G,MG) 5.4. Vector-Valued Constraints Our algorithms naturally generalize to settings where the learner must satisfy vector-valued constraints. Specifically, given a context x X, an action a A is now considered safe if f(a, x) 0 for unknown vector-valued function f F A X [ 1, 1]m. We formalize this notion in the appendix subsection C.4 - it is a straightforward extension of our main results. Notably, for the setting of m linear constraints, with probability at least 1 δ, regret is O md T log(δ 1) . Online Learning with Unknown Constraints Interestingly, we can interpret each of the m coordinates as an independent safety feature, all of which must adhere to safety. Consequently, | f(a)[i]| denotes how unsafe action a is on safety feature i, and the constraint is that it cannot exceed b. In this scenario, suppose the safety feedback space is Z = [m] and let the safety feedback function psignal( f(a)) be given by: P(z = i | a) = exp(| f(a)[i]|) P j [m] exp(| f(a)[j]|) as per the Boltzmann distribution on | f(a)[1]| . . . | f(a)[j]|. This is a natural form of feedback known as the Bradley-Luce Shepherd rule ((Christiano et al., 2017), (Bradley & Terry, 1952)) that outputs which constraint is most likely to be violated - when given m options of varying magnitudes, human-generated feedback has been shown to follow such a distribution ((Ghosal et al., 2023)). In this setting, we can use the logistic regression oracle w.r.t F as Oracle OR. When learning linear predictors, (Foster et al., 2018b) provide an efficient algorithm with Reg OR d log T . 5.4.1. POLYTOPIC CONSTRAINTS WITH SCALAR FEEDBACK Suppose the safety function class is polytopic: FPoly = {(a, x) 7 Fa b|F Rd m} Furthermore, suppose the unknown constraint is f(a) 0 R, for some f FPoly. In this setting, MLin as defined in Lemma 5.6 has bounded κ for FPoly: Lemma 5.11. Suppose Assumption 5.4 holds. Then, κ (FPoly, MLin) DℓDa Acknowledgments KS acknowledges support from Linked In-Cornell grant. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Agarwal, A. Selective sampling algorithms for costsensitive multiclass prediction. In Proceedings of the 30th International Conference on Machine Learning, pp. 1220 1228. PMLR, 2013. Amani, S. and Thrampoulidis, C. Decentralized multi-agent linear bandits with safety constraints. Proceedings of the AAAI Conference on Artificial Intelligence, pp. 6627 6635, 2021. Atawnih, A., Papageorgiou, D., and Doulgeri, Z. Kinematic control of redundant robots with guaranteed joint limit avoidance. Robotics and Autonomous Systems, pp. 122 131, 2016. Azoury, K. S. and Warmuth, M. K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, pp. 211 246, 2001. Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, pp. 324 345, 1952. Brunke, L., Greeff, M., Hall, A. W., Yuan, Z., Zhou, S., Panerati, J., and Schoellig, A. P. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annu. Rev. Control. Robotics Auton. Syst., pp. 411 444, 2022. Chaudhary, S. and Kalathil, D. M. Safe online convex optimization with unknown linear safety constraints. In Thirty-Sixth AAAI Conference on Artificial Intelligence, pp. 6175 6182. AAAI Press, 2022. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30, 2017. Dobbe, R., Hidalgo-Gonzalez, P., Karagiannopoulos, S., Henriquez-Auba, R., Hug, G., Callaway, D., and Tomlin, C. Learning to control in power systems: Design and analysis guidelines for concrete safety problems. Electric Power Systems Research, pp. 106615, 2020. Fereydounian, M., Shen, Z., Mokhtari, A., Karbasi, A., and Hassani, H. Safe learning under uncertain objectives and constraints. ar Xiv preprint ar Xiv:2006.13326, 2020. Foster, D. J. and Rakhlin, A. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In Proceedings of the 37th International Conference on Machine Learning, pp. 3199 3210. PMLR, 2020. Foster, D. J., Agarwal, A., Dud ık, M., Luo, H., and Schapire, R. E. Practical contextual bandits with regression oracles. In Proceedings of the 35th International Conference on Machine Learning, pp. 1534 1543. JMLR.org, 2018a. Foster, D. J., Kale, S., Luo, H., Mohri, M., and Sridharan, K. Logistic regression: The importance of being improper. In Conference On Learning Theory, pp. 167 208, 2018b. Foster, D. J., Rakhlin, A., Simchi-Levi, D., and Xu, Y. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. In Conference on Learning Theory, pp. 2059. PMLR, 2021. Online Learning with Unknown Constraints Foster, D. J., Rakhlin, A., Sekhari, A., and Sridharan, K. On the complexity of adversarial decision making. In Advances in Neural Information Processing Systems 35, 2022. Foster, D. J., Kakade, S. M., Qian, J., and Rakhlin, A. The statistical complexity of interactive decision making, 2023. Ghosal, G. R., Zurek, M., Brown, D. S., and Dragan, A. D. The effect of modeling human rationality level on learning rewards from multiple feedback types. In Thirty Seventh AAAI Conference on Artificial Intelligence, pp. 5983 5992, 2023. Gillen, S., Jung, C., Kearns, M. J., and Roth, A. Online learning with an unknown fairness metric. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montr eal, Canada, pp. 2605 2614, 2018. Hutchinson, S. and Alizadeh, M. Safe online convex optimization with first-order feedback. In American Control Conference, pp. 1 7. IEEE, 2024a. Hutchinson, S. and Alizadeh, M. Safe online convex optimization with multi-point feedback. In Learning for Dynamics and Control Conference, pp. 168 180. PMLR, 2024b. Hutchinson, S., Turan, B., and Alizadeh, M. Directional optimism for safe linear bandits. In International Conference on Artificial Intelligence and Statistics, pp. 658 666. PMLR, 2024. Hutchinson, S., Chen, T., and Alizadeh, M. Optimistic safety for online convex optimization with unknown linear constraints. In International Conference on Artificial Intelligence and Statistics, 2025. Jenatton, R., Huang, J., and Archambeau, C. Adaptive algorithms for online convex optimization with long-term constraints. In Proceedings of the 33nd International Conference on Machine Learning, pp. 402 411. JMLR.org, 2016. Kakade, S. M., Kalai, A., Kanade, V., and Shamir, O. Efficient learning of generalized linear and single index models with isotonic regression. In Neural Information Processing Systems 2011, pp. 927 935, 2011. Krishnamurthy, A., Agarwal, A., Huang, T.-K., Daum e, III, H., and Langford, J. Active learning for cost-sensitive classification. In International Conference on Machine Learning, pp. 1915 1924. PMLR, 2017. Levy, D., Sun, Z., Amin, K., Kale, S., Kulesza, A., Mohri, M., and Suresh, A. T. Learning with user-level privacy. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 12466 12479, 2021. Mahdavi, M., Jin, R., and Yang, T. Trading regret for efficiency: online convex optimization with long term constraints. Journal of Machine Learning Research, pp. 2503 2528, 2012. Moradipari, A., Amani, S., Alizadeh, M., and Thrampoulidis, C. Safe linear thompson sampling with side information. IEEE Transactions on Signal Processing, pp. 3755 3767, 2021. Neely, M. J. and Yu, H. Online convex optimization with time-varying constraints, 2017. Pacchiano, A., Ghavamzadeh, M., Bartlett, P. L., and Jiang, H. Stochastic bandits with linear constraints. In The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, pp. 2827 2835. PMLR, 2021. Pacchiano, A., Ghavamzadeh, M., and Bartlett, P. Contextual bandits with stage-wise constraints, 2024. Rakhlin, A. and Sridharan, K. Online non-parametric regression. In Proceedings of The 27th Conference on Learning Theory, pp. 1232 1264. PMLR, 2014. Rakhlin, A., Sridharan, K., and Tewari, A. Online learning: Random averages, combinatorial parameters, and learnability. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2010. Russo, D. and Roy, B. V. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp. 2256 2264, 2013. Sekhari, A., Sridharan, K., Sun, W., and Wu, R. Contextual bandits and imitation learning via preference-based active queries, 2023a. Sekhari, A., Sridharan, K., Sun, W., and Wu, R. Selective sampling and imitation learning via online regression. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. Sun, W., Dey, D., and Kapoor, A. Safety-aware algorithms for adversarial contextual bandit. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 3280 3288. JMLR.org, 2017. Online Learning with Unknown Constraints Usmanova, I., Krause, A., and Kamgarpour, M. Safe convex learning under uncertain constraints. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, pp. 2106 2114. PMLR, 2019. Vovk, V. Competitive on-line linear regression. In Advances in Neural Information Processing Systems 10, [NIPS Conference, Denver, Colorado, USA, 1997], pp. 364 370. The MIT Press, 1997. Yi, X., Li, X., Xie, L., and Johansson, K. Distributed online convex optimization with time-varying coupled inequality constraints. IEEE Transactions on Signal Processing, pp. 1 1, 2020. Yu, H. and Neely, M. J. A low complexity algorithm with O( T) regret and O(1) constraint violations for online convex optimization with long term constraints. Journal of Machine Learning Research, 2020. Zafar, M. B., Valera, I., Gomez-Rodriguez, M., and Gummadi, K. P. Fairness constraints: A flexible approach for fair classification. Journal of Machine Learning Research, 2019. Online Learning with Unknown Constraints A. Proofs from Section 3: Setup Definition A.1. A function Φ : [ 1, 1] R is λ-strongly convex if for all z, z [ 1, 1], it satisfies λ 2 (z z)2 Φ(z ) Φ(z) + φ(z)(z z )s where φ( ) is the derivative of Φ. The following formulation of link functions is standard in the literature (Sekhari et al., 2023a) Definition A.2. For a link function φ that is the derivative of a λ-strongly convex function Φ, we define the associated loss: ℓφ(z, z ) := Φ(z) z(z + 1) Assumption A.3 (Online Regression Oracle, Regret Version). The algorithm Oracle OR guarantees that for any (possibly adversarially chosen) sequence {at, xt}T t=1 generates predictions {ˆzt}T t=1 satisfying: t=1 ℓφ(ˆzt, zt) inf f F t=1 ℓφ(f(at, xt), zt) Regφ OR(T, F) where zt φ(f (at, xt)). The following lemma is adapted from (Sekhari et al., 2023a), Lemma 9 and related to (Agarwal, 2013), Lemma 2 Lemma A.4. Suppose that zt is generated with a link function φ that is λ-strongly convex. Suppose that the regression oracle satisfies assumption A.3. Then for any δ (0, 1) and T 3, with probability at least 1 δ, the regression oracle satisfies assumption 3.1 with: Reg OR(T, δ, F) 4 λRegφ OR(T, F) + 16 + 24λ λ2 log 4δ 1 log(T) . Proof. The proof is an application of (Sekhari et al., 2023a) Lemma 9. Proposition (Proposition 3.3 restated). There exists an algorithm satisfying Assumption 3.2 with Reg OL(T, δ, Π) 2 Radseq ℓ Π(T) where Radseq ℓ Π(T) is the sequential Rademacher complexity of the loss class, defined as: Radseq ℓ Π(T) := sup y,x Eϵ t=1 ϵtℓ(π(xt(ϵ1:t 1)), xt(ϵ1:t 1), yt(ϵ1:t 1)) where in the above supremum over y and x are taken over all mappings of the form y : ST 1 t=0 { 1}t 7 Y and x : ST 1 t=0 { 1}t 7 X respectively. Proof. We show that expected regret is bounded by 2Radseq ℓ Π(T) through a minimax analysis with sequential symmetrization techniques that are now standard from (Rakhlin et al., 2010). We use the notation Operatort T t=1 [A] to denote Operator1{Operator2{. . . Operator T {A} . . .}}. Furthermore, we denote the set of safe policies in hindsight as ΠT := {π Π : t, π(xt) At}. We view our online learning setting as a repeated game between adversary and learner where on each round t adversary picks a context and a set At learner picks a (randomized) action from this set and finally adversary picks yt for that round. The value of this game can we written as: Val T = sup xt,At inf pt Ω(At) sup yt Y Eat pt t=1 ℓ(at, xt, yt) min π ΠT t=1 ℓ(π(xt), xt, yt) sup xt,At sup qt Ω(Y) inf at At Eyt qt t=1 ℓ(at, xt, yt) min π ΠT t=1 ℓ(π(xt), xt, yt) Online Learning with Unknown Constraints sup xt,At sup qt Ω(Y) Eyt qt t=1 inf at At Eyt qt [ℓ(at, xt, yt)] min π ΠT t=1 ℓ(π(xt), xt, yt) sup xt,At sup qt Ω(Y) Eyt qt t=1 Eyt qt [ℓ(π(xt), xt, yt)] ℓ(π(xt), xt, yt) sup xt,At sup qt Ω(Y) Eyt,y t qt t=1 (ℓ(π(xt), xt, y t) ℓ(π(xt), xt, yt)) sup xt,At sup qt Ω(Y) Eyt,y t qt Eϵt t=1 ϵt (ℓ(π(xt), xt, y t) ℓ(π(xt), xt, yt)) sup xt,At sup yt,y t Y Eϵt t=1 ϵt (ℓ(π(xt), xt, y t) ℓ(π(xt), xt, yt)) sup xt,At sup yt,y t Y Eϵt t=1 ϵt (ℓ(π(xt), xt, y t) ℓ(π(xt), xt, yt)) where first line is obtained using repeated application of minimax theorem (which holds with minor assumptions on action sets and context set etc. that can be found in (Rakhlin et al., 2010)). Second line is a rearrangement. The next line is by noting that each loss-minimizing action at At has smaller losses than π(xt) At. The rest of the steps above are standard sequential symmetrization arguments. The key step is the last inequality above where we use the fact that ΠT Π. But once this is done, the inner terms are devoid of At s and so we drop them in the supremums and this results in two times the sequential Rademacher complexity of the loss class yielding: Val T 2 Radseq ℓ Π(T) Since the minmax value Val T is bounded, there exists a regret minimizing algorithm with required bound and this concludes the proof. B. Proofs from Section 4: Main Results Proposition (Proposition 4.1 restated). Suppose f G F. Then: P(G, x) {a A : f (a, x) 0} O(G, x) Proof. First suppose a is such that f (a, xt) 0. Since f G, a O(G, x). Hence, {a A : f (a, xt) 0} O(G, xt). Now suppose a P(F, xt). Then, f G, f(a, xt) 0. Since f G, f (a, xt) 0, and hence P(G, xt) {a A : f (a, xt) 0} Theorem (Theorem 4.2 restated). For any δ (0, 1) there exists an algorithm that with probability at least 1 3δ, produces a sequence of actions {at}T t=1 that are safe and enjoys the following bound on regret: Regret T inf κ>0 t=1 Vκ( pt, Ft, xt) + κ inf α αT + 20 α Reg OR(T, δ, F0)E(F0, α) ) + Reg OL(T, δ, Π) + q Moreover, there exists a problem setting such that if there exists κ > 0 satisfying t=1 Vκ( pt, Ft, x) > 0, then, safe learning is impossible, i.e. no algorithm that is safe on every round (with high probability) can ensure that Regret T = o(T). Proof. The first statement follows by applying Theorem 4.3 using the optimal mapping M = M (defined in eq. 4). The second statement follows from Theorem 4.4. Online Learning with Unknown Constraints B.1. Proofs of Upper Bounds Lemma B.1. With probability at least 1 δ, for all t [T], f Ft. Proof. This follows immediately from Assumption 3.1 which guarantees with probability at least 1 δ. s=1 (f (as, xs) ˆzs)2 Reg OR(T, δ, F0) and hence for any t [T], s=1 (f (as, xs) ˆzs)2 Reg OR(T, δ, F0) which shows f Ft. Proposition B.2. Let {Ft}T t=1 be the sequence of version spaces generated by Algorithm 1. For any δ (0, 1) with probability at least (1 δ) we have for any t [T] P(Ft, xt) {a A : f (a, xt) 0} O(Ft, xt) Proof. By Lemma B.1, we have with probability at least 1 δ that f Ft simultaneously for all t [T]. Applying Proposition 4.1 immediately yields the result. The following lemma bounds the number of times the width of the set Ft can exceed some threshold, and is a variant of Proposition 3 of (Russo & Roy, 2013). It is slightly different as our Ft are constructed around the predictions produced by Oracle OR. We state it for completeness. Lemma B.3. Let the sequence {Ft, at, ˆzt}T t=1 be generated by Algorithm 1. Then, for any sequence of adversarial contexts {xt}T t=1, and ϵ > 0, it holds that t=1 1{ Ft(at, xt) > ϵ} 4Reg OR(T, δ, F0) ϵ2 + 1 E(F0, ϵ) Proof. First we claim that for t [T] if Ft(at, xt) ϵ, then (at, xt) must be ϵ-dependent on at most 4Reg OR(T,δ,F0) ϵ2 disjoint subsequences of (a1, x1) (at 1, xt 1). Since Ft(at, xt) > ϵ, there must exist two functions f, f Ft satisfying f(at, xt) f (at, xt) > ϵ. By the definition of ϵ-dependence, if (at, xt) is ϵ-dependent on a sequence (ai1, xi1) (aiτ , xiτ ) of its predecessors, we must have Pτ j=1(f(aij, xij) f (aij, xij))2 > ϵ2. Therefore, if (at, xt) is ϵ-dependent on N such subsequences it follows that Pt 1 j=1(f(aj, xj) f (aj, xj))2 > Nϵ2. Therefore: j=1 (f(aj, xj) f (aj, xj))2 j=1 (f(aj, xj) ˆzj + ˆzj f (aj, xj))2 j=1 (f(aj, xj) ˆzj)2 + 2 j=1 (f (aj, xj) ˆzj)2 4Reg OR(T, δ, F0) where the second inequality follows from (a + b)2 2a2 + 2b2 and the third follows from f, f Ft. Online Learning with Unknown Constraints Second, we claim that for any k [T] and any sequence (a1, x1) (ak, xk), there must be a j k such that (aj, xj) is ϵ-dependent on at least N = k/E(F0, ϵ) 1 disjoint subsequences of its predecessors. We will show an iterative process of finding such an index j. Let S1 SN be N subsequences initialized as Si = {(ai, xi)} for i [N]. For j [N + 1, k] first check if (aj, xj) is ϵ-dependent of all Si, i [N]. If it is, we have found the index j satisfying our condition. Otherwise, pick a Si such that xj is ϵ-independent of Sj, and add xj to that Si. By the definition of eluder dimension, the maximum size of each Si, i [N] is E(F0, ϵ), and because N E(F0, ϵ) k 1, the process will terminate. Now, let (ai1, xi1) (aik, xik) be the subsequence such that for j [k], Ft(aij, xij) > ϵ. By the first claim, each element of this subsequence is ϵ-dependent on at most 4Reg OR(T,δ,F0) ϵ2 disjoint subsequences. By the second claim, there is some element that is ϵ-dependent on at least (k 1)/E(F0, ϵ) disjoint subsequences. It follows that (k/E(F0, ϵ) 1 4Reg OR(T,δ,F0) ϵ2 , and hence k 4Reg OR(T,δ,F0) ϵ2 + 1 E(F0, ϵ) The following Lemma utilizes Lemma B.3 to upper bound the sum of Ft. It is similar in spirit to Lemma 2 of (Russo & Roy, 2013), but our analysis is different and captures a trade-off between T and Reg OR(T, δ, F0)E(F0, ). Lemma B.4. Let the sequence {Ft, pt}T t=1 be generated by Algorithm 1. Then, for any sequence of adversarial contexts {xt}T t=1, t=1 E at pt [ Ft(at, xt)] inf α αT + 20Reg OR(T, δ, F0)E(F0, α) Proof. For a run of Algorithm 1, let {at}T t=1 be any sequence of actions drawn at pt for all t [T]. Furthermore, to simplify the notation, let us denote t := Ft(at, xt). Let us consider some arbitrary α > 0. Then, for this sequence of actions and contexts, t=1 Ft(at, xt) := log(2/α) 1 X t:2iα< t 2i+1α t log(2/α) 1 X t:2iα< t 2i+1α 2i+1α log(2/α) 1 X 2i+1α 4Reg OR(T, δ, F0) 22iα2 + 1 E(F0, 2iα) log(2/α) 1 X 2i+1α 5Reg OR(T, δ, F0) log(2/α) 1 X 10Reg OR(T, δ, F0) (iv) αT + E(F, α) 10Reg OR(T, δ, F0) (v) αT + 20Reg OR(T, δ, F0) In (i) we set the upper bound to the sum as log(2/α) 1 since all functions f F map to [ 1, 1], hence t 2 so it is enough to consider i : 2i+1α 2 and (ii) follows from Lemma B.3, (iii) follows from the fact that 1 Reg OR(T,δ,F) (2iα)2 for i [log(2/α) 1] if T > 1, (iv) follows from the fact that E(F, ) is nonincreasing in its second argument, and (v) is an Online Learning with Unknown Constraints upper bound from the sum of an infinite series. Therefore, for any sequence {Ft, at, xt}T t=1 generated by Algorithm 1 we have t=1 Ft(at, xt) αT + 20Reg OR(T, δ, F0) Now, since this holds for any sequence {Ft, at}T t=1 generated by the algorithm and adversarial contexts {xt}T t=1, it holds in expectation over the algorithm s draws. Theorem (Theorem 4.2 restated). For any δ (0, 1) with probability at least 1 3δ, Algorithm 1 when using mapping M produces a sequence of actions {at}T t=1 that are safe and enjoys the following bound on regret: Regret T inf κ>0 t=1 VM κ ( pt, Ft, xt) + κ inf α αT + 20 α Reg OR(T, δ, F0)E(F0, α) ) + Reg OL(T, δ, Π) + q VM κ ( pt; Ft, xt) = sup y Y E at M( pt;Ft,xt) [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] κ E at M( pt;Ft,xt) [ Ft(at, xt)] Furthermore, using κ = κ (F, M) := sup ˆ F F,x X, p (O( ˆ F,x)),y Y Ea M( p; ˆ F,x)[ℓ(a,x,y)] E a p[ℓ( a,x,y)] Ea M( p; ˆ F,x)[ ˆ F(a,x)] : Regret T κ inf α αT + 20Reg OR(T, δ, F0)E(F0, α) + Reg OL(T, δ, Π) + p Proof. By Proposition B.2, with probability at least 1 δ, if we play actions from Pt, we can guarantee the all the constraints are satisfied. On the other hand, to bound the regret of our algorithm w.r.t. the optimal action in hindsight that also satisfies constraint on every round, note that Regret T := t=1 ℓ(at, xt, yt) min π Π: t f (π(xt),xt) 0 t=1 ℓ(π(xt), xt, yt) t=1 ℓ(at, xt, yt) min π Π: t π(xt) Ot t=1 ℓ(a, xt, yt) ℓ(at, xt, yt) E at pt [ℓ( at, xt, yt)] + t=1 E at pt [ℓ( at, xt, yt)] min π Π: t π(xt) Ot ℓ(a, xt, yt) ℓ(at, xt, yt) E at pt [ℓ( at, xt, yt)] + Reg OL(T, δ, Π) E at pt [ℓ(at, xt, yt)] E at pt [ℓ( at, xt, yt)] + Reg OL(T, δ, Π) + p t=1 VM κ ( pt; Pt, Ft, xt) + κ t=1 E at pt [ Ft(at, xt)] + Reg OL(T, δ, Π) + p where (i) follows from the fact that by Proposition B.2, a policy π satisfying t, f (π(xt), xt) 0 satisfies t, π(xt) Ot, (ii) is an application of Hoeffding Azuma to bound PT t=1 ℓ(at, xt, yt) PT t=1 Eat pt [ℓ(at, xt, yt)] and: VM κ ( pt; Ft, xt) = sup y Y E at M( pt;Ft,xt) [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] κ E at M( pt;Ft,xt) [ Ft(at, xt)] Online Learning with Unknown Constraints by Lemma B.4 we can bound the PT t=1 κ Eat pt [ Ft(at, xt)] term, hence, Regret T inf κ>0 t=1 VM κ ( pt; Ft, xt) + κ inf α αT + 20Reg OR(T, δ, F0)E(F0, α) + Reg OL(T, δ, Π) + p This concludes the first bound - which holds with probability at least 1 3δ as we take a union bound over the online regression oracle guarantee, the online learning oracle guarantee, and the application of Hoeffding Azuma. To conclude the second part of the statement, we need to show that for κ = κ (F, M) := sup ˆ F F,x X, p (O( ˆ F,x)),y Y Ea M( p; ˆ F,x) [ℓ(a, x, y)] E a p [ℓ( a, x, y)] Ea M( p; ˆ F,x) ˆ F(a, x) we have that VM κ ( pt; Ft, xt) 0. To this end, note that VM κ ( pt; Ft, xt) E at M( pt;Ft,xt) [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] κ E at M( pt;Ft,xt) [ Ft(at, xt)] E at M( pt;Ft,xt) [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] sup ˆ F F,x X, p (O( ˆ F,x)),y Y Ea M( p; ˆ F,x) [ℓ(a, x, y)] E a p [ℓ( a, x, y)] Ea M( p; ˆ F,x) ˆ F(a, x) E at M( pt;Ft,xt) [ Ft(at, xt)] E at M( pt;Ft,xt) [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] supy Y{Eat M( pt;Ft,xt)[ℓ(at,xt,y)] E at pt[ℓ( at,xt,y)]} Eat M( pt;Ft,xt)[ Ft(at,xt)] E at M( pt;Pt,Ft,xt) [ Ft(at, xt)] B.2. Proofs of Lower Bounds We formalize the notion of P A0 with the properties described in subsection 4.3 with the following lemma, and show its existence. Lemma B.5. Assume that we have a fixed loss function ℓ: A 7 R such that for any a A satisfying f (a) > 0, ℓ(a) = mina A:f (a ) 0 ℓ(a ). Furthermore, assume that the eluder dimension of F at any scale ϵ > 0, (with input space A) is bounded. If for some c > 0, κ 0, any regret minimizing oracles Oracle OL and Oracle OR (assuming regret in both cases is o(T)) lim T 1 T PT t=1 Vκ( pt, Ft) c then, there exists a set P A0 with the following properties, 1. Set P satisfies constraints, i.e. a P , f (a) 0 2. Define F = {f : a P , f(a) = f (a)}. For every action a A \ P , f F such that f(a) > 0. That is, P cannot be expanded to a larger set guaranteed to satisfy constraint. 3. P is such that infa P ℓ(a) infa A:f (a) 0 ℓ(a) c Proof. First, since loss is fixed and using the property of the loss assumed, any online learning oracle that minimizes regret would have to return distributions over actions pt s such that lim T 1 T PT t=1 E at pt [ℓ( at)] = infa A ℓ(a). We have from the premise that lim T 1 T PT t=1 Vκ( pt, Ft) c and by the definition of Vκ( ) = inf M VM κ ( ), every mapping M satisfies Online Learning with Unknown Constraints T PT t=1 VM κ ( pt; Ft) c . Hence this means that for any mapping giving us distributions pt, we have that t=1 E at pt [ℓ(at)] inf a A:f (a) 0 ℓ(a) c since Ea pt [ Ft(at)] 0. Further, note that since the loss is fixed, if at some point we are able to find distribution pt such that Eat pt [ℓ(at)] infa A:f (a) ℓ(a) < c then by returning this distribution we would violate the premise that lim T 1 T PT t=1 Vκ( pt, Ft) c . Hence we have that for any mapping, and any t, Eat M( pt;Pt,Ft) [ℓ(at)] infa A:f (a) 0 ℓ(a) c . Since this holds for all mappings, let us consider the following mapping M( pt; Pt, Ft) = δ argmina Pt ℓ(a) if mina Pt ℓ(a) < infa A:f (a) 0 ℓ(a) + c δ argmaxa Pt Ft(a) otherwise where δ( ) is the point mass distribution. In the above we assume the argmin and argmax exists otherwise we can do a limiting argument. The above mapping is a valid one since loss is fixed and given. Now note that since we already showed that any valid mapping satisfies Eat M( pt;Pt,Ft) [ℓ(at)] infa A:f (a) 0 ℓ(a) c we can conclude mina Pt ℓ(a) infa A:f (a) 0 ℓ(a) + c . Now define the set Since Pt s are all guaranteed to be safe, we have that P is also safe, satisfying property 1. Second, since for every t, mina Pt ℓ(a) infa A:f (a) 0 ℓ(a) + c we have that infa P ℓ(a) infa A:f (a) 0 ℓ(a) c . Thus, P satisfies property 3 as well. Finally, to prove property 2, we use the assumption that eluder dimension for any scale ϵ is finite and that the regression oracle ensures that regret is sub-linear. Specifically, assume that online regression oracle guarantees an anytime regret guarantee of φδ(t) with probability 1 δ for any t rounds. In this case, using Lemma B.3 we have that with probability at least 1 δ, for all T 1 and all ϵ > 0, t=1 1{ Ft(at) > ϵ} 4φδ(T) ϵ2 + 1 E(F, ϵ) for at s produced by the above mapping. However, since we are picking at s that maximize Ft(at) on every round and because the Pt s are nested, the indicators are in descending order. Hence, with probability at least 1 δ, for any ϵ (0, 1], let Tϵ be the smallest integer such that Tϵ φδ(Tϵ) > 5E(F, ϵ) This is where the condition that regret bound φδ(Tϵ) is o(Tϵ) is needed so that the above yields a valid lower bound on Tϵ. We have that for any t, for every action a Pt, Ft+Tϵ is such that supf Ft+Tϵ |f(a) f (a)| ϵ. The reason we take Ft+Tϵ is because t is the first round in which actions in Pt not in earlier sets come into consideration and so we need Tϵ more rounds to ensure that for all actions in this set, estimation error is smaller than ϵ. Thus if we consider the set T t 1 Ft, this set corresponds to the set F = {f : a P , f(a) = f (a)}. Further, by definition of Pt s we have that if there were some action a such that f F f(a) 0, then this action would be contained in P . Thus we conclude that every action not in P is such that it evaluates to a positive number for some function f F . Thus we have shown property 2 as well. Once we have the existence of P A0 with the properties described in subsection 4.3 we show that learning is impossible. Proposition B.6. If there exists a set P that has the following properties, 1. Set P satisfies constraints, i.e. a P , f (a) 0 2. Define F = {f : a P , f(a) = f (a)}. For every action a A \ P , f F such that f(a) > 0. That is, P cannot be expanded to a larger set guaranteed to satisfy constraint. 3. P is such that infa P ℓ(a) infa A:f (a) 0 ℓ(a) c Online Learning with Unknown Constraints Then, safe learning is impossible, and any learning algorithm that is guaranteed to satisfy constraints on every round (with high probability) has a regret lower bounded by Regret T Tc . Proof. By property 1, we are guaranteed that P is safe so we can start any algorithm with initial safe set A0 = P . Since any safe algorithm must play actions that it can guarantee are safe with high probability, initially any algorithm initialized with P has to play from within this set till it can verify some action outside of this set is safe. However by property 3, any action within P is at least c suboptimal. Any feedback zt we obtain in the process of playing actions at P would certainly help us evaluate f (at) more accurately. However, property 2 implies that even if we were given the values of f for every action in the set P , we still would not be able to find another action outside of this set that we can conclude is safe unless we make further assumptions on f . This is because, each probe/feedback by playing action at yields value of f (at) + ξt. Since the noise ξt is a standard normal variable, at best we might be able to learn only f (a) for every a P . However, even if we had this information, the best we could conclude is that f is one of the functions in F . However, property 2 ensures that for every a A \ P , there is a function in f F that matches the value of f on on every action in P but has f(a) > 0. Since we have no information about which f F is the true f , no learning algorithm will be able to safely try any action outside of P and so any safe learning algorithm will suffer a sub-optimality of at least c on every round and thus Regret T Tc Theorem (Theorem 4.4 Restated). Suppose we are given some A0, X = {}, Y = {}, f F and losses ℓ: A 7 R are fixed such that for any a A satisfying f (a) > 0, ℓ(a) = mina A:f (a ) 0 ℓ(a ). Furthermore suppose ϵ, E(F, ϵ) is finite. Then, if there exists κ > 0 such that t=1 Vκ( pt, Ft) > 0, then, safe learning is impossible, i.e. no algorithm that is safe on every round (with high probability) can ensure that Regret T = o(T). Proof. Combining Lemma B.5 and Proposition B.6 trivially yields the statement of the theorem. B.3. Proofs of Per-Timestep Constraints vs Long Term Constraints Lemma (Lemma 4.5 restated). For any δ (0, 1), there exists an algorithm that with probability at least 1 2δ produces a sequence of actions {at}T t=1 that satisfies: Regret T Reg OL(T, δ, Π) and t=1 f (at, xt) inf α αT + 20Reg OR(T, δ, F0)E(F0, α) We provide a modified version of Algorithm 1, stated in Algorithm 2, where we do not maintain an pessimistic set, and directly play the output of the Oracle OL.We claim that Algorithm 2 satisfies the guarantee from Lemma 4.5. Algorithm 2 Online Learning with Long Term Constraints 1: Input: Oracle OL, Oracle OR, A0( ), δ (0, 1) 2: Initialize F0 = {f F : x X, a A0(x), f(a, x) 0} 3: for t = 1, , T do 4: Receive context xt 5: Ft = {f F0 : Pt 1 s=1(f(as, xs) ˆzs)2 Reg OR(T, δ, F0)} 6: Ot = O(Ft, xt) // Optimistic set; cf. eq (2) 7: pt = Oracle OLt(xt, Ot) 8: Draw at pt 9: Receivenoisy feedback zt 10: Update ˆzt = Oracle ORt(at, xt, zt) 11: Play at and receive yt 12: end for Online Learning with Unknown Constraints Proof. By Lemma B.1, we know that with probability at least 1 δ, for every T simultaneously, f Ft. Now for a given timestep t [T] consider the action played by the algorithm at. Because at was generated by Oracle OL, at Ot. For this action, let ft := argminf ˆ Ft f(at, xt). Then ft(at, xt) 0 ft(at, xt) ft(at, xt) + f (at, xt) Ft(at, xt) f (at, xt) Ft(at, xt) Summing up all terms, we get t=1 f (at, xt) t=1 Ft(at, xt) Now, using Lemma B.4 with each pt defined as point distributions putting all its mass on at, we can further bound the above as: t=1 f (at, xt) inf α αT + 20Reg OR(T, δ, F0)E(F0, α) Finally, since we are just playing the output of our oracle Oracle OL, the regret bound is simply Regret T := t=1 ℓ(at, xt, yt) min π Π: t f (π(xt),xt) 0 t=1 ℓ(π(xt), xt, yt) t=1 ℓ(at, xt, yt) min π Π: t π(xt) Ot t=1 ℓ(a, xt, yt) Reg OL(T, δ, Π) where the second inequality follows from the fact that by Proposition B.2, a policy π satisfying t, f (π(xt), xt) 0 satisfies t, π(xt) Ot. The theorem statement holds with probability at least 1 2δ as we apply a union bound over the online regression oracle guarantee and the online learning oracle guarantee C. Proofs from Section 5: Examples C.1. Proofs for Finite Action Spaces Procedure for selecting adaptive κt : Let us define κ : κ (FFAS, MFAS) (defined in Theorem 4.3 statement). κ need to be known - but we will show we can attain similar as or potentially much better results than κ nonetheless. For timestep t, given context xt and Ft, Pt, pt generated by the algorithm, let Vt(κ) be the value of the saddle-point optimization in eq 5, which is monotonically decreasing in κ. Observe that since infα n αT + 20Reg OR(T,δ,F0)E(F0,α) T, κ ranges from T] in order for the regret bounds to be o(T). Therefore, starting with κt = T, repeatedly halve κt while Vt(κt) < 0. Set κt to be the last κt satisfying Vt(κt) < 0, and play pt as the minimizer of Vt(κt). By definition, κ satisfies Vt(κ ) < 0, and therefore at worst κt < 2κ while simultaneously κt may be be much smaller as we are adapting to xt, Ft, Pt, pt. Consequently, our final regret bound would depend on maxt κt 2κ . Thus we can easily obtain the result competing with κ corresponding to the best mapping. As a further note, we can relax the condition Vt(κt) < 0 by Vt(κt) < T p for some appropriate p to get more general rates. Lemma (Lemma 5.2 restated). Suppose Assumption 5.1 holds. Then, κ (FFAS, MFAS) 2 0 . Proof. First suppose Pt = Ot. Then we have MFAS( p; F, x) = ea (F,x). Let F be the maximizer of κ (FFAS, MFAS). Then: κ (FFAS, MFAS) = sup x X, p (O(F ,x)),y Y Ea MFAS( p;F ,x) [ℓ(a, x, y)] E a p [ℓ( a, x, y)] Ea MFAS( p;F ,x) [ F (a, x)] Online Learning with Unknown Constraints = sup x X, p (O(F ,x)),ℓ [0,1]K ℓ, ea (F ,x) p F (a (F , x), x) sup x X, p (O(F ,x)),ℓ [0,1]K ℓ ea (F ,x) p 1 where the first inequality follows from the fact that given a context x, a (F , x) maximizes F ( , x), and hence by Assumption 5.1 F (a (F , x), x) 0. Now if Pt = Ot, MFAS( pt, Pt, Ft) = pt. Clearly, by definition, κ (FFAS, MFAS) = 0. C.2. Proofs for Linear Constraints Lemma (Lemma 5.6 restated). Suppose Assumption 5.4 holds. Then, κ (FLin, MLin) DℓDa We first introduce a lemma that lower bounds γ( a; F). Lemma C.1. Let F be an arbitrary subset of FLin and consider some a O(F). γ( a; F) := max {γ [0, 1] : γ a P(F)} is lower bounded as: γ( a; F) b b + F( a) Proof. Recall that F FLin := {(a, x) 7 f, a b|f Rd}. Define f := argminf F f( a), and let f be some arbitrary function in F. From the definition of a O(F), we have f( a) 0. Then: f( a) + b b f( a) + f( a) f( a) + b b + F( a) f( a) + b b + F( a) where the third inequality follows from the definition of F( ). Let α = b b+ F( a). Notice that f( a) + b is a linear function w, a for some w Rd. It follows that α(f( a) + b) = α w, a = w, α a = f(α a) + b. Using this fact, f( a) + b b + F( a) f( a) + b α(b + F( a)) f(α a) + b b Since f was an arbitrary function in F, this shows that α a P(F). Since we defined γ( a; F) = max {γ [0, 1] : γ a P(F)} γ( a; F) α = b b + F( a) We now prove Lemma 5.6. Online Learning with Unknown Constraints Proof. Let F be the maximizer of κ (FLin, MLin). Using the definition of κ , κ (FLin, MLin) = sup x X, p (O(F ,x)),y Y Ea MLin( p;F ,x) [ℓ(a, x, y)] E a p [ℓ( a, x, y)] Ea MLin( p;F ,x) [ F (a, x)] = sup a O(F ),y Y ℓ(MLin( a; F ), y) ℓ( a, y) F (MLin( a; F )) = sup a O(F ),y Y ℓ(γ( a; F ) a, y) ℓ( a, y) F (γ( a; F ) a) = sup a O(F ),y Y Dℓ γ( a; F ) a a F (γ( a; F ) a) = sup a O(F ),y Y DℓDa(1 γ( a; F )) γ( a; F ) F ( a) Now Lemma C.1 implies 1 γ( a; F ) γ( a; F ) 1 b b + F ( a) b = F ( a) b + F ( a) κ (FLin, MLin) DℓDa C.2.1. PROOF OF LEMMA 5.7 We present a constructive online learning oracle Oracle OL for the case of linear cost functions. Recall that Oracle OL must satisfy Assumption 3.2 restated below: Assumption (Online Learning Oracle). For any sequence of adversarially chosen sets {At}T t=1 and any δ (0, 1) with probability at least 1 δ, the algorithm Oracle OL produces a sequence of distributions {pt}T t=1 satisfying pt Ω(At) for all t [T] with expected regret bounded as: t=1 E at pt [ℓ(at, xt, yt)] min a T t=1At t=1 ℓ(a, xt, yt) Reg OL(T, δ, Π) Algorithm 3 is stated below, and it is a projected online gradient descent based algorithm. Let the losses encountered at timesteps t [T] be denoted by ℓ(at, , yt) = ℓt, at . Algorithm 3 Oracle OL for Linear Losses 1: Input: A, Da, Dℓ, δ (0, 1), η 2: for timesteps t = 1, , T do 3: Receive At = Ot 4: at ΠConv(At) ( at 1 ηℓt 1) 5: Decompose at = Pd+1 i=1 pt,iat,i, i, at,i At 6: Sample at pt 7: Receive t = ℓt 8: end for We briefly describe the the steps in Algorithm 3. Convex Hulls of At (line 4) Online Learning with Unknown Constraints Because the action sets At = Ot are sublevel sets of a minimum of affine functions, they are not necessarily convex, making them incompatible with projection based online learning algorithms. In order to address this, we take the convex hull of At, Conv(At), as our projection set. Projected Online Gradient Descent (line 4) Our algorithm then performs projected online gradient descent in sets Conv(At), generating a sequence of vectors { a1 a T } produced by at = ΠConv(At) ( at 1 ηℓt 1). We note that while we use projected online gradient descent, because the vectors { a1 a T } are maintained and updated independently, we could alternatively use a projected variant of any other online convex optimization algorithm that guarantees low regret instead. Sampling a Point in At (line 5) Due to Carath eodory s theorem, we know that can write any at Conv(At) as a linear combination of at most d + 1 vectors in At, at = Pd+1 i=1 pt,iat,i, i, at,i At. In line 4, we perform this decomposition, and in line 5, we sample the vector at according to this distribution pt. Notably, the point at satisfies E[ at] = at. Lemma (Lemma 5.7 restated). Suppose Assumption 5.4 holds. Then Algorithm 3 satisfies Assumption 3.2 with: Reg OL(T, δ, Π) 4DℓDa p Proof. At every timestep t [T], Algorithm 3 receives a set At = Ot, and produces a at by: at ΠConv(At) ( at 1 ηℓt 1) then, it decomposes each at as: i=1 pt,iat,i i, at,i At and then at is produced by sampling: at pt. We analyze the regret of Algorithm 3 by decomposing it into two terms: Reg OL(T, δ, Π) = t=1 ℓt, at min a TT t=1 At ℓt, a t=1 ℓt, at ℓt, at | {z } Term I t=1 ℓt, at min a TT t=1 at ℓt, a | {z } Term II Bounding Term I We show that Term I is a difference between a bounded random variable and its expectation, and use Hoeffding s inequality to bound it. Let ST := PT t=1 ℓt, at . Then: E[ST ] = E[ t=1 ℓt, at ] = t=1 ℓt, E[ at] = where the second equality follows by linearity of expectation. Note that each summand in ST satisfies | ℓt, at | ℓt at DℓDa. Hence, by Hoeffding s inequality, with probability at least 1 δ, t=1 ℓt, at ℓt, at |ST E[ST ]| q 2TD2 ℓD2a log(2/δ) (6) Bounding Term II Term II captures the performance of the online gradient descent portion, line 4, of Algorithm 3. A difference is that the projection set is time-varying - yet this does not pose a problem for us since we only need to guarantee performance w.r.t. a a in the intersection of all the sets. Let a := argmina T t=1At ℓt, a . with this, t=1 ℓt, at min a TT t=1 at ℓt, a = t=1 ℓt, at a Online Learning with Unknown Constraints Therefore, it is sufficient to bound the terms ℓt, at a . For any timestep t, we have: at+1 a 2 2 = ΠAt ( at ηℓt) a 2 2 at ηℓt a 2 2 = at a 2 2 + η2 ℓt 2 2 2η ℓt, at a where the inequality follows from the fact that a T t=1At At, so projection to At only decreases the distance. Rearranging, at a 2 2 at+1 a 2 2 + η Summing up the terms t [T], we get: t=1 ℓt, at a 1 at a 2 2 at+1 a 2 2 + η 2η ( a1 a 2 2 a T +1 a 2 2) + η 4D2 a 2η + η Setting η = 2Da Dℓ t=1 ℓt, at a 2Da Dℓ Combining the bounds from equations 6 and 7, we get t=1 ℓt, at min a TT t=1 at ℓt, a = Term I + Term II 4Da Dℓ p C.3. Proofs for Composing an Activation Function Lemma (Lemma 5.10 restated). Let G be a function class with bounded κ (G, MG) for some mapping MG. Suppose assumption 5.4 holds. Then κ (σ(G), MG) κ (G,MG) Proof. For any set F σ(G) let GF be the set for which F = σ(GF). Since σ(0) = 0 and σ is non decreasing, for any x X we have P(F, x) = P(GF, x) and O(F, x) = O(GF, x). Given a context x X, notice that MG is a mapping from distributions over optimistic sets to pessimistic sets. It follows that MG is a valid mapping satisfying MG( ; GF, x) = MG( ; F, x). Let F be the maximizer of κ (σ(G), MG). Using the definition of κ , κ (σ(G), MG) = sup x X, p (O(F ,x)),y Y Ea MG( p;F ,x) [ℓ(a, x, y)] E a p [ℓ( a, x, y)] Ea MG( p;F ,x) [ F (a, x)] = sup x X, a O(F ,x),y Y ℓ(MG( a; F , x), x, y) ℓ( a, x, y) F (MG( a; F , x)) = sup x X, a O(F ,x),y Y ℓ(MG( a; GF , x), x, y) ℓ( a, x, y) F (MG( a; F , x)) = sup x X, a O(F ,x),y Y GF (MG( a; GF , x)) F (MG( a; F , x)) ℓ(MG( a; GF , x), x, y) ℓ( a, x, y) GF (MG( a; GF , x)) Online Learning with Unknown Constraints κ (G, MG) GF (MG( a; GF , x)) F (MG( a; F , x)) Now, for any a A, x X and F let g := argmaxg GF g(a, x) and g := argming GF g(a, x) F(a) = g(a, x) g(a, x) σ(g(a, x)) σ(g(a, x)) g(a, x) g(a, x) c(g(a, x) g(a, x)) where the second inequality follows from σ(0) = 0 and for any x [ 1, 1], σ (x) is bounded below by c. Therefore κ (σ(G), MG) κ (G,MG) C.4. Proofs for Vector-Valued Constraints Recall that given a context x X, an action a A is considered safe if f(a, x) 0 for unknown vector-valued function f F A X [ 1, 1]m. Notably, Proposition 3.2 continues hold - guaranteeing the existence of an oracle Oracle OL. We recall that for a vector-valued function class F A X [ 1, 1]m, we denote F (a, x) := supf,f F f(a, x) f (a, x) , where is the ℓ norm. We define the following vector valued analogues of optimistic and pessimistic sets: O(G, x) = {a A | f G, f(a, x) 0} P(G, x) = {a A | f G, f(a, x) 0} (8) Algorithm 4 General Online Learning with Vector-Valued Constraints 1: Input: Oracle OL, Oracle OR, Initial safe set A0, δ (0, 1) 2: Fi 0 = {(a, x) 7 f(a, x)[i] : f F : a A0, x X, |f(a, x)[i]| 0} 3: F0 = F1 0 . . . Fm 0 4: for t = 1, . . . , T do 5: Receive context xt 6: Ft = {f F0 : Pt 1 s=1( f(as, xs) ˆzs )2 Reg OR(T, δ, F0)} 7: Ot = O(Ft, xt) , Pt = P(Ft, xt) // Optimistic/Pessimistic; cf. eq (8) 8: pt = Oracle OLt(xt, Ot) 9: pt = M( pt; Ft, xt) 10: Draw at pt 11: Receive noisy feedback zt 12: Update ˆzt = Oracle ORt(at, xt, zt) 13: Play at and receive yt 14: end for The following assumption is a variant of Assumption 3.1 that applies to multiple constraints. Assumption C.2 (Online Regression Oracle with Vector Valued Constraints). The algorithm Oracle OR guarantees that for any (possibly adversarially chosen) sequence {at, xt}T t=1, for any δ (0, 1), with probability at least 1 δ, generates predictions {ˆzt}T t=1 with each ˆzt [ 1, 1]m satisfying: t=1 ˆzt f(at, xt) 2 Reg OR(T, δ, F) The following formulation of link functions is standard in the literature (Sekhari et al., 2023b). Online Learning with Unknown Constraints Definition C.3. A function Φ : [ 1, 1]m R is λ-strongly convex if for all v, v [ 1, 1]m, it satisfies λ 2 v v 2 2 Φ(v ) Φ(v) + φ(v), v v where φ( ) = Φ( ). Definition C.4. For a link function φ( ) = Φ( ) for a λ-strongly convex function Φ, we define the associated loss: ℓφ(v, z) := Φ(v) v[z] Assumption C.5 (Online Regression Oracle with Vector Valued Constraints, Regret Version). The algorithm Oracle OR guarantees that for any (possibly adversarially chosen) sequence {at, xt}T t=1 generates predictions {ˆzt}T t=1 satisfying: t=1 ℓφ(ˆzt, zt) inf f F t=1 ℓφ( f(at, xt), zt) Regφ OR(T, F) where zt φ( f(at, xt)). The following lemma is adapted from (Sekhari et al., 2023b), Lemma 5 and related to (Agarwal, 2013), Lemma 2 Lemma C.6. Suppose that zt is generated with a link function φ that is the gradient of a λ-strongly convex function. Suppose that the regression oracle satisfies assumption C.5. Then for any δ (0, 1) and T 3, with probability at least 1 δ, the regression oracle satisfies assumption C.2 with: Reg OR(T, δ, F) 4 λRegφ OR(T, F) + 112 λ2 log 4δ 1 log(T) . Proof. The proof is an application of (Sekhari et al., 2023b) Lemma 5. Lemma C.7 (Variant of Lemma B.1 for Vector Valued Constraints). With probability at least 1 δ, for all t [T], f Ft. Proof. First, notice that as f is safe for all actions in A0, f F0. Assumption C.2 guarantees with probability at least 1 δ. s=1 f(as, xs) ˆzs 2 Reg OR(T, δ, F0) and hence for any t [T], s=1 f(as, xs) ˆzs 2 Reg OR(T, δ, F0) which shows f Ft. Proposition C.8 (Variant of Proposition B.2 for Vector Valued Constraints). Let {Ft}T t=1 be the sequence of version spaces generated by Algorithm 4. For any δ (0, 1) with probability at least (1 δ) we have for any t [T] P(Ft, xt) {a A : f(a, xt) 0} O(Ft, xt) Proof. By Lemma C.7, we have with probability at least 1 δ that f Ft simultaneously for all t [T]. Take some arbitrary t [T]. First suppose a is such that f(a, xt) 0. Since f Ft, a O(Ft, x). Hence, {a A : f(a, xt) 0} O(Ft, xt). Now suppose a P(F, xt). Then, f Ft, f(a, xt) 0. Since f Ft, f(a, xt) 0, and hence P(Ft, xt) {a A : f(a, xt) 0} Lemma C.9 (Variant of Lemma B.3 for Vector Valued Constraints). Let the sequence {Ft, at, ˆzt}T t=1 be generated by Algorithm 4. Then, for any sequence of adversarial contexts {xt}T t=1, and ϵ > 0, it holds that t=1 1 Ft(at, xt) > ϵ 4Reg OR(T, δ, F0) i [m] E(Fi 0, ϵ) Online Learning with Unknown Constraints Proof. For all i [m], let Fi t := {f i Fi 0 | Pt 1 s=1(f i(as, xs) ˆzi s)2 Reg OR(T, δ, F0)} We appeal directly to Lemma B.3. t=1 1 Ft(at, xt) > ϵ = t=1 1 max i m Fi t (at, xt) > ϵ t=1 1 n Fi t (at, xt) > ϵ o 4Reg OR(T, δ, F0) ϵ2 + 1 E(Fi 0, ϵ) 4Reg OR(T, δ, F0) i [m] E(Fi 0, ϵ) where the third line follows from Lemma B.3. The following lemma is similar to Lemma B.4 - we state it for completeness. Lemma C.10 (Variant of Lemma B.4 for Vector Valued Constraints). Let the sequence {Ft, pt}T t=1 be generated by Algorithm 4. Then, for any sequence of adversarial contexts {xt}T t=1, t=1 E at pt Ft(at, xt) inf α αT + 20Reg OR(T, δ, F0) Pm i=1 E(Fi 0, α) α Proof. For a run of Algorithm 1, let {at}T t=1 be any sequence of actions drawn at pt for all t [T]. Furthermore, to simplify the notation, let us denote t := Ft(at, xt). Let us consider some arbitrary α > 0. Then, for this sequence of actions and contexts, t=1 Ft(at, xt) := log(2/α) 1 X t:2iα< t 2i+1α t log(2/α) 1 X t:2iα< t 2i+1α 2i+1α log(2/α) 1 X 2i+1α 4Reg OR(T, δ, F0) j [m] E(Fj 0, 2iα) log(2/α) 1 X 2i+1α 5Reg OR(T, δ, F0) j [m] E(Fj 0, 2iα) log(2/α) 1 X 10Reg OR(T, δ, F0) j [m] E(Fj 0, 2iα) j [m] E(Fj 0, α) 10Reg OR(T, δ, F0) (v) αT + 20Reg OR(T, δ, F0) P j [m] E(Fj 0, α) Online Learning with Unknown Constraints In (i) we set the upper bound to the sum as log(2/α) 1 since all functions f F map to [ 1, 1], hence t 2 so it is enough to consider i : 2i+1α 2 and (ii) follows from Lemma B.3, (iii) follows from the fact that 1 Reg OR(T,δ,F) (2iα)2 for i [log(2/α) 1] if T > 1, (iv) follows from the fact that E(F, ) is nonincreasing in its second argument, and (v) is an upper bound from the sum of an infinite series. Therefore, for any sequence {Ft, at, xt}T t=1 generated by Algorithm 1 we have t=1 Ft(at, xt)αT + 20Reg OR(T, δ, F0) P j [m] E(Fj 0, α) Now, since this holds for any sequence {Ft, at}T t=1 generated by the algorithm and adversarial contexts {xt}T t=1, it holds in expectation over the algorithm s draws. The following is an analogue of Theorem 4.2 for multiple constraints. Corollary C.11. For any δ (0, 1) with probability at least 1 3δ, Algorithm 4 produces a sequence of actions {at}T t=1 that are safe, and enjoys the following bound on regret: Regret T inf κ>0 t=1 VM κ ( pt; Ft, xt) + κ inf α αT + 20Reg OR(T, δ, F0) P i [m] E(Fi 0, α) + Reg OL(T, δ, Π) + p 2T log(δ 1) VM κ ( pt; Ft, xt) = sup y Y E at M( pt;Ft,xt) [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] κ E at M( pt;Ft,xt) Further, if we use κ = κ (F, M) := sup ˆ F F,x X, p (O( ˆ F,x)),y Y Ea M( p; ˆ F,x)[ℓ(a,x,y)] E a p[ℓ( a,x,y)] Ea M( p; ˆ F,x)[ ˆ F (a,x)] , then in the above, VM κ ( pt; Ft, xt) 0 and so we can conclude that: Regret T κ inf α αT + 20Reg OR(T, δ, F0) P i [m] E(Fi 0, α) + Reg OL(T, δ, Π) + p Proof. By Proposition C.8, with probability at least 1 δ, if we play actions from Pt, we can guarantee the all the constraints are satisfied. On the other hand, to bound the regret of our algorithm w.r.t. the optimal action in hindsight that also satisfies constraint on every round, note that Regret T := t=1 ℓ(at, xt, yt) min π Π: t f(π(xt),xt) 0 t=1 ℓ(π(xt), xt, yt) t=1 ℓ(at, xt, yt) min π Π: t π(xt) Ot t=1 ℓ(a, xt, yt) ℓ(at, xt, yt) E at pt [ℓ( at, xt, yt)] + t=1 E at pt [ℓ( at, xt, yt)] min π Π: t π(xt) Ot ℓ(a, xt, yt) ℓ(at, xt, yt) E at pt [ℓ( at, xt, yt)] + Reg OL(T, δ, Π) E at pt [ℓ(at, xt, yt)] E at pt [ℓ( at, xt, yt)] + Reg OL(T, δ, Π) + p Online Learning with Unknown Constraints t=1 VM κ ( pt; Ft, xt) + κ t=1 E at pt Ft(at, xt) ) + Reg OL(T, δ, Π) + p where (i) follows from the fact that by Proposition C.8, a policy π satisfying t, f(π(xt), xt) 0 satisfies t, π(xt) Ot, (ii) is an application of Hoeffding Azuma to bound PT t=1 ℓ(at, xt, yt) PT t=1 Eat pt [ℓ(at, xt, yt)] and: VM κ ( pt; Ft, xt) = sup y Y E at M( pt;Ft,xt) [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] κ E at M( pt;Ft,xt) by Lemma C.10 we can bound the PT t=1 κ Eat pt [ Ft(at, xt)] term, hence, Regret T inf κ>0 t=1 VM κ ( pt; Ft, xt) + κ inf α αT + 20Reg OR(T, δ, F0) Pm i=1 E(Fi 0, α) α + Reg OL(T, δ, Π) + p This concludes the first bound - which holds with probability at least 1 3δ as we take a union bound over the online regression oracle guarantee, the online learning oracle guarantee, and the application of Hoeffding Azuma. To conclude the second part of the statement, we need to show that for κ = κ (F, M) := sup ˆ F F,x X, p (O( ˆ F,x)),y Y Ea M( p; ˆ F,x) [ℓ(a, x, y)] E a p [ℓ( a, x, y)] Ea M( p; ˆ F,x) h ˆ F (a, x) i we have that VM κ ( pt; Ft, xt) 0. To this end, note that VM κ ( pt; Ft, xt) E at M( pt;Ft,xt) [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] κ E at M( pt;Ft,xt) E at M( pt;Ft,xt) [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] sup ˆ F F,x X, p (O( ˆ F,x)),y Y Ea M( p; ˆ F,x) [ℓ(a, x, y)] E a p [ℓ( a, x, y)] Ea M( p; ˆ F,x) h ˆ F (a, x) i E at M( pt;Ft,xt) E at M( pt;Ft,xt) [ℓ(at, xt, y)] E at pt [ℓ( at, xt, y)] supy Y{Eat M( pt;Ft,xt)[ℓ(at,xt,y)] E at pt[ℓ( at,xt,y)]} Eat M( pt;Ft,xt)[ Ft(at,xt)] E at M( pt;Pt,Ft,xt) C.4.1. PROOFS FOR POLYTOPIC CONSTRAINTS WITH SCALAR FEEDBACK Recall that MLin( p; F) is defined as the distribution induced by drawing a p and outputting γ( a; F) a, where γ( a; F) := max {γ [0, 1] : γ a P(F)}. First we show a lemma that lower bounds γ( a; F). Lemma C.12. Let F be an arbitrary subset of FPoly and consider some a O(F). γ( a; F) := max {γ [0, 1] : γ a P(F)} is lower bounded as: γ( a; F) b b + F( a) Online Learning with Unknown Constraints Proof. From the definition of a O(F), we know f( a) b for some f F. Let f be some arbitrary function in F. Let α = b b+ F( a). Then: f(α a) = f(α a) + f(α a) f(α a) f(α a) + f(α a) f(α a) α(b + F( a)) This shows that α a P(F) as f was arbitrary. Since we defined γ( a; F) = max {γ [0, 1] : γ a P(F)}, it follows that γ( a; F) α = b b + F( a) Lemma. (Lemma 5.11 Restated) Suppose Assumption 5.4 holds. Then, κ (FPoly, MLin) DℓDa Proof. Given Lemma C.12, the proof is analogous to that of Lemma 5.6.