# handling_heterogeneous_curvatures_in_bandit_lqr_control__4e0cd41c.pdf

Handling Heterogeneous Curvatures in Bandit LQR Control

Yu-Hu Yan 1 2 Jing Wang 1 2 Peng Zhao 1 2

We investigate online Linear Quadratic Regulator (LQR) with bandit feedback and semi-adversarial disturbances. Previous works assume costs with homogeneous curvatures (i.e., with a uniform strong convexity lower bound), which can be hard to satisfy in many real scenarios and prohibits adapting to true curvatures for better performance. In this paper, we initiate the study of bandit LQR control with heterogeneous cost curvatures, aiming to strengthen the algorithm s adaptivity. To achieve this, we reduce the problem to bandit convex optimization with memory via a withhistory reduction to avoid hard-to-control truncation errors. Then we provide a novel analysis for an important stability term that appeared in both regret and memory, using Newton decrement developed in interior-point methods. The analysis enables us to guarantee memory-related terms introduced in the reduction and also provide a simplified analysis for handling heterogeneous curvatures in bandit convex optimization. Finally, we achieve interpolated guarantees that can not only recover existing bounds for convex and quadratic costs but also attain new implications for cases of corrupted and decaying quadraticity.

1. Introduction

There have been extensive studies on Linear Quadratic Regulator (LQR) and the more general Linear Quadratic Gaussian (LQG) (Bellman, 1954; Kalman, 1960; Abbasi-Yadkori and Szepesv ari, 2011; Lewis et al., 2012; Cohen et al., 2018; Dean et al., 2018; Mania et al., 2019). Specifically, LQR considers controlling the following linear dynamical system:

xt+1 = Axt + But + ξt,

1National Key Laboratory for Novel Software Technology, Nanjing University, China 2School of Artificial Intelligence, Nanjing University, China. Correspondence to: Peng Zhao <zhaop@lamda.nju.edu.cn>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

where xt, ut denote the state and action, A, B are the system transition matrices, and ξt is the disturbance. The system evolves for T rounds, and at the t-th round, the learner selects an action ut and observes the next state xt+1. The goal of LQR is to minimize the following cumulative cost:

t=1 ct(xπ t , uπ t )

t=1 xπ t Qtxπ t +

t=1 uπ t Rtuπ t ,

where ct( , ) represents the quadratic cost, parameterized by the time-varying, positive semi-definite matrices Qt, Rt. Here, xπ t , uπ t denote the state and action generated by a specific policy π, chosen from a policy class Π. In the online version of LQR, the learner aims to minimize the gametheoretic policy regret (Dekel et al., 2012), which depicts the excess cumulative cost against the best policy in hindsight:

t=1 ct(xt, ut) min π Π

t=1 ct(xπ t , uπ t ), (1.1)

where xt, ut are produced by the online algorithm.

In this work, we investigate online LQR in the case where the system transition matrices A, B are known, the learner only obtains a scalar cost in each round (also known as bandit feedback in online learning), and the disturbances are semi-adversarial a mixture of stochastic and adversarial parts. In this setup, Sun et al. (2023) obtained an optimal e O(

T) regret bound, up to logarithmic factors. Moreover, when the cost functions are generally convex, this problem intersects with recent studies in online non-stochastic control (Hazan and Singh, 2022). Notably, the pioneering work (Agarwal et al., 2019) attained an e O(

T) regret bound through a novel reduction to online convex optimization with memory (Anava et al., 2015). The most related studies to ours, particularly in the bandit setup, are those by (Gradu et al., 2020; Cassel and Koren, 2020). The authors obtained an e O(T 3/4) regret bound for Lipschitz and convex cost functions with adversarial disturbances, with Cassel and Koren (2020) further achieving an improved e O(T 2/3) result for smooth cost functions.

The latest work for bandit LQR control (Sun et al., 2023) has achieved an e O(α 1 min

T) regret, where αmin is the minimum cost curvatures, i.e., Qt, Rt αt I and αmin = mint αt. While optimal in T, it presents unfavorable curvature considerations for two reasons: (i) the learner must have

Handling Heterogeneous Curvatures in Bandit LQR Control

Table 1: A summary of our results for bandit LQR control with heterogeneous curvatures. Below, αt is the curvature of the t-th quadratic cost (i.e., Qt, Rt αt I). Our general results are given in Theorem 3 and we provide four special cases: pure convexity, pure quadraticity, corruptions in quadraticity, and decaying quadraticity, to demonstrate the adaptivity and robustness of our results. We use 1[ ] to denote the indicator function and T [T] for the rounds with no quadraticity. Notably, for either Lipschitz or smooth cost functions, the results in four special cases can be achieved by one single algorithm. We distinguish the previous results by using a gray background color.

Convexity (αt = 0)

Quadraticity (αt = α > 0)

Corrupted Quadraticity (αt = α 1t/ T )

Decaying Quadraticity (αt = t γ)

Lipschitz Functions

e O(T 3/4) (Gradu et al., 2020) N/A N/A N/A

e O(T 3/4) [Corollary 1] e O(T 2/3) [Corollary 1]

(when |T | = T 8/9) e O(T 2/3)

( e O(T 2/3+γ/3), γ [0, 1/4] e O(T 3/4), γ (1/4, 1]

[Corollary 3] [Corollary 5]

Smooth Functions

e O(T 2/3) (Cassel and Koren, 2020) e O(

T) (Sun et al., 2023) N/A N/A

e O(T 2/3) [Corollary 2] e O(

T) [Corollary 2]

(when |T | = T 3/4) e O(

( e O(T 1/2+γ/2), γ [0, 1/3] e O(T 2/3), γ (1/3, 1]

[Corollary 4] [Corollary 6]

access to the curvature lower bound αmin, which can be hard to obtain before an algorithm initializes. This requirement restricts the algorithm s adaptivity to true curvatures (i.e., αt αmin) for better performance; and (ii) when there are corruptions in the cost functions, e.g., there exist generally convex costs in the function sequence, algorithms designed for quadratic costs become unfeasible since αmin = 0. And algorithms for convex costs will significantly degrade the performance e.g., the regret will become from e O(

T) to e O(T 2/3) for smooth costs. Therefore, a natural question arises: Is it possible design an algorithm that is adaptive to heterogeneous curvatures for better performance and robust to (possibly) corrupted cost functions?

Motivated by the question above, we focus on heterogeneous curvatures in bandit LQR control. As far as we know, the study of heterogeneous curvatures in control has been largely unexplored, with the exception of the work of Muthirayan et al. (2022). Specifically, they study the full information feedback, where the learner attains complete knowledge of costs. By contrast, we focus on the more challenging bandit feedback, which is common in many real-world applications where getting adequate feedback is hard.

Results. In this work, we achieve interpolated results that are adaptive to the true curvatures of costs. Our results can recover existing bounds and imply new ones in certain cases. Specifically, for Lipschitz costs, we obtain interpolated bounds between e O(T 3/4) for convex functions (Gradu et al., 2020; Cassel and Koren, 2020) and e O(T 2/3) for quadratic functions, a novel result in bandit LQR control. For smooth costs, we achieve interpolated results between e O(T 2/3) (Cassel and Koren, 2020) and e O(

T) (Sun et al., 2023). Moreover, our results also imply meaningful guarantees in the intermediate cases where corrupted or decaying quadraticity exists. For instance, our results can maintain the desired

T) bound for smooth costs even when O(T 3/4) in T functions are not quadratic. And for Lipschitz costs, the desired e O(T 2/3) bound is attainable even when O(T 8/9) of the cost functions are not quadratic, thereby greatly enhancing our method s robustness. Table 1 summarizes our results.

Techniques. Many online decision-making tasks can be reduced to online learning with memory, an online model capturing the impact of past decisions in the present. Prior studies used a truncation-based reduction with easy-to-control truncation errors. However, heterogeneous curvatures with bandit feedback requires the usage of self-concordant barriers (Nesterov and Nemirovskii, 1994) as the regularizer in online update, which are inherently unbounded near the domain boundary and will thus make truncation errors hardto-control. To avoid it, we adopt a with-history reduction scheme proposed by recent studies (Sun et al., 2023), which admits a lossless reduction. Consequently, we address the reduced problem within a non-oblivious adversary setup.

In both the regret and memory analysis, a key stability term, which captures the switching of the decisions based on certain local measures, is important. Luo et al. (2022) conducted initial studies on this term in Bandit Convex Optimization (BCO) with heterogeneous curvatures, using the proof argument by contradiction and some local stability analysis of self-concordant barriers. In this work, we further identify the importance of this term in the memory analysis and provide a simple analysis for it using Newton decrement developed in interior-point methods (Nesterov and Nemirovskii, 1994). This enables us to also provide a simplified regret analysis for BCO with heterogeneous curvatures. Our main technical finding is given in Theorem 1.

Organization. The rest of the paper is structured as follows. Section 2 provides preliminaries of the problem setup

Handling Heterogeneous Curvatures in Bandit LQR Control

and previous works on handling heterogeneous curvatures. Section 3 presents our method and key analysis for bandit LQR control with heterogeneous curvatures. Finally, Section 4 concludes the work. Due to page limits, most proofs are deferred to appendices.

2. Preliminaries

In this section, we introduce some preliminary knowledge, including our problem setup and assumptions in Section 2.1, and the latest progress on handling heterogeneous curvatures in online convex optimization in Section 2.2.

2.1. Problem Setup

In this work, we investigate online Linear Quadratic Regulator (LQR) control with bandit feedback, partial observations (sometimes referred to as LQG control in literature), and semi-adversarial disturbances. Concretely, we consider controlling the following linear dynamical system:

xt+1 = Axt + But + ξt, yt = Cxt + et, (2.1)

where A Rdx dx, B Rdx du, C Rdy dx are known system transition matrices. Here, xt Rdx and ut Rdu represent the state and action respectively, while yt Rdy is a partial observation of the state. The learner can only observe a bandit feedback, i.e., a scalar value of ct(yt, ut), without access to the full function. Notably, same as Cassel and Koren (2020); Gradu et al. (2020); Sun et al. (2023), we consider the problem within an oblivious setup the cost functions and disturbances are chosen by the environments in advance before the algorithm starts.

In the following, we list the assumptions used in this work.

Assumption 1 (Stablility). The system is stable, i.e., the spectral radius ρ(A) < 1.

Assumption 2 (Disturbance). The disturbances {ξt, et}T t=1 are semi-adversarial: ξt = ξadv t +ξsto t and et = eadv t +esto t , where E[ξsto t ] = E[esto t ] = 0, E[ξsto t ξsto t ] Varξ I, E[esto t esto t ] Vare I. Besides, ξt 2 , et 2 W.

Assumption 3 (Cost). The cost ct( , ) is quadratic:

ct(y, u) = y Qty + u Rtu,

non-negative, αt-strongly convex and βc-smooth:

Qt, Rt αt I, 2ct( , ) βc I,

and Lipschitz: for all (y, u), (y , u ) Rdy+du

|ct(y, u) ct(y , u )| Lc Rc (y y , u u ) 2,

where Rc max{ (y, u) 2, 1}.

The assumptions are common in the literature. Specifically, Assumption 1 can be extended to strongly stabilizable systems, which are unstable but can be stabilized by a linear controller, due to the reduction proposed in Appendix A of Cassel et al. (2022). And Assumption 2 is typically considered when dealing with strongly convex cost functions (Simchowitz et al., 2020; Sun et al., 2023). In Assumption 3, while we provide a detailed list of various functional properties, not all of them are invoked for each result. The related assumptions will be specified as needed.

Notations. For clarity, we use bold symbols (e.g., x) within the control context and italic symbols (e.g., w or w) within the online learning context. For simplicity, we define xa:b Pb i=a xi and x[a:b] (xa, . . . , xb) for variable x. A function is described as α-quadratic when it is quadratic with Qt, Rt αI. We denote by w A

w Aw a local norm for any w and positive semi-definite matrix A.

2.2. Handling Heterogeneous Curvature

In online convex optimization (Hazan, 2016), the learner submits a decision wt inside a convex compact set W Rd, aiming to minimize the regret (Cesa-Bianchi and Lugosi, 2006) defined on convex loss functions ht : W 7 R, i.e.,

t=1 ht(wt) min w W

t=1 ht(w). (2.2)

The problem of handling heterogeneous curvatures can be traced back to the seminal work of Adaptive Online Gradient Descent (AOGD) (Bartlett et al., 2007), which studies σtstrongly convex loss function ht( ) and achieves adaptive results to heterogeneous curvatures of loss functions.

The key idea of AOGD is implementing online gradient descent (Zinkevich, 2003) on regularized functions {eht( )}T t=1, where eht(w) ht(w) + λt

2 w 2 2 for any w W, with λt as the regularization coefficient. Intuitively, AOGD chooses a larger λt to add more curvatures to the functions for faster rates. As a price, this method requires a delicate balance between the regularization term, dominated by O(1/(σ1:t + λ1:t)), and a bias term of O(λt). As a consequence, AOGD interpolates between the O(

T) regret for convex functions (Zinkevich, 2003) and the O(log T) guarantee for strongly convex functions (Hazan et al., 2007).

Addressing heterogeneous curvatures with bandit feedback is more challenging and was recently solved by Luo et al. (2022). Similar to AOGD, their method also optimizes the regularized loss functions. Differently, to handle the bandit feedback, they use Follow-the-Regularized-Leader (FTRL) (Abernethy et al., 2008) with self-concordant barriers and a gradient estimator with shrinking sampling, following Hazan and Levy (2014). The algorithmic details will be illuminated again in Section 3.2.1. In their work,

Handling Heterogeneous Curvatures in Bandit LQR Control

the key challenge comes from a hard-to-analyze local-norm term of ψ(w) 2ψ(w), where ψ( ) is the FTRL regularizer. To handle this, they operate in a lifted domain W {w = (w, 1) | w W} and construct a corresponding normal barrier in W. This method enables bounding the above term through the advantageous properties of the normal barrier, inspired by Lee et al. (2020).

Self-concordant barriers, which becomes infinity near the domain boundary, are commonly used in interior-point methods for convex optimization (Nesterov and Nemirovskii, 1994). For any closed convex set, there always exists a corresponding self-concordant barrier. And a normal barrier in the lifted domain can always be constructed from a self-concordant barrier of the original domain. Interested readers can refer to Appendix A for formal definitions and properties about self-concordant and normal barriers.

To conclude, Luo et al. (2022) achieved interpolated results between e O(T 2/3) (Saha and Tewari, 2011) for convex functions and e O(

T) (Hazan and Levy, 2014) for strongly convex ones when functions are smooth, as well as interpolation between e O(T 3/4) (Flaxman et al., 2005) and e O(T 2/3) for Lipschitz functions. Furthermore, they investigated cases where the cost functions are mixtures of convex and strongly convex functions or have a decaying curvature coefficient, and achieved meaningful guarantees therein.

Our work falls under the topic of universal online learning, where the learner aims to design a single method capable of handling online functions with potentially different properties. Existing studies consider two kinds of universality: the curvatures are known but heterogeneous (Bartlett et al., 2007; Luo et al., 2022), or homogeneous but unknown (van Erven and Koolen, 2016; Zhang et al., 2022; Yan et al., 2023). While there are extensive studies on both threads with full information feedback, existing research with bandit feedback has only made progress on known but heterogeneous curvatures due to limited information. Our problem also falls in this category. Exploring homogeneous but unknown curvatures with bandit feedback remains an important future direction for investigation.

3. Our Method

This section proposes our method for handling heterogeneous curvatures in bandit LQR control. Section 3.1 reduces the problem to BCO with memory (switching cost). Section 3.2 handles heterogeneous curvatures in the reduced online learning problem. Finally, Section 3.3 applies the results back to the control setup.

3.1. With-History Reduction to BCO with Memory

In this part, we reduce bandit LQR control to BCO with memory via a with-history reduction scheme, where the

memory can be further transformed into the stability analysis between consecutive decisions. The reduction builds on existing progress in handling partial observations and bandit feedback in the control problem (Cassel and Koren, 2020; Simchowitz et al., 2020; Sun et al., 2023).

Initially, we introduce the notion of Nature s y following Simchowitz et al. (2020), which is intuitively an external observation of the cumulative impact of disturbances.

Definition 1. Nature s y (denoted by ynat) is the observation without any action on the system. In system (2.1), in the t-th round, given disturbances ξ1:t, e1:t, Nature s y is defined as ynat t et + Pt 1 i=1 CAi 1ξt i.

With Nature s y, we introduce the Disturbance-Response Policy (DRP) (Simchowitz et al., 2020), which is effective for handling partial observations.

Definition 2. Given Nature s ynat t m+1:t, a disturbanceresponse policy, parameterized by an m-length tuple of matrices M = (M [0], . . . , M [m 1]), chooses the action as ut(M) = Pm 1 i=0 M [i]ynat t i. The DRP policy class is defined as M {M | Pm 1 i=0 M [i] op R}.

Next we reduce the problem to BCO with memory. Importantly, due to the property of Nature s y, it holds that yt = ynat t + Pt 1 i=1 G[i]ut i(Mt i), where G[i] CAi 1B is the Markov operator of the system (2.1). Consequently, the cost ct(yt, ut) can be reinterpreted as a function Ft of the policies, i.e., Ft(M[1:t]). This reformulation leads to a problem with unbounded memory, which has a lower bound of Ω(T) (Cesa-Bianchi et al., 2013) and thus hard to handle.

To address this, previous works (Agarwal et al., 2019; Simchowitz et al., 2020) proposed a truncation-based reduction, which artificially erases the effect of actions of more than H rounds before. The reduction imports a truncation error, which can be easily handled in previous setups. For illustration, we decompose observation yt as follows:

yt = ynat t +

i=1 G[i]ut i(Mt i)

| {z } HIST-I

i=H+1 G[i]ut i(Mt i)

| {z } HIST-II

where HIST-I includes the most recent H actions, i.e., M[t H:t 1], and HIST-II comprises actions that are more than H steps away, i.e., M[1:t H 1]. Given the system s stability, previous works focus only on HIST-I, truncate HISTII, define a truncated observation eyt ynat t + (HIST-I), and finally obtain a loss function with bounded history, i.e., ft(M[t H:t]) ct(eyt, ut), with an ignorable truncation error overhead. By doing this, they can reduce the problem to BCO with bounded memory.

However, this truncated-based reduction will fail in our problem. This is mainly because the truncation error, i.e.,

Handling Heterogeneous Curvatures in Bandit LQR Control

ft(M[t H:t]) = ct(yt, ut), can be greatly enlarged when incorporating self-concordant barriers (necessary for handling heterogeneous curvatures with bandit feedback), a concept in interior-point methods (Nesterov and Nemirovskii, 1994). Barrier functions are inherently unbounded near the domain boundary and will make the bias hard to control, thus making the truncation-based reduction fail for our purposes.

In this work, to avoid the hard-to-control truncation errors, we leverage a with-history reduction scheme (Sun et al., 2023), which allows a lossless reduction to BCO with memory. Intuitively, instead of truncating HIST-II, the method integrates it into the cost function to define a with-history function. A formal definition is provided below.

Definition 3. Given the Markov operator G of system (2.1) and a cost function ct( , ), the with-history loss function ft : MH+1 7 R is defined as ft(N0, . . . , NH)

i=1 G[i]ut i(NH i) + (HIST-II), ut(NH)

Accordingly, we define eft(N) ft(N, . . . , N).

The with-history function enjoys a notable lossless property: ft(M[t H:t]) = ct(yt, ut), effectively avoiding the hard-tocontrol truncation errors mentioned earlier. However, the lossless property comes with a price: the with-history function ft( ) can depend on past decisions M[1:t H 1], making the reduced online learning problem a non-oblivious setup. Handling a non-oblivious adversary is typically challenging in BCO (Flaxman et al., 2005). The key difficulty is that the comparator w arg minw W P

t [T ] ht(w) becomes a random variable as ht( ) may depend on the algorithm s past decisions, requiring analysis specially designed for the nonoblivious setup. Fortunately, in bandit control, the compared policy M arg min M M PT t=1 ct(yt(M), ut(M)) is determined by the cost functions, hence deterministic since the costs are obliviously chosen. This characteristic avoids the undesired randomness and enables us to handle the reduced non-oblivious online learning problem.

Consequently, we reduce the problem to BCO with memory, up to constant errors. The regret over the with-history losses is defined and further decomposed as

t=H ft(M[t H:t])

t=H eft(M )

eft(Mt) eft(M )

| {z } UNARY-REG

ft(M[t H:t]) eft(Mt)

| {z } MEMORY

where H H + m is the effective memory length. In the last line, the first term (the unary regret) is the standard regret defined on the unary function { eft( )}T t=1. And the

second term (the memory term) is essential in control problems and other online decision-making tasks, e.g., in online Markov decision process (Even-Dar et al., 2009; Zhao et al., 2022). Next we analyze these two terms respectively.

Optimizing the unary regret necessitates eft(Mt), which can be estimated using the value of eft(Mt) (Flaxman et al., 2005). However, the learner only obtains a scalar value of ft(M[t H:t]), i.e., ct(yt, ut), which leads to a mismatch between (Mt H, . . . , Mt) and (Mt, . . . , Mt). To this end, we adopt a lazy-update method (Cassel and Koren, 2020). Specifically, at the t-th round, the learner draws a random bit bt Bern(1/H), where Bern(1/H) is a Bernoulli distribution with parameter 1/H. Once bt QH 1 i=1 (1 bt i) = 1, indicating that Mt H = = Mt, the learner estimates the gradient using ft(M[t H:t]). While the updates occur over a smaller time horizon rather than the entire one, the unary regret will be only 3H times larger. The typical choice of H = Θ(log T) will not affect the final bound.

It can be verified that the Lipschitzness and smoothness of cost functions can be inherited to the with-history functions, which is formally stated in Lemma 8 in Appendix B. When the cost function ct( , ) is Lc-Lipschitz, ft( ) will be coordinate-wise Lf-Lipschitz, and the memory term can be bounded in the following way:

t S (ϑt + 2δt), (3.1)

where Mt = E[Mt] denotes the policy without randomness, ϑt Mt+1 Mt F measures the switching costs, and δt Mt Mt F represents the variance. When eft( ) is Lf-Lipschitz and βf-smooth, the upper bound becomes

t S (Lfϑt + βfϑ2 t + 6βfδ2 t ). (3.2)

To conclude, we reduce the problem to BCO with Switching Cost (BCO-SC), by upper-bounding the memory with the switching of policies (Cassel and Koren, 2020, Theorem 9).

3.2. Heterogeneous Curvature with Switching Cost

In this part, we address the BCO-SC (switching cost) problem with heterogeneous curvatures. For clarity, we frame the problem within the online convex optimization setup, as introduced in Section 2.2. Denoting by E[REGT ] the expectation of (2.2), we investigate the following regret of

t=2 wt wt 2 +

t=2 wt wt 1 2 (3.3)

Handling Heterogeneous Curvatures in Bandit LQR Control

Algorithm 1 Subroutine of Luo et al. (2022)

Input: bandit value ht(wt), curvature σt, last-round decision wt, step size ηt+1, regularization coefficient λt 1: Estimate gradient gt d(ht(wt) + λt

2 wt 2 2)H 1/2 t st 2: Update as wt+1 = arg minw W Ft+1(w) 3: Compute Ht+1 2Ψ( wt+1) + ηt+1(σ1:t + λ0:t)I

4: Draw st+1 randomly from Sd+1 (H 1/2 t+1 ed+1)

5: Perturb wt+1 = (wt+1, 1) = wt+1 + H 1/2 t+1 st+1

for Lipschitz functions, and for smooth functions, we study

t=2 wt wt 2 2

t=2 wt wt 1 2 +

t=2 wt wt 1 2 2, (3.4)

according to the reduction in (3.1) and (3.2). For clarity, we use italic symbols (e.g., w) in the original domain and bold italic symbols (e.g., w) in the lifted domain.1 Next we review the method of Luo et al. (2022) in Section 3.2.1 and provide our key and simple analysis in Section 3.2.2.

3.2.1. REVIEW OF LUO ET AL. (2022)

This work studies heterogeneous curvatures in the BCO setup. The algorithm runs in a lifted domain W {w = (w, 1) | w W Rd}. We describe the subroutine of the t-th round in Algorithm 1. Specifically, after submitting decision wt and receiving bandit feedback ht(wt) as well as the curvature σt, the learner constructs an unbiased gradient estimator in Line 1, where (ht(wt) + λt

2 wt 2 2) is an observation of the regularized function, st is a uniformly unit vector for exploration, and Ht is a perturbation matrix. In Line 2, an FTRL algorithm updates the decision as wt+1 = arg minw W Ft+1(w), where

2 w ws 2 2 (APPROX)

2 w ws 2 2 + λ0

2 w 2 2 (REGLR-I)

+ 1 ηt+1 Ψ(w). (REGLR-II)

Here, APPROX is the approximation of the original function ht( ) using the gradient estimator gt, REGLR-I is a regularization term following the same spirit of AOGD (Bartlett et al., 2007), and REGLR-II is the FTRL regularizer, where ηt+1 is a non-increasing step size and Ψ is a normal barrier

1wt and wt (in online learning context) correspond to Mt and Mt (in control context) respectively.

in the lifted domain. In Line 3, the learner constructs the perturbation matrix Ht+1, which intuitively uses the curvature of the domain and the function for effective exploration. In Line 4, w represents space orthogonal to w. Finally, in Line 5, the learner perturbs the updated decision.

Finally, we illuminate the parameter configurations of Algorithm 1 (step sizes {ηt}T t=1 and regularization coefficients {λt}T t=1). Specifically, for Lipschitz functions, we follow Algorithm 2 of Luo et al. (2022) and set them as follows:

ηt = d 4/3(Lf + 1) 2/3 1 σ1:t 1 + λ0:t 1 + 1

λt = d 2/3(Lf + 1) 2/3

(σ1:t + λ0:t) 1/3 , (3.5)

where Lf denotes the coordinate-wise Lipschitzness of ft( ). And for smooth functions, we follow Algorithm 1 of Luo et al. (2022) and set ηt and λt as follows:

ηt = 1 2d q

βf +1 σ1:t 1+λ0:t 1 + 1

βf +1 σ1:t+λ0:t , (3.6)

where βf represents the smoothness of ft( ).

3.2.2. OUR ANALYSIS

In this part, we consider BCO-SC with heterogeneous curvatures for our purpose. To this end, we provide a simple half-page proof for an essential stability term in both regret and memory analysis, using Newton decrement developed in interior-point methods (Nesterov and Nemirovskii, 1994). Notably, Luo et al. (2022) has conducted some initial analysis on this term in the regret part using the proof argument by contradiction and some local stability analysis of selfconcordant barriers. Our analysis can significantly simplify their two-page proof (please refer to Lemma 17 therein).

To better illuminate our key technique, we first give some basic knowledge of Newton decrement. Definition 4. For a self-concordant function f defined on W,2 for any w int(W), the Newton decrement is defined as λ(w, f) f(w) 2f(w).

Newton decrement vanishes exactly at the minimizer w

of f in the interior of W, denoted by int(W), and can be considered as an observable measure of the proximity of any w to w . Specifically, it exhibits the following property. Proposition 1. For a self-concordant barrier f, if the Newton decrement λ(w, f) 1/2, then it follows that w w 2f(w) 2λ(w, f), where w = arg minw f(w).

In the following, we highlight a key stability term of wt wt+1 Ht, which is essential in both the regret and

2The self-concordant function is a more general notion than self-concordant barriers self-concordant barrier are selfconcordant functions, but not vice versa.

Handling Heterogeneous Curvatures in Bandit LQR Control

memory analysis: (i) in the standard FTRL regret analysis, e.g., in Chapter 7 of Orabona (2019), an important term is gt, wt wt+1 . Given the local boundedness of the gradient estimator, i.e., gt H 1 t O(d), it is crucial to analyze wt wt+1 Ht; (ii) the switching cost terms wt wt+1 2 and wt wt+1 2 2 in (3.3) and (3.4) are closely related to wt wt+1 Ht, with the only difference being the kinds of norms, showing that the stability term is also essential in the memory analysis.

Below, we present our main technical contribution, a novel stability analysis for wt wt+1 Ht, with a simple halfpage proof using Newton decrement. Theorem 1 (Key Technical Result). Using {ηt}T t=1 as step sizes, as long as λ( wt, Ft+1) 1/2 is satisfied, Algorithm 1 guarantees the stability term as

wt wt+1 Ht 4dηt + 2η1

where ν denotes the normal barrier coefficient.

Proof. To begin with, we observe that

2Ft+1(w) 1 ηt+1 2Ψt(w) 1

η1 2Ψt(w) (3.7)

for any w W, where Ψt(w) Ψ(w) + ηtλ0

2 w 2 2 + ηt Pt 1 s=1 σs+λs

2 w ws 2 2 such that Ft+1 can be rewritten as Ft+1(w) = Pt s=1 g s w + 1 ηt+1 Ψt+1(w) for any w W. The first step is due to the definitions of Ft+1 and Ψt and the second step is because the step sizes are nonincreasing (i.e., η1 . . . ηt). With (3.7), it holds that

wt wt+1 Ht = wt wt+1 2Ψt( wt) ηt wt wt+1 2Ft+1( wt) 2 ηtλ( wt, Ft+1), (3.8)

where the first step is due to Ht 2Ψ( wt) + ηt(σ1:t 1 + λ0:t 1)I = 2Ψt( wt), the second step is by (3.7), and the last step is because of the aforementioned property of Newton decrement, as long as λ( wt, Ft+1) 1/2. This condition will be verified in Lemma 1.

Next, Newton decrement λ( wt, Ft+1) can be bounded by

λ( wt, Ft+1) Ft+1( wt) 2Ft+1( wt)

gt 2Ft+1( wt) | {z }

Ψ( wt) 2Ft+1( wt) | {z }

which is due to Ft+1( wt) = Ft( wt) + gt + (1/ηt+1 1/ηt) Ψ( wt) and Ft( wt) = 0 since wt minimizes Ft. Then TERM (A) can be bounded as

TERM (A) gt 2Ft+1( wt)

(3.7) ηt gt 2Ψt( wt) = ηt gt H 1 t 2d ηt,

where the second step is due to 2Ft+1(w) 1 ηt 2Ψt(w), the third step is because of the definition of Ht, and the last step holds due to the property of the gradient estimator.

As for TERM (B), it holds that

Ψ( wt) 2Ft+1( wt) η1 Ψ( wt) 2Ψ( wt) η1 ν,

where the first step is due to 2Ft+1(w) 1 η1 2Ψ(w) for any w W, and the last step is due to the property of the normal barrier, whose coefficient is denoted by ν. Putting everything together finishes the proof.

The intuition of the proof is relating the local norm Ht to 2Ft+1( ), where Ft+1 denotes the function that wt+1 minimizes. By doing so, we upper-bound it by the Newton decrement λ( wt, Ft+1) using Proposition 1, which exhibits some nice properties making the analysis much easier.

By aggregating Theorem 1 over T rounds, we get P

t wt wt+1 Ht O(η 1 T +1 + P

t ηt), a standard result in FTRL regret analysis, e.g., in Chapter 7.3 of Orabona (2019). This can be easily handled, and thus leads to a simplified regret analysis for BCO with heterogeneous curvatures. Due to space constraints, we defer the regret guarantees and proofs to Lemma 9 and Lemma 10 in Appendix C.1.

Notably, Theorem 1 hinges on the condition λ( wt, Ft+1) 1/2. In Lemma 1 below, we show that this condition can be satisfied by a proper initialization of the regularization coefficient λ0 and requiring the time horizon T to be larger than a certain constant. The proof is deferred to Appendix C.2.

Lemma 1. Denoting by ν the normal barrier coefficient, λ( wt, Ft+1) 1/2 holds under the following conditions:

(i) ht( ) is Lf-Lipschitz, the time horizon T T0 = 213d4(Lf + 1)2(1 + ν)6, and λ0 = T0; (ii) ht( ) is βf-smooth, the time horizon T T0 = 128d2(1 + ν)4, and λ0 = (βf + 1)T0.

Consequently, we obtain wt wt+1 Ht η1 = O(1).

Upon bounding wt wt+1 Ht, we analyze the switchingcost terms (i.e., wt wt+1 2 and wt wt+1 2 2) by relating Ht to the identity matrix I. Consequently, we find that the switching costs can be perfectly absorbed into the regret analysis, validating the stability of Algorithm 1. By doing this, we achieve adaptive guarantees to heterogeneous curvatures in BCO-SC. Theorem 2 concludes the results, and the corresponding proof is deferred to Appendix C.3.

Theorem 2. Suppose the function ht( ) is σt-strongly convex, and denote by {λ t }T t=1 the optimal regularization coefficients. When ht( ) is Lf-Lipschitz continuous, Algorithm 1 can upper-bound the BCO-SC regret (3.3) by

inf λ 1,...,λ T

T 1/3 + λ 1:T 1 +

t=1 (σ1:t + λ 0:t) 1/3 )!

Handling Heterogeneous Curvatures in Bandit LQR Control

When ht( ) is βf-smooth, Algorithm 1 can upper-bound the BCO-SC regret (3.4) by

inf λ 1,...,λ T

T + λ 1:T 1 +

t=1 (σ1:t + λ 0:t) 1/2 )!

Notably, the optimal regularization coefficients {λ t }T t=1 only exist in analysis. Therefore, we can plug in any feasible {λ t}T t=1 to achieve different guarantees in various cases. Specifically, by doing this, our results can recover existing regret guarantees, be robust to corrupted quadratic properties, and consider decaying quadraticity in bandit LQR control. The details of these implications will be shown in Section 3.3. At the end of this part, we provide several remarks on the results and techniques in this work.

Remark 1. We have proved that Algorithm 1, designed for BCO with heterogeneous curvatures, can also handle switching cost terms, as shown in Theorem 2. Our contribution lies mainly in the technical aspect achieving a simple analysis for the stability term of wt wt+1 Ht, which enables us to control the switching costs and provide a simplified analysis for BCO with heterogeneous curvatures.

Remark 2. We clarify that the techniques in this work (with-history reduction and stability analysis) are general beyond the specific problem studied here. Specifically, the with-history reduction (Sun et al., 2023) can be used in other bandit non-stochastic control problems to avoid the undesired truncation error emerging in previous reduction methods. Our stability analysis is also not restricted to this specific context, as it can be used in other problems requiring FTRL with time-varying step sizes and self-concordant barriers (or normal barriers by domain lifting).

Remark 3. In the control-reduced BCO-SC problem, we need to handle a non-oblivious setup due to the with-history losses. We emphasize again that the obliviousness inherited from the control part makes our inner online learning algorithm still feasible, as explained in Section 3.1.

3.3. Back to Bandit LQR Control

In this part, we apply the results to the bandit LQR control problem, and obtain interpolated theoretical guarantees that are adaptive to the true curvatures of cost functions. Our results can thus recover existing results and imply new bounds in certain cases, as summarized in Table 1.

We adapt the method and analysis in Section 3.2 and obtain our method, which is summarized in Algorithm 2. In summary, we adopt DRP policy (Definition 2) and define with-history functions (Definition 3). We utilize the lazyupdate method of Cassel and Koren (2020) to realize effective gradient estimation on a smaller time horizon. When the algorithm is scheduled to update, we follow the same

Algorithm 2 Heterogeneous Bandit LQR Control Input: Initial regularization coefficient λ0, lifted domain M, normal barrier Ψ( ) on M Initialize: M1 = arg min M M Ψ(M), dc = mdydu, S = (record the time steps when the algorithm updates) for t = H , . . . , T do

Receive an observation yt Compute Nature s ynat t = yt Pt 1 i=1 G[i]ut i Submit ut(Mt) = Pm 1 i=0 M [i] t ynat t i via DRP policy Receive ct(yt, ut) and its curvature αt Compute curvature σt of the with-history loss Draw a random bit bt Bern(1/H) if bt QH 1 i=1 (1 bt i) = 1 then Update S = S {t} Compute λt and ηt+1 using (3.5) for Lipschitz functions or (3.6) for smooth functions σ1:t and λ0:t are computed only on S Send (ct(yt, ut), σt, Mt, ηt+1, λt) to Algorithm 1 and obtain Mt+1 else

Maintain the current policy Mt+1 = Mt end end

method in the online learning context, that is, using Algorithm 1 as a subroutine. The algorithms for Lipschitz and smooth functions vary only in the setups of regularization coefficients {λt}T t=1 and step sizes {ηt}T t=1. We defer these configurations to (3.5) and (3.6) due to page constraints.

We provide our regret bounds for bandit LQR control in Theorem 3, which is analogous to the online learning context (Theorem 2). The proof can be found in Appendix D.1.

Theorem 3. Denoting by {λ t }T t=1 the optimal regularization coefficients, for Lipschitz and αt-quadratic costs, under Assumptions 1-2, by setting the regularization coefficients {λt}T t=0 and step sizes {ηt}T t=1 as (3.5), Algorithm 2 can bound the expected policy regret (i.e., E[REGC T ]) as

inf λ 1,...,λ T

T 1/3 + λ 1:T 1 +

t=1 (α1:t + λ 0:t) 1/3 )!

If the cost functions are additionally smooth, by choosing the regularization coefficients {λt}T t=0 and step sizes {ηt}T t=1 as (3.6), Algorithm 2 bounds the expected policy regret as

inf λ 1,...,λ T

T + λ 1:T 1 +

t=1 (α1:t + λ 0:t) 1/2 )!

As noted in Section 3.2.2, the optimal regularization coefficients {λ t }T t=1 only exist in analysis. Therefore, we can plug in any feasible {λ t}T t=1 to achieve different guarantees

Handling Heterogeneous Curvatures in Bandit LQR Control

in various cases. Specifically, below we provide direct corollaries considering the cases of pure convexity / quadraticity, corrupted quadraticity, and decaying quadraticity. The corresponding proofs are deferred to Appendix D.2.

1 Pure Convexity/Quadraticity. Corollary 1 recovers the e O(T 3/4) regret for convex costs (Gradu et al., 2020; Cassel and Koren, 2020). For quadratic costs, we obtain a new e O(α 1/3T 2/3) bound. For smooth costs, Corollary 2 recovers the e O(T 2/3) regret for convex costs (Cassel and Koren, 2020). Our e O(α 1/2

T) for quadratic functions matches e O(α 1

T) of Sun et al. (2023) in the dependence of T, with a tighter dependence on the curvature α.

Corollary 1. For Lipschitz costs, under Assumptions 1, 2, when {ct( , )}T t=1 are convex, Algorithm 2 with configuration (3.5) achieves an e O(T 3/4) regret. When {ct( , )}T t=1 are α-quadratic, Algorithm 2 achieves e O(α 1/3T 2/3).

Corollary 2. For Lipschitz and smooth costs, under Assumptions 1, 2, when {ct( , )}T t=1 are convex, Algorithm 2 with (3.6) achieves an e O(T 2/3) regret. When {ct( , )}T t=1 are α-quadratic, Algorithm 2 achieves e O(α 1/2

We note that an e O(

T) bound is also obtained in Theorem 8 of Cassel and Koren (2020) for strongly convex and smooth functions. However, a caveat is that they allow an improper learning, i.e., allowing policies in a larger domain outside M with the compared policy still in M. This eliminates the use of self-concordant barriers, hence avoiding the associated hard-to-control truncation errors.

T) bound is near-optimal compared to the Ω(

T) lower bound (Shamir, 2013), as also stated by Sun et al. (2023). Although the other results are sub-optimal in this sense, they have matched the state-of-the-art bounds attainable by efficient algorithms (Flaxman et al., 2005; Saha and Tewari, 2011). While some recent breakthroughs (Bubeck et al., 2017; Lattimore, 2020) achieve the optimal e O(

T), they are computationally expensive and not efficiently implementable in practice. Achieving the optimal regret with efficient methods in BCO is extremely challenging and is still open in the community.

2 Corrupted Quadraticity. A na ıve solution is to discard the non-quadratic (i.e., convex) ones, where the optimal e O(T 2/3) can be maintained when M = O(T 2/3) for Lipschitz costs. Corollary 3 shows that in the worst case (the first bound), when M = O(T 8/9) functions are convex, the algorithm can still maintain the same regret bound. In the best case (the second bound), the desired bound is attainable even when M = Θ(T). Similarly, for smooth costs, a na ıve solution can maintain the e O(

T) bound when M = O(

T). Corollary 4 shows that the same bound is attainable when M = O(T 3/4) in the worst case (the first bound), and even

when M = Θ(T) in the best case (the second bound). Corollary 3. For Lipschitz costs, under Assumptions 1, 2, when M functions in {ct( , )}T t=1 are convex and the rest are α-quadratic, Algorithm 2 with configuration (3.5) guarantees e O(M 3/4 + α 1/3(T M) 2/3). If the quadratic functions appear in the first (T M) rounds, the bound can be further improved to e O(α 1/3T(T M) 1/3).

Corollary 4. For Lipschitz and smooth costs, under Assumptions 1, 2, when M functions in {ct( , )}T t=1 are convex and the rest are α-quadratic, Algorithm 2 with configuration (3.6) guarantees e O(

T +M 2/3 +α 1/2(T M) 1/2). If the quadratic functions appear in the first (T M) rounds, it can be further improved to e O(

T + α 1/2T(T M) 1/2).

3 Decaying Quadraticity. Corollaries 5-6 study cost functions with decaying quadraticity. The results show that when αt = t γ with γ > 0, the bounds will degenerate. And when γ exceeds a specific threshold, the results become the same as optimizing on purely convex costs, even if the costs still exhibit quadratic properties. Corollary 5. For Lipschitz costs, under Assumptions 1, 2, when ct( , ) is αt-quadratic with αt = t γ for some γ [0, 1], Algorithm 2 with configuration (3.5) achieves an e O(T 2/3+γ/3) regret if γ [0, 1/4], and e O(T 3/4) otherwise. Corollary 6. For Lipschitz and smooth costs, under Assumptions 1, 2, when ct( , ) is αt-quadratic with αt = t γ

for some γ [0, 1], Algorithm 2 with (3.6) achieves an e O(T 1/2+γ/2) regret if γ [0, 1/3], and e O(T 2/3) otherwise.

Remark 4. As opposed to the online setting, the Lipschitzness assumption is still required for smooth costs in bandit control. This is mainly caused by the reduction from bandit control to BCO-M (and BCO-SC). The results for BCO-SC do not require Lipschitzness for smooth loss functions.

4. Conclusion and Future Directions

In this paper, we study bandit LQR control with heterogeneous curvatures. We obtain interpolated guarantees that are adaptive to the true curvatures of costs. This is done via a lossless with-history reduction scheme and a simple analysis of a local-norm stability term essential in both regret and switching cost analysis, using Newton decrement.

As discusses at the end of Section 2, studies in universal online learning with bandit feedback remains largely unexplored. This work, along with that of Luo et al. (2022), has made a first step in addressing this problem by considering known but heterogeneous curvatures. An important future direction would be to investigate homogeneous but unknown curvatures, a more challenging setup in universal online learning, with bandit feedback. This problem is difficult even within bandit convex optimization, considering only convex or strongly convex online functions.

Handling Heterogeneous Curvatures in Bandit LQR Control

Acknowledgements

This research was supported by NSFC (62361146852, 62206125, 61921006) and Jiangsu SF (BK20220776). The authors thank Jennifer Sun for the discussions on nonstochastic control, especially about the with-history reduction to BCO with memory. Peng Zhao thanks Tomer Koren for helpful discussions.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Yasin Abbasi-Yadkori and Csaba Szepesv ari. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference Computational Learning Theory (COLT), pages 1 26, 2011.

Jacob D. Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In Proceedings of the 21st Annual Conference Computational Learning Theory (COLT), pages 263 274, 2008.

Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, and Karan Singh. Online control with adversarial disturbances. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 111 119, 2019.

Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: Price of past mistakes. In Advances in Neural Information Processing Systems 28 (NIPS), pages 784 792, 2015.

Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. In Advances in Neural Information Processing Systems 20 (NIPS), pages 65 72, 2007.

Richard Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6): 503 515, 1954.

S ebastien Bubeck, Yin Tat Lee, and Ronen Eldan. Kernelbased methods for bandit convex optimization. In Proceedings of the 49th Annual ACM Symposium on Theory of Computing (STOC), pages 72 85, 2017.

Asaf Cassel and Tomer Koren. Bandit linear control. In Advances in Neural Information Processing Systems 33 (Neur IPS), pages 8872 8882, 2020.

Asaf Cassel, Alon Cohen, and Tomer Koren. Rate-optimal online convex optimization in adaptive linear control. In Advances in Neural Information Processing Systems 35 (Neur IPS), pages 7410 7422, 2022.

Nicol o Cesa-Bianchi and G abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

Nicol o Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. In Advances in Neural Information Processing Systems 26 (NIPS), 2013.

Alon Cohen, Avinatan Hassidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 1028 1037, 2018.

Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems 31 (Neur IPS), pages 4192 4201, 2018.

Ofer Dekel, Ambuj Tewari, and Raman Arora. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 1747 1754, 2012.

Eyal Even-Dar, Sham. M. Kakade, and Yishay Mansour. Online Markov decision processes. Mathematics of Operations Research, pages 726 736, 2009.

Abraham Flaxman, Adam Tauman Kalai, and H. Brendan Mc Mahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 385 394, 2005.

Paula Gradu, John Hallman, and Elad Hazan. Nonstochastic control with bandit feedback. In Advances in Neural Information Processing Systems 33 (Neur IPS), pages 10764 10774, 2020.

Elad Hazan. Introduction to Online Convex Optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Elad Hazan and Kfir Y. Levy. Bandit convex optimization: Towards tight bounds. In Advances in Neural Information Processing Systems 27 (NIPS), pages 784 792, 2014.

Elad Hazan and Karan Singh. Introduction to online nonstochastic control. Ar Xiv preprint, ar Xiv:2211.09619, 2022.

Handling Heterogeneous Curvatures in Bandit LQR Control

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. 1960.

Tor Lattimore. Improved regret for zeroth-order adversarial bandit convex optimisation. Mathematical Statistics and Learning, 2(3):311 334, 2020.

Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang. Bias no more: High-probability data-dependent regret bounds for adversarial bandits and MDPs. In Advances in Neural Information Processing Systems 33 (Neur IPS), pages 15522 15533, 2020.

Frank L Lewis, Draguna Vrabie, and Vassilis L Syrmos. Optimal Control. John Wiley & Sons, 2012.

Haipeng Luo, Mengxiao Zhang, and Peng Zhao. Adaptive bandit convex optimization with heterogeneous curvature. In Proceedings of the 35th Conference on Learning Theory (COLT), pages 1576 1612, 2022.

Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems 32 (Neur IPS), pages 10154 10164, 2019.

Deepan Muthirayan, Jianjun Yuan, and Pramod P. Khargonekar. Adaptive gradient online control. In Proceedings of the 2022 American Control Conference (ACC), pages 2136 2141, 2022.

Yurii E. Nesterov and Arkadii Nemirovskii. Interior-point Polynomial Algorithms in Convex Programming. SIAM, 1994.

Francesco Orabona. A modern introduction to online learning. Ar Xiv preprint, ar Xiv:1912.13213, 2019.

Ankan Saha and Ambuj Tewari. Improved regret guarantees for online smooth convex optimization with bandit feedback. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 636 642, 2011.

Ohad Shamir. On the complexity of bandit and derivativefree stochastic convex optimization. In Proceedings of the 26th Annual Conference Computational Learning Theory (COLT), pages 3 24, 2013.

Max Simchowitz. Making non-stochastic control (almost) as easy as stochastic. In Advances in Neural Information Processing Systems 33 (Neur IPS), pages 18318 18329, 2020.

Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In Proceedings of 33rd Conference on Learning Theory (COLT), pages 3320 3436, 2020.

Y. Jennifer Sun, Stephen Newman, and Elad Hazan. Optimal rates for bandit nonstochastic control. In Advances in Neural Information Processing Systems 36 (Neur IPS), pages 21908 21919, 2023.

Tim van Erven and Wouter M. Koolen. Metagrad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems 29 (NIPS), pages 3666 3674, 2016.

Yu-Hu Yan, Peng Zhao, and Zhi-Hua Zhou. Universal online learning with gradual variations: A multi-layer online ensemble approach. In Advances in Neural Information Processing Systems 36 (Neur IPS), pages 37682 37715, 2023.

Lijun Zhang, Guanghui Wang, Jinfeng Yi, and Tianbao Yang. A simple yet universal strategy for online convex optimization. In Proceedings of the 39th International Conference on Machine Learning (ICML), pages 26605 26623, 2022.

Peng Zhao, Long-Fei Li, and Zhi-Hua Zhou. Dynamic regret of online Markov decision processes. In Proceedings of the 39th International Conference on Machine Learning (ICML), pages 26865 26894, 2022.

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 928 936, 2003.

Handling Heterogeneous Curvatures in Bandit LQR Control

A. Self-concordant Barrier and Normal Barrier

In this section, we provide some background knowledge about self-concordant barriers and normal barriers. Definition 5. A function ψ : int(W) R is a self-concordant function if: (i) ψ is three times continuously differentiable and convex, and becomes infinity along any sequence of points approaching the boundary of the domain; and (ii) For every x Rd and w int(W), it holds that | 3ψ(w)[x, x, x]| 2| 2ψ(w)[x, x]| 3/2. Definition 6. If | ψ(w)[x]| ν 1/2(| 2ψ(w)[x, x]|) 1/2, a self-concordant function ψ is a ν-self-concordant barrier. Lemma 2 (Theorem 2.5.1 of Nesterov and Nemirovskii (1994)). For each closed convex body W Rd, there always exists an O(d)-self-concordant barrier on W. Lemma 3 (Proposition 2.3.4 of Nesterov and Nemirovskii (1994)). If Ψ is a ν-normal barrier on W Rd, then for any w1, w2 int(W), it holds that: (i) w 2 2Ψ(w) = w 2Ψ(w)w = ν; (ii) 2Ψ(w)w = Ψ(w); (iii) Ψ(w1)

Ψ(w2) ν ln Ψ(w2),w1

ν ; and (iv) Ψ(w) 2 2Ψ(w) = ν.

Lemma 4 (Proposition 5.1.4 of Nesterov and Nemirovskii (1994)). If ψ is a ν-self-concordant barrier on W Rd, then the function ψ(w, b) 400(ψ(w/b) 2ν ln b) is a ν-normal barrier on con(W) with ν = 800ν, where con(W) = {0} {(w, b) | w/b W, w Rd, b > 0} is the conic hull of W lifted to Rd+1. Lemma 5 (Proposition 2.3.2 of Nesterov and Nemirovskii (1994)). Let ψ : int(W) 7 R be a ν-self-concordant barrier over the closed convex set W, then for any w, w int(W), we have ψ(w ) ψ(w) ν log 1 1 πw(w ), where πw(w ) arg inft 0{w + t 1(w w) W} is the Minkowski function of W whose pole is on w, which is always in [0, 1].

B. Omitted Details of Section 3.1 (Reduction to Online Learning)

In this section, we give some omitted details of Section 3.1, including the truncation lemma (Lemma 6), the relationship of regret bounds between the whole time horizon and the lazy-update horizon (Lemma 7), and the preservation of function properties between the control cost and the truncated function (Lemma 8). Lemma 6 (Appendix D.3 of Sun et al. (2023)). For Lipschitz costs, choosing memory length H = Θ(log T) and m = Θ(log T) in the definition of DRP (Definition 2), for any fixed DRP policy M M, the cumulative gap between the with-history loss function eft(M) (Definition 3) and the costs ct(yt(M), ut(M)) can be bounded by

t=1 ct(yt(M), ut(M)) O(1).

Interested readers can refer to Appendix D.3 of Sun et al. (2023) for the proof. Lemma 7 (Lemma 11 of Cassel and Koren (2020)). Suppose the lazy-update method chooses random bits b1:T from Bern(1/H) independently, then for any decision sequence MH :T and fixed comparator M, it holds that

t=H eft(Mt)

t S eft(Mt) X

where H denote the effective memory length.

The Lipschitzness, strong convexity, and smoothness of the cost function can be preserved to the with-history loss functions, as summarized in the following lemma. Lemma 8 (Function Properties Connection (Sun et al., 2023, Lemma D.6)). The functional relationship between the cost function ct and the with-history loss function ft (in Definition 3) or its unary version eft is as follows:

(i) For Lc-Lipschitz costs, ft is Lf-Lipschitz with Lf 2Lc p

(1 + RRG)2 + R2RGR2 nat

H, where R appears in the definition of DRP policy class M (Definition 2).

(ii) For αt-quadratic costs, Et H 1[ eft] is σt-strongly convex with σt αt Vare + Varξ σmin(C) 1+ A 2 2

, where the expectation

Et H 1[ ] is taken on the randomness up to the (t H 1)-th round, due to Definition 3.

(iii) For βc-smooth costs, eft is βf-smooth with βf 4βc R2 nat R2 GH.

In above results, Rnat denotes the upper bound for ynat t 2 and RG is the upper bound for G ℓ1,op = P i=1 G[i] op.

Handling Heterogeneous Curvatures in Bandit LQR Control

C. Omitted Proofs of Section 3.2 (BCO with Switching Cost)

In this section, we provide the proofs for Section 3.2, including simplified regret analyses for BCO with heterogeneous curvatures, Lemma 1 and Theorem 2.

C.1. Simplified Regret Analysis for BCO with Heterogeneous Curvatures

In this part, we provide the regret analysis for BCO with heterogeneous curvatures (Luo et al., 2022) in Lemma 9 and Lemma 10, with simplified proofs using our Theorem 1 for the crucial term of wt wt 1 Ht.

Lemma 9 (Theorem 21 of Luo et al. (2022)). If the loss functions {ht}T t=1 is Lf-Lipschitz continuous and σt-strongly convex, Algorithm 2 of Luo et al. (2022) ensures with any regularization coefficients {λt}T t=1 (0, 1):

t=1 ht(wt) min w W

T 1/3 + λ1:T +

t=1 (σ1:t 1 + λ0:t 1) 1/3 !

Lemma 10 (Lemma 19 of Luo et al. (2022)). If the loss functions {ht}T t=1 is βf-smooth and σt-strongly convex, Algorithm 1 of Luo et al. (2022) ensures with any regularization coefficients {λt}T t=1 (0, 1):

t=1 ht(wt) min w W

t=1 (σ1:t 1 + λ0:t 1) 1/2 !

Proof of Lemma 9. The proof mainly follows the Theorem 12 and Lemma 20 of Luo et al. (2022), and with our novel analysis (i.e., Theorem 1) for the crucial stability term wt wt 1 Ht. To begin with, the regret can be decomposed into the following parts:

t=1 ht( wt)

| {z } 1 EXPLORATION

t=1 ht( wt)

t=1 eht( wt)

| {z } 2 REGULARIZATION-I

t=1 eht( wt)

t=1 bht( wt)

| {z } 3 SMOOTH-I

t=1 bht( wt)

t=1 bht( e w)

| {z } 5 REG-TERM

t=1 bht( e w)

t=1 eht( e w)

| {z } 3 SMOOTH-II

t=1 eht( e w)

t=1 ht( e w)

| {z } 2 REGULARIZATION-II

t=1 ht( e w)

| {z } 4 COMPARATOR-BIAS

where w arg minw W PT t=1 ht(w), w = (w , 1). ht is the lifted version of ht( ). eht(w) ht(w) + λt

2 w(1:d) 2 2 is a regularized version of ht( ). bht(w) Eb Bd+1[eht(w + H 1/2 t b)] is the smoothed version of eht( ). e w (1 1/T) w + 1/T w1, where w1 is the minimizer of Ψ( ), i.e., w1 = arg minw W Ψ(w).

1 EXPLORATION term, using the Lf-Lipschitzness of ht( ), can be bounded by

t=1 ht( wt) Lf

t=1 wt wt 2 = Lf

t=1 H 1/2 t st 2

t=1 (σ1:t 1 + λ0:t 1) 1/3 !

2 REGULARIZATION-I and REGULARIZATION-II can be bounded using the definition of eht( ). For any w W,

t=1 ht(w) O

= O(λ1:T ). (C.2)

Handling Heterogeneous Curvatures in Bandit LQR Control

3 SMOOTH-I and SMOOTH-II, using Lipschitzness of eht( ) and the definition of bht( ), can be bounded as

t=1 eht(w) = Eb Bd+1

t=1 eht(w + H 1/2 t b)

t=1 (Lf + λt) H 1/2 t b 2

ηt(σ1:t 1 + λ0:t 1)

t=1 (σ1:t 1 + λ0:t 1) 1/3 !

for any w W, where the last step is due to the step size setup (3.5).

4 COMPARATOR-BIAS term, using the definition of e w, i.e., e w (1 1/T) w + 1/T w1, can be bounded as

t=1 ht( e w)

t=1 ht(w ) =

t=1 ht( w1) 1

t=1 ht(w ) O(1), (C.3)

which can be ignored. In the end, we analyze the most important 5 REG-TERM. Since bht( ) is (σt + λt)-strongly convex,

t=1 bht( wt)

t=1 bht( e w)

bht( wt) ( wt e w) σt + λt

2 wt e w 2 2

g t ( wt e w) σt + λt

2 wt e w 2 2

t=1 ℓt( wt)

t=1 ℓt( e w)

where the second step is due to the unbiased gradient estimator, and the last step defines ℓt(w) g t w + σt+λt

2 w wt 2 2. Consequently, the FTRL update rule wt+1 = arg minw W Ft+1(w) can be rewritten as

wt+1 = arg min w W Ft+1(w) = arg min w W

s=1 ℓs(w) + Rt+1(w)

= arg min w W

s=1 ℓs(w) + R t+1(w)

where Rt+1(w) λ0

2 w 2 2 + 1 ηt+1 Ψ(w), and R t+1(w) Rt+1(w) 1 ηt+1 Ψ( w1) = λ0

2 w 2 2 + 1 ηt+1 (Ψ(w) Ψ( w1)). Using the regret guarantee of FTRL (Lemma 12), it holds that

t=1 ℓt( wt)

t=1 ℓt( e w) R T +1( e w) R 1( w1) +

t=1 ℓt( wt) ( wt wt+1) +

t=1 (R t( wt+1) R t+1( wt+1))

R T +1( e w) R 1( w1) +

t=1 g t ( wt wt+1) Ψ( e w) Ψ( w1)

t=1 g t ( wt wt+1) + O(1)

t=1 g t ( wt wt+1) ν log T

t=1 wt wt+1 Ht, (C.4)

where the second step is due to ℓt( wt) = gt and R t(w) R t+1(w) for any w W since the step size sequence {ηt}T t=1 is non-increasing. The third step is because of R 1( w1) = λ0

2 w1 2 2 0 and λ0

2 e w 2 2 O(1). The fourth step is due to Lemma 5 with the fact that e w (1 1/T) w + 1/T w1, and omits the constant term O(1). The fifth step is due to the upper bound of the gradient estimator:

gt 2 H 1 t = d2 ht(wt) + λt

2 H 1/2 t st 2

H 1 t d2 1 + λt

2 4d2, (C.5)

which implies gt H 1 t 2d. The second step requires ht( ) 1, which can be assumed without loss of generality. Consequently, the P

t [T ] wt wt+1 Ht term, using Theorem 1, can be bounded as

t=1 wt wt+1 Ht 4d

t=1 ηt + 2η1

Handling Heterogeneous Curvatures in Bandit LQR Control

Plugging the above upper bound back into (C.4), we can upper-bound the REG-TERM as

log T ηT +1 +

t=1 (σ1:t 1 + λ0:t 1) 1/3 !

Combining all parts finishes the proof.

Proof of Lemma 10. We decompose the regret using the same way as (C.1) in Lemma 9. To begin with, 1 EXPLORATION, using the βf-smoothness of ht( ), can be bounded by

t=1 ht( wt) βf

t=1 wt wt 2 2 = βf

t=1 H 1/2 t st 2 2

t=1 (σ1:t 1 + λ0:t 1) 1/2 !

2 REGULARIZATION-I and REGULARIZATION-II can bounded by O(λ1:T ) as (C.2). For 3 SMOOTH-I and SMOOTH-II, using smoothness of eht( ) and the definition of bht( ), for any w W, it holds that

t=1 eht(w) = Eb Bd+1

t=1 eht(w + H 1/2 t b)

t=1 (βf + λt) H 1/2 t b 2 2

1 ηt(σ1:t 1 + λ0:t 1)

t=1 (σ1:t 1 + λ0:t 1) 1/2 !

where the last step is due to the step size setup (3.6). 4 COMPARATOR-BIAS can be bounded by O(1) as (C.3). It remains to handle the most important 5 REG-TERM. Specifically, following the same proof as Lemma 9, we obtain

log T ηT +1 +

t=1 (σ1:t 1 + λ0:t 1) 1/2 !

Combining all parts finishes the proof.

C.2. Proof of Lemma 1

Proof. From the proof of Theorem 1, we can first upper-bound the concerned λ( wt, Ft+1) as follows:

λ( wt, Ft+1) 2d ηt + 1 ηt+1 1

In the following, we discuss the Lipschitzness and smoothness cases respectively.

Lipschitzness Case. Due to the step size configurations (3.5), it holds that

ηt d 4/3(Lf + 1) 2/3 1 σ1:t + λ0:t + 1

1/3 1 σ1:t 1 + λ0:t 1 + 1

d 4/3(Lf + 1) 2/3 (σ1:t + λ0:t) 1/3 (σ1:t 1 + λ0:t 1) 1/3 d 4/3(Lf + 1) 2/3(σt + λt) 1/3

d 4/3(Lf + 1) 2/3(4Lf + 1) 1/3 2 2/3d 4/3

where the second last step is due to σt 4Lf by Lemma 11. Thus it suffices to choose η1 such that λ( wt, Ft+1) 2d 4/3 η1(1 + ν) 1/2, i.e., η1 1 16d8/3(1+ ν)2 . Consequently, requiring T λ0, it suffices to require

η1 = d 4/3(Lf + 1) 2/3 1

1/3 2 1/3d 4/3(Lf + 1) 2/3λ 1/3 0 1 16d 8/3(1 + ν)2 ,

which can be satisfied by setting λ0 = 213d4(Lf + 1)2(1 + ν)6 and assuming T λ0.

Handling Heterogeneous Curvatures in Bandit LQR Control

Smoothness Case. Due to the step size configurations (3.6), it holds that

βf +1 σ1:t+λ0:t + 1

βf +1 σ1:t 1+λ0:t 1 + 1

σ1:t + λ0:t p

σ1:t 1 + λ0:t 1 2d σt + λt p

where the last step is due to λt (0, 1) and σt βf. Thus it suffices to choose η1 such that λ( wt, Ft+1) 2d η1(1 + ν) 1/2, i.e., η1 1 16d2(1+ ν)2 . Consequently, requiring 1/T (βf + 1)/λ0, it suffices to require

λ0 1 16d2(1 + ν)2 ,

which can be satisfied by setting λ0 = 128d2(βf + 1)(1 + ν)4 and assuming T 128d2(1 + ν)4. Notably, given λ( wt, Ft+1) 1/2, we directly obtain wt wt+1 Ht η1 = O(1) from (3.8), which is useful for the switching cost analysis. The proof is finished.

C.3. Proof of Theorem 2

Proof. In this theorem, we analyze the regret of BCO with switching cost. To begin with, we define a sequence of functions in the lifted domain ht : W 7 R such that ht(w) ht(w(1:d)) for any w W, where w(1:d) denotes the first d entries of w. It is easy to verify that ht can still inherit the σt-strong convexity, Lf-Lipschitzness, and βf-smoothness from the original function ht. We investigate the following regrets with switching costs in the Lipschitzness and smoothness cases respectively, where some problem-dependent constants are omitted. Specifically, for Lipschitz functions,

REGSC-LIP T E

t=1 ht(wt) min w W

t=2 wt wt 2 +

t=2 wt wt 1 2. (C.6)

And for smooth functions, we study

REGSC-SM T E

t=1 ht(wt) min w W

t=2 wt wt 2 2 +

t=2 wt wt 1 2 +

t=2 wt wt 1 2 2. (C.7)

In the following, we discuss the Lipschitzness and smoothness cases respectively.

Lipschitzness Case. To begin with, the unary regret part can be upper-bounded using the guarantee of Luo et al. (2022) for Lipschitz functions. We restate it in Lemma 9 with the corresponding proof in Appendix C.1 for self-containedness.

In the following, we focus on the switching cost terms. Since wt = wt + H 1/2 t b, wt wt 2 can be bounded as

wt wt 2 = H 1/2 t b 2 O (σ1:t 1 + λ0:t 1) 1/3 ,

where the last step is because of Ht = 2Ψ( wt) + ηt(σ1:t 1 + λ0:t 1)I and the step size setup (3.5). Since Ht ηt(σ1:t 1 + λ0:t 1)I, we obtain

wt wt 1 2 wt wt 1 Ht ηt(σ1:t 1 + λ0:t 1) η1 ηt(σ1:t 1 + λ0:t 1) O (σ1:t 1 + λ0:t 1) 2/3 ,

where the second step is due to Lemma 1 and the last step is due to the step size setup (3.5) and η1 = O(1). In the following, we show that the above term can be absorbed by O((σ1:t 1 +λ0:t 1) 1/3). Specifically, due to (3.5) and λt (0, 1), it holds that d 2/3(Lf +1) 2/3/(σ1:t +λ0:t) 1/3 < 1, which implies (σ1:t +λ0:t) 1/3 > d 2/3(Lf +1) 2/3 > 1. As a result, σ1:t +λ0:t > 1, and thus O((σ1:t 1 + λ0:t 1) 2/3) O((σ1:t 1 + λ0:t 1) 1/3). Combining all terms, the regret with switching cost (C.6) can be bounded by

REGSC-LIP T e O

T 1/3 + λ1:T +

t=1 (σ1:t 1 + λ0:t 1) 1/3 !

T 1/3 + λ1:T 1 +

t=1 (σ1:t + λ0:t) 1/3 !

Handling Heterogeneous Curvatures in Bandit LQR Control

for any regularization coefficients {λt}T t=1 (0, 1), where the last step is due to λT (0, 1) and (σ1:0+λ0) 1/3 λ 1/3 0 1 (λ0 1 because of Lemma 1).

Later, following Lemma 22 of Luo et al. (2022), by setting the regularization coefficient as λt (σ1:t + λ0:t) 1/3, the regret of BCO with switching cost for Lipschitz functions can be bounded by the optimal regularization coefficients {λ t }T t=1:

REGSC-LIP T e O

inf λ 1,...,λ T

T 1/3 + λ 1:T 1 +

t=1 (σ1:t + λ 0:t) 1/3 )!

which finishes the proof.

Smoothness Case. To begin with, the unary regret part can be upper-bounded using the guarantee of Luo et al. (2022) for smooth functions. We restate it in Lemma 10 with the corresponding proof in Appendix C.1 for self-containedness.

In the following, we focus on the switching cost terms. Since wt = wt + H 1/2 t b, wt wt 2 2 can be bounded as

wt wt 2 2 = H 1/2 t b 2 2 O (σ1:t 1 + λ0:t 1) 1/2 ,

where the last step is because of Ht 2Ψ( wt) + ηt(σ1:t 1 + λ0:t 1)I and the step size setup (3.6). Since Ht ηt(σ1:t 1 + λ0:t 1)I, we obtain

wt wt 1 2 wt wt 1 Ht ηt(σ1:t 1 + λ0:t 1) η1 ηt(σ1:t 1 + λ0:t 1) O (σ1:t 1 + λ0:t 1) 1/2 ,

where the second step is due to Lemma 1 and the last step is due to the step size setup (3.6) and η1 = O(1). Accordingly,

wt wt 1 2 2 O (σ1:t 1 + λ0:t 1) 1 .

In the following, we show that the above term can be absorbed by O((σ1:t 1 + λ0:t 1) 1/2). Specifically, due to (3.6) and λt (0, 1), it holds that d p

βf + 1/ σ1:t + λ0:t < 1, which implies σ1:t + λ0:t > d p

βf + 1 > 1. As a result, σ1:t + λ0:t > 1, and thus O((σ1:t 1 + λ0:t 1) 1) O((σ1:t 1 + λ0:t 1) 1/2). Combining all terms, the regret with switching cost (C.7) can be bounded by

REGSC-SM T e O

t=1 (σ1:t 1 + λ0:t 1) 1/2 !

T + λ1:T 1 +

t=1 (σ1:t + λ0:t) 1/2 !

for any regularization coefficients {λt}T t=1 (0, 1), where the last step is due to λT (0, 1) and (σ1:0+λ0) 1/2 λ 1/2 0 1 (λ0 1 because of Lemma 1).

Later, following Lemma 6 of Luo et al. (2022), by setting the regularization coefficient as λt (σ1:t + λ0:t) 1/2, the regret of BCO with switching cost for smooth functions can be bounded by the optimal regularization coefficients {λ t }T t=1:

REGSC-SM T e O

inf λ 1,...,λ T

T + λ 1:T 1 +

t=1 (σ1:t + λ 0:t) 1/2 )!

which finishes the proof.

D. Omitted Proofs of Section 3.3 (Bandit LQR Control)

In this section, we provide the omitted proofs of Section 3.3, including Theorem 3, and corollaries considering the cost functions with pure convexity and quadraticity (Corollary 1 and 2), corrupted quadraticity (Corollary 3 and 4), and decaying quadraticity (Corollary 5 and 6).

Handling Heterogeneous Curvatures in Bandit LQR Control

D.1. Proof of Theorem 3

Proof. To begin with, we decompose the regret in the following way.

E[REGC T ] = E

t=1 ct(yt, ut) min π Π

t=1 ct(yπ t , uπ t )

t=1 ct(yt, ut)

t=1 ct(yt(M ), ut(M ))

t=1 ct(yt, ut)

| {z } BURN-IN

t=H ct(yt, ut)

t=H ft(M[t H:t])

t=H eft(M )

t=H ct(yt(M ), ut(M ))

t=H ft(M[t H:t])

t=H eft(M )

| {z } REGRET

where the first line is due to the definition of the optimal fixed policy M arg min M M PT t=1 ct(yt(M), ut(M)), which is fixed due to the cost functions {ct( , )}T t=1 are chosen by an oblivious adversary. For Lipschitz costs, by choosing the memory length H = Θ(log T), the burn-in cost can be bounded by Pm+H t=1 E[ct(yt, ut)] e O(1). TERM (I) is exactly zero due to the property of with-history loss functions, i.e., ft(M[t H:t]) = ct(yt, ut). And TERM (II) is the truncation error, which is at most e O(1) due to Lemma 6.

Finally, it remains to deal with the REGRET term. We first give it a further decomposition:

t=H ft(M[t H:t])

t=H e ft(Mt)

| {z } MEMORY

t=H e ft(Mt)

t=H e ft(M )

| {z } UNARY-REG

where ft : MH+1 7 R denotes the loss function in lifted domain such that ft(M1, . . . , MH+1) ft(M1, . . . , MH+1) for any Mi = (Mi, 1). Its unary version is correspondingly defined as e ft(M) ft(M, . . . , M) for any M M. It can be verified that e ft( ) can still inherit the σt-strong convexity, Lf-Lipschitzness, and βf-smoothness from the original ft( ).

Denoting by S the time horizon when the algorithm updates, the unary regret will only become 3H times larger (Lemma 7):

UNARY-REG 3H E

t S e ft(Mt) X

t S e ft(M )

We note that since the comparator M (and thus M ) is fixed, we can easily extend Lemma 7 for non-oblivious adversary. To further deal with the unary regret, we denote by ft;H(M) Et H 1[ eft(M)] the expected version of eft( ) for any M M. Due to Lemma 8, ft;H( ) is σt-strongly convex. It is easy to verify that ft;H(M) Et H 1[ e ft(M)] is also σt-strongly convex for any M = (M, 1) M. As a result, the E[ e ft(Mt) e ft(M )] term in the unary regret can be further transformed to

E h e ft(Mt) e ft(M ) i = E h Et H 1 h e ft(Mt) e ft(M ) ii = E [ft;H(Mt) ft;H(M )]

E h ft;H(Mt), Mt M σt

2 Mt M 2 F i

= E h gt, Mt M σt

2 Mt M 2 F i

| {z } OPT-TERM

+ E [ ft;H(Mt) gt, Mt M ] | {z } BIAS-TERM

where the first line is due to the definition of ft;H( ), the second line is because of the strong convexity of ft;H( ). The OPT-TERM in the third line can be optimized in a deterministic way following the same analysis as that in Lemma 9 and Lemma 10. To optimize the BIAS-TERM, we give it a further decomposition:

BIAS-TERM = E h ft;H(Mt) e ft(Mt), Mt M i + E h e ft(Mt) gt, Mt M i ,

Handling Heterogeneous Curvatures in Bandit LQR Control

where the first term

E h ft;H(Mt) e ft(Mt), Mt M i = E h Et h ft;H(Mt) e ft(Mt) i , Mt M i = 0 (D.1)

because of the definition of ft;H( ), where Et[ ] is taken on the randomness up to the t-th round. The second term

E h e ft(Mt) gt, Mt M i = E h Et h e ft(Mt) gt i , Mt M i = 0 (D.2)

because the gradient estimator gt is unbiased for the true gradient e ft(Mt) (actually gt is an unbiased estimator of the smoothed version of e ft at Mt). Note that in the first step of (D.1) and (D.2), given randomness up to the t-th round, the variable Mt M is deterministic mainly due to the fact that M is fixed as discussed before. Otherwise, when facing a general non-oblivious adversary, this step will not hold due to the randomness on the comparator.

As for the memory part, since Mt = Mt + H 1/2 t st, i.e., R[Mt] = Mt, if the costs are Lipschitz, which implies Lf-Lipschitzness of ft, the memory part can be bounded by

t S ( Mt+1 Mt F + 2 Mt Mt F).

If the costs are smooth, which implies βf-smoothness of e ft, we obtain

t S (Lf Mt+1 Mt F + βf Mt+1 Mt 2 F + 6βf Mt Mt 2 F).

Thus we successfully reduce the problem to bandit convex optimization with switching cost (BCO-SC). Thus we can directly use the parameter configurations of Theorem 2 and obtain the same regret guarantees therein. Finally, noticing that the strong convexity parameter αt of the cost function ct( , ) is linear in the strong convexity parameter σt of the with-history loss function eft( ), using the second part of Lemma 8 finishes the proof.

D.2. Proofs of Corollaries 1-6

Proof of Corollary 1. Since Theorem 3 holds for any non-negative sequence of {λ t }T t=1, when {ct( )}T t=1 are convex, if we choose λ 1 = T 3/4 and λ t = 0 for t 2, then it holds that E[REGC T ] e O(T 3/4). If {ct( )}T t=1 are α-quadratic, if we choose λ t = 0 for t [T], it holds that E[REGC T ] e O(α 1/3T 2/3). The proof is finished.

Proof of Corollary 2. Since Theorem 3 holds for any non-negative sequence of {λ t }T t=1, when {ct( )}T t=1 are convex, if we choose λ 1 = T 2/3 and λ t = 0 for t 2, then it holds that E[REGC T ] e O(T 2/3). If {ct( )}T t=1 are α-quadratic, if we choose λ t = 0 for t [T], it holds that E[REGC T ] e O( p

T/α). The proof is finished.

Proof of Corollary 3. When the first M online functions are convex and the rest ones are α-quadratic, the regret upper-bound in Theorem 3 is the largest. By choosing λ t = 0 for t 2, it holds that

E[REGC T ] e O

T 1/3 + λ 1:T 1 +

t=1 (α1:t + λ 0:t) 1/3 !

e O λ 1 + Mλ 1 1/3 + α 1/3(T M) 2/3 ,

where the last step omits the low-order term of e O(T 1/3). Choosing λ 1 = M 3/4 finishes the proof of the first part. When the σ-quadratic functions appear in the first (T M) rounds, the above guarantee can be strengthened. Specifically, by choosing λ t = 0 for t [T], we obtain

E[REGC T ] e O

T 1/3 + λ 1:T 1 +

t=1 (α1:t + λ 0:t) 1/3 !

e O α 1/3(T M) 2/3 + α 1/3M(T M) 1/3 ,

where the first term represent the regret bound of the first (T M) rounds for quadratic functions and the second term is the regret bound of the last M rounds for convex functions, finishing the proof.

Handling Heterogeneous Curvatures in Bandit LQR Control

Proof of Corollary 4. When the first M online functions are convex and the rest ones are α-quadratic, the regret upper-bound in Theorem 3 is the largest. By choosing λ t = 0 for t 2, it holds that

E[REGC T ] e O

T + λ 1:T 1 +

t=1 (α1:t + λ 0:t) 1/2 !

T + λ 1 + Mλ 1 1/2 + α 1/2(T M) 1/2 .

Choosing λ 1 = M 2/3 finishes the proof of the first part. When the α-quadratic functions appear in the first (T M) rounds, the above guarantee can be strengthened. Specifically, by choosing λ t = 0 for t [T], we obtain

E[REGC T ] e O

T + λ 1:T 1 +

t=1 (α1:t + λ 0:t) 1/2 !

T + α 1/2(T M) 1/2 + α 1/2M(T M) 1/2 ,

where the first term represent the regret bound of the first (T M) rounds for quadratic functions and the second term is the regret bound of the last M rounds for convex functions, finishing the proof.

Proof of Corollary 5. To begin with, choosing λ 1 = T b and λ t = 0 for t 2, from Theorem 3, we obtain

E[REGC T ] e O

t=1 (t1 γ + λ 1) 1/3 !

e O T b + min n T 2/3+γ/3, T 1 b/3o ,

where the first step omits the low-order term of e O(T 1/3) and uses α1:t = Pt s=1 s γ = O(t1 γ). In the following, we discuss the above upper bound case by case. If 2/3+γ/3 1 b/3, i.e., γ+b 1, it holds that E[REGC T ] e O(T b+T 2/3+γ/3). To minimize the upper bounds, we set b = 2/3 + γ/3 and achieve e O(T 2/3+γ/3). Combining γ + b 1 and b = 2/3 + γ/3 gives the constraint of γ 1/4. Otherwise, if 2/3 + γ/3 > 1 b/3, i.e., γ + b > 1, we obtain E[REGC T ] e O(T b + T 1 b/3). Choosing b = 3/4 gives an e O(T 3/4) regret bound. Combining γ + b > 1 and b = 3/4 gives the constraint of γ > 1/4.

Proof of Corollary 6. To begin with, choosing λ 1 = T b and λ t = 0 for t 2, from Theorem 3, we obtain

E[REGC T ] e O

t=1 (t1 γ + λ 1) 1/2 !

T + T b + min n T 1/2+γ/2, T 1 b/2o ,

where the first step is due to α1:t = Pt s=1 s γ = O(t1 γ). In the following, we discuss the above upper bound case by case. If 1/2 + γ/2 1 b/2, i.e., γ + b 1, it holds that E[REGC T ] e O(T b + T 1/2+γ/2). To minimize the upper bounds, we set b = 1/2 + γ/2 and achieve e O(T 1/2+γ/2). Combining γ + b 1 and b = 1/2 + γ/2 gives the constraint of γ 1/3. Otherwise, if 1/2 + γ/2 > 1 b/2, i.e., γ + b > 1, we obtain E[REGC T ] e O(

T + T b + T 1 b/2). Choosing b = 2/3 gives an e O(T 2/3) regret bound. Combining γ + b > 1 and b = 2/3 gives the constraint of γ > 1/3.

E. Technical Lemmas

In this section, we provide technical lemmas about the relationship between Lipschitzness and strong convexity (Lemma 11) and a basic lemma about FTRL (Lemma 12).

Lemma 11 (Lemma 31 of Luo et al. (2022)). If a convex function f : W 7 R is L-Lipschitz and σ-strongly convex, and has bounded domain diameter maxw1,w2 W w1 w2 2 D, then it holds that σ 4L/D.

Lemma 12. Let W Rd be a closed and convex feasible set, and denote by ψt : W 7 R the convex regularizer and ht : W 7 R the convex online functions. Denoting by Ft(w) = Pt 1 s=1 hs(w) + ψt(w), if the FTRL update rule is specified as wt arg minw W Ft(w), then for any w W, we have

t=1 ht(w) ψT +1(w) ψ1(w1) +

t=1 ht(wt) (wt wt+1) +

t=1 (ψt(wt+1) ψt+1(wt+1)).